Unexplained Process Termination, bare metal, 8.12.0, SORT, consistent?
Environment
Description
Conclusion
Activity

Gavin Halliday April 18, 2023 at 1:53 PM
fix released. It is data related, and the change is likely to do with the way very unusual unicode characters are converted to strings.

Gavin Halliday April 18, 2023 at 10:36 AM
Some digging later...
The record that it crashes on has an incoming utf8 field with the value:
0x02 0x00 0x00 0x00 0xe3 0x85 0xa4 0xe3 0x85 0xa4
I will see if I can write a test case using that value.

Gavin Halliday April 18, 2023 at 9:32 AM
Thanks . I edited the disassembly to remove the exception code. Which suggests it s failing on the destruction of vTR7 (since 3 more rtlDataAttr are destroyed afterwards). There is nothing particularly different about that variable, or the call that initialises it.

Mark Kelly April 17, 2023 at 7:00 PM
rax 0x0 0
rbx 0x7f71d0003b30 140126797708080
rcx 0xffffffffffffffff -1
rdx 0x6 6
rsi 0x1b761c 1799708
rdi 0x6e861 452705
rbp 0x7f71e16ec510 0x7f71e16ec510 <CStreamFileOwner::prefetchRow()>
rsp 0x7f70aa8a8410 0x7f70aa8a8410
r8 0x0 0
r9 0x27 39
r10 0x8 8
r11 0x202 514
r12 0x7f71c4057c22 140126596725794
r13 0x7f71d0003bf8 140126797708280
r14 0x1829ef0 25337584
r15 0x7f71c1be977a 140126558525306
rip 0x7f71e16efa82 0x7f71e16efa82 <CDiskReadSlaveActivity::CDiskPartHandler::nextRow()+146>
eflags 0x202 [ IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

Mark Kelly April 17, 2023 at 7:00 PMEdited
I see 35 calls to rtlDataAttr::~rtlDataAttr() starting at the address:
(gdb) x/200i 0x00007f71c1be94c8
0x7f71c1be94c8: lea -0x60(%rbp),%rax
0x7f71c1be94cc: mov %rax,%rdi
0x7f71c1be94cf: callq 0x7f71c1ac8660 <_ZN11rtlDataAttrD1Ev@plt>
0x7f71c1be94d4: lea -0x50(%rbp),%rax
0x7f71c1be94d8: mov %rax,%rdi
0x7f71c1be94db: callq 0x7f71c1ac8660 <_ZN11rtlDataAttrD1Ev@plt>
0x7f71c1be94e0: lea -0x40(%rbp),%rax
0x7f71c1be94e4: mov %rax,%rdi
0x7f71c1be94e7: callq 0x7f71c1ac8660 <_ZN11rtlDataAttrD1Ev@plt>
0x7f71c1be94ec: mov %ebx,%eax
0x7f71c1be94ee: jmpq 0x7f71c1be9768
.......
0x7f71c1be9768: add $0x878,%rsp
0x7f71c1be976f: pop %rbx
0x7f71c1be9770: pop %r12
0x7f71c1be9772: pop %r13
0x7f71c1be9774: pop %r14
0x7f71c1be9776: pop %r15
0x7f71c1be9778: pop %rbp
0x7f71c1be9779: retq
FIDO, 8.12.0-1, consistent failures that started back in February when 8.12.0-1 was deployed (8.10.10 previously). This does include a global SORT for a index build.
I have some suspicions it could be data related, because day-to-day groupings present with the same IP address reporting the error (of late, for many days, consistently 10.194.83.18).
0012B226 USR 2023-04-13 00:15:25.845 1637012 198236 "================================================"
0012B227 USR 2023-04-13 00:15:25.845 1637012 198236 "Program: 10.194.83.18:/mnt/disk1/HPCCSystems/bin/thorslave_lcr"
0012B228 USR 2023-04-13 00:15:25.845 1637012 198236 "Signal: 6 Aborted"
0012B229 USR 2023-04-13 00:15:25.845 1637012 198236 "Fault IP: 00007FBED0437387" 0012B22A USR 2023-04-13 00:15:25.845 1637012 198236 "Accessing: 000001F40018FA94" 0012B22B PRG 2023-04-13 00:15:25.845 1637012 198236 "Backtrace:"
In W20230214-230300, internal_8.10.10-1, and a spot-checked group of seven I restored between Jan 25 and Feb 14, were successful.
In W20230215-230318, internal_8.12.0-1, and apparently every similar WU after that, failed. That includes dozens of WUs, as recent as W20230413-000415.