Segfault during Smart Join

Environment

http://10.173.71.54:8010

Description

The job is segfaulting the thorslave:

The thorslave segfaulted on this job. Looks like it generated core files as well.

 

TS logs

10.173.71.44:/var/lib/HPCCSystems/thor100_71_5/thorslave.94.2018_11_29.log

0032F51B 2018-11-29 16:17:00.714 189606 1471742 "recvLoop - received bcast_stop, from : node=3, slave=3 - activity(ch=0, smartjoin, 1317)"

0032F51C 2018-11-29 16:17:00.785 189606 1471729 "clearNonLocalRows[slave=2], numCommitted=87621, totalRows(inc uncommitted)=87646, flushMarker=0 - activity(ch=0, smartjoin, 1317)"

0032F51D 2018-11-29 16:17:00.789 189606 1471729 "clearAllNonLocalRows(100): CThorSpillableRowArray::save (skipNulls=true, emptyRowSemantics=0) max rows = 87621 - activity(ch=0, smartjoin, 1317)"

0032F51E 2018-11-29 16:17:00.802 189606 1471729 "clearAllNonLocalRows(100): CThorSpillableRowArray::save done, rows written = 87621, bytes = 1051452 - activity(ch=0, smartjoin, 1317)"

0032F51F 2018-11-29 16:17:00.944 189606 1471729 "clearNonLocalRows[slave=2], numCommitted=67, totalRows(inc uncommitted)=87711, flushMarker=0 - activity(ch=0, smartjoin, 1317)"

0032F520 2018-11-29 16:17:00.944 189606 1471729 "================================================"

0032F521 2018-11-29 16:17:00.944 189606 1471729 "Program:   10.173.71.44:/mnt/disk1/HPCCSystems/bin/thorslave_lcr"

0032F522 2018-11-29 16:17:00.944 189606 1471729 "Signal:    11 Segmentation fault"

0032F523 2018-11-29 16:17:00.944 189606 1471729 "Fault IP:  00007FF65D81EF22"

0032F524 2018-11-29 16:17:00.944 189606 1471729 "Accessing: 0000000000000000"

0032F525 2018-11-29 16:17:00.944 189606 1471729 "Backtrace:"

0032F526 2018-11-29 16:17:00.961 189606 1471729 "  /var/lib/HPCCSystems/queries/thor100_71_5_24200/V4167451050_libW20181129-154014.so(+0x6d3f22) [0x7ff65d81ef22]"

0032F527 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libactivityslaves_lcr.so(+0xe44ab) [0x7ff669eca4ab]"

0032F528 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libactivityslaves_lcr.so(+0xe45e3) [0x7ff669eca5e3]"

0032F529 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libroxiemem.so(+0x16432) [0x7ff66488a432]"

0032F52A 2018-11-29 16:17:00.961 189606 1471729 "  /opt/HPCCSystems/lib/libroxiemem.so(+0x16660) [0x7ff66488a660]"

0032F52B 2018-11-29 16:17:00.962 189606 1471729 "  /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x2c) [0x7ff6645c5cbc]"

0032F52C 2018-11-29 16:17:00.962 189606 1471729 "  /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x1e) [0x7ff6645c768e]"

0032F52D 2018-11-29 16:17:00.962 189606 1471729 "  /lib64/libpthread.so.0(+0x7e25) [0x7ff6631dee25]"

0032F52E 2018-11-29 16:17:00.962 189606 1471729 "  /lib64/libc.so.6(clone+0x6d) [0x7ff662f08bad]"

0032F52F 2018-11-29 16:17:00.962 189606 1471729 "Registers:"

0032F530 2018-11-29 16:17:00.962 189606 1471729 "EAX:0000000000000000  EBX:0000000000000000  ECX:0000000000000000  EDX:00007FF64D9C0080  ESI:0000000000000000  EDI:00000000015F4168"

0032F531 2018-11-29 16:17:00.962 189606 1471729 "R8 :0000000000000001  R9 :00007FF662E5716D  R10:61202D20303D7265  R11:0000000000000000"

0032F532 2018-11-29 16:17:00.962 189606 1471729 "R12:00007FF410008440  R13:0000000000000000  R14:0000000000000043  R15:00007FF4100084B0"

0032F533 2018-11-29 16:17:00.962 189606 1471729 "CS:EIP:0033:00007FF65D81EF22"

0032F534 2018-11-29 16:17:00.962 189606 1471729 "   ESP:00007FF40DFFA320  EBP:00007FF40DFFA350"

0032F535 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA320]: 0000000000000000 015F416800000000 00000000015F4168 0DFFA3C000000000 00007FF40DFFA3C0 645402DC00007FF4 00007FF6645402DC 0000000000007FF6"

0032F536 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA340]: 00007FF400000000 0000000000007FF4 0000000000000000 0193A07000000000 000000000193A070 69ECA4AB00000000 00007FF669ECA4AB 0DFFA3E000007FF6"

0032F537 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA360]: 00007FF40DFFA3E0 1000844000007FF4 00007FF410008440 0193A07000007FF4 000000000193A070 0000000200000000 0000000000000002 0000006400000000"

0032F538 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA380]: 0000000000000064 0000000200000000 0000000000000002 0193A4C000000000 000000000193A4C0 69ECA5E300000000 00007FF669ECA5E3 100360F000007FF6"

0032F539 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA3A0]: 00007FF4100360F0 6487FE8100007FF4 00007FF66487FE81 0193A53000007FF6 000000000193A530 8870000000000000 0000000188700000 0000000100000001"

0032F53A 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA3C0]: 0000000000000001 100015D000000000 00007FF4100015D0 540022A000007FF4 00007FF6540022A0 0000000A00007FF6 000008000000000A 0000000000000800"

0032F53B 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA3E0]: 0000000000000000 648899E700000000 00007FF6648899E7 0000000000007FF6 00007FF600000000 0000004E00007FF6 000000800000004E 8884003800000080"

0032F53C 2018-11-29 16:17:00.962 189606 1471729 "Stack[00007FF40DFFA400]: 00007FF488840038 1000163000007FF4 00007FF410001630 5400185000007FF4 00007FF654001850 64889D1400007FF6 00007FF664889D14 0000021000007FF6"

0032F53D 2018-11-29 16:17:00.962 189606 1471729 "ThreadList:

7FF6617E6700 140696174356224 189613: CMPNotifyClosedThread

7FF660FE5700 140696165963520 189614: CSocketBaseThread

7FF6607E4700 140696157570816 189615: MP Connection Thread

7FF65FFE3700 140696149178112 189617: CMemoryUsageReporter

7FF65F7E2700 140696140785408 189619: CBackupHandler

7FF65EFE1700 140696132392704 189621: CGraphProgressHandler

7FF40DFFB700 140686183610112 1471729: BackgroundReleaseBufferThread

7FF42CBF2700 140686699472640 1471733: ProcessSlaveActivity

7FF4529B6700 140687334663936 1471734: CGraphExecutor pool

7FF41D7FA700 140686443652864 1471742: CBroadcaster::CRecv

7FF4539B8700 140687351449344 1471743: CBroadcaster::CSend

7FF4531B7700 140687343056640 1471744: CRowProcessor

7FF4521B5700 140687326271232 1471745: CDistributorBase::cRecvThread

7FF41FFFF700 140686485616384 1471746: CDistributorBase::cSendThread

7FF41F7FE700 140686477223680 1471747: CDistributorBase::cRecvThread

7FF41EFFD700 140686468830976 1471748: CDistributorBase::cSendThread

7FF41DFFB700 140686452045568 1471752: CRowStreamLookAhead

 

From Jake:

Not sure what the root cause is, but it's crashing during a Smart Join, whilst spilling.
As the RHS of this join is pretty big and it's spilling quite a bit, it may be better to use a standard join rather than a smart join.
If the bug is with Smart Join as it appears, using a standard join will workaround the problem.

Conclusion

None

Activity

Show:
Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Compatibility

Minor

Fix versions

Affects versions

Created November 29, 2018 at 11:21 PM
Updated December 14, 2018 at 12:10 PM
Resolved December 14, 2018 at 12:10 PM