A race condition involving shared spills (e.g. as used by graph results), can cause an assert to be hit and job failure

Environment

Production Thor400_60

Description

The problem occurs while running the attribute that does the Normalize of a large dataset. My dataset contains 2.8B records that I'm normalizing. If I run the attribute by itself I don't get an error. It only occurs when I run it as part of the entire build process.

I haven't seen this error on Dataland which is running internal_5.2.6-rc1

Please let me know if there is any additional information needed.

Conclusion

None

Activity

Show:

Jacob Cobbett-Smith August 21, 2015 at 9:34 AM
Edited

The problem occurs when dealing with shared spilling stream, in this case as used by Graph Results, that a helper function is calling.

NB: This bug is likely to only occur when there is a lot of memory contention and spilling.

If at the moment the stream is created, a OOM event causes a spilling callback to the streams owner, the stream construction will block on a mutex and the spilling callback will not have knowledge of the pending new stream.
As a consequence the spilling callback does not setup the stream correctly (to deal with the pending disk read it must before).
When the spilling callback is done, the new stream constructor completes.
The next time it is read, it hits the "assertex(((offset_t)-1) != outputOffset);" assert , which is an indication that it was not setup.

Jacob Cobbett-Smith August 20, 2015 at 3:06 PM

Slave stack:

000072E2 2015-08-18 20:01:39.626 53589 47797 "Backtrace:" 000072E3 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_Z16printStackReportv+0x28) [0x7fc8f66bf198]" 000072E4 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_Z20raiseAssertExceptionPKcS0_j+0x26) [0x7fc8f66c0b96]" 000072E5 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN22CSharedSpillableRowSet7CStream7nextRowEv+0x160) [0x7fc8f46c7460]" 000072E6 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN16CThorGraphResult15getLinkedResultERjRPPh+0x6a) [0x7fc8f46df7ea]" 000072E7 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN17CThorGraphResults15getLinkedResultERjRPPhj+0x2c) [0x7fc8f46ddb0c]" 000072E8 2015-08-18 20:01:39.705 53589 47797 " /var/lib/HPCCSystems/queries/thor400_60_3_22050/V106379470_libW20150818-152417.so(+0x54952a) [0x7fc7f9d6952a]" 000072E9 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN22NormalizeSlaveActivity7nextRowEv+0xe8) [0x7fc8f3f87f18]" 000072EA 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN22NSplitterSlaveActivity10writeaheadEyRKbR9Semaphore+0x14d) [0x7fc8f3f 8a6dd]" 000072EB 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN15CSplitterOutput7nextRowEv+0xdd) [0x7fc8f3f891ed]" 000072EC 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN22NSplitterSlaveActivity13CDelayedInput7nextRowEv+0x2a) [0x7fc8f3f8993 a]" 000072ED 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN21CProjectSlaveActivity7nextRowEv+0x5d) [0x7fc8f3f92c2d]" 000072EE 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN15CParallelFunnel13CInputHandler4mainEv+0x7e) [0x7fc8f3f1a98e]" 000072EF 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN19CThreadedPersistent4mainEv+0x65) [0x7fc8f6768465]" 000072F0 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN19CThreadedPersistent8CAThread3runEv+0x10) [0x7fc8f676c890]" 000072F1 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x2f) [0x7fc8f676832f]" 000072F2 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x1c) [0x7fc8f676717c]" 000072F3 2015-08-18 20:01:39.705 53589 47797 " /lib64/libpthread.so.0(+0x7851) [0x7fc8f049a851]" 000072F4 2015-08-18 20:01:39.705 53589 47797 " /lib64/libc.so.6(clone+0x6d) [0x7fc8f01e890d]" 000072F5 2015-08-18 20:01:39.705 53589 47797 "ERROR: assert(((offset_t)-1) != outputOffset) failed - file: /var/lib/jenkins/workspace/LN-Candidate-5.2.4-1/LN/centos-6. 4-x86_64/HPCC-Platform/thorlcr/thorutil/thmem.cpp, line 271"

Kevin Garrity August 19, 2015 at 2:35 PM

ZAP Report already attached. ZAPReport_W20150818-152417_kgarrity_prod.zip

The ECL Error I get is:
System error: 3000: Graph[233], normalize[257]: SLAVE #218 [10.241.60.18:22050]: assert(((offset_t)-1) != outputOffset) failed - file: /var/lib/jenkins/workspace/LN-Candidate-5.2.4-1/LN/centos-6.4-x86_64/HPCC-Platform/thorlcr/thorutil/thmem.cpp, line 271,

Russ Whitehead August 19, 2015 at 2:29 PM

Can you attach a ZAP report, available through the work unit view in eclwatch

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Fix versions

Labels

Affects versions

Created August 19, 2015 at 2:07 PM
Updated August 21, 2015 at 2:25 PM
Resolved August 21, 2015 at 2:25 PM

Flag notifications