A race condition involving shared spills (e.g. as used by graph results), can cause an assert to be hit and job failure
Environment
Description
Conclusion
Activity
Jacob Cobbett-Smith August 21, 2015 at 9:34 AMEdited
The problem occurs when dealing with shared spilling stream, in this case as used by Graph Results, that a helper function is calling.
NB: This bug is likely to only occur when there is a lot of memory contention and spilling.
If at the moment the stream is created, a OOM event causes a spilling callback to the streams owner, the stream construction will block on a mutex and the spilling callback will not have knowledge of the pending new stream.
As a consequence the spilling callback does not setup the stream correctly (to deal with the pending disk read it must before).
When the spilling callback is done, the new stream constructor completes.
The next time it is read, it hits the "assertex(((offset_t)-1) != outputOffset);" assert , which is an indication that it was not setup.
Jacob Cobbett-Smith August 20, 2015 at 3:06 PM
Slave stack:
000072E2 2015-08-18 20:01:39.626 53589 47797 "Backtrace:"
000072E3 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_Z16printStackReportv+0x28) [0x7fc8f66bf198]"
000072E4 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_Z20raiseAssertExceptionPKcS0_j+0x26) [0x7fc8f66c0b96]"
000072E5 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN22CSharedSpillableRowSet7CStream7nextRowEv+0x160) [0x7fc8f46c7460]"
000072E6 2015-08-18 20:01:39.704 53589 47797 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN16CThorGraphResult15getLinkedResultERjRPPh+0x6a) [0x7fc8f46df7ea]"
000072E7 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libgraph_lcr.so(_ZN17CThorGraphResults15getLinkedResultERjRPPhj+0x2c) [0x7fc8f46ddb0c]"
000072E8 2015-08-18 20:01:39.705 53589 47797 " /var/lib/HPCCSystems/queries/thor400_60_3_22050/V106379470_libW20150818-152417.so(+0x54952a) [0x7fc7f9d6952a]"
000072E9 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN22NormalizeSlaveActivity7nextRowEv+0xe8) [0x7fc8f3f87f18]"
000072EA 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN22NSplitterSlaveActivity10writeaheadEyRKbR9Semaphore+0x14d) [0x7fc8f3f
8a6dd]"
000072EB 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN15CSplitterOutput7nextRowEv+0xdd) [0x7fc8f3f891ed]"
000072EC 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN22NSplitterSlaveActivity13CDelayedInput7nextRowEv+0x2a) [0x7fc8f3f8993
a]"
000072ED 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN21CProjectSlaveActivity7nextRowEv+0x5d) [0x7fc8f3f92c2d]"
000072EE 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libactivityslaves_lcr.so(_ZN15CParallelFunnel13CInputHandler4mainEv+0x7e) [0x7fc8f3f1a98e]"
000072EF 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN19CThreadedPersistent4mainEv+0x65) [0x7fc8f6768465]"
000072F0 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN19CThreadedPersistent8CAThread3runEv+0x10) [0x7fc8f676c890]"
000072F1 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x2f) [0x7fc8f676832f]"
000072F2 2015-08-18 20:01:39.705 53589 47797 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x1c) [0x7fc8f676717c]"
000072F3 2015-08-18 20:01:39.705 53589 47797 " /lib64/libpthread.so.0(+0x7851) [0x7fc8f049a851]"
000072F4 2015-08-18 20:01:39.705 53589 47797 " /lib64/libc.so.6(clone+0x6d) [0x7fc8f01e890d]"
000072F5 2015-08-18 20:01:39.705 53589 47797 "ERROR: assert(((offset_t)-1) != outputOffset) failed - file: /var/lib/jenkins/workspace/LN-Candidate-5.2.4-1/LN/centos-6.
4-x86_64/HPCC-Platform/thorlcr/thorutil/thmem.cpp, line 271"
Kevin Garrity August 19, 2015 at 2:35 PM
ZAP Report already attached. ZAPReport_W20150818-152417_kgarrity_prod.zip
The ECL Error I get is:
System error: 3000: Graph[233], normalize[257]: SLAVE #218 [10.241.60.18:22050]: assert(((offset_t)-1) != outputOffset) failed - file: /var/lib/jenkins/workspace/LN-Candidate-5.2.4-1/LN/centos-6.4-x86_64/HPCC-Platform/thorlcr/thorutil/thmem.cpp, line 271,
Russ Whitehead August 19, 2015 at 2:29 PM
Can you attach a ZAP report, available through the work unit view in eclwatch
The problem occurs while running the attribute that does the Normalize of a large dataset. My dataset contains 2.8B records that I'm normalizing. If I run the attribute by itself I don't get an error. It only occurs when I run it as part of the entire build process.
I haven't seen this error on Dataland which is running internal_5.2.6-rc1
Please let me know if there is any additional information needed.