Prevent Thor manager watchdog from stopping (on e.g. deserialization error), causing a build up of orphaned worker MP messages

Description

The Thor manager watchdog runs at the start of each graph and waits for watchdog/progress packets from the workers.
If there is an exception processing one of those packets, it stops.

Workers continue to send progress packets and the MP messaging system keeps all of them pending waiting to be read.
This causes over time, a massive build up of pending messages - which wastes memory, but I think also causes a huge slowdown in MP communication between manager and workers
(as seen primarily by very slow sorts).

I believe this is being seen now, because a serialization/deserialization issue has been introduced in recent builds related to the sub file stats.

Conclusion

None

Activity

Show:

Jacob Cobbett-Smith April 22, 2022 at 11:39 AM

For reference, a stack trace from a recent incident that stopped the Thor manager watchdog:

00003F2A 2022-04-07 15:34:43.094 1562904 1562933 "Backtrace:" 00003F2B 2022-04-07 15:34:43.094 1562904 1562933 " /opt/HPCCSystems/lib/libjlib.so(_Z16printStackReportx+0x45) [0x7f07a17f64d5]" 00003F2C 2022-04-07 15:34:43.094 1562904 1562933 " /opt/HPCCSystems/lib/libjlib.so(_Z20raiseAssertExceptionPKcS0_j+0x83) [0x7f07a17f6763]" 00003F2D 2022-04-07 15:34:43.094 1562904 1562933 " /opt/HPCCSystems/lib/libjlib.so(+0x20ee18) [0x7f07a18ece18]" 00003F2E 2022-04-07 15:34:43.094 1562904 1562933 " /opt/HPCCSystems/lib/libactivitymasters_lcr.so(_ZN14CIndexReadBase16deserializeStatsEjR12MemoryBuffer+0x48) [0x7f07a932fb98]" 00003F2F 2022-04-07 15:34:43.094 1562904 1562933 " /opt/HPCCSystems/lib/libgraphmaster_lcr.so(_ZN12CMasterGraph16deserializeStatsEjR12MemoryBuffer+0x1e4) [0x7f07a8c8f244]" 00003F30 2022-04-07 15:34:43.094 1562904 1562933 " /mnt/disk1/var/lib/HPCCSystems/thor400_198_6/thormaster_thor400_198_6() [0x40fe44]" 00003F31 2022-04-07 15:34:43.094 1562904 1562933 " /mnt/disk1/var/lib/HPCCSystems/thor400_198_6/thormaster_thor400_198_6() [0x40ae23]" 00003F32 2022-04-07 15:34:43.094 1562904 1562933 " /mnt/disk1/var/lib/HPCCSystems/thor400_198_6/thormaster_thor400_198_6(_ZThn16_N9CThreaded3runEv+0x10) [0x40b250]" 00003F33 2022-04-07 15:34:43.094 1562904 1562933 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread5beginEv+0x28) [0x7f07a18fb2b8]" 00003F34 2022-04-07 15:34:43.094 1562904 1562933 " /opt/HPCCSystems/lib/libjlib.so(_ZN6Thread11_threadmainEPv+0x22) [0x7f07a18fa162]" 00003F35 2022-04-07 15:34:43.094 1562904 1562933 " /usr/lib64/libpthread.so.0(+0x7e65) [0x7f079fd74e65]" 00003F36 2022-04-07 15:34:43.094 1562904 1562933 " /usr/lib64/libc.so.6(clone+0x6d) [0x7f079fa9d88d]" 00003F37 2022-04-07 15:34:43.094 1562904 1562933 "assert(SELF::isItem(pos)) failed - file: jarray.hpp, line 269" 00003F38 2022-04-07 15:34:43.095 1562904 1562933 "mawatchdog.cpp(259) : Watchdog Server Exception : assert(SELF::isItem(pos)) failed - file: jarray.hpp, line 269"

 

Fixed
Pinned fields
Click on the next to a field label to start pinning.
Created April 22, 2022 at 11:37 AM
Updated June 27, 2022 at 3:35 PM
Resolved April 25, 2022 at 4:26 PM

Flag notifications