Thor worker registration timeout, causes job to remain blocked indefinitely

Environment

K8s 1.27.7, Ubuntu 22.04.3 LTS

Description

We are seeing instances of some jobs going into a blocked state, where eclwatch shows the job as

blocked [ECLagent on thor400-...]

Looking at the eclagent logs, and you can see it claiming to queue the job onto thor400:

0000000A OPR INF 2023-12-12 15:20:29.577 43604 43604 W20231212-150757 "hthor build internal_9.4.14-1"
0000000D PRG INF 2023-12-12 15:20:29.606 43604 43604 W20231212-150757 "Loading dll (libW20231212-150757.so) from location /var/lib/HPCCSystems/queries/libW20231212-150757.so"
0000000E PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Starting process"
0000000F PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "RoxieMemMgr: Setting memory limit to 7199522816 bytes (27464 pages)"
00000010 OPR ERR 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "WARNING: The OS is configured to always use transparent huge pages.  This may cause unexplained pauses while transparent huge pages are coalesced. The recommended setting for /sys/kernel/mm/transparent_hugepage/enabled is madvise"
00000011 PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Transparent huge pages used for roxiemem heap"
00000012 PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Memory released to OS in 8192k blocks"
00000013 PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "RoxieMemMgr: 27488 Pages successfully allocated for the pool - memsize=7205814272 base=0x7fe615800000 alignment=262144 bitmapSize=859"
00000014 USR PRO 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Total memory = 7629 MB, query memory = 6866 MB"
00000017 PRG INF 2023-12-12 15:20:29.764 43604 43604 W20231212-150757 "Queueing wuid=W20231212-150757, graph=graph14, on queue=thor400.thor, timelimit=172800 seconds"

But it is not clear if a job ever attempts to run. Having the user abort the WU and resubmit, and the job gets picked up and runs on the next attempt.

Attaching eclagent log and values yaml.

Conclusion

None

Activity

Show:

Jacob Cobbett-Smith December 13, 2023 at 7:14 PM

Ultimately, this was the cause:

Logging from thormanager job:

12/12/2023, 1:47:58.846 AM 00000296 USR WRN 2023-12-12 01:47:58.845 9 9 W20231211-234443 "Timeout waiting for all workers to register within timeout period (60 mins)" 12/12/2023, 1:47:58.846 AM 00000297 USR PRO 2023-12-12 01:47:58.845 9 9 W20231211-234443 "Registration aborted"

I don't know why the slaves didn't register within 1hr - my guess would be a resourcing issue..

The reason the job stalled indefinitely is because the manager reported the error to its log and quit, but didn't propagate the error back to the workunit and cause it to abort.

I will issue a fix for that.
In the meantime, I have aborted W20231211-234443

Russell Wagner III December 13, 2023 at 3:43 PM
Edited

Looks like one of the users didn't abort and resubmit when I noticed the problem and opened that ticket. W20231211-234443 is a still blocked/queuing example of https://hpccsystems.atlassian.net/browse/HPCC-31009

Jacob Cobbett-Smith December 13, 2023 at 3:38 PM

- I may need to look at the system when it's in this state.. could you reach out to me, then next time it happens?

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Priority

Fix versions

Affects versions

Created December 12, 2023 at 10:13 PM
Updated January 3, 2024 at 2:22 PM
Resolved January 3, 2024 at 2:21 PM

Flag notifications