Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Assignee
Jacob Cobbett-SmithJacob Cobbett-SmithReporter
Russell Wagner IIIRussell Wagner IIIPriority
Not specifiedFix versions
Affects versions
Details
Details
Assignee
Jacob Cobbett-Smith
Jacob Cobbett-SmithReporter
Russell Wagner III
Russell Wagner IIIPriority
Fix versions
Affects versions
Created December 12, 2023 at 10:13 PM
Updated January 3, 2024 at 2:22 PM
Resolved January 3, 2024 at 2:21 PM
We are seeing instances of some jobs going into a blocked state, where eclwatch shows the job as
blocked [ECLagent on thor400-...]
Looking at the eclagent logs, and you can see it claiming to queue the job onto thor400:
0000000A OPR INF 2023-12-12 15:20:29.577 43604 43604 W20231212-150757 "hthor build internal_9.4.14-1"
0000000D PRG INF 2023-12-12 15:20:29.606 43604 43604 W20231212-150757 "Loading dll (libW20231212-150757.so) from location /var/lib/HPCCSystems/queries/libW20231212-150757.so"
0000000E PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Starting process"
0000000F PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "RoxieMemMgr: Setting memory limit to 7199522816 bytes (27464 pages)"
00000010 OPR ERR 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "WARNING: The OS is configured to always use transparent huge pages. This may cause unexplained pauses while transparent huge pages are coalesced. The recommended setting for /sys/kernel/mm/transparent_hugepage/enabled is madvise"
00000011 PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Transparent huge pages used for roxiemem heap"
00000012 PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Memory released to OS in 8192k blocks"
00000013 PRG INF 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "RoxieMemMgr: 27488 Pages successfully allocated for the pool - memsize=7205814272 base=0x7fe615800000 alignment=262144 bitmapSize=859"
00000014 USR PRO 2023-12-12 15:20:29.757 43604 43604 W20231212-150757 "Total memory = 7629 MB, query memory = 6866 MB"
00000017 PRG INF 2023-12-12 15:20:29.764 43604 43604 W20231212-150757 "Queueing wuid=W20231212-150757, graph=graph14, on queue=thor400.thor, timelimit=172800 seconds"
But it is not clear if a job ever attempts to run. Having the user abort the WU and resubmit, and the job gets picked up and runs on the next attempt.
Attaching eclagent log and values yaml.