Fixed
Pinned fields
Click on the next to a field label to start pinning.
Details
Components
Assignee
Jacob Cobbett-SmithJacob Cobbett-SmithReporter
Russell Wagner IIIRussell Wagner IIIPriority
CriticalFix versions
Affects versions
Details
Details
Components
Assignee
Jacob Cobbett-Smith
Jacob Cobbett-SmithReporter
Russell Wagner III
Russell Wagner IIIPriority
Fix versions
Affects versions
Created April 11, 2022 at 3:03 PM
Updated April 13, 2022 at 3:20 PM
Resolved April 13, 2022 at 3:20 PM
version: 8.6.16-rc4
We are running a job that fails due to slave #1 not being able to find specific spill subdirs that are not present on the node it is running on. When this failure occurs, the thormanager pod vanishes, even with keepJobs: all set. Nothing is written to the debug plane when this occurs. I was only able to gather logs by actively tailing the thormanager log.
I've attached the logs from the thor job. Thor slave #1 should be thorworker-w20220411-142716-graph2-qpsl5. Looking at the logs from that thorworker and I do not see the error messages that the thormanager got.
Looking in that pod's spill dir and I see old WU subdirs as well:
hpcc@thorworker-w20220411-142716-graph2-qpsl5:/var/lib/HPCCSystems$ ls -la spill/
total 76
drwxr-xr-x 19 hpcc hpcc 4096 Apr 11 14:28 .
drwxrwxrwx 7 root root 4096 Apr 11 14:28 ..
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 13:32 W20220411-132643_graph2_0
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_1
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_2
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_3
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_4
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_5
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_1
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_2
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_3
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_4
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_5
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:28 W20220411-142716_graph2_0
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_1
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_2
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_3
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_4
drwxr-xr-x 2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_5
However I do not see the _graph2_11 thru 20 subdirs that thormanager says slave #1 is missing.
I've attached values.yaml and the thor pod logs.