spill subdirs missing on worker #1

Description

version: 8.6.16-rc4

We are running a job that fails due to slave #1 not being able to find specific spill subdirs that are not present on the node it is running on. When this failure occurs, the thormanager pod vanishes, even with keepJobs: all set. Nothing is written to the debug plane when this occurs. I was only able to gather logs by actively tailing the thormanager log.

I've attached the logs from the thor job. Thor slave #1 should be thorworker-w20220411-142716-graph2-qpsl5. Looking at the logs from that thorworker and I do not see the error messages that the thormanager got. 

Looking in that pod's spill dir and I see old WU subdirs as well:
hpcc@thorworker-w20220411-142716-graph2-qpsl5:/var/lib/HPCCSystems$ ls -la spill/
total 76
drwxr-xr-x 19 hpcc hpcc 4096 Apr 11 14:28 .
drwxrwxrwx  7 root root 4096 Apr 11 14:28 ..
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 13:32 W20220411-132643_graph2_0
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_1
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_2
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_3
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_4
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 13:34 W20220411-132643_graph2_5
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_1
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_2
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_3
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_4
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:02 W20220411-140027_graph2_5
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:28 W20220411-142716_graph2_0
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_1
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_2
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_3
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_4
drwxr-xr-x  2 hpcc hpcc 4096 Apr 11 14:29 W20220411-142716_graph2_5

 

However I do not see the _graph2_11 thru 20 subdirs that thormanager says slave #1 is missing.

I've attached values.yaml and the thor pod logs.

Conclusion

None

Attachments

3

Activity

Show:

Tony Kirk April 12, 2022 at 8:21 PM

Confirming success, as expected.

Jacob Cobbett-Smith April 12, 2022 at 8:11 PM

The thor workers pods, that did not hit this temp directories issue, were ones that happened to be placed on pods that shared the same plane and did perform a forcePermissions init container (the thor manager pod).

Russell Wagner III April 12, 2022 at 7:02 PM

Jake provided a possible fix by using this file as my helm templates/thor.yaml and I've confirmed we did not have a job failure after using it.

https://raw.githubusercontent.com/hpcc-systems/HPCC-Platform/a68428d11a02536cde38a3b17b81f2764e252c22/helm/hpcc/templates/thor.yaml

Russell Wagner III April 12, 2022 at 6:15 PM

  1. found it occurs even when no scaling happens.

  2. forcePermissions is needed because otherwise the hostPath mount is owned by root:root causing permission denied failures.

  3. ran test query and reported results to you. It appears that thorworkers are not enforcing forcePermissions.

Jacob Cobbett-Smith April 12, 2022 at 3:19 PM

Summary of questions/things to try:

1) Does this happen only when the node pool has to scale? i.e. if it has already run and failed, and it is run again having already scaled, does it still fail?
2) Why is forcePermissions needed on the spill plane? (I am wondering if it and mutiple pods on same node are interferring with each other during the chown forcePermission performs).
3) Ahead of the fix to https://hpccsystems.atlassian.net/browse/HPCC-27504#icft=HPCC-27504 and as a way to provide additional insight, we can run a test query that is deliberately slow, and then whilst running, bash into the pods, and inspect the temp dirs. which should exists as soon as the query starts (test query attached)

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Fix versions

Affects versions

Created April 11, 2022 at 3:03 PM
Updated April 13, 2022 at 3:20 PM
Resolved April 13, 2022 at 3:20 PM

Flag notifications