Some orphaned temp files never being deleted

Description

For temporary files based on generated temporary outputs, e.g. graph disk write spills, the temporary directory for the in-flight partial outputs are created based on the temp directory defined by the master. Each slave has it's own temporary subdirectory.

When the slave starts up, it deletes any orphaned temporaries, it will not delete the files that are in the parent temp. directory whose paths have been designated by the master.
However, any file that is the progress of being written by a slave, is tracked by Files-In-Progress mechanism, which keeps a written log of files in flight.
When a file is done, it is removed from this list and the list is committed to disk.

On slave startup, anything still remaining in this list is deleted.

So all temporary files should be deleted one way or another via one of these mechanisms.
However there is evidence that the temporary files in the parent directory named by the master are not always being deleted and eventually building up / consuming space.

Conclusion

None

Activity

Show:

Jacob Cobbett-Smith April 14, 2016 at 12:30 PM

You should be able to kill -9 the slave process mid-flight, before any error and it should still delete the file (on restart)
The above log sequence may be interesting too though and related.
However whatever happened here, an unsuccessful abort, caused Thor to recycle, i.e. master exited and the script killed the slave(s).

The main issue is that the temp files were not where the slave expected them to be, if they had been, they would have been deleted on slave restart.

Mark Kelly April 12, 2016 at 6:52 PM
Edited

It appears registerFile() is being called but then a graph of the job is aborted, the slave restarts, but deregisterFile() or clearTemps() is never called and thus the file is not removed.

Here is a sequence:

0006151A 2016-02-23 12:00:20.852 32342 32342 "Started wuid=W20160223-114942, user=*, graph=graph1

00061550 2016-02-23 12:00:24.178 32342 25098 "registerTmpFile name=/var/lib/HPCCSystems/thor400_64b_linking/temp/1__w20160223-114942._26_of_400, usage
Count=1"

0006359D 2016-02-23 12:15:14.073 32342 32342 "GraphAbort: W20160223-114942graph1"
0006359E 2016-02-23 12:15:14.073 32342 32342 "Abort condition set - activity(diskwrite, 204)"
0006359F 2016-02-23 12:15:14.073 32342 32342 "Abort condition set - activity(degroup, 202)"
000635A0 2016-02-23 12:15:14.073 32342 32342 "Abort condition set - activity(sort, 201)"
000635A1 2016-02-23 12:15:14.073 32342 32342 "Abort condition set - activity(group, 200)"
000635A2 2016-02-23 12:15:14.073 32342 32342 "Abort condition set - activity(sort, 185)"
000635A3 2016-02-23 12:15:14.073 32342 32342 "Abort condition set - activity(hashdistribute, 184)"
000635A4 2016-02-23 12:15:14.073 32342 32342 "Abort condition set - activity(diskread, 182)"
000635A5 2016-02-23 12:15:14.078 32342 14648 "Distribute send finishing - activity(hashdistribute, 184)"
000635A6 2016-02-23 12:15:14.078 32342 14648 "HDIST: waiting for threads - activity(hashdistribute, 184)"
000635A7 2016-02-23 12:15:20.428 32342 32342 "Finished wuid=W20160223-114942, graph=graph1"
000635A8 2016-02-23 12:15:20.428 32342 32342 "QueryDone, removing W20160223-114942graph1 from jobs"
000635A9 2016-02-23 12:15:20.428 32342 32342 "Job ended : graph1"

shortly after this ...

000635D1 2016-02-23 12:16:19.656 32342 1408 "ERROR: 4: /var/lib/jenkins/workspace/LN-Candidate-withplugins-5.4.6-1/LN/centos-6.4-x86_64/HPCC-Platform
/system/jlib/jthread.cpp(302) : CGraphProgressHandler : MP link closed (10.194.64.201:16500)"
000635D2 2016-02-23 12:16:19.656 32342 1408 "ERROR: 4: /var/lib/jenkins/workspace/LN-Candidate-withplugins-5.4.6-1/LN/centos-6.4-x86_64/HPCC-Platform
/thorlcr/slave/slavmain.cpp(122) : No job active exception: : MP link closed (10.194.64.201:16500)"
000635D3 2016-02-23 12:16:34.227 32342 32342 "Unregistering slave : 10.194.64.26:16600"
000635D4 2016-02-23 12:16:34.227 32342 32342 "ERROR: 4: /var/lib/jenkins/workspace/LN-Candidate-withplugins-5.4.6-1/LN/centos-6.4-x86_64/HPCC-Platform
/thorlcr/slave/thslavemain.cpp(190) : slave unregistration error : MP link closed (10.194.64.201:16500)"

slave restart ...

00000000 2016-02-23 12:17:00.538 25753 25753 "Opened log file //10.194.64.26/mnt/disk1/var/log/HPCCSystems/thor400_64b_linking/thorslave.26.2016_02_23
.log"
00000001 2016-02-23 12:17:00.538 25753 25753 "Build internal_5.4.6-1"
00000002 2016-02-23 12:17:00.563 25753 25753 "registering 10.194.64.26:16600 - master 10.194.64.201:16500"
00000003 2016-02-23 12:17:17.017 25753 25753 "Initialization received"
00000004 2016-02-23 12:17:17.017 25753 25753 "Master build: internal_5.4.6-1"
00000005 2016-02-23 12:17:17.017 25753 25753 "Registration confirmation sent"
00000006 2016-02-23 12:17:17.038 25753 25753 "verifying mp connection to rest of cluster"
00000007 2016-02-23 12:17:18.126 25753 25753 "verified mp connection to rest of cluster"
00000008 2016-02-23 12:17:18.126 25753 25753 "registered 10.194.64.26:16600"
00000009 2016-02-23 12:17:18.126 25753 25753 "setIORetryCount set to : 0"
0000000A 2016-02-23 12:17:18.126 25753 25753 "Disk space: /var/lib/HPCCSystems/hpcc-data/thor = 445416 MB, /var/lib/HPCCSystems/hpcc-mirror/thor = 557
887 MB, /var/lib/HPCCSystems/thor400_64b_linking/temp = 445416 MB"

and temp file remains

rw-rr- 1 hpcc hpcc 9890089 Feb 23 12:00 /var/lib/HPCCSystems/thor400_64b_linking/temp/1__w20160223-114942._26_of_400

Jacob Cobbett-Smith February 22, 2016 at 10:51 AM

Yes, I know what the issue is.
It's caused when a temp file output (disk write activity) is named by thor master. It's name and directory is based on the temp directory configuration.
However the slave processes all use an individual temp directory (based on port), partly to avoid any potential clashing.

As a consequent, these disk write temps are written outside of the slave temp dir. and are not autocleared up on termination or restart.

The fix is probably to remap these temps into the correct slave temp dirs.

Mark Kelly February 16, 2016 at 3:34 PM

There is some evidence that both a failed job and an aborted job leave behind temp files.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Compatibility

Major

Fix versions

Pull Request URL

Created December 3, 2015 at 6:41 PM
Updated April 19, 2016 at 8:35 AM
Resolved April 19, 2016 at 8:35 AM