checkAbnormalTermination might [rarely] incorrectly report job failed

Description

workunit routine waitForWorkUnit() loops querying WU state and if running gets WU session id then checks if session is still present. There is a very rare but possible chance the WU could have ended cleanly between query for running state and check for session id still present.
We could conn->reload() after check for session id present fails, and then query state again to see if state changed to completed/failed/aborted to know its not an unexpected termination. But is that really doing anything meaningful ? Not sure what check for session id present is really for.

Conclusion

None

Activity

Show:

Mark Kelly August 25, 2016 at 6:25 PM

Closing and issuing new PR after review comments to candidate-6.0.6

Mark Kelly August 2, 2016 at 2:10 PM

PR shows one possible solution.
log files confirm what can happen, as seen below.

eclagent:
--------------
00000092 2016-07-18 15:46:09.667 17308 17308 "Releasing run lock"
00000093 2016-07-18 15:46:09.669 17308 17308 "Releasing persist read locks"
00000094 2016-07-18 15:46:09.669 17308 17308 "Released persist read locks"
00000095 2016-07-18 15:46:09.669 17308 17308 "Process complete"
00000096 2016-07-18 15:46:09.670 17308 17308 "check sessionStopped(30028f015, 0)"
00000097 2016-07-18 15:46:09.670 17308 17308 "getProcessSessionNode(30028f015) returning node.getClear()"
00000098 2016-07-18 15:46:09.670 17308 17308 "sessionStopped() timeout is zero"
00000099 2016-07-18 15:46:09.676 17308 17308 "Workunit written complete"

esp:
------
0000D8E2 2016-07-18 15:46:09.851 26604 17329 "check sessionStopped(30028f491, 0)"
0000D8E3 2016-07-18 15:46:09.852 26604 17329 "getProcessSessionNode(30028f491) node->endpoint().isNull() is true"
0000D8E4 2016-07-18 15:46:09.852 26604 17329 "sessionStopped() getProcessSessionNode(id) returns NULL"
0000D8E5 2016-07-18 15:46:09.852 26604 17329 "WARNING: checkAbnormalTermination: workunit terminated: 12887585937 state = 4"

dali:
------
000031FA 2016-07-18 15:46:09.712 25830 26048 "Session starting 30028f491 (172.16.99.24:7473) : role=EclAgent"
000031FB 2016-07-18 15:46:09.847 25830 26048 "getProcessSessionNode(30028f491) getClientProcessEndpoint() ok - about to return createINode()"
000031FC 2016-07-18 15:46:09.849 25830 26048 "Session stopping 30028f491 ok"
000031FD 2016-07-18 15:46:09.852 25830 26048 "getProcessSessionNode(30028f491) getClientProcessEndpoint() not ok"
000031FE 2016-07-18 15:46:09.860 25830 25831 "Client closed (172.16.99.24:7473)"

Mark Kelly July 29, 2016 at 7:46 PM

I'll submit a PR to show issue and a possible solution.

Gavin Halliday July 29, 2016 at 2:40 PM

any comments?

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Fix versions

Created July 27, 2016 at 7:13 PM
Updated August 31, 2016 at 4:40 PM
Resolved August 31, 2016 at 4:40 PM