stopdali and dfuserver and their inits living past stop_component call

Environment

centos 6

Description

The killed functions in the init_dfuserver and init_dali scripts have calls to binaries to shutdown their processes (respectively.) You can watch the stopdali and dfsuserver STOP=1 commands hang after the reported shutdown of the component by hpcc-init.

Conclusion

None

Activity

Show:

Michael Gardner May 14, 2015 at 6:26 PM

Mark made a necessary change recently that fixed a looping issue in 5.2.0 -> 5.2.2. In this change we now use check_status as the primary way of seeing the health of a component within the loop. It made everything much simpler, but it was causing issues where a RETURN value of 4, could mean that a process was still alive, but it would give a success message.

I modified the RESULT test to only show success if the value was '1' (Stopped in healthy state.) and I moved the unlock and removePid to before the loop so it won't immediately fail. This is what was causing issues I was seeing in 5.4 where dali and dfuserver were having issues on restart (because they were still trying to shut down) and I initially thought it was a problem with the stopdali and dfuserver action=STOP calls. The bug that Srinivas Sivasubramanian had the other day with 5.2.0 helped me isolate and find the issue when I was testing 5.2.2.

Michael Gardner May 7, 2015 at 8:51 PM

So there are two issues.

1) we don't unlock the lock file before we eval the stop command within stop_component. This means that check_status is going to return 4 from within stop_component, and it will say the component is stopped before it actually is shut down properly.

2) there is an issue where PIDPATH gets overwritten somewhere within the (i suspect) pid.sh file. I'm hunting it down now.

Jacob Cobbett-Smith May 6, 2015 at 10:08 AM

The stop handler's / signal handler.. should already shutdown dali/dfuserver cleanly.
So the STOP=1 method should not be necessary for this purpose..
But they do offer a way to remotely quit Dali for example, so I'm not sure they should be completely removed.

However, for the service start/stop the signal handlers is a better route and the STOP=1 should be removed I think.
As I mentioned in another PR, they also potentially cause deadlock, if the process they're trying to talk to is already down.. they then sit there trying to talk to a port of a process which is no longer there.
Which may be what you've been seeing perhaps.

Michael Gardner May 5, 2015 at 1:06 PM

I was able to get it to happen once with dfuserver in 5.0.14, but not with dali. And I can't seem to recreate it with dfuserver in any consistent manner in the 5.0.14 branch.

Michael Gardner May 5, 2015 at 12:23 PM

Ok, well if we think just about Dali here. Is this not handled by the signal handler already? Does stopServer() not shutdown cleanly? And if it does, then we should probably change the kill_process function call to wait longer before we send a SIGKILL after sending the initial SIGTERM,

What's the best way to handle this?
1) daserver gets a sigterm from kill_process. shut down cleanly. Have the init script wait for the process to die before cleaning up the PID_NAME file.
2) call another program dalistop, to send a message to the daserver instance and have it shut down cleanly, then have kill_process immediately try to do a SIGKILL on the process if it still exists after dalistop returns?

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Priority

Fix versions

Affects versions

Created May 4, 2015 at 5:25 PM
Updated May 18, 2015 at 9:16 AM
Resolved May 18, 2015 at 9:16 AM

Flag notifications