Post-mortem debug ability in cloud

Description

We should add a process to be executed AFTER the main process in a pod terminates that preserves information useful for debugging, if the main process terminates abnormally.

We can use sentinel file to determine whether the main process was terminated cleanly.

Information to gather includes:
Core files (optional)
Extracted info from core files - all stacks, variables etc
Syslog info
Full log (may be more than goes to ELK)
Spill files that were the input to the current graph (optional)

Conclusion

None
50% Done
Loading...

Activity

Show:

Gavin Halliday February 21, 2022 at 3:44 PM

Gavin Halliday February 17, 2022 at 4:02 PM

Gavin Halliday February 14, 2022 at 6:11 PM

Also https://github.com/hpcc-systems/HPCC-Platform/pull/15771 fixing problem with non-debug builds.

Gavin Halliday January 31, 2022 at 10:59 AM

Also, include details of k8s information especially OOM kills

Richard Chapman January 27, 2022 at 12:51 PM

What info would be useful?
- Every stack of every process (so do we need gdb?)
o With local variables? But maybe pii?
- Disassembly of ‘top’ function
- Full rather than concise logs
- Core files (but there are security concerns)?
o Generate locally then copy? Or redirect cores straight to destination?
o Destination is WU dependent so better to generate locally first.
o By default, just save the extracted info not the whole core file (too big).
- System logs
- Stderr and stdout?
- Node-level system logs?
- Resource stats like file handle counts, disk space, etc.
- Current workunit
- Info on what else is running (at node level)?
- k8s pod event log
- The spill files that the current subgraph was reading?
- Current locks on dali
- May need more earlier in the migration…
- May need ZAP to be encrypted

What to gather, versus what to put in default ZAP?

Where to gather it?
- Directory in ‘debug’ plane, with reference to it in the log
o Multiple pods? So if any pod crashes master tells all to do the post-crash stuff.
- Some of them maybe go to the log
- Retrievable via parameterized ZAP request
- When are they deleted? Depends on size…

Implementation
- A post-termination action on pods that gathers it
- Some of the above are useful/available on a pod eviction…
- Debug plane can be on cheap storage
- Possible coding exercise

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Fix versions

Labels

Created January 27, 2022 at 12:16 PM
Updated February 21, 2022 at 3:44 PM
Resolved February 10, 2022 at 6:18 PM