Avoid keeping the graph property tree in memory

Description

When roxie loads a query it accesses the property tree for the graph. The property tree can be large for complex queries, and the sizes will add up with you have large numbers of queries deployed.

Currently the tree is cached within the workunit code in case it is re-requested, but that is almost certainly not worth it. Many of the other places that access the graphs have no benefit in holding on to it, and when the stats are also merged into the tree it will mean the tree is unnecessarily cloned.

Conclusion

Default changed to not keep the graph in memory in version 9.8 and later. It is possible that this should be cherry-picked into 9.8.0 if it proves worthwhile.

Activity

Show:

HPCC JiraBot May 22, 2024 at 12:27 PM

Jirabot Action Result:
Workflow Transition: Merge Pending
Additional PR: https://github.com/hpcc-systems/HPCC-Platform/pull/18690

HPCC JiraBot April 10, 2024 at 11:28 AM

Jirabot Action Result:
Workflow Transition: Merge Pending
Updated PR

Gavin Halliday March 22, 2024 at 12:03 PM

Note: It would be relatively easy to add a list of files. It could reuse the existing workunit->addFile() function, but add 2 new file types to WUFileKind: WUFileInputIndex WUFileInputFile. Some care would be needed to ensure the extra entries did not.

It would have the extra advantage that roxie queries would have a list of inputs associated with them. That in itself would be a useful feature.

Gavin Halliday March 22, 2024 at 10:39 AM

Roxie uses the graphs in the following situations:

  • Package manager to update the stats workunit. Much better to not cache the graph xml since it is modified anyway.

  • getQueryHash() - used when a query is loaded to generate a unique id (and avoid reloading if nothing has changed). The graphs are used to get a list of all files that are accessed by the query.

  • CQueryFactory::load() - use to walk the list of activities and create the activity factory. Note: The graph xml is not needed after the query factory has been created.

  • cloneQueryXGMML() - used by the roxie debugger

  • gatherStats() - used from control::queryStats

  • getGraphNames(). Used from control:queryStats action=listGraphNames.
    Very expensive to expand the whole graph just to extract a single name, but not used very often, so probably not worth addressing.

 

Notes:

  • We could maintain a separate list of all files accessed from the query. It would be quicker to walk and might facilitate resolving the filenames in bulk asynchronously.

  • Storing the number of graphs would allow the graph names to be generated without needing to look at any of the graphs.

  • If the graph is still held in memory it would make sense to strip ecl, definition, recordSize, predictedCount, label, name, metaLocalSortOrder (and variants). Using a simple search and replace on the xml from a large query (21K activities) it saved about 50% of the file size.
    Extra information stripped: Source xml: 8MB, in memory: ~23MB
    With extra information: Source xml: 16MB, in memory ~42MB.
    So it would save maybe 20MB per complex query. Avoiding storing it would save another 20MB. If there are say 150 queries that is 3/6GB - but not all queries might be the same size..

So a couple of possibilities:

  • Add a flag to getXGMML() to indicate that extra information is not needed. Only cache the stripped version.

  • Add a separate list of files used to be used by getQueryHash(), then avoid caching any graphs in memory

  • Improve the graph format.
    Some simple changes - e.g., storing some common values (kind, grouped, colocal, parentActivity) as an attribute would likely reduce the memory consumption and size when serialized. If would require matching changes in the engines, and wudetails.

thoughts/comments?

My branch issue31511 contains some initial work on avoiding caching - which might be useful if we follow one of these routes.

 

NOTE: Program used to look at the memory impact:

Gavin Halliday March 21, 2024 at 4:04 PM

I looked at adding a parameter to indicate whether the graph should be cached. That saves some of the work.

However in roxie it first calculates a hash for the query, and then loads the query - which would load the graph twice.

A better approach might be to keep the flag on the function, especially for the cases where progress is being merged, but also allow the cache to be cleared.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Fix versions

Pull Request URL

Created March 21, 2024 at 3:58 PM
Updated June 24, 2024 at 9:57 AM
Resolved May 17, 2024 at 12:22 PM