Provide a mechanism for specifying a preferred plane
Description
Conclusion
relates to
Activity
Jacob Cobbett-Smith January 13, 2023 at 5:56 PM
Documentation JIRA opened: https://hpccsystems.atlassian.net/browse/HPCC-28781#icft=HPCC-28781
Gavin Halliday January 10, 2023 at 9:15 AM
> If it's a single logical file that resides on multilple planes, this should be fairly trivial to implement I think.
I think this is the case - it should be a simple addition to/replacement for the code that is already there to select for bare-metal based on ip.
Anthony Fishbeck January 9, 2023 at 11:10 PMEdited
Unrelated: It also strikes me that it might be useful to have a dfu copy command which copies the contents of a superfile, but only copies the files if they do not already exist. Does this already exist? If not would it be useful?
I don't know, I suspect that functionality does not exist though.Is it useful.. If used as a way to 'rsync'/update copies, then yes probably, but is this functionaliy supplanted in any way by package deployments (or are packages necessarily tied to queries and deployment to roxies? Anthony Fishbeck). How do packages handle copying super files?
When packagemaps are used for copying and they contain superfiles we do something similar to what to what was suggested. The packagemap service copies the metadata for the superfile and each of the subfiles for which metadata doesn't already exist (overwrite may be specified). Then roxie itself copies the physical parts for any files referenced by queries when they don't already exist.
In the new DFU based packagemap processing... if we had the DFU server feature described (that could copy the contents of a superfile), it would actually be quite helpful. Currently each file copy becomes a DFU workunit and having only one for a superfile would be more efficient.
All that being said, I don't think the data team mixes DALI superfiles with packagemaps. Packagemaps are (in a sense) a replacement for the superfile mechanism, but with packagemaps each query may have it's own definition for what the superfile contains.
Jacob Cobbett-Smith January 9, 2023 at 7:36 PM
>i) Each thor instance (or roxie) could have a ordered list of preferred planes. If a file is on more than one plane the list would be used to select which one was preferred.
If it's a single logical file that resides on multilple planes, this should be fairly trivial to implement I think.
Jacob Cobbett-Smith January 9, 2023 at 7:31 PM
Jake Smith is the restriction on mixed width indexes still relevant?
It is. e.g. Thor does not support KJ's with indexes comprised of different width sub indexes.
The way it is coded (and deals with 1 big super IFileDescriptor) assumes that they are of equal size (and the master enforces it). It could be recoded to handle mixed widths with a bit of effort.
Not sure what else if anything enforces this restriction.
But, for multiple copies of the same logical index on different planes.. I think it would be okay if they kept the same width at least in the short to medium term.
For discussion....
For cloud systems the default storage for files and indexes is likely to be blob/s3 storage which has a reasonably high throughput, but poor latency.
When Thor is performing keyed joins, the performance is likely to be very poor if the indexes are stored on blob storage. A couple of options spring to mind:
1. Write all indexes to a faster storage plane
HPCC-28502: Provide a helm option to define the default plane for index buildsResolved may help in this case. The disadvantage is that if files are kept for a long time and not actively used the storage costs will be much higher.
2. Write indexes as normal, and then create copies on another storage plane for the active indexes.
The issue with option(2) is that the file will exist on multiple storage planes - how will Thor know which one to prefer. It is possible to include the plane name in the name of the logical filename e.g. a::b::c@myplane, but that is not a great solution if the filenames are part of a super file.
Two possibilities suggest themselves (both might be useful)
i) Each thor instance (or roxie) could have a ordered list of preferred planes. If a file is on more than one plane the list would be used to select which one was preferred.
ii) Add ,CLUSTER('x' [,OPT]) syntax to indexes and files to indicate the perferred source
On reflection this doesn't seem like a very good solution - it isn't the correct logical place for that information.
So @Tony Kirk , are we going to face this problem (file on multiple planes) and would the preferred plane list be a good solution? (It could apply equally well to bare-metal where the decision is currently (poorly) made based on the ip-distance.)
Unrelated: It also strikes me that it might be useful to have a dfu copy command which copies the contents of a superfile, but only copies the files if they do not already exist. Does this already exist? If not would it be useful?
@Jacob Cobbett-Smith