The tag generated for the content field is <description/content> which throws a error when trying to read the file. I would expect to see <content> according to the documentation.
If I use embedded datasets to reduce the XPATH to a single reference, it works correctly.
But shouldn’t the OUTPUT action use the actual field name instead of the XPATH information?
In summary, OUTPUT using the XML option should ignore complex xpaths. I think it’s supposed to ignore complex xpaths on output and simply use the field name. In any case it should never create invalid xml. Only the complex xpaths should be ignored, not the simple ones.
Note: using NOXPATH will eliminate the bad tag generation, but you need to rewrite the RECORD to a simplified layout to read the new file again.
See the following Forum post for additional discussion on this issue:
Thanks Tony, do you think Jim should add a blurb to the documentation?
Anthony Fishbeck April 17, 2015 at 8:13 PM
The fix I am submitting will make sure these situations don't output invalid xml, but it won't make writing and reading symetrical. I.e. if you use a record layout like this to write a file, the same layout wont read it. Building out the full xpath on fields like this would usually result in pretty non-sensical xml that would lead to mostly unintended results.
Could read in several different xml formats and we wouldn't know which one to write out.
So we won't write out bad xml, but we also wont try to guess an inexact format.
So after my fix if we read in either: <Row><article><description><content>abc</content></description><page id="101">xyz</page></article></Row> or <Row><article><description><content>abc</content></description></article><article><page id="101">xyz</page></article></Row> we will write out valid xml, but simplified to:
The OUTPUT action using the XML option is documented as follows:
XML - Specifies the file is output as XML data with the name of each field in the format becoming the XML tag for that field's data.
Consider the following field definition:
Layout := record
string20 article_id {xpath('/Row/article/@id')} ;
string1000 content {xpath('description/content')};
END;
If I were to use OUTPUT to write this XML:
ds:=dataset('~forum::xmlread',layout,xml('Row/article'));
output(ds,,'~xmlOut',xml,overwrite);
The tag generated for the content field is <description/content> which throws a error when trying to read the file.
I would expect to see <content> according to the documentation.
If I use embedded datasets to reduce the XPATH to a single reference, it works correctly.
But shouldn’t the OUTPUT action use the actual field name instead of the XPATH information?
In summary, OUTPUT using the XML option should ignore complex xpaths.
I think it’s supposed to ignore complex xpaths on output and simply use the field name. In any case it should never create invalid xml.
Only the complex xpaths should be ignored, not the simple ones.
Note: using NOXPATH will eliminate the bad tag generation, but you need to rewrite the RECORD to a simplified layout to read the new file again.
See the following Forum post for additional discussion on this issue:
http://hpccsystems.com/bb/viewtopic.php?f=10&t=1610&sid=35f59a57694f45838ab8685288b1d918