Improve error when detecting unsupported UTF format (e.g. UTF16)

Description

From Jose Bello:

Hi, I sprayed a UTF16LE XML file to thor but I receive this error when I try to read it. Any ideas why? Is there an option in the dataset statement they I should be using? Below is a screenshot of the first two bytes of the file in fileview. System error: 2: Error - syntax error "Unsupported unicode detected in BOM header" [file offset 2] (//1.1.1.129:7100/var/lib/HPCCSystems/hpcc-data/thor/thor_data400/in/aurorapd/crash_xml._1_of_40) (in Xml Read G1 E2) Here is the dataset statement dataset('~thor_data400::in::aurorapd::crash_xml',r,xml('CRASH_XML/CRASH_TABLE/CRASH'));

I got the input file (batch_mCO00A_i340753_d20160226120000_CRASH_2425_20160226_102324527.xml) sprayed it and run this code:

r:=record STRING dataProviderId { XPATH('dataProviderId' )}; STRING caseNumber { XPATH('caseNumber' )}; STRING reportNumber { XPATH('reportNumber' )}; STRING reportDate { XPATH('reportDate' )}; STRING address { XPATH('address' )}; STRING county { XPATH('county' )}; STRING city { XPATH('city' )}; STRING state { XPATH('state' )}; STRING x_coordinate { XPATH('x_coordinate' )}; STRING y_coordinate { XPATH('y_coordinate' )}; STRING coordinate { XPATH('coordinate' )}; STRING hitAndRun { XPATH('hitAndRun' )}; STRING intersectionRelated { XPATH('intersectionRelated' )}; STRING officerName { XPATH('officerName' )}; STRING crashType { XPATH('crashType' )}; STRING locationType { XPATH('locationType' )}; STRING accidentClass { XPATH('accidentClass' )}; STRING specialCircumstance1 { XPATH('specialCircumstance1' )}; STRING specialCircumstance2 { XPATH('specialCircumstance2' )}; STRING specialCircumstance3 { XPATH('specialCircumstance3' )}; STRING lightCondition { XPATH('lightCondition' )}; STRING weatherCondition { XPATH('weatherCondition' )}; STRING surfaceType { XPATH('surfaceType' )}; STRING roadSpecialFeature1 { XPATH('roadSpecialFeature1' )}; STRING roadSpecialFeature2 { XPATH('roadSpecialFeature2' )}; STRING roadSpecialFeature3 { XPATH('roadSpecialFeature3' )}; STRING surfaceCondition { XPATH('surfaceCondition' )}; STRING trafficControlPresent { XPATH('trafficControlPresent' )}; STRING narrative { XPATH('narrative' )}; STRING quarantined { XPATH('quarantined' )}; STRING action { XPATH('action' )}; end; dataset('~batch_mco00a_i340753_d20160226120000_crash_2425_20160226_102324527.xml_copy',r,xml('CRASH_XML/CRASH_TABLE/CRASH'));

The result is same error message.
I have some questions:

  • Do we support UTF16LE/BE in our engine and that error message caused by a bug?

  • The XML file has a complex structure

    <CRASH_XML> <CRASH_TABLE> <CRASH> </CRASH> <CRASH> </CRASH> ... </CRASH_TABLE> <PERSON_TABLE> ... </PERSON_TABLE> <VEHICLE_TABLE> ... </VEHICLE_TABLE> </CRASH_XML>

    and I'm not sure this simple dataset instruction is good to read a subset (CRASH records from CRASH_TABLE) from that file.

Conclusion

None

Activity

Show:

Gavin Halliday March 23, 2016 at 11:44 AM

We do not currently support utf-16 as a xml input file format.

Attila Vamos March 2, 2016 at 2:31 PM

There is no command line option for conversion. I can't see any footprint of conversion in spray code.

The error message above comes from XMLRead if the (logical) file has BOM and it is not UTF-8. From this I think we don't support utf16.

If we don't support UTF16, how we support Unicode chars?

Richard Chapman March 2, 2016 at 1:51 PM

Do we support utf16 ?

Gavin Halliday March 2, 2016 at 1:49 PM

A better error message would be a good start.

There used to be the ability for dfu spray to convert formats as it sprayed. I'm not sure that the command line options were ever added to dfuplus. I'm not sure how efficient it is either.

Attila Vamos March 1, 2016 at 1:22 PM

If I convert the input file from UTF-16 to UTF8, than spray the result logical file is readable by the ECL code provided in the description.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Components

Assignee

Reporter

Priority

Compatibility

Minor

Fix versions

Created March 1, 2016 at 11:23 AM
Updated March 23, 2016 at 11:44 AM
Resolved March 23, 2016 at 11:44 AM