Clarify Unicode types in ECL Lang Ref. 5.2.0-1
Environment
Description
Conclusion
Attachments
Activity

Gavin Halliday May 11, 2015 at 12:22 PM
There are two different things which are often confused:
a) What is the format of the file
b) What fields are used to read data from the file.
The file formats supported by thor are csv,xml and flat(thor).
csv and xml assume that the input is utf8
flat is generally used for file generated by thor. For instance, string fields are preceded by a a 4 byte length.
If you can choose the input format then you are likely to want to spray a utf8 format file. (There is support in the spray mechanism for converting a utf16 file to utf8, but I'm not sure if it is available from the esp user interface.)
When reading from that file you want to use ,CSV/,UTF8 on your file definition. (I think you were using THOR.)
The field types determine how the data will be stored in memory - they will want to be unicode (which is stored as 4 byte length followed by UTF-16 characters).

Vic Kovacs May 5, 2015 at 8:14 PM
Now that I've tried to spray some data to use with VARUNICODE and UNICODEnnnn types:
1. I write the file in UTF-8 and get a "Premature end of data" error.
2. I write it in UTF-16 which now causes PERL to put a BOM at the start of the file, and I get the same error.
How does HPCC handle UTF-16, if that's what it sprays?
Here is the start of the od output for the file that was sprayed:
[kovacsvx1@alpiadpprod01:~]$ zcat /data2/tmp/Data_Access_to_HPCC/000000000/2015040604.BOCA__search.spray.gz | head -n1 | od -t x1a
0000000 fe ff 00 32 00 30 00 31 00 35 00 30 00 34 00 30
~ del nul 2 nul 0 nul 1 nul 5 nul 0 nul 4 nul 0
0000020 00 36 00 30 00 34 00 30 00 30 00 30 00 30 00 09
nul 6 nul 0 nul 4 nul 0 nul 0 nul 0 nul 0 nul ht
0000040 00 62 00 65 00 74 00 61 00 09 00 09 00 73 00 65

Vic Kovacs May 5, 2015 at 8:13 PM
This is the search file which contains the "search" data structure.

Vic Kovacs May 5, 2015 at 8:10 PM
Screenshot showing errors.

Richard Chapman May 5, 2015 at 9:38 AM
Can you comment?
Details
Components
Assignee
Jim DeFabiaJim DeFabiaReporter
Vic KovacsVic KovacsPriority
MinorFix versions
Pull Request URL
Details
Details
Components
Assignee

Reporter

The question is: on p. 11 under "Unicode string constants", the text says that the constant is UTF-8 and is the same as casting to Unicode.
P. 42, however, says the UNICODE is a UTF-16 encoded. Which is it? If string constants are automatically converted to UTF-16, perhaps that should be clarified on p. 11.