Clarify Unicode types in ECL Lang Ref. 5.2.0-1

Environment

ECL

Description

The question is: on p. 11 under "Unicode string constants", the text says that the constant is UTF-8 and is the same as casting to Unicode.

P. 42, however, says the UNICODE is a UTF-16 encoded. Which is it? If string constants are automatically converted to UTF-16, perhaps that should be clarified on p. 11.

Conclusion

None

Attachments

Activity

Show:

Gavin Halliday May 11, 2015 at 12:22 PM

There are two different things which are often confused:

a) What is the format of the file
b) What fields are used to read data from the file.

The file formats supported by thor are csv,xml and flat(thor).

csv and xml assume that the input is utf8
flat is generally used for file generated by thor. For instance, string fields are preceded by a a 4 byte length.

If you can choose the input format then you are likely to want to spray a utf8 format file. (There is support in the spray mechanism for converting a utf16 file to utf8, but I'm not sure if it is available from the esp user interface.)

When reading from that file you want to use ,CSV/,UTF8 on your file definition. (I think you were using THOR.)

The field types determine how the data will be stored in memory - they will want to be unicode (which is stored as 4 byte length followed by UTF-16 characters).

Vic Kovacs May 5, 2015 at 8:14 PM

Now that I've tried to spray some data to use with VARUNICODE and UNICODEnnnn types:
1. I write the file in UTF-8 and get a "Premature end of data" error.
2. I write it in UTF-16 which now causes PERL to put a BOM at the start of the file, and I get the same error.

How does HPCC handle UTF-16, if that's what it sprays?

Here is the start of the od output for the file that was sprayed:

[kovacsvx1@alpiadpprod01:~]$ zcat /data2/tmp/Data_Access_to_HPCC/000000000/2015040604.BOCA__search.spray.gz | head -n1 | od -t x1a
0000000 fe ff 00 32 00 30 00 31 00 35 00 30 00 34 00 30
~ del nul 2 nul 0 nul 1 nul 5 nul 0 nul 4 nul 0
0000020 00 36 00 30 00 34 00 30 00 30 00 30 00 30 00 09
nul 6 nul 0 nul 4 nul 0 nul 0 nul 0 nul 0 nul ht
0000040 00 62 00 65 00 74 00 61 00 09 00 09 00 73 00 65

Vic Kovacs May 5, 2015 at 8:13 PM

This is the search file which contains the "search" data structure.

Vic Kovacs May 5, 2015 at 8:10 PM

Screenshot showing errors.

Richard Chapman May 5, 2015 at 9:38 AM

Can you comment?

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Components
Assignee
Jim DeFabia
Reporter
Vic Kovacs
Priority
Minor
Fix versions
8.10.18
Pull Request URL

Created May 4, 2015 at 2:30 PM

Updated January 13, 2023 at 3:49 PM

Resolved January 13, 2023 at 3:49 PM

Clarify Unicode types in ECL Lang Ref. 5.2.0-1

Environment

Description

Conclusion

Attachments

Activity

Gavin Halliday May 11, 2015 at 12:22 PM

Vic Kovacs May 5, 2015 at 8:14 PM

Vic Kovacs May 5, 2015 at 8:13 PM

Vic Kovacs May 5, 2015 at 8:10 PM

Richard Chapman May 5, 2015 at 9:38 AM

DetailsComponentsAssigneeJim DeFabiaJim DeFabiaReporterVic KovacsVic KovacsPriorityMinorFix versions8.10.18Pull Request URL

Details

Components

Assignee

Reporter

Priority

Fix versions

Pull Request URL

Details
Components
Assignee
Jim DeFabia
Reporter
Vic Kovacs
Priority
Minor
Fix versions
8.10.18
Pull Request URL