This project was completed as a intern opportunity with HPCC Systems in 2019. Curious about projects we are offering for future internships? Take a look at our Ideas List.
Find out about the HPCC Systems Summer Internship Program.
The project proposal application period for 2020 summer internships is now open. Please see our list of Available Projects. Contact the project mentor for more information and to discuss your ideas. You may suggest a project idea of your own but it must leverage HPCC Systems in some way. Contact us for support from an HPCC Systems mentor with experience in your chosen project area.
Project Description
As part of the drive to make improvements in the way HPCC Systems handles unstructured text, we need to ensure that all standard library functions have Unicode implementations. While a number of functions already have Unicode implementations, the following need to be provided:
- ExcludeFirstWord
- ExcludeLastWord
- ExcludeNthWord
For the functions below, the definition in Str.ecl can probably be copied to Uni.ecl – although see the note below about normalization. The main complication might be extending the test cases.
- EndsWith (*)
- StartsWith (*)
- RemoveSuffix (*,**)
It is possible that most of these functions will not need to call any special icu functions. They are likely to be similar to the string implementations.
- CountWords (aka count delimited tokens)
- FindCount (*)
- Repeat (**)
- SplitWords
- Translate(**)
The following are no longer required for this project:
FromHexPairs (no longer required)ToHexPairs (no longer required)
Extra work - Optimize the way break iterators are created
The current String version can serve as a sufficient specification except for:
- Functions marked (*) will need an additional optional parameter to indicate if a normalization is to be performed and if so which normalization. The default will be not to perform a normalization. The normalization will be one of: NFC, NFKC, NFD, or NFKD. The parameter will be a string with the string literal indicating the normalization. See http://unicode.org/reports/tr15/ for the details concerning the normalizations. The ICU Normalizer2 class is to be used.
You can assume the strings coming in are normalized, but the translation may result in an unnormalized string. Look at unicodeEnsureIsNormalized() in rtl/eclrtl/eclrtl.cpp and the linked reference above. The test case will need to include examples where the normalization is required. There may be scope for another function which explicitly normalizes a Unicode string to a specific normal form.
- Functions marked (**) must verify that unpaired surrogates cannot be created.
For Repeat() if the two inputs are well-formed Unicode strings then the output cannot contain any surrogate pairs. For translate the function needs to make sure that it maps code points rather than unicode16 characters - otherwise it would be possible to create unmatched pairs.
Completion of this project involves:
- C/C++ implementation of the function
- Code usage examples to be added into the HPCC Systems regression suite
- Documentation update to add the function name to the Standard Library Reference page for the function. In the 4 cases where a parameter is added, the parameter description table will be updated.
- An accepted pull request for the above three deliverables for each function.
By the GSoC mid term review we would expect you to have:
- Accepted pull requests for 6 of the 13 functions listed above.
Mentor | John Holt Backup Mentor: Gavin Halliday |
Skills needed |
|
Deliverables |
|
Other resources |
|