This project was completed by Farah Al Shanik, a PhD student studying Computer Science at Clemson University. Farah Joined the HPCC Systems intern program in 2018.
There are many variants to take into account for this project such as matching plural and singular forms, language variants, punctuation evident in acronyms and the use of initials and alternative spellings. Such as color with and without the ‘u’.
Find out about the HPCC Systems Summer Internship Program.
The project proposal application period for 2020 summer internships is now closed. Check back in the Fall for details about applying to join our 2021 program.
Project Description
There is a detailed description of the work in the JIRA issue TS1, which includes an attachment to the the Open Source Text Search document. This JIRA also details a series of sub-tasks describing the work.
There is a preliminary collection of ECL attributes that were drawn from several earlier proprietary text search applications. The intent is to provide a framework for building generally useful text search applications supporting searching XML text documents.
The sub-projects are:
- Initial build version. Build the inversion datasets.
- Initial search version. Search the initial inversions.
- Regression tests. Regressions for search request parsing, inversion builds, and search resolution.
- Document add, replace, and delete. Attributes to maintain the inversion.
- Slice Rollup. Automation to rollup the incremental data.
- Wildcard processing. Alter the wildcard processing to work with large numbers of terms that match a patterns.
- Retrieval application. An application to retrieve documents from the search resolve hit lists.
- Equivalence terms. Language equivalence (like stemming) and ad hoc phrase equivalencing.
There is enough work that it is unlikely that a single intern would be able to complete all of the sub-projects in a single period.
Completion of this project involves:
Code checkin will be done weekly, and the commit will be pushed. The developer can determine whether to amend a single commit or to provide a sequence of weekly commits.
Each sub-project will be done in sequence, and each sub-project will have a separate pull request.
The attribute exports intended to be used by an application developer using the framework will be documented using java Doc style comments.
By the midterm review we would expect you to have completed:
- Initial build version: See https://track.hpccsystems.com/browse/TS-2
- Initial search version: See https://track.hpccsystems.com/browse/TS-3
- Regression tests: See https://track.hpccsystems.com/browse/TS-4
Mentor | John Holt Backup Mentor: Roger Dev |
Skills needed |
|
Deliverables |
|
Other resources |