Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This project is available as a student work experience opportunity with HPCC Systems this summer. Curious about other projects we are offering? Take a look at our Ideas List

Find out about the HPCC Systems Summer Internship Program.
Deadline for proposals - Monday April 3rd 2017

Project Description

There is a detailed description of the work in the JIRA issue TS1, which includes an attachment to the the Open Source Text Search document.  This JIRA also details a series of sub-tasks describing the work.

There is a preliminary collection of ECL attributes that were drawn from several earlier proprietary text search applications.  The intent is to provide a framework for building generally useful text search applications supporting searching XML text documents.

The sub-projects are:

  1. Initial build version.  Build the inversion datasets.
  2. Initial search version.  Search the initial inversions.
  3. Regression tests.  Regressions for search request parsing, inversion builds, and search resolution.
  4. Document add, replace, and delete.  Attributes to maintain the inversion.
  5. Slice Rollup.  Automation to rollup the incremental data.  
  6. Wildcard processing.  Alter the wildcard processing to work with large numbers of terms that match a patterns.  
  7. Retrieval application.  An application to retrieve documents from the search resolve hit lists.
  8. Equivalence terms.  Language equivalence (like stemming) and ad hoc phrase equivalencing.

There is enough work that it is unlikely that a single intern would be able to complete all of the sub-projects in a single period.  

Completion of this project involves:

Code checkin will be done weekly, and the commit will be pushed.  The developer can determine whether to amend a single commit or to provide a sequence of weekly commits.

Each sub-project will be done in sequence, and each sub-project will have a separate pull request.

The attribute exports intended to be used by an application developer using the framework will be documented using java Doc style comments.

By the midterm review we would expect you to have completed:

  1. Initial build version: See https://track.hpccsystems.com/browse/TS-2
  2. Initial search version: See https://track.hpccsystems.com/browse/TS-3
  3. Regression tests: See https://track.hpccsystems.com/browse/TS-4
Mentor

John Holt
Contact details

Backup Mentor: Roger Dev
Contact Details 

Skills needed
  • Ability to code in ECL.
  • Knowledge of regular expression parsing.
  • Ability to build and test the HPCC system (guidance will be provided).
  • Ability to write test code.
Deliverables
  • Checked in code
  • Test cases demonstrating the correct behaviour and performance
  • Documentation
Other resources