Page Comparison

...

The LSA’s notion of term-document similarity can be applied to information retrieval, creating a system known as Latent Semantic Indexing (LSI). An LSI system calculates similarity several terms provided in a query have with documents by creating k-dimensional query vector as a sum of k-dimensional vector representations of individual terms, and comparing it to the k-dimensional document vectors.

The implementation can be done completely in the ECL language and only a knowledge of ECL and distributed computing techniques is required. A knowledge of linear algebra will be helpful.

Completion of this project involves:

Generation Selection of the test data. Three sets of The test data are required, one for each of the three test cases:
Somewhat uniform distribution where each node has data from the entire range.
Skewed data where at least half of the nodes do not have observations in at least 50% of the range.

Highly skewed data where range overlaps do not occur.

will be a collection of open data text documents. The collection must have an open data license or be completely free of copyright restrictions. The most important aspect of the collection is that you will be familiar with the subjects in the collection so that you can judge the effectiveness of your implementation. The test text collection should be composed of 1000 to 10000 documents.

Development of the algorithm using ECL.
Testing the algorithm for correctness and performance, which involves comparing the approximate solution to the exact solution and validating that that the results are within the tolerance specified.

By the GSoC mid term review we would expect you to have written the ECL needed to generate the test data for the three casesprocess the text documents into a dataset of term vectors.

Mentor	John Holt Contact details: Contact Details
Skills needed	Knowledge of ECL. Training manuals and online courses are available on the HPCC Systems website. Knowledge of distributed computing techniques
Deliverables	Test code demonstrating the correctness and performance of the algorithm. Supporting documentation.
Other resources	HPCC Systems website JIRA issue for this project: Learning ECL documentation and on-line training courses. Examples of existing code HPCC Systems Machine Learning documentation The Wikipedia article on Latent Semantic Indexing: https://en.wikipedia.org/wiki/Latent_semantic_indexing For use in Latent Semantic Analysis: https://en.wikipedia.org/wiki/Latent_semantic_analysis http://www.siam.org/meetings/la03/proceedings/Dvorsky.pdf http://www.apluswebservices.com/wp-content/uploads/2012/05/latent-semantic-indexing-fast-track-tutorial.pdf

Versions Compared

Old Version 2

New Version 3

Key

Mentor

Skills needed

Deliverables

Other resources