Project Description
SVD has many applications. For example, SVD could be applied to natural language processing for latent semantic analysis (LSA). LSA starts with a matrix whose rows represent words, columns represent documents, and matrix values (elements) are counts of the word in the document. It then applies SVD to the input matrix, and uses a subset of most significant singular vectors and corresponding singular values to map words and documents into a new space, called ‘latent semantic space’, where documents are placed near each other measured by co-occurrence of words, even if those words never co-occurred in the training corpus.
The LSA’s notion of term-document similarity can be applied to information retrieval, creating a system known as Latent Semantic Indexing (LSI). An LSI system calculates similarity several terms provided in a query have with
documents by creating k-dimensional query vector as a sum of k-dimensional vector representations of individual terms, and comparing it to the k-dimensional document vectors.
The implementation can be done completely in the ECL language and only a knowledge of ECL and distributed computing techniques is required.
Completion of this project involves:
- Generation of the test data. Three sets of test data are required, one for each of the three test cases:
- Somewhat uniform distribution where each node has data from the entire range.
- Skewed data where at least half of the nodes do not have observations in at least 50% of the range.
- Highly skewed data where range overlaps do not occur.
- Development of the algorithm using ECL.
- Testing the algorithm for correctness and performance, which involves comparing the approximate solution to the exact solution and validating that that the results are within the tolerance specified.
By the GSoC mid term review we would expect you to have written the ECL needed to generate the test data for the three cases
Mentor | John Holt |
Skills needed |
|
Deliverables |
|
Other resources |
|