Create an NLP Dictionary for Tamil

If you are interested in this project contact David de Hilster.

Find out about the HPCC Systems Summer Internship Program.

Project Description

In order to eventually create digital human readers in Tamil, a dictionary must be established. This project will use the Tamil dictionary from Wiktionary.

Completion of this project involves:

Download the Tamil dictionary from wiktionary
Write an NLP++ parser to extract the vocabulary from the wiktionary files into text files
Write an NLP++ parser to transform the text files into knowledge base files
Create Tamil test files for part-of-speech tagging
Write an NLP++ part-of-speech tagger
Run the tests using the NLP++ Plugin in ECL to show enhancements
Create an NLP++ repository for the Tamil dictionary and analyzers

By the mid term review we would expect you to have:

More details coming soon

Mentor	David Dehilster
Skills needed	Keen interest in natural language Ability to learn and program in NLP++ Ability to create test cases Ability to write test code in ECL using the NLP++ plugin to test the enhanced dictionary
Deliverables	Midterm Parts-of-speech text files End of project A Portuguese dictionary repository in the VisualText open source github including the dictionary files and NLP++ analyzers
Other resources	HPCC Systems website JIRA issue for this project Wiktionary Blog: Understanding Natural Language Processing Github Repository Video: Deploying Digital Human Readers Leveraging HPCC Systems Video: NLP++ ECL Plugin Visual Text Open Source Website NLP++ Language Extension Formal language description Learning ECL documentation and on-line training courses.

Create an NLP Dictionary for Tamil

Mentor

Skills needed

Deliverables

Other resources