Create an NLP Dictionary for Tamil



If you are interested in this project contact David de Hilster

Find out about the HPCC Systems Summer Internship Program.

Project Description

In order to eventually create digital human readers in Tamil, a dictionary must be established. This project will use the Tamil dictionary from Wiktionary. 

Completion of this project involves:

  • Download the Tamil dictionary from wiktionary

  • Write an NLP++ parser to extract the vocabulary from the wiktionary files into text files

  • Write an NLP++ parser to transform the text files into knowledge base files

  • Create Tamil test files for part-of-speech tagging

  • Write an NLP++ part-of-speech tagger

  • Run the tests using the NLP++ Plugin in ECL to show enhancements

  • Create an NLP++ repository for the Tamil dictionary and analyzers

By the mid term review we would expect you to have:

  • More details coming soon

Mentor

David Dehilster

Skills needed
  • Keen interest in natural language

  • Ability to learn and program in NLP++

  • Ability to create test cases

  • Ability to write test code in ECL using the NLP++ plugin to test the enhanced dictionary

Deliverables

Midterm

  • Parts-of-speech text files

End of project

  • A Tamil dictionary repository in the VisualText open source github including the dictionary files and  NLP++ analyzers

Other resources

All pages in this wiki are subject to our site usage guidelines.