Browse Poster Wiki: 2022 Poster Award Winners, Awards Ceremony (Watch Recording from minute marker 1630), Posters by 2022 HPCC Systems Interns, Posters by Academic Partners, Poster Judges, About Virtual Judging, 2022 Poster Contest Home Page, Poster Contest Previous Years

Ananya Gupta is a PhD student studying Human-Centered Computing at Clemson University.

Ananya joined the program to work on an NLP project involving the development of an analyzer to look at Nepali text. This analyzer needed to be incorporated into Wiktionary for lookup and additional NLP analysis. Ananya generated additional interest in her project by issuing a press release calling for help in building the Nepali Wiktionary. Her Nepali NLP Forum FaceBook page that is increasing its readership daily, may make her a pioneer for her native language in the world of NLP.

As well as the resources included here, read Ananya's intern blog journal which includes a more in depth look of her work.

2022 Winner of the Best Poster Data Analytics Award

Poster Abstract

Words are the foundation of Natural Language Processing (NLP) in any language. In order to analyze languages using computer, we need to have language information (tokenizer, analyzer, dictionary, etc). Among the different tools available, Wiktionary is one of the tools that contain dictionary of all natural languages. With regards to Nepali language, its web presence is limited. Though there are more than 100,000 Nepali words, Nepali Wiktionary contains only around 16,800 words. Most of the words in Nepali Wiktionary are not consistent in their format due to the lack of enough data. In order to increase the number of words in a proper formart having complete information in Wiktonary, we took a number of initiatives.

During first initiative, we developed a parser and an analyzer. For this, we first conducted a background research and came up with a standard template to enter words to the Wiktionary in the wikitext format that can give complete information. Using these wikitexts of word entries as input files, we build rules on NLP++ and parsed words’ content. Finally, we developed an analyzer, NeWiktionary to build a knowledge base for words using HPCC Systems to build a dictionary record structure from the Wiktionary data. Ultimately, this dictionary will be used in doing NLP in Nepali using HPCC Systems. In summary, this Wiktionary entries and online dictionary can be a great resource to conduct NLP related projects in Nepali language such as translation, sentiment analysis, Word Sense Disambiguity(WSD), and much more.

The second initiative focused on trying to involve community to build a better dictionary for Nepali language. For this, we first published a press release statement and created a Facebook group to recruit and bring all enthusiastic people together. Additionally, we extended networking with contributors of Nepali Wiktionary and Nepali Wikimedia in order to extend its use to the community.

In summary, we took an initiative to make Nepali NLP more resourceful through cutting-edge technology and with the help of the community.

Presentation

In this Video Recording, Ananya provides a tour and explanation of her poster content.

Nepali NLP Initiative

Click on the poster for a larger image.

HPCC

Ananya Gupta - 2022 Poster Contest Resources

Poster Abstract

Presentation

Nepali NLP Initiative

Related content