Artificial Haitian Creole Language Database

Project Motivation and Description:

Following my work from Intel, I wanted to explore the idea of text generation for low-resource languages such as Haitian Creole instead of for a widely-spoken languages such as English. The goal of this work was to quickly expand an already existing database using limited datasets collected online. I opted to use a specific type of generative adversarial network–a LeakGAN–because the text generation results in English performed quite well, so I assumed that this would also hold true for Haitian Creole .

The Most Challenging Part of the Project:

The most challenging part of the project was figuring out a way to improve the sentence quality of the generated Haitian Creole sentences. The results from using LeakGAN to generate Haitian Creole sentences were poor. The BLEU (Bilingual Evaluation Understudy) score–a number (from 0 – 1) comparing a the generated text to one or more reference texts–ranged between 0.5 – 0.7 . I tried fine-tuning the network in effort to generate higher quality sentences, but there was only a slight improvement. However, it was not until I read a paper about Facebook’s InferSent did my results change. To train InferSent on Haitian Creole data, I first had to use the FastText embeddings (for Haitian Creole word embeddings). Then I used the trained InferSent to evaluate my sentences generated from the LeakGAN.

Results:

I noticed an immediate difference in sentence quality in comparison to the BLEU score metric– almost a 20% increase in quality. However, while I was impressed with these results I wanted to continue to increase my accuracy. Currently, ongoing I am still trying to improve my accuracy and will update my code soon.