Speech to Image

Project Motivation and Description:

Automatic synthesis of realistic images from description is a fascinating topic in the AI field. With the recent development of deep neural networks and deep convolutional generative adversarial network (GAN), one can create synthetic images from realistic images. Han Zhang’s StackGAN network takes this one step further via creating synthetic images from plain text descriptions. My project improves this via generating images from speech. With this goal, my teammates and I, for our Machine Learning Signal Processing Final project, aimed to create art by using your voice by combining Mozilla’s DeepSpeech speech-to-text engine with the StackGAN network.

Team and Roles:

I was the primary technical director of this project. Specifically, I upgraded the existing code for the StackGAN and slightly improved Mozilla DeepSpeech’s speech-to-text accuracy using TensorRT, which would make this application run in real-time. My teammates created a GUI, which would run each trained model separately using Amazon Web Services, for our demo during the presentation day.

Most Challenging Part of the Project:

The most challenging part of the project was updating the existing StackGAN code (which was written in TensorFlow 0.11) to TensorFlow 1.0+. Specifically, the author Han Zhang used a TensorFlow API called PrettyTensor, which is now deprecated for TensorFlow 1.0+ usage. So, I had to rewrite a lot of his code and figure out how to properly use batch normalization when updating the variables on each iteration. Another difficultly was implementing TensorRT with DeepSpeech to cut down latency and improve accuracy. While there is a lot of documentation online for TensorRT—differing blog posts and gists on how best to utilize TensorRT, I found it challenging on how to specifically implement it. Ultimately, after multiple trials and errors, I used it in the evaluation of the DeepSpeech to improve the word error rate by 1 %.

Moving Forward:

Though this project may be over, this was only a stepping stone! At the moment, Speech-to-Image works via the common denominator of text. First, the DeepSpeech produces text from speech, and StackGAN uses the text input to output an image. In total, this whole procedure takes about roughly 20 seconds. Currently, I am working on instantly creating pictures directly from speech without the need for text. In addition, I am expanding the datasets to train the network, so there are more diverse images than just birds.

We were voted best presentation in the class during presentation day!