Producing Voice Over for Video Tutorials with Open Source

by Sebastien Mirolo on Wed, 26 Jul 2023

The team has embarked on producing a series of video tutorials. Since none of the team members is a native English speaker and we have been eager to test some new Artificial Intelligence (AI) tools, we decided to experiment with Text-to-Speech technologies - Open Source ones of course.

Finding an Open Source Text-to-Speech model

First thing, we type "open source voice generator" in a search engine. It comes up with some recent articles (Top 14 Open Source AI Voice Projects 2023).

Many of the tools are written in Python. We tried Uberduck without much success. There are multiple use of CUDA hard-coded left and right. Unfortunately CUDA is no longer available on MacOSX. (As a side note you can also read about Apple M1/M2 GPU Support in PyTorch). None-the-less, in the process, we found out most Text-to-Speech (TTS) models are trained on The LJ Speech Dataset, a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. Also, most models are referenced on Hugging Face.

Trying out FastSpeech2

FastSpeech2 has some samples readily available. The command-line interface (CLI) looks pretty straightforward so we decided to give it a shot.

Despite FastSpeech2 being quite recent (last commit in 2021), many dependencies are broken. To install it on MaOSX, it is best to install the prerequisites with native libraries through MacPorts instead of using the requirements.txt as specified.

Terminal
$ port install py310-librosa py310-llvmlite py310-numba py310-numpy py310-pytorch py310-regex py310-resampy py310-scikit-learn py310-soundfile py310-sympy py310-tensorboard py310-unidecode py310-yaml

$ git clone https://github.com/ming024/FastSpeech2.git
$ cd FastSpeech2
$ python3.10 -m venv --system-site-packages .venv
$ source .venv/bin/activate
$ pip install g2p-en==2.1.0 inflect==4.1.0 pypinyin==0.39.0 pyworld==0.2.10 tgt==1.4.4 tqdm==4.46.1

The source code repository is 512Mb and there are 3 pre-trained models to download from a Google Drive (1.1 Gb) as well.

Terminal
$ mkdir -p output/ckpt/LJSpeech output/result/LJSpeech
$ pushd output/ckpt/LJSpeech
#### Download 900000.pth.tar
$ popd

$ mkdir -p output/ckpt/AISHELL3 output/result/AISHELL3
$ pushd output/ckpt/AISHELL3
#### Download AISHELL3_600000.pth.tar-20230728T162101Z-001.zip
$ popd

$ mkdir -p output/ckpt/LibriTTS output/result/LibriTTS
$ pushd output/ckpt/LibriTTS
#### Download LibriTTS_800000.pth.tar
$ popd

We will also need to unzip files in the hifigan directory, and make a few updates to the source code to run with CUDA disabled.

Terminal
$ pushd hifigan
$ unzip generator_LJSpeech.pth.tar.zip
$ unzip generator_universal.pth.tar.zip
$ popd
$ diff -u prev utils/model.py
@@ -17,7 +17,7 @@ def get_model(args, configs, device, train=False):
             train_config["path"]["ckpt_path"],
             "{}.pth.tar".format(args.restore_step),
         )
-        ckpt = torch.load(ckpt_path)
+        ckpt = torch.load(ckpt_path, map_location=torch.device('cpu'))
         model.load_state_dict(ckpt["model"])

     if train:
@@ -60,9 +60,9 @@ def get_vocoder(config, device):
         config = hifigan.AttrDict(config)
         vocoder = hifigan.Generator(config)
         if speaker == "LJSpeech":
-            ckpt = torch.load("hifigan/generator_LJSpeech.pth.tar")
+            ckpt = torch.load("hifigan/generator_LJSpeech.pth.tar", map_location=torch.device('cpu'))
         elif speaker == "universal":
-            ckpt = torch.load("hifigan/generator_universal.pth.tar")
+            ckpt = torch.load("hifigan/generator_universal.pth.tar", map_location=torch.device('cpu'))
         vocoder.load_state_dict(ckpt["generator"])
         vocoder.eval()
         vocoder.remove_weight_norm()

At this point we are ready to synthesize our first sentence. Let's go!

Terminal
$ python3 synthesize.py --text "To respond to a questionnaire for your organization, you will have to create a user account and an organization profile." --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
$ open output/result/LJSpeech/To\ respond\ to\ a\ questionnaire\ for\ your\ organization,\ you\ will\ have\ to\ create\ a\ user\ account\ and\ an\ o.wav

Shortcomings of FastSpeech2 voice generator

Acronyms typically needs to be spelled out (ex: T S P instead of TSP) though some work out-of-the-box (unique ID).

We had to edit one-time to one time, e-mail to mail or email, and support@example.com to support at example dot com.

Sometimes we had to be creative by replacing to by too or rewrite the text so the generated voice over sounds better - especially with the intonation at the beginning at end of sentences.

Clicks are sometimes noticeable on start (clicks also seem to be present in the LJ Speech Dataset training dataset). We used a background music to smoothen out the digital-sounding voice artifacts.

Check out the final Register and create a profile video tutorial. All-in-all it is not bad. It is worth investing more time in trying out over Open Source models to compare the generated results.

More to read

You might also like to read:

More technical posts are also available on the DjaoDjin blog, as well as business lessons we learned running a SaaS application hosting platform.

by Sebastien Mirolo on Wed, 26 Jul 2023


Receive news about DjaoDjin in your inbox.

Bring fully-featured SaaS products to production faster.

Follow us on