Color mode

Generating picture-in-picture of a speaker with Synthesia

by Sebastien Mirolo on Fri, 4 Aug 2023

After experimenting with AI-generated voice over in a previous post, it was time to amp it up. In this post, we will sign up for Synthesia and attempt to generate a picture-in-picture of a AI-generated actor taking us through a video tutorial.

Onboarding

On the Synthesia website, I clicked "Create an Account" in the top menu bar and am presented with the pricing page. There is really only one plan you can self-sign for at this time. So let's go with it.

The next screen offers the choice to sign up with a Google e-mail address or another e-mail address. I picked the second option. After entering a name and e-mail address, the site is then waiting for a code to confirm the e-mail I registered with (Verifying e-mail addresses is more reliable that Captcha at this point to limit the number of bots signing up. Good!)

After confirmation, I am redirected to a payment checkout page on Stripe (We went a long way around from the days where Stripe was selling itself as a Paypal alternative where you don't need to redirect to a third-party website for checkout). With the payment completed, a couple survey questions about referral and use case is presented, then straight into the video editor!

All-in-all, a decent on-boarding workflow.

First scene

We started copy/pasting the video script in the text input field. For some reason, the site keeps asking about enabling access to the machine copy/paste buffer. This is an overreach for data privacy, as well as a risk for cybersecurity (if you have a tendency to copy/paste your password from a password file) with no explanation as why the website needs those permissions. Everything seems to work without granting those permissions, so good, let's not do that.

The Synthesia voice generator seems to have less problems that the Open Source Voice Generator we previously used when it comes to acronyms (ex: ESG correctly pronounced). We still had to rewrite support@example.com to support at example.com but otherwise we didn't had to rewrite example.com as example dot com. In few places we introduced an extra comma to get the tone sound better but otherwise rewrites for the sake of the voice generator were very minimal.

The slightly annoying thing is that clicking "play" starts playing the generated speech from the beginning instead of where the text cursor is. We got around this problem by creating a second scene, entering each sentence one-by-one in that second scene, testing the generated voice sounds good, then copy/pasting the result at the end of the text in scene one.

You can upload your own videos in Synthesia, but here we are much more interested to use our preferred local tool for movie edition. After a few experiments in iMovie, it seems the best approach is to generate a video of a blue background with an avatar in the bottom right corner. We will do that.

It is pretty fun to change avatars and voices. There are no lips movement in edit mode. It hasn't been a problem though.

We have a couple sentences that sound good, an avatar in a circle in the bottom right corner of a blue background slide. So let's generate the scene and import it in iMovie to check our workflow process.

It states it would take about 10min to generate the video. I clicked generate movie. Nothing happened. After about 15min, I tried again and this time I was redirected to the videos workspace. My video was shown as being processed. It seems a constant issue that some clicks are not taken into account. You can open a menu, click on an action and nothing happens. When you repeat the same pattern, the action starts. I haven't been able to copy/paste, or duplicate entire scenes but it hasn't proved an big issue since we are not interested in the online editor outside the AI-generated avatar here. Keynote works great.

If you select the option subtitles, they will be generated in the video stream. The actual text subtitles are always generated as a separate track regardless of the option. So best not to enable that option and download the subtitles text track later on.

Overlaying first scene in iMovie

The video generated by Synthesia is 1920x1080 which is perfect because we recorded the screen grabs with the same dimensions. Youtube also likes 1920x1080.

We added the Synthesia generated video as an overlay track in iMovie, then use the Green/Blue Screen option and done. The overlay works.

It is time to produce a full video.

Producing the full video tutorial

We tested the voice generator on each sentence in a separate scene, then copy/pasted that sentence in scene one, and added a 1s silence afterwards so we can easily cut the resulting video in iMovie to synchronize it with the application walk-through recording.

At some point we hit a The script can't be more than 3500 characters per scene error. There is no clue how much characters we have in the scene, and/or which characters are considered extra. There also seem to be some kind of background task recomputing characters count because cutting text does not immediately tell you if you now are in the limit or not.

We thus created multiple identical scenes with text under the 3500 characters limit until the play movie option stops complaining.

In Synthesia, if we use a template with the avatar in the bottom right corner, then remove the background, or if we use a blank template and add the avatar in the bottom right corner, the avatar will be in a slightly different position that is noticeable. So since we were not able to duplicate scenes, we need to be consistent in how we create them.

After the final video was generated, editing the overlay track to match the application walk-through recording was a bit tedious, though straightforward. In future video tutorials, we might write down the whole voice over text and generate it beforehand, then record the application walk-through to understand if that workflow would require less editing.

Check out the resulting Responding to an ESG/Environmental practices Assessment video tutorial. What do you think?

Generating picture-in-picture of a speaker with Synthesia

Onboarding

First scene

Overlaying first scene in iMovie

Producing the full video tutorial

More to read