Free neural text to speech with emotion! - Guides -

Don't you wish VAM had more story based content? I think the major thing holding this back is the lack of voice. Are you sick of using speech bubbles in VAM? Don't you wish there was an easy way to generate excellent TTS (text to speech) that sounded decent and even included emotions? Even better, don't you wish you could do it for free? Well, you're in luck!

Microsoft Azure has amazing AI neural speech text to speech, and if you use their Free tier, you can do 500,000 characters per month! This includes ALL features, such as many voices in virtually all languages, having voices have accents in different languages (for example, speaking English with a Spanish accent), and using emotions from the list below;

If you want to see how the voices sound without going through any setup (you won't be able to download any of the files though) click this link;

Text to Speech – Realistic AI Voice Generator | Microsoft Azure

Start by setting up a free Azure trial. It's completely free for the first year, and once that first year is up, just set your subscription to the 'no support' option and it will remain free!

Azure Free Trial | Microsoft Azure

Then add Azure speech services.

Create Speech Services - Microsoft Azure

Make sure you select the Free Tier (F0)

Once you have that, just go to Speech Studio and get creating! See below;

Speech Studio - Microsoft Azure

Click 'Audio Content Creation' (the one the red arrow is pointing at.)

Below is what the editor looks like. Select a voice and just start typing!

I've attached the audio samples so you can hear how it sounds, along with the actual transcript in the screenshot above. It's all point and click, and very simple! Just highlight the text that you want to use a different emotion, then select the voice from the drop down on the right by 'speaking style.'

What's also great is it can store your projects, so you can always have your written text backed up, and even backup the generated audio files!

If you break your writing up into paragraphs (as distinguished by the numbers on the left) then when you do your audio download export, you will have the option of each paragraph being a separate audio file. So, each character's lines in a separate paragraph, do your export and bam, you've got all of it in a single shot!

Also, this is all API based... So, if any of you devs out there would like to write a VAM plugin....

I hope this is helpful!

1 minute long video clip of a VAM character using the Azure AI generated audio;

Part 2;

Bob Nothing

Typically, I will write up an entire script with various emotions used. (You can even use different emotions within the same line.) Then I do an export of .wav files and I select the 'each paragraph is it's own file' option. Then in animation triggers I use HeadAudio on the person atom to play the audio file using RT Lip Sync plugin. It works really well. If you're going to have a conversation back and forth, I'll have my first audio file play 5 seconds in for example, hit play, then hit stop once the audio file is done playing. Then I hit play again for a second or so, then stop. Then on the next audio trigger click the 'current time' button, and so on and so on so the audio spacing is correct.

Well, hopefully I'm about to make your day. Some of the US English voices are multilingual. There is a "Jenny multilingual" for example that speaks German. Also, if it helps, you can give the English speaking voices that aren't multilingual German accents, if that helps.

I haven't tested Google's yet, but Azure is lightyears ahead of AWS Polly, it's free, and there's a lot more options in voices plus emotions. Honestly it crushes AWS. I'm really looking forward to seeing some content from you that uses it! :-)

I agree, it's the best I've tested. I was using AWS Polly for a bit since I had a free year to try it out. It was only costing me like $.35 a month with my usage but I thought hey, why not see if there's anything else out there free, and the Azure stuff is SOOOO much better!

Well, you generate your audio files and then it would be stand alone/portable. It creates audio files that you download and then use in VAM. Via API it could be doing in real time/live, but I doubt anyone would want to front that, and I don't really see the advantage.

Guides Free neural text to speech with emotion!

More resources from Bob Nothing

Share this resource

Latest updates

Added video

Latest reviews