Hi! I'm looking for recommendations on AI text-to-speech generators that create believable and emotional voices that fit the scene.
What I'm looking for?
Why? Some context...
I'm currently working on adding the ability to my Life plugin to play voice audio files between breathing. It won't be optimal as actual recorded voice with breathing, but you won't have to decide whether you want breathing or voice anymore. Medium term goal is to upgrade my "The Remote" scene with random playing voice lines. For that I obviously need a confident, dominant female voice.
(Voice in the video was generated with Elevenlabs, see below.)
Based on search the forums here, especially @atani seems to have done some research, I tried the following AI's so far. Sadly, none of them are really any good for the purpose of building games where you have individual voice lines, which is what we would need for building non-linear scenes in VAM ? Since these AIs are coming up everywhere at the moment, maybe someone here has more time and already tried them all?
TTSMP3
Uberduck AI
Elevenlabs
What I'm looking for?
- Text-to-Speech (Speech-to-Speech may work, too)
- Reproducible results. Generating multiple different voice-lines one by one with the same settings should use the same tone/voice, so they all match.
- Ideally direct control over emotion, speed, emphasis of individual words
- Good selection of voices
- Ideally also giggling / laughing
- Ideally allowed for commercial use
- As little as possibly noise/artifacts
Why? Some context...
I'm currently working on adding the ability to my Life plugin to play voice audio files between breathing. It won't be optimal as actual recorded voice with breathing, but you won't have to decide whether you want breathing or voice anymore. Medium term goal is to upgrade my "The Remote" scene with random playing voice lines. For that I obviously need a confident, dominant female voice.
Based on search the forums here, especially @atani seems to have done some research, I tried the following AI's so far. Sadly, none of them are really any good for the purpose of building games where you have individual voice lines, which is what we would need for building non-linear scenes in VAM ? Since these AIs are coming up everywhere at the moment, maybe someone here has more time and already tried them all?
TTSMP3
- Based on Amazon Polly
- Allows control of speed/pitch via meta-tags in the text input:
- Example: <prosody rate="fast" pitch="20%">Sending out the ninja sharks.</prosody>
- Ok, but somewhat robotic sound that doesn't sound natural.
Uberduck AI
- Very few female voices?
- Terrible quality, lots of noise/artifacts/glitches
- No control at all, beyond choosing a voice
- => Useless?
Elevenlabs
- If you aren't paying attention, you might believe it's a real voice. Best I have seen so far.
- Emotion, speed, tone, etc. for a given input is random and different every time you run it. It's near-impossible to generate matching voice lines. You can only generate them all at once in one big input and then cut it into pieces by hand. However, that way you burn through your character quota VERY quickly. Also you can't add more voice lines later, without regenerating everything.
- Generating voices is also extremely random and frustrating.
- (Have not tried the voice cloning, which requires payment, is that any better?)
Last edited: