AI Voice recommendations?

MacGruber

Invaluable member
Developer
Featured Contributor
Messages
1,537
Reactions
3,010
Points
143
Hi! I'm looking for recommendations on AI text-to-speech generators that create believable and emotional voices that fit the scene.

What I'm looking for?
  • Text-to-Speech (Speech-to-Speech may work, too)
  • Reproducible results. Generating multiple different voice-lines one by one with the same settings should use the same tone/voice, so they all match.
  • Ideally direct control over emotion, speed, emphasis of individual words
  • Good selection of voices
  • Ideally also giggling / laughing
  • Ideally allowed for commercial use
  • As little as possibly noise/artifacts

Why? Some context...
I'm currently working on adding the ability to my Life plugin to play voice audio files between breathing. It won't be optimal as actual recorded voice with breathing, but you won't have to decide whether you want breathing or voice anymore. Medium term goal is to upgrade my "The Remote" scene with random playing voice lines. For that I obviously need a confident, dominant female voice.
(Voice in the video was generated with Elevenlabs, see below.)


Based on search the forums here, especially @atani seems to have done some research, I tried the following AI's so far. Sadly, none of them are really any good for the purpose of building games where you have individual voice lines, which is what we would need for building non-linear scenes in VAM 🙁 Since these AIs are coming up everywhere at the moment, maybe someone here has more time and already tried them all?

TTSMP3
  • Based on Amazon Polly
  • Allows control of speed/pitch via meta-tags in the text input:
    • Example: <prosody rate="fast" pitch="20%">Sending out the ninja sharks.</prosody>
  • Ok, but somewhat robotic sound that doesn't sound natural.

Uberduck AI
  • Very few female voices?
  • Terrible quality, lots of noise/artifacts/glitches
  • No control at all, beyond choosing a voice
  • => Useless?

Elevenlabs
  • If you aren't paying attention, you might believe it's a real voice. Best I have seen so far.
  • Emotion, speed, tone, etc. for a given input is random and different every time you run it. It's near-impossible to generate matching voice lines. You can only generate them all at once in one big input and then cut it into pieces by hand. However, that way you burn through your character quota VERY quickly. Also you can't add more voice lines later, without regenerating everything.
  • Generating voices is also extremely random and frustrating.
  • (Have not tried the voice cloning, which requires payment, is that any better?)
 
Last edited:
It will be incredibly fun to see the development of this!

My experience with Elevenlabs is that the voices created with cloning give better results.
I don't know why, but they feel more alive. But you need to have good sound quality on the voices you use to clone. But then you only need a one-minute audio clip for good results.
But yes, it gives very random results.You may have to re-generate the same sentence several times where you keep changing the voice settings to get the feeling you want. But then it can be really good.
 
Well now, let's put some quotation marks on my "research" 😄
I tried Uberduck, Azure TTS, ElevenLabs, and Tortoise TTS.

Uberduck is very limited on female voices and generally not so good results, acceptable at best if it had a lot of training sources.

AzureTTS I would expect it to be similar to the TTStoMP3 service you mention. You can do a lot more control on the voices that are provided with tags (SSML or something) and the sources are real voices. They tend to be robotic, like a news anchor reading the news, not great for VaM "action" purposes, but at least the configuration and language availability makes it interesting on some scenarios.

Tortoise TTS is like Stable Diffusion but for audio. I have limited experience, mainly a days worth of it, and tried to do some training on voice cloning, but failed to do anything decent. I've seen results from others that sounded really good, but there's quite a long path or better understanding of the jargon to get to good results. However, when you know your way you can have immense control and replicability, in the long run this may be the best option if configuration and replicability are crucial. Maybe someday I can do something decent here.

ElevenLabs has been quite a surprising tool. It's probably Tortoise on Cocaine, but limited in configuration and optimised to american english accents. I've tried the Voice cloning and damn, I am surprised on how well it handled them, they do sound like the source material. Of course, the source material has to be somewhat of a plain US accent and with a decent clarity, I have not yet seen such good results for a natural voice, and so quickly, on other tools like this one.
The configuration is limited to a stability (variable to stable) and a clarity+enhancement (low-high) sliders, same system as available in the free tier. Lower values of stability give results with more expression, and higher clarity make the voice cleaner but also more americanised. You are credited for each character you process, each round costs you, and sometimes you get a keeper on the first run, sometimes you need a few runs.

So far I have made many voice clonings for personal use, using around 20 for stability and 85 for clarity, and they deliver good results in expression and voice cloning. These values are not predictable though, a run can go through different emotions and pitches, sometimes they go nuts too, which is funny to hear but also sad as you wasted characters. For personal use they are good, best I've seen so far, but replication, consistency, or configuration ability is something you don't have.

There's no killer app yet, they all have their pros and cons. In the long run, TortoiseTTS is likely the tool to use for configuration capability and no restrictions, but it's much slower and takes a long time to be able to use it well. 11 Labs has the best results for a real person speaking, but little configuration, unreliable, and weird paying schemes.
 
Last edited:
I've tried tortoise it's not bad especially for being offline but not realtime. You do have to play with the outputs to get it right. I did make little clips and add them together but it's inconsistent. For example reading the current hour then use chained multiple single word clips together.

You'd have to feed in a dictionary of common words then build a database of wav files or one large wav file which you can seek through with a table. There is no inflection in this method, it's good for a monotone vr butler type app though. You could probably change the pitch some to change it but still looking for better methods.

"It's one oclock" = Its.wav + oneoclock.wav
"It's now two oclock" = Its.wav + now.wav + twooclock.wav
"It's currently four oclock" = Its.wav + currently.wav + fouroclock.wav
 

Attachments

  • oclock.mp4
    191 KB
So, I was playing for a couple of hours with TortoiseTTS today. Besides struggling to get it working despite broken python dependencies, I couldn't get any useful results from it. The included voices are mostly newscaster, narrator-style voices, so not useful for our purposes. Even with then, anything of more than a 1-2 sentences seems to have a high chance of producing artifacts. Tried to train my own voice, but it's really hard to get good enough quality audio samples where there is no music or clicks, noises or other voices in the background. Might need some more experimentation, but it's not looking promising at the moment.

For anyone having the same install trouble:
  • Installing the dependencies from requirements.txt as per install instructions failed for me with obscure error messages. Likely because it's trying to install century old versions of some libraries. If that happens to you too, open requirements.txt and remove the version-locks for scipy, numpy and numba packages. Just have it install the current version of those. Then save the file and try again.
  • Also "pysoundfile" doesn't seem to be needed. Not sure it was the cause of my troubles trying to install that.
  • My attempts at fixing the issues apparently broke my Anaconda install. Only could get it working after uninstalling Anaconda and then reinstall to start fresh.
 
This is one I tried:

It will likely never reach the quality you seek, it's more to the crude enjoyment of the masses.
 
Hi Macgrubber

i've tried TTS silero , and it is not perfect but quite , they have a 100+ female voices

OobaBooga Text generation webui , use it as an extension to have TTS during chats .



That would be great to be able to implement it into VAM .
 
Last edited:
Hello,


I find Tortoise to be top notch and trouble free once properly installed! I have a 4070 ti and I confess, it's long and I'm afraid, my card makes a fucking coil whine when it writes the voice lol.

Perso I did like this to have no problem:
Install default miniconda and git.

My commands :

conda create -n Tortoise python=3.8

conda activate Tortoise

cd miniconda3

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

git clone https://github.com/neonbjb/tortoise-tts

cd tortoise-tts

pip install numpy==1.23.0

python setup.py install

conda install --channel=numba llvmlite

pip uninstall pydantic

pip install pydantic==1.9.1

For test :

python tortoise/do_tts.py --text "Hi, nice to see you." --voice mol --preset ultra_fast



It's true that it's very long in preset fast... In ultra_fast, it's 8x faster but the quality is less good, you can feel the robot coming back...
 
Hello,


I find Tortoise to be top notch and trouble free once properly installed! I have a 4070 ti and I confess, it's long and I'm afraid, my card makes a fucking coil whine when it writes the voice lol.

Perso I did like this to have no problem:
Install default miniconda and git.

My commands :

conda create -n Tortoise python=3.8

conda activate Tortoise

cd miniconda3

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

git clone https://github.com/neonbjb/tortoise-tts

cd tortoise-tts

pip install numpy==1.23.0

python setup.py install

conda install --channel=numba llvmlite

pip uninstall pydantic

pip install pydantic==1.9.1

For test :

python tortoise/do_tts.py --text "Hi, nice to see you." --voice mol --preset ultra_fast



It's true that it's very long in preset fast... In ultra_fast, it's 8x faster but the quality is less good, you can feel the robot coming back...
Hey. How fast does it generate ?
 
That is really bad :(. Anything beyond 1sec is not good enough
We have different goals, you want live responses. However, speed doesn't matter that much when you just want to offline generate voice lines.
Still, I played a bit more around with TortoiseTTS. Tried with some better quality voice inputs. Still not ideal ones, but definitely better than the included samples. However, I still can't get anything useful out of this.

Also played again with Elevenlabs. Getting decent results with it. But you really need to spend your entire monthly free character budget on hitting the "Generate" button some 50x to produce random voices until you get some decent one. Once you have the voice, you can store it at least. Still, somewhat weird/annoying business model. Other interesting discoveries:
  • You can emphasize individual words by putting quotation marks around it. Like so:
    • From now on, I'm your "goddess".
  • You can create pauses by using three dots: ...
  • I had some success editing the generated audio afterwards with Audacity. E.g. playing with tempo/pitch, sometimes for individual words.
  • Some have been experimenting with spamming question or exclamation marks, underscores, etc.
  • Also interesting, emotion is derived from context. So you need to add a bit around your actual sentence: https://www.reddit.com/r/ElevenLabs/comments/11ftdss/how_do_you_put_emotions_into_the_voice/
 
We have different goals, you want live responses. However, speed doesn't matter that much when you just want to offline generate voice lines.
Still, I played a bit more around with TortoiseTTS. Tried with some better quality voice inputs. Still not ideal ones, but definitely better than the included samples. However, I still can't get anything useful out of this.

Also played again with Elevenlabs. Getting decent results with it. But you really need to spend your entire monthly free character budget on hitting the "Generate" button some 50x to produce random voices until you get some decent one. Once you have the voice, you can store it at least. Still, somewhat weird/annoying business model. Other interesting discoveries:
  • You can emphasize individual words by putting quotation marks around it. Like so:
    • From now on, I'm your "goddess".
  • You can create pauses by using three dots: ...
  • I had some success editing the generated audio afterwards with Audacity. E.g. playing with tempo/pitch, sometimes for individual words.
  • Some have been experimenting with spamming question or exclamation marks, underscores, etc.
  • Also interesting, emotion is derived from context. So you need to add a bit around your actual sentence: https://www.reddit.com/r/ElevenLabs/comments/11ftdss/how_do_you_put_emotions_into_the_voice/
Very interesting. I need to try those methods. Three dots and quotes. Yeah it's very weird. I've already spent more than I wanted on it.
 
It is true that the extract was an original voice... Not a good example I took...

I had fun to take all the voices of the Overwatch 2 characters in .ogg format that I converted to .wav
Tortoise did the job and I find the result surprising for the little information he has...

And I like the way to provoke a different intonation with for example :
[Sad] or [Angry] before each sentence or word to give importance.

With punctuation and some expression brackets, you can create some good stuff with tortoise TTS.


This :

for win the times but I can't get it to work :/


And this : https://huggingface.co/Snowad/French-Tortoise

to replace the intonation by a bit of French, it could change but the same, things I do not know how to install :/ ^^ but I think it's not great, too bad for me who is French.
 
I'm in my second month of subscription with 11 Labs. First month was only 1$, 5$ with the second. And I think it was worth it. At least for me. Cloning the Audio is really amazing. However, as said, you need a good audio source. The higher db the better. I did some cloning and lines but because of different db I now run into issues in scenes when using it. The low volume ones won't wirk well with LipSync. The standard 11Lab voices do. If you can rework it and bring all different voices on the same level, you are fine.

Edit: And yes, it is limited to mostly US/UK English. But I trained the voice to some other languages (certain words) and uff, it is so cute with the US/UK accent.
 
How about running Silero locally for TTS, and Whisper for speech recognition?

Add Oogabooga and Pygmalion and, oh boy...
 
How about running Silero locally for TTS, and Whisper for speech recognition?

Add Oogabooga and Pygmalion and, oh boy...

Is there any tutorial out there for such a set-up? I'M completely new to this and I worry a bit that my PC specs are too low to run or train it locally.
 
I used Tortoise TTS for months before ElevenLabs was created. Without fine-tuning, which the author hasn't provided straightforward instructions to do, it just isn't worth the effort compared to paying 5 bucks a month for ElevenLabs voice cloning, imo.

I find I get the most consistent results from ElevenLabs by framing it like the voice is narrating an erotic audiobook. As an example, I've gotten a lot of mileage out of this text from various voices, with Voice Settings > Style Exaggeration set to 50% and all other settings at default:

She moaned, feeling intense pleasure, "Oh god... Oh my god... Grab my breasts.... Oh my god... Oh god... Grab my breasts.... Oh my god... Oh god... Oh my god... Push deeper inside me.... Oh my god... Oh god... Push deeper inside me.... Oh my god... Oh god... This feels so good.... Oh god... This feels so good.... Oh god... I'm gonna come....".

Then, her breath just a whisper as she could barely speak, her legs trembling, "Oh god... I'm gonna come....Oh god... I'm coming... Oh god... I'm coming..."
 
Last edited:
I've been waiting for this feature... Azure AI's Speech Studio is, as of now, the most user-friendly tool I've encountered. It meets at least the following criteria:

  • Text-to-Speech
  • Reproducible results.
  • Ideally direct control over emotion, speed, emphasis of individual words
  • Good selection of voices
  • As little as possibly noise/artifacts
Azure's AI models allow you to adjust the emotional tone, speech rate, and other aspects of individual words within a sentence through labeling. However, when it comes to sound effects like laughter, it may not perform as well as desired.

Elevenlabs does indeed produce more realistic voices, and you can upload and train your own models (which takes about a month, if I recall correctly). However, it may not offer the same richness and flexibility in handling emotional tones as Azure. As you mentioned, using it to generate speech can sometimes feel like playing a slot machine.
 
Last edited:
Elevenlabs does indeed produce more realistic voices, and you can upload and train your own models (which takes about a month, if I recall correctly).
I'm not sure what you are referencing that would take a month. They do offer "Professional Voice Cloning" but only of your own voice, which isn't going to be much use to most people for erotic utterance generation.

Custom voice cloning literally takes a minute or less once you upload a sample MP3.
 
I'm not sure what you are referencing that would take a month. They do offer "Professional Voice Cloning" but only of your own voice, which isn't going to be much use to most people for erotic utterance generation.

Custom voice cloning literally takes a minute or less once you upload a sample MP3.
Indeed... It's unlikely that anyone would use their own voice for adult-oriented purposes.:sneaky: but who knows. Just like I have a female friend who said she could voice my scenes, but it's quite cumbersome in practice. I once thought about having her clone her voice using Elevenlabs, which seems like a solution. I don't know if she would consent to this...
 
I used Tortoise TTS for months before ElevenLabs was created. Without fine-tuning, which the author hasn't provided straightforward instructions to do, it just isn't worth the effort compared to paying 5 bucks a month for ElevenLabs voice cloning, imo.

I find I get the most consistent results from ElevenLabs by framing it like the voice is narrating an erotic audiobook. As an example, I've gotten a lot of mileage out of this text from various voices, with Voice Settings > Style Exaggeration set to 50% and all other settings at default:
I used Tortoise TTS for the voiced intro I wrote for to my "Drums in the Deep" Scene, and speed/pitch adjusted the output myself. It was way more effort than I expected, and it took ages. The first word was Moria, and I had to try so many times with different spellings just to make it sound right. It kept wanting to say "Mariah" regardless of how I spelled it. Eventually it accepted "Moorear".
I was surprised to find it did seem to have a kind of context based emotional tone, like it added emotion to phrases like "her dreams" without any prompts. I was happy with the final result, the timing and cadence worked perfectly, but in future I'll probably try something a little more user friendly.
 
Back
Top Bottom