AI Voice recommendations?

MacGruber · Apr 9, 2023

Hi! I'm looking for recommendations on AI text-to-speech generators that create believable and emotional voices that fit the scene.

What I'm looking for?

Text-to-Speech (Speech-to-Speech may work, too)
Reproducible results. Generating multiple different voice-lines one by one with the same settings should use the same tone/voice, so they all match.
Ideally direct control over emotion, speed, emphasis of individual words
Good selection of voices
Ideally also giggling / laughing
Ideally allowed for commercial use
As little as possibly noise/artifacts

Why? Some context...
I'm currently working on adding the ability to my Life plugin to play voice audio files between breathing. It won't be optimal as actual recorded voice with breathing, but you won't have to decide whether you want breathing or voice anymore. Medium term goal is to upgrade my "The Remote" scene with random playing voice lines. For that I obviously need a confident, dominant female voice.

(Voice in the video was generated with Elevenlabs, see below.)

Based on search the forums here, especially @atani seems to have done some research, I tried the following AI's so far. Sadly, none of them are really any good for the purpose of building games where you have individual voice lines, which is what we would need for building non-linear scenes in VAM ? Since these AIs are coming up everywhere at the moment, maybe someone here has more time and already tried them all?

TTSMP3

Based on Amazon Polly
Allows control of speed/pitch via meta-tags in the text input:
- Example: <prosody rate="fast" pitch="20%">Sending out the ninja sharks.</prosody>
Ok, but somewhat robotic sound that doesn't sound natural.

Uberduck AI

Very few female voices?
Terrible quality, lots of noise/artifacts/glitches
No control at all, beyond choosing a voice
=> Useless?

Elevenlabs

If you aren't paying attention, you might believe it's a real voice. Best I have seen so far.
Emotion, speed, tone, etc. for a given input is random and different every time you run it. It's near-impossible to generate matching voice lines. You can only generate them all at once in one big input and then cut it into pieces by hand. However, that way you burn through your character quota VERY quickly. Also you can't add more voice lines later, without regenerating everything.
Generating voices is also extremely random and frustrating.
(Have not tried the voice cloning, which requires payment, is that any better?)

Rickyziggy · Apr 9, 2023

It will be incredibly fun to see the development of this!

My experience with Elevenlabs is that the voices created with cloning give better results.
I don't know why, but they feel more alive. But you need to have good sound quality on the voices you use to clone. But then you only need a one-minute audio clip for good results.
But yes, it gives very random results.You may have to re-generate the same sentence several times where you keep changing the voice settings to get the feeling you want. But then it can be really good.

atani · Apr 9, 2023

Well now, let's put some quotation marks on my "research" ?
I tried Uberduck, Azure TTS, ElevenLabs, and Tortoise TTS.

Uberduck is very limited on female voices and generally not so good results, acceptable at best if it had a lot of training sources.

AzureTTS I would expect it to be similar to the TTStoMP3 service you mention. You can do a lot more control on the voices that are provided with tags (SSML or something) and the sources are real voices. They tend to be robotic, like a news anchor reading the news, not great for VaM "action" purposes, but at least the configuration and language availability makes it interesting on some scenarios.

Tortoise TTS is like Stable Diffusion but for audio. I have limited experience, mainly a days worth of it, and tried to do some training on voice cloning, but failed to do anything decent. I've seen results from others that sounded really good, but there's quite a long path or better understanding of the jargon to get to good results. However, when you know your way you can have immense control and replicability, in the long run this may be the best option if configuration and replicability are crucial. Maybe someday I can do something decent here.

ElevenLabs has been quite a surprising tool. It's probably Tortoise on Cocaine, but limited in configuration and optimised to american english accents. I've tried the Voice cloning and damn, I am surprised on how well it handled them, they do sound like the source material. Of course, the source material has to be somewhat of a plain US accent and with a decent clarity, I have not yet seen such good results for a natural voice, and so quickly, on other tools like this one.
The configuration is limited to a stability (variable to stable) and a clarity+enhancement (low-high) sliders, same system as available in the free tier. Lower values of stability give results with more expression, and higher clarity make the voice cleaner but also more americanised. You are credited for each character you process, each round costs you, and sometimes you get a keeper on the first run, sometimes you need a few runs.

So far I have made many voice clonings for personal use, using around 20 for stability and 85 for clarity, and they deliver good results in expression and voice cloning. These values are not predictable though, a run can go through different emotions and pitches, sometimes they go nuts too, which is funny to hear but also sad as you wasted characters. For personal use they are good, best I've seen so far, but replication, consistency, or configuration ability is something you don't have.

There's no killer app yet, they all have their pros and cons. In the long run, TortoiseTTS is likely the tool to use for configuration capability and no restrictions, but it's much slower and takes a long time to be able to use it well. 11 Labs has the best results for a real person speaking, but little configuration, unreliable, and weird paying schemes.

VRStudy · Apr 9, 2023

I've tried tortoise it's not bad especially for being offline but not realtime. You do have to play with the outputs to get it right. I did make little clips and add them together but it's inconsistent. For example reading the current hour then use chained multiple single word clips together.

You'd have to feed in a dictionary of common words then build a database of wav files or one large wav file which you can seek through with a table. There is no inflection in this method, it's good for a monotone vr butler type app though. You could probably change the pitch some to change it but still looking for better methods.

"It's one oclock" = Its.wav + oneoclock.wav
"It's now two oclock" = Its.wav + now.wav + twooclock.wav
"It's currently four oclock" = Its.wav + currently.wav + fouroclock.wav

MacGruber · Apr 10, 2023

So, I was playing for a couple of hours with TortoiseTTS today. Besides struggling to get it working despite broken python dependencies, I couldn't get any useful results from it. The included voices are mostly newscaster, narrator-style voices, so not useful for our purposes. Even with then, anything of more than a 1-2 sentences seems to have a high chance of producing artifacts. Tried to train my own voice, but it's really hard to get good enough quality audio samples where there is no music or clicks, noises or other voices in the background. Might need some more experimentation, but it's not looking promising at the moment.

For anyone having the same install trouble:

Installing the dependencies from requirements.txt as per install instructions failed for me with obscure error messages. Likely because it's trying to install century old versions of some libraries. If that happens to you too, open requirements.txt and remove the version-locks for scipy, numpy and numba packages. Just have it install the current version of those. Then save the file and try again.
Also "pysoundfile" doesn't seem to be needed. Not sure it was the cause of my troubles trying to install that.
My attempts at fixing the issues apparently broke my Anaconda install. Only could get it working after uninstalling Anaconda and then reinstall to start fresh.

atani · Apr 10, 2023

This is one I tried:

ai-voice-cloning

Collection of utilities aimed to voice clone through AI

git.ecker.tech

It will likely never reach the quality you seek, it's more to the crude enjoyment of the masses.

subboyinla · Apr 11, 2023

Hi Macgrubber

i've tried TTS silero , and it is not perfect but quite , they have a 100+ female voices

OobaBooga Text generation webui , use it as an extension to have TTS during chats .

GitHub - snakers4/silero-models: Silero Models: pre-trained text-to-speech models made embarrassingly simple

Silero Models: pre-trained text-to-speech models made embarrassingly simple - snakers4/silero-models

github.com

GitHub - oobabooga/textgen: Open-source desktop app for local LLMs. Text, vision, tool-calling, OpenAI/Anthropic-compatible API. 100% private.

Open-source desktop app for local LLMs. Text, vision, tool-calling, OpenAI/Anthropic-compatible API. 100% private. - oobabooga/textgen

github.com

That would be great to be able to implement it into VAM .

LUXDAR · Apr 14, 2023

Hello,

I find Tortoise to be top notch and trouble free once properly installed! I have a 4070 ti and I confess, it's long and I'm afraid, my card makes a fucking coil whine when it writes the voice lol.

Perso I did like this to have no problem:
Install default miniconda and git.

My commands :

conda create -n Tortoise python=3.8

conda activate Tortoise

cd miniconda3

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

git clone https://github.com/neonbjb/tortoise-tts

cd tortoise-tts

pip install numpy==1.23.0

python setup.py install

conda install --channel=numba llvmlite

pip uninstall pydantic

pip install pydantic==1.9.1

For test :

python tortoise/do_tts.py --text "Hi, nice to see you." --voice mol --preset ultra_fast

It's true that it's very long in preset fast... In ultra_fast, it's 8x faster but the quality is less good, you can feel the robot coming back...

LUXDAR · Apr 14, 2023

For this : " Hello ! Nice to see you. How are you today my boy ?"

Preset ultra_fast : 20 seconds.
Sample sound: https://anonfiles.com/8cd0k9laz3/train_grace_ultra_fast_wav

Preset fast : 54 seconds.
Sample sound: https://anonfiles.com/T8d2k4laz0/train_grace_fast_wav

MacGruber · Apr 14, 2023

ReignMocap said:
That is really bad . Anything beyond 1sec is not good enough

We have different goals, you want live responses. However, speed doesn't matter that much when you just want to offline generate voice lines.
Still, I played a bit more around with TortoiseTTS. Tried with some better quality voice inputs. Still not ideal ones, but definitely better than the included samples. However, I still can't get anything useful out of this.

Also played again with Elevenlabs. Getting decent results with it. But you really need to spend your entire monthly free character budget on hitting the "Generate" button some 50x to produce random voices until you get some decent one. Once you have the voice, you can store it at least. Still, somewhat weird/annoying business model. Other interesting discoveries:

You can emphasize individual words by putting quotation marks around it. Like so:
- From now on, I'm your "goddess".
You can create pauses by using three dots: ...
I had some success editing the generated audio afterwards with Audacity. E.g. playing with tempo/pitch, sometimes for individual words.
Some have been experimenting with spamming question or exclamation marks, underscores, etc.
Also interesting, emotion is derived from context. So you need to add a bit around your actual sentence: https://www.reddit.com/r/ElevenLabs/comments/11ftdss/how_do_you_put_emotions_into_the_voice/

LUXDAR · Apr 15, 2023

It is true that the extract was an original voice... Not a good example I took...

I had fun to take all the voices of the Overwatch 2 characters in .ogg format that I converted to .wav
Tortoise did the job and I find the result surprising for the little information he has...

And I like the way to provoke a different intonation with for example :
[Sad] or [Angry] before each sentence or word to give importance.

With punctuation and some expression brackets, you can create some good stuff with tortoise TTS.

This :

GitHub - 152334H/tortoise-tts-fast: Fast TorToiSe inference (5x or your money back!)

Fast TorToiSe inference (5x or your money back!). Contribute to 152334H/tortoise-tts-fast development by creating an account on GitHub.

github.com

for win the times but I can't get it to work :/

And this : https://huggingface.co/Snowad/French-Tortoise

to replace the intonation by a bit of French, it could change but the same, things I do not know how to install :/ ^^ but I think it's not great, too bad for me who is French.

Sunny74 · Apr 24, 2023

I'm in my second month of subscription with 11 Labs. First month was only 1$, 5$ with the second. And I think it was worth it. At least for me. Cloning the Audio is really amazing. However, as said, you need a good audio source. The higher db the better. I did some cloning and lines but because of different db I now run into issues in scenes when using it. The low volume ones won't wirk well with LipSync. The standard 11Lab voices do. If you can rework it and bring all different voices on the same level, you are fine.

Edit: And yes, it is limited to mostly US/UK English. But I trained the voice to some other languages (certain words) and uff, it is so cute with the US/UK accent.

vertiphon · May 23, 2023

How about running Silero locally for TTS, and Whisper for speech recognition?

Add Oogabooga and Pygmalion and, oh boy...

Sunny74 · May 23, 2023

vertiphon said:
How about running Silero locally for TTS, and Whisper for speech recognition?

Add Oogabooga and Pygmalion and, oh boy...

Is there any tutorial out there for such a set-up? I'M completely new to this and I worry a bit that my PC specs are too low to run or train it locally.

TommyTomahawk · May 24, 2023

atani said:
This is one I tried:

ai-voice-cloning

Collection of utilities aimed to voice clone through AI

git.ecker.tech

It will likely never reach the quality you seek, it's more to the crude enjoyment of the masses.

Where did you get that idea? There's a saying... ignore people when they make claims about what future technology will NOT be able to do.

Clocksmith · Sep 21, 2023

I used Tortoise TTS for months before ElevenLabs was created. Without fine-tuning, which the author hasn't provided straightforward instructions to do, it just isn't worth the effort compared to paying 5 bucks a month for ElevenLabs voice cloning, imo.

I find I get the most consistent results from ElevenLabs by framing it like the voice is narrating an erotic audiobook. As an example, I've gotten a lot of mileage out of this text from various voices, with Voice Settings > Style Exaggeration set to 50% and all other settings at default:

She moaned, feeling intense pleasure, "Oh god... Oh my god... Grab my breasts.... Oh my god... Oh god... Grab my breasts.... Oh my god... Oh god... Oh my god... Push deeper inside me.... Oh my god... Oh god... Push deeper inside me.... Oh my god... Oh god... This feels so good.... Oh god... This feels so good.... Oh god... I'm gonna come....".

Then, her breath just a whisper as she could barely speak, her legs trembling, "Oh god... I'm gonna come....Oh god... I'm coming... Oh god... I'm coming..."

Shadow Venom · Sep 29, 2023

I've been waiting for this feature... Azure AI's Speech Studio is, as of now, the most user-friendly tool I've encountered. It meets at least the following criteria:

Text-to-Speech
Reproducible results.
Ideally direct control over emotion, speed, emphasis of individual words
Good selection of voices
As little as possibly noise/artifacts

Azure's AI models allow you to adjust the emotional tone, speech rate, and other aspects of individual words within a sentence through labeling. However, when it comes to sound effects like laughter, it may not perform as well as desired.

Elevenlabs does indeed produce more realistic voices, and you can upload and train your own models (which takes about a month, if I recall correctly). However, it may not offer the same richness and flexibility in handling emotional tones as Azure. As you mentioned, using it to generate speech can sometimes feel like playing a slot machine.

Clocksmith · Sep 29, 2023

Shadow Venom said:
Elevenlabs does indeed produce more realistic voices, and you can upload and train your own models (which takes about a month, if I recall correctly).

I'm not sure what you are referencing that would take a month. They do offer "Professional Voice Cloning" but only of your own voice, which isn't going to be much use to most people for erotic utterance generation.

Custom voice cloning literally takes a minute or less once you upload a sample MP3.

Shadow Venom · Sep 29, 2023

Clocksmith said:
I'm not sure what you are referencing that would take a month. They do offer "Professional Voice Cloning" but only of your own voice, which isn't going to be much use to most people for erotic utterance generation.

Custom voice cloning literally takes a minute or less once you upload a sample MP3.

Indeed... It's unlikely that anyone would use their own voice for adult-oriented purposes.

but who knows. Just like I have a female friend who said she could voice my scenes, but it's quite cumbersome in practice. I once thought about having her clone her voice using Elevenlabs, which seems like a solution. I don't know if she would consent to this...

Origin69 · Sep 30, 2023

Clocksmith said:
I used Tortoise TTS for months before ElevenLabs was created. Without fine-tuning, which the author hasn't provided straightforward instructions to do, it just isn't worth the effort compared to paying 5 bucks a month for ElevenLabs voice cloning, imo.

I find I get the most consistent results from ElevenLabs by framing it like the voice is narrating an erotic audiobook. As an example, I've gotten a lot of mileage out of this text from various voices, with Voice Settings > Style Exaggeration set to 50% and all other settings at default:

I used Tortoise TTS for the voiced intro I wrote for to my "Drums in the Deep" Scene, and speed/pitch adjusted the output myself. It was way more effort than I expected, and it took ages. The first word was Moria, and I had to try so many times with different spellings just to make it sound right. It kept wanting to say "Mariah" regardless of how I spelled it. Eventually it accepted "Moorear".
I was surprised to find it did seem to have a kind of context based emotional tone, like it added emotion to phrases like "her dreams" without any prompts. I was happy with the final result, the timing and cadence worked perfectly, but in future I'll probably try something a little more user friendly.

AI Voice recommendations?

Invaluable member

New member

Invaluable member

New member

Attachments

Invaluable member

Invaluable member

Member

Well-known member

Well-known member

Invaluable member

Well-known member

Member

New member

Member

Member

Member

Well-known member

Member

Well-known member

Well-known member

Similar threads