My dream VAM plugin/scenario

@Hedgepig could you just use Amazon Sumerian for the TTS? What is the quality? Right now I'm using Microsoft Azure which is a lot better than IBM Watson or Google Wave. I look at Amazon Polly but that TTS engine sucks. So Amazon Sumerian would be better? I checked out their page and it looks some kind of 3d online stuff you can use. I am only interested in the TTS part however...

Is it better than this? https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/#features
Using Aria (Neural) with Cheerful settings? Because that is what I'm using right now... and the best one I could find online. I hope Amazon Sumerian sounds better, would love to hear your opinion.

I can't look at that , Azure is all about the code and, these days, it does my head in. Totally my fault but I'm a writer and visual thinker, I only do XML and AIML; they are language related and make sense to me. If you want to share a video of you using Azure, please do, I'd love to see it.

Sumerian has a feature that allows you to annotate text with different kinds of emotional emphasis, and you can set the speed of the spoken text. It's not perfect but it's near enough.

Sumerian is free to use but if you need higher quality, there are companies with this kind of stuff 4 Best AI Voice Generators (Text-to-Speech) for 2022 - Victory Tale

But it looks like you'll be paying $30.00 per month to be able to use it outside the sandbox.

Please let me know how you get on with TTS generation.
 
It would be worth to do a few QC checks first to ensure that the problem is not your data.
  1. Did you split your audio into smaller samples (usually 2 to 10s long)?
  2. Did you trim the beginning of your audio or text to remove audio that doesn't match text?
  3. Did you check the samples to ensure that the text is matching with the audio?
  4. Did you use transfer learning or did you start training a new model?
  5. How many epochs did you train the model?
  6. Were all those clips from the same person and single voiced?
Another workflow that I played a bit with today is voice conversion. It gives you some control on the intonation. What you do is you train 2 voices: yours, and the voice you want. Then, you can say something and it will convert your audio directly to the other voice by keeping the emotion.
Regarding 1 to 6, I did most of that. But they were smallish samples with a few words per line. Quality was fine and they matched syllable for syllable. I only ran it for 500 epochs though, so that might matter. I'm right now doing tests with a female writer of who I have 10+ hours of text + audio so to see how much it all matters I'm going to train a few models:

- 0.5 hour of source material @ 500 epochs
- 2 hours of source material @ 500 epochs
- 0.5 hours of source material @ 3000 epochs
- 2 hours of source material @ 3000 epochs

Then I'm going to compare the results for these cases so I have a better feeling of what matters the most for the quality.

I also used a pretrained model for the sample.

However: I really like your voice changing approach. Sounds super interesting. How does that work? Can you point me to some material?
 
I can't look at that , Azure is all about the code and, these days, it does my head in. Totally my fault but I'm a writer and visual thinker, I only do XML and AIML; they are language related and make sense to me. If you want to share a video of you using Azure, please do, I'd love to see it.

Sumerian has a feature that allows you to annotate text with different kinds of emotional emphasis, and you can set the speed of the spoken text. It's not perfect but it's near enough.

Sumerian is free to use but if you need higher quality, there are companies with this kind of stuff 4 Best AI Voice Generators (Text-to-Speech) for 2022 - Victory Tale

But it looks like you'll be paying $30.00 per month to be able to use it outside the sandbox.

Please let me know how you get on with TTS generation.
The page I gave you, if you scroll down, allows you to try out the different voices. You can just click on the voice, and type what it has to say. So if you're willing, maybe you can look at it again. In the meantime I'll try to hopefully find some youtube video's detaling the Amazon Sumerian software.
 
@checkking @Hedgepig
First results are in, and I'm pretty stoked to be honest. This looks very promising.

2 hours of source material @ 500 epochs:

"I'm happy to show you my ass sir"

"Did you know I always swallow when I suck dick?"

"I think using artificial intelligence opens a world of possibilities."
 
The page I gave you, if you scroll down, allows you to try out the different voices. You can just click on the voice, and type what it has to say. So if you're willing, maybe you can look at it again. In the meantime I'll try to hopefully find some youtube video's detaling the Amazon Sumerian software.

Just tried it, SSML is similar to that in Sumerian. Microsoft voices way too harsh for me.

To get more expression than AWS Poly voices you could try Cereproc with RTVoice in Unity. Bit fiddly to set up, but even a non-coder like me managed it, after a day-long blitz of console red warnings.:)
 
@checkking @Hedgepig
First results are in, and I'm pretty stoked to be honest. This looks very promising.

2 hours of source material @ 500 epochs:

"I'm happy to show you my ass sir"

"Did you know I always swallow when I suck dick?"

"I think using artificial intelligence opens a world of possibilities."

That looks promising!!
 
However: I really like your voice changing approach. Sounds super interesting. How does that work? Can you point me to some material?
Directly on the VITS colab example, there is a Voice Conversion example with the multiple voice model:
Beware, that notebook is a bit messy and you need to manually change the active folder a few times to get it work.

The example is using voice actor 81 as seed, then using an audio-to-audio approach for converting the voice to the other speaker voices. It works very well, but sadly all those examples are boring narrative voices. A practical usage would be to train at least 2 voices, your voice and the voice you want, and add those 2 voices to the model. Then, you just record with your own voice and convert it.

I did try with a very emotional sample, and it did keep the emotion, but I didn't train for that speaker so the result is not good. That was just a quick test to figure out how much emotion/intonation can be transfered that way, and it seems to work.
 
@checkking @Hedgepig
First results are in, and I'm pretty stoked to be honest. This looks very promising.

2 hours of source material @ 500 epochs:

"I'm happy to show you my ass sir"

"Did you know I always swallow when I suck dick?"

"I think using artificial intelligence opens a world of possibilities."
Yes, very promising for only 500 epochs. I can still hear that Tacotron distorsion that has been bugging me all the time.

Maybe I am more sensitive to it. I keep seeing everywhere people stating that Tacotron DDC > VITS, yet I hear their cherry picked samples, and to me VITS is many times more clean. Maybe because the quality is more constant. Tacotron, sometimes the voice is very clean, then you hear a small distorsion and that's what is killing the magic.
 
Just to demo. Results are expected to be really bad.

I took a small sample audio, from gonewildaudio, I used voice ID 21, which is not that similar but kinda worked. No training on any voice, directly from the colab notebook.

Original audio:

ID 12

ID 15

It keeps the intonation. Now, you can hear how good it sounds on trained voices in the notebook, even with different gender, so... lots of potential.
 
I've been experimenting with replika. Using regolos engine plugin (disabling speech and possibly expressions) and adding microphone input and rt_lipsync in VR I can get a walking, interacting "AI", and I can talk to her and she responds. The way I'm getting live audio into VAM is very slow though, the delay is quite a few seconds.
 
I'm still busy with training a voice clone. Within another week I have hopefully succesfully trained a model at max quality, and if so, I can share some results.
 
@pinosante any success with training? Its sounds very cool! Can you share the model?
Looks like YourTTS is much easier to use. But at this moment it works only on colab, cant get it offline
 
Last edited:
@pinosante any success with training? Its sounds very cool! Can you share the model?
Looks like YourTTS is much easier to use. But at this moment it works only on colab, cant get it offline
Yes, here it is :)

 
Back
Top Bottom