Voice Model (Text-To-Speech, Neural Network based)

Other Voice Model (Text-To-Speech, Neural Network based)

pinosante

Well-known member
Messages
384
Reactions
432
Points
63
pinosante submitted a new resource:

Voice Model (Text-To-Speech) - Generate your own beautiful speech using this voice model trained on a neural network

View attachment 176180
Hi everyone!

This is a voice model based on a neural network. You can make it say your own things. There is a lot of TTS stuff out there and I have tried about everything there is to find, but I ended up training my own one using a neural network. After about 700of training I'm sharing the model.

First time use (installation):
  • Download the latest release from...

Read more about this resource...
 
Changing the advanced settings when submitting makes the speech hilarious :LOL: like Sylvester from Looney Tunes

Sylvester_the_Cat_(SVG).svg
 
Very well done, thx for the hard work 👍🏻 Excellent result, impressive!

Just to let you know: I have fiddled a lot with Azure's TTS engine (https://speech.microsoft.com/portal) lately, I find it quite satisfying.
They have a lot of female and male voices, and many of them with up to 15 expressions like Angry, Cheerful, Whispering, Shouting etc...
Also they have most languages, but the non-US ones only have one standard voice. Best: you get all on their free plans forever.

Even with the standard presets you got pretty decent results. If you go advanced, almost everything is possible. And if you understand the standard SSML language, possibilities are endless.

I made a 20 minutes animated short story with it in VAM, everyone was very impressed. Too bad I cant share the scene here because of paid content and the sheer size of it.

Anyway, thank you for sharing, this is still an amazing work.
 
Here are two snipptes from the scene:



















 
@Saint66 yeah I also looked at azure and the google speech packages. Azure is by far the best of the available speech packets online (imho). The problem though, is that they only have this kind of support for English (as you mentioned). And the other thing is, that although the speech is really clean, it feels a bit "monotone" / "business-like" to my ears. That is not surprising, since their customerbase probably wants a professionally sounding voice. But the snippets you sent me are super high quality voices, but they lack a bit of feeling, so to speak (to my ears, at least). The one part which I did really like was the seductive part by the girl in the second video. That voice had more feeling to it.

When you train a voice with a neural net like I did, it also catches the cadance/tone of the speaker, which makes it very natural to listen to. I prefer that over the professional voice-over kind of style from the online available voices. To get a better quality, I need more material (audiobooks/hours of available speech). The problem is, that if you only use audiobooks, you run the risk of generating the voice-of-someone-who-is-reading-a-book, which also lacks the timbre/emotion/feel that you want for a project like VAM. So I have been listening to hundreds audiobook speakers to find a voice which I liked. If you (or anyone else) have any suggestions for a good voice, please let me know.

Oh, btw, there is another *big* reason in my opinion why I like the trained neural net approach a bit better. You can submit the same phrase 10 different times, and it will be spoken 10 different ways. When you do this with Google or Azure, you will get the same speech 10 times with the exact tone. So when I am writing dialogue for some scene, I will sometimes submit a phrase and try 10 different spoken lines and pick the one which sounds the most to what I was expecting. (Hope this makes sense).

Anyway, super cool that you made this whole voice animated scene. Do you think that having a male neural voice would make you consider doing a scene like this with a neural net based voice? Or do you still feel that Azure is the way to go? (Btw, the voice from Azure and Google are most probably also trained by neural nets, but I think they were chosen for professionalism).
 
Scrappy little nobody in progress, it's doing 3,000 epochs, 6 1/2 hours of audio, going to take 5 1/2 days on the 3090.
 
Scrappy little nobody in progress, it's doing 3,000 epochs, 6 1/2 hours of audio, going to take 5 1/2 days on the 3090.
Awesome @Bob Nothing ! So 3000 epochs per 5.5 days? You can increase batch size btw, that will boost speed significantly. I set my batch size to 40, you should be able to do that as well. Maybe even more. I trained my model for 10k epochs.
The speed on my 3080 ti is roughly 1000 iterations per 30 minutes.

Oh and you can use transfer learning which will also improve the quality of your model. When starting training you can choose my model (my savepoint @ 720000 something) as a starting point, or use the default statedict they advise. I hope you did this, otherwise I’d start over with training. It really helps to start with a pretrained model.
Keep me posted! I think it’s super cool that you are also training a model!
 
Last edited:
Very interested in this, can it be used standalone and to train other voices? I'd like to do that as well.
 
I haven't tried yet.
Can I teach the neural network to speak Russian? Will it work?
You probably could, but it involves some steps. You need to download a Russian language model somewhere, and create a Russian Cyrillic alphabet file. And of course have a Russian audiobook + raw text which you need to split into sentences, probably by yourself. This last step involves splitting 1000’s of lines of text. There are smarter ways, like using Youtube captioning or other methods, but it is still a lot of work.
 
Awesome @Bob Nothing ! So 3000 epochs per 5.5 days? You can increase batch size btw, that will boost speed significantly. I set my batch size to 40, you should be able to do that as well. Maybe even more. I trained my model for 10k epochs.
The speed on my 3080 ti is roughly 1000 iterations per 30 minutes.

Oh and you can use transfer learning which will also improve the quality of your model. When starting training you can choose my model (my savepoint @ 720000 something) as a starting point, or use the default statedict they advise. I hope you did this, otherwise I’d start over with training. It really helps to start with a pretrained model.
Keep me posted! I think it’s super cool that you are also training a model!

So I did download the default Nvidia model and that's what I started the training on.

I'm at about 20 hours in and this is the progress so far;

1668543997305.png


So this is looking more like it will be done in 3 days. It also says you can stop/resume training, but I don't see a way to do that. Do I just close the command prompt and re-run to resume? Also, it looks like my batch size is set to 58.
 
So I did download the default Nvidia model and that's what I started the training on.

I'm at about 20 hours in and this is the progress so far;

View attachment 176608

So this is looking more like it will be done in 3 days. It also says you can stop/resume training, but I don't see a way to do that. Do I just close the command prompt and re-run to resume? Also, it looks like my batch size is set to 58.
Yeah, you can just close the app and continue from where you left off, except that it will continue from the latest save point, so best is to do that at a round iteration number (31000, 32000, 33000, etc.). It depends on what you set for the frequency of saves. You can also go to the training/models directory and look at what intervals the saves are being done. I always set the save during training to be at 1000 iterations and 2500 iterations for backup. This means that when I quit training in the worst case, I only lose 30 minutes of compute (1000 iterations ~ 30 minutes of computation). But in practice, I just look at the numbers and do something else until it reaches the 1000 mark, saves, updates the validation score, and then quit.

Batchsize 58, I'm jealous :). I think you can actually set your batch size to probably something like 80. The thing is: you can just try to set it as high as you can, and if it's too high, it'll just bug out at the start with an error. So then you lower it a bit. I then put it a little bit lower, just to make sure that it won't crash over night when it's running. But I have 12 Gb GPU with batch size 40 (11 Gb reported) so you should be able to get double that with 23 Gb free report available.

Doing that will increase your speed by 80/58 another 37%.

Looking good so far, loss looks good and attention score is in the 0.60-0.70. Did you stop training and do some synthesis yet? At 1000 epoch you can probably get a feel for the voice already (but you still want more epochs).

A small tip, I'd also advise you to copy and paste the reports it gives you, with all the validation errors, to some notepad file somewhere. Reason being: what you want to do is an "early stop". Meaning: if you train this thing to 10000 epochs, there is a risk, that at some point the neural net becomes overtrained. You can see when that happens, that the validation score isn't increasing for a long time. So what you are looking for, is the epoch number, where the validation score is at it's maximum-plateau value, before it starts wobbling around that plateau. To put it differently: keep checking the validation scores to make sure that they are still improving. Sometimes they drop a little and increase a little, but in the long run, they should still improve.

Anyway, I'm very curious how the voice sounds. And I'd most definitely train this to 10000 epochs (especially at the speed you can train) and then look at the validation scores in a notepad somewhere, if it started plateauing somewhere before the 10000 epoch mark or not. If so, just load the model at the epoch where it started plateauing and use that.
 
Last edited:
Yeah, you can just close the app and continue from where you left off, except that it will continue from the latest save point, so best is to do that at a round iteration number (31000, 32000, 33000, etc.). It depends on what you set for the frequency of saves. You can also go to the training/models directory and look at what intervals the saves are being done. I always set the save during training to be at 1000 iterations and 2500 iterations for backup. This means that when I quit training in the worst case, I only lose 30 minutes of compute (1000 iterations ~ 30 minutes of computation). But in practice, I just look at the numbers and do something else until it reaches the 1000 mark, saves, updates the validation score, and then quit.

Batchsize 58, I'm jealous :). I think you can actually set your batch size to probably something like 80. The thing is: you can just try to set it as high as you can, and if it's too high, it'll just bug out at the start with an error. So then you lower it a bit. I then put it a little bit lower, just to make sure that it won't crash over night when it's running. But I have 12 Gb GPU with batch size 40 (11 Gb reported) so you should be able to get double that with 23 Gb free report available.

Doing that will increase your speed by 80/58 another 37%.

Looking good so far, loss looks good and attention score is in the 0.60-0.70. Did you stop training and do some synthesis yet? At 1000 epoch you can probably get a feel for the voice already (but you still want more epochs).

A small tip, I'd also advise you to copy and paste the reports it gives you, with all the validation errors, to some notepad file somewhere. Reason being: what you want to do is an "early stop". Meaning: if you train this thing to 10000 epochs, there is a risk, that at some point the neural net becomes overtrained. You can see when that happens, that the validation score isn't increasing for a long time. So what you are looking for, is the epoch number, where the validation score is at it's maximum-plateau value, before it starts wobbling around that plateau. To put it differently: keep checking the validation scores to make sure that they are still improving. Sometimes they drop a little and increase a little, but in the long run, they should still improve.

Anyway, I'm very curious how the voice sounds. And I'd most definitely train this to 10000 epochs (especially at the speed you can train) and then look at the validation scores in a notepad somewhere, if it started plateauing somewhere before the 10000 epoch mark or not. If so, just load the model at the epoch where it started plateauing and use that.
So I'm currently using 22.2GB of the 24GB of RAM on the GPU, so I think I'm probably at the correct batch size. I'm currently at 1019, so I should close the command prompt, re-launch it, and then try to see how the voice sounds? Sorry, obviously a total NOOB at this point on this :)
 
Yeah, you can just close the app and continue from where you left off, except that it will continue from the latest save point, so best is to do that at a round iteration number (31000, 32000, 33000, etc.). It depends on what you set for the frequency of saves. You can also go to the training/models directory and look at what intervals the saves are being done. I always set the save during training to be at 1000 iterations and 2500 iterations for backup. This means that when I quit training in the worst case, I only lose 30 minutes of compute (1000 iterations ~ 30 minutes of computation). But in practice, I just look at the numbers and do something else until it reaches the 1000 mark, saves, updates the validation score, and then quit.

Batchsize 58, I'm jealous :). I think you can actually set your batch size to probably something like 80. The thing is: you can just try to set it as high as you can, and if it's too high, it'll just bug out at the start with an error. So then you lower it a bit. I then put it a little bit lower, just to make sure that it won't crash over night when it's running. But I have 12 Gb GPU with batch size 40 (11 Gb reported) so you should be able to get double that with 23 Gb free report available.

Doing that will increase your speed by 80/58 another 37%.

Looking good so far, loss looks good and attention score is in the 0.60-0.70. Did you stop training and do some synthesis yet? At 1000 epoch you can probably get a feel for the voice already (but you still want more epochs).

A small tip, I'd also advise you to copy and paste the reports it gives you, with all the validation errors, to some notepad file somewhere. Reason being: what you want to do is an "early stop". Meaning: if you train this thing to 10000 epochs, there is a risk, that at some point the neural net becomes overtrained. You can see when that happens, that the validation score isn't increasing for a long time. So what you are looking for, is the epoch number, where the validation score is at it's maximum-plateau value, before it starts wobbling around that plateau. To put it differently: keep checking the validation scores to make sure that they are still improving. Sometimes they drop a little and increase a little, but in the long run, they should still improve.

Anyway, I'm very curious how the voice sounds. And I'd most definitely train this to 10000 epochs (especially at the speed you can train) and then look at the validation scores in a notepad somewhere, if it started plateauing somewhere before the 10000 epoch mark or not. If so, just load the model at the epoch where it started plateauing and use that.

Also, if I click 'setting' on the webUI it says I'll lose any changes I've made. Does clicking out of the training tell the training to stop? Am I able to click out of the web UI and test the voice without stopping training? Will training just auto-resume when if I close and re-launch the app command prompt?
 
Yeah, you can just close the app and continue from where you left off, except that it will continue from the latest save point, so best is to do that at a round iteration number (31000, 32000, 33000, etc.). It depends on what you set for the frequency of saves. You can also go to the training/models directory and look at what intervals the saves are being done. I always set the save during training to be at 1000 iterations and 2500 iterations for backup. This means that when I quit training in the worst case, I only lose 30 minutes of compute (1000 iterations ~ 30 minutes of computation). But in practice, I just look at the numbers and do something else until it reaches the 1000 mark, saves, updates the validation score, and then quit.

Batchsize 58, I'm jealous :). I think you can actually set your batch size to probably something like 80. The thing is: you can just try to set it as high as you can, and if it's too high, it'll just bug out at the start with an error. So then you lower it a bit. I then put it a little bit lower, just to make sure that it won't crash over night when it's running. But I have 12 Gb GPU with batch size 40 (11 Gb reported) so you should be able to get double that with 23 Gb free report available.

Doing that will increase your speed by 80/58 another 37%.

Looking good so far, loss looks good and attention score is in the 0.60-0.70. Did you stop training and do some synthesis yet? At 1000 epoch you can probably get a feel for the voice already (but you still want more epochs).

A small tip, I'd also advise you to copy and paste the reports it gives you, with all the validation errors, to some notepad file somewhere. Reason being: what you want to do is an "early stop". Meaning: if you train this thing to 10000 epochs, there is a risk, that at some point the neural net becomes overtrained. You can see when that happens, that the validation score isn't increasing for a long time. So what you are looking for, is the epoch number, where the validation score is at it's maximum-plateau value, before it starts wobbling around that plateau. To put it differently: keep checking the validation scores to make sure that they are still improving. Sometimes they drop a little and increase a little, but in the long run, they should still improve.

Anyway, I'm very curious how the voice sounds. And I'd most definitely train this to 10000 epochs (especially at the speed you can train) and then look at the validation scores in a notepad somewhere, if it started plateauing somewhere before the 10000 epoch mark or not. If so, just load the model at the epoch where it started plateauing and use that.

Also, looking at my checkpoint files;

1668550677963.png


I'm not sure what made it create a checkpoint at 33,000 when before it seemed to be every 10,000? It looks like if I stop it now I'll lose roughly 30 minutes of training.
 
Yeah, you can just close the app and continue from where you left off, except that it will continue from the latest save point, so best is to do that at a round iteration number (31000, 32000, 33000, etc.). It depends on what you set for the frequency of saves. You can also go to the training/models directory and look at what intervals the saves are being done. I always set the save during training to be at 1000 iterations and 2500 iterations for backup. This means that when I quit training in the worst case, I only lose 30 minutes of compute (1000 iterations ~ 30 minutes of computation). But in practice, I just look at the numbers and do something else until it reaches the 1000 mark, saves, updates the validation score, and then quit.

Batchsize 58, I'm jealous :). I think you can actually set your batch size to probably something like 80. The thing is: you can just try to set it as high as you can, and if it's too high, it'll just bug out at the start with an error. So then you lower it a bit. I then put it a little bit lower, just to make sure that it won't crash over night when it's running. But I have 12 Gb GPU with batch size 40 (11 Gb reported) so you should be able to get double that with 23 Gb free report available.

Doing that will increase your speed by 80/58 another 37%.

Looking good so far, loss looks good and attention score is in the 0.60-0.70. Did you stop training and do some synthesis yet? At 1000 epoch you can probably get a feel for the voice already (but you still want more epochs).

A small tip, I'd also advise you to copy and paste the reports it gives you, with all the validation errors, to some notepad file somewhere. Reason being: what you want to do is an "early stop". Meaning: if you train this thing to 10000 epochs, there is a risk, that at some point the neural net becomes overtrained. You can see when that happens, that the validation score isn't increasing for a long time. So what you are looking for, is the epoch number, where the validation score is at it's maximum-plateau value, before it starts wobbling around that plateau. To put it differently: keep checking the validation scores to make sure that they are still improving. Sometimes they drop a little and increase a little, but in the long run, they should still improve.

Anyway, I'm very curious how the voice sounds. And I'd most definitely train this to 10000 epochs (especially at the speed you can train) and then look at the validation scores in a notepad somewhere, if it started plateauing somewhere before the 10000 epoch mark or not. If so, just load the model at the epoch where it started plateauing and use that.
Also it really seems to be wobbling around the 67-68 mark.
 
Also, looking at my checkpoint files;

View attachment 176650

I'm not sure what made it create a checkpoint at 33,000 when before it seemed to be every 10,000? It looks like if I stop it now I'll lose roughly 30 minutes of training.
Yeah so during training it probably saves every 1000 iterations, but overwrites those everytime. It keeps the 10000 spaced checkpoints. So it does 10k save, 20k save, 30k save 33k checkpoint, 34k checkpoint etc until 40k checkpoint + save. You can quit at every 1000 iterations and continue from that checkpoint.
 
Yeah, you can just close the app and continue from where you left off, except that it will continue from the latest save point, so best is to do that at a round iteration number (31000, 32000, 33000, etc.). It depends on what you set for the frequency of saves. You can also go to the training/models directory and look at what intervals the saves are being done. I always set the save during training to be at 1000 iterations and 2500 iterations for backup. This means that when I quit training in the worst case, I only lose 30 minutes of compute (1000 iterations ~ 30 minutes of computation). But in practice, I just look at the numbers and do something else until it reaches the 1000 mark, saves, updates the validation score, and then quit.

Batchsize 58, I'm jealous :). I think you can actually set your batch size to probably something like 80. The thing is: you can just try to set it as high as you can, and if it's too high, it'll just bug out at the start with an error. So then you lower it a bit. I then put it a little bit lower, just to make sure that it won't crash over night when it's running. But I have 12 Gb GPU with batch size 40 (11 Gb reported) so you should be able to get double that with 23 Gb free report available.

Doing that will increase your speed by 80/58 another 37%.

Looking good so far, loss looks good and attention score is in the 0.60-0.70. Did you stop training and do some synthesis yet? At 1000 epoch you can probably get a feel for the voice already (but you still want more epochs).

A small tip, I'd also advise you to copy and paste the reports it gives you, with all the validation errors, to some notepad file somewhere. Reason being: what you want to do is an "early stop". Meaning: if you train this thing to 10000 epochs, there is a risk, that at some point the neural net becomes overtrained. You can see when that happens, that the validation score isn't increasing for a long time. So what you are looking for, is the epoch number, where the validation score is at it's maximum-plateau value, before it starts wobbling around that plateau. To put it differently: keep checking the validation scores to make sure that they are still improving. Sometimes they drop a little and increase a little, but in the long run, they should still improve.

Anyway, I'm very curious how the voice sounds. And I'd most definitely train this to 10000 epochs (especially at the speed you can train) and then look at the validation scores in a notepad somewhere, if it started plateauing somewhere before the 10000 epoch mark or not. If so, just load the model at the epoch where it started plateauing and use that.

Ok dude, holy freaking crap! She seems to have a hard time with the word Hi for some reason? Maybe short words it isn't good at yet? But I'm doing entire paragraphs here and it sounds exactly like her. Sometimes the timing is just off by a hair, but it sounds EXACTLY like her. How do I upload audio clips to this thread? :) Ok, I had to .zip it. I almost don't want to iterate anymore.... It's really freaking good!
 

Attachments

  • Everyone sample.zip
    467.2 KB · Views: 0
Ok, I'll probably kick the training back off tonight, but I've gotta say, it's pretty dang near perfect already. For most paragraphs if I hit submit a second time, the 2nd time I submit the exact same thing it comes out perfectly. The first render always seems to be read really fast, and the second render comes out just right. I tried to find some good samples of text that would cover a wide variety of scenarios. First I tried using a Wikipedia entry about her, so that's there. Then I thought "why not just use a bunch of her tweets?" So, I found a variety article with a bunch of their 'favorite tweets' of hers, and I had the AI read out her tweets. Pretty freaking perfect, listen for yourself. Attached.
 

Attachments

  • Anna AI Samples.zip
    4.7 MB · Views: 0
Back
Top Bottom