Voice Model (Text-To-Speech, Neural Network based)

Other Voice Model (Text-To-Speech, Neural Network based)

Ok, I'll probably kick the training back off tonight, but I've gotta say, it's pretty dang near perfect already. For most paragraphs if I hit submit a second time, the 2nd time I submit the exact same thing it comes out perfectly. The first render always seems to be read really fast, and the second render comes out just right. I tried to find some good samples of text that would cover a wide variety of scenarios. First I tried using a Wikipedia entry about her, so that's there. Then I thought "why not just use a bunch of her tweets?" So, I found a variety article with a bunch of their 'favorite tweets' of hers, and I had the AI read out her tweets. Pretty freaking perfect, listen for yourself. Attached.
Yeah, those are really good! Especially the timing for some of her tweets. Like, she actually gets the timing for the punchline right. You can just keep on training and write down somewhere that you like the quality at this iteration. So you have a model at this checkpoint you like, but if you train it a bit more, it will probably sound a bit more crisp.

By the way, before you would continue training, you can also try to do some custom hifi-gan training. (Train your own vocoder). This works the same way. Now when you start/continue training your model, click on "Train Custome one" select the base vocoder files, and they will start training. Vocoder training goes pretty fast. In half a day to a day you will probably have the vocoder trained. The effect for me was minimal, but it does make it sound more crisp.

Question: when you said that the score was wobbling around 67-68 did you mean validation score? Because that's like an insanely high validation score. Mine are between 0.3 and 1.0 for the voices I train.
 
Thanks, can you share some resources for good audiobooks?
A good tip is to become an (online) member of a library. Most libraries allow you to check out digital books. So now you can check for both the audiobook + ereader version. And the DRM protection on library digital files are not that hard to crack. You can also buy stuff on audible (I've done so) and then crack it/download it using some googling. Same for Kindle ebooks (which are a major pain in the ass to get to convert to epub/pdf). You can also look at torrent sites, and if you google enough, there is also a website I remember which shared free audiobooks. But then you also need to find the corresponding ebook.

So: library > torrents > audible > amazon/kindle.
 
  • Will/Would you use this model?


Definitely. The free online TTS is limiting and I don't have the money for one of the subscription ones. Having my own, on my own system, is a boon. Thank you.
 
Yeah, those are really good! Especially the timing for some of her tweets. Like, she actually gets the timing for the punchline right. You can just keep on training and write down somewhere that you like the quality at this iteration. So you have a model at this checkpoint you like, but if you train it a bit more, it will probably sound a bit more crisp.

By the way, before you would continue training, you can also try to do some custom hifi-gan training. (Train your own vocoder). This works the same way. Now when you start/continue training your model, click on "Train Custome one" select the base vocoder files, and they will start training. Vocoder training goes pretty fast. In half a day to a day you will probably have the vocoder trained. The effect for me was minimal, but it does make it sound more crisp.

Question: when you said that the score was wobbling around 67-68 did you mean validation score? Because that's like an insanely high validation score. Mine are between 0.3 and 1.0 for the voices I train.

Sorry, I was referring to attention score. So, I had kicked off the training and it's fully complete now.

1668730075922.png


The attention score did go up much higher to 737. Loss score is also very low. I'll test it out and we'll see how it goes.
 
Yeah, those are really good! Especially the timing for some of her tweets. Like, she actually gets the timing for the punchline right. You can just keep on training and write down somewhere that you like the quality at this iteration. So you have a model at this checkpoint you like, but if you train it a bit more, it will probably sound a bit more crisp.

By the way, before you would continue training, you can also try to do some custom hifi-gan training. (Train your own vocoder). This works the same way. Now when you start/continue training your model, click on "Train Custome one" select the base vocoder files, and they will start training. Vocoder training goes pretty fast. In half a day to a day you will probably have the vocoder trained. The effect for me was minimal, but it does make it sound more crisp.

Question: when you said that the score was wobbling around 67-68 did you mean validation score? Because that's like an insanely high validation score. Mine are between 0.3 and 1.0 for the voices I train.

Hmmm I feel like it may have been better before... I have plenty of checkpoints to fall back on...
 
A good tip is to become an (online) member of a library. Most libraries allow you to check out digital books. So now you can check for both the audiobook + ereader version. And the DRM protection on library digital files are not that hard to crack. You can also buy stuff on audible (I've done so) and then crack it/download it using some googling. Same for Kindle ebooks (which are a major pain in the ass to get to convert to epub/pdf). You can also look at torrent sites, and if you google enough, there is also a website I remember which shared free audiobooks. But then you also need to find the corresponding ebook.

So: library > torrents > audible > amazon/kindle.
I kicked off training a custom vocoder for it, but oddly that says it's going to take 4 days??? Longer than the data set training.
 
I kicked off training a custom vocoder for it, but oddly that says it's going to take 4 days??? Longer than the data set training.
In practice it goes really fast. Each iteration is about 0.69 seconds instead of the usual 1.98.
 
pinosante updated Voice Model (Text-To-Speech, Neural Network based) with a new update entry:

Improved the model (a lot!) and retrained it, please download this improved model!

Hi everyone,

It took me another long time, but I've been cleaning up the source audio, using some custom scripts I created myself. Long story short, I fixed the "pitch' of the source material to all be in the same range, and the "breath" noises due to inhaling air before speaking have been removed. After that I retrained the model on this improved source data. It improved the quality of the model by a lot.

Improvements:
  • Less raspy, cleaner audio
  • No more sudden pitch drops...

Read the rest of this update entry...
 
First of all thanks for sharing this tutorial with us. I have a model training at the moment but have a question and a suggestion that may open up the door to more voice models.

Is it strictly wav and text files that we are able to use or is there other supported file formats? There is quite a few audio books on Audible with no Kindle ebook counterpart. I was thinking maybe about using AI transcription to get a counterpart text, anyone tried this yet?
 
First of all thanks for sharing this tutorial with us. I have a model training at the moment but have a question and a suggestion that may open up the door to more voice models.

Is it strictly wav and text files that we are able to use or is there other supported file formats? There is quite a few audio books on Audible with no Kindle ebook counterpart. I was thinking maybe about using AI transcription to get a counterpart text, anyone tried this yet?
Good question. So yeah, there is software which transcribes the audio book to written text. In fact: the app does that, to match the audio with the written source material. It transcribes what it is in the audio and then uses this to match it with the written material. Problem is: the quality of transcribed material is not really good. If you would be interested in taking that route, I see two options. 1) Use a python library which is called VOSK audio transcriber (blog post and jupyter notebook with code) this is what I have been using for some other projects. 2) Use youtube's transcribing algorithms. You would have to google how to do that (I have done this is in the past). What I did was: create a video with only a black screen and add the audio as sound. Then upload this black video with the audio to youtube as a video. Then somewhere choose automatic subtitling. And then it will subtitle the audio. This will give decent results, but not perfect. If you are willing to invest the time, you can then just read the subtitles for weird sentences, or more time intensive, listen to the whole audio and look at the subtitles to check.

Personally I have made the decision to always make sure to have a good written source text, because that saves a lot of time. But if you are willing to invest the time, you can do the suggestions I made above.

How is the training going?
 
Good question. So yeah, there is software which transcribes the audio book to written text. In fact: the app does that, to match the audio with the written source material. It transcribes what it is in the audio and then uses this to match it with the written material. Problem is: the quality of transcribed material is not really good. If you would be interested in taking that route, I see two options. 1) Use a python library which is called VOSK audio transcriber (blog post and jupyter notebook with code) this is what I have been using for some other projects. 2) Use youtube's transcribing algorithms. You would have to google how to do that (I have done this is in the past). What I did was: create a video with only a black screen and add the audio as sound. Then upload this black video with the audio to youtube as a video. Then somewhere choose automatic subtitling. And then it will subtitle the audio. This will give decent results, but not perfect. If you are willing to invest the time, you can then just read the subtitles for weird sentences, or more time intensive, listen to the whole audio and look at the subtitles to check.

Personally I have made the decision to always make sure to have a good written source text, because that saves a lot of time. But if you are willing to invest the time, you can do the suggestions I made above.

How is the training going?

Thanks, will definitely take a look at those programs. The training is going pretty decent IMO, was actually surprised that I was able to get some decent results after just 1000 iterations. Left it going over night and am at 20K now. Here is a sample https://mega.nz/file/zB8EWCoL#IrFXGdD32FdahpdXDCCchEG9m5l3_bXWnUWqZz2bcWI

Are you the author of the voice cloning app? I see it has a link to a page to share voice models but it seems to be dead which is a shame. Perhaps we could all share some models once we get them trained up as its so time consuming. Most certainly an interesting project and would love to see more results from other people.
 
Thanks, will definitely take a look at those programs. The training is going pretty decent IMO, was actually surprised that I was able to get some decent results after just 1000 iterations. Left it going over night and am at 20K now. Here is a sample https://mega.nz/file/zB8EWCoL#IrFXGdD32FdahpdXDCCchEG9m5l3_bXWnUWqZz2bcWI

Are you the author of the voice cloning app? I see it has a link to a page to share voice models but it seems to be dead which is a shame. Perhaps we could all share some models once we get them trained up as its so time consuming. Most certainly an interesting project and would love to see more results from other people.
Nice sample! Yeah, it's an amazing app. No, I'm not the author (I wish), and regarding sharing... Maybe it's because there is some potential liability / copyright issues which might be the case. Same with all the digital AI art happening right now.
 
idk, i only looked into it a bit to see if there's an easy way to save tts as audio files, like a command line utility. couldn't find much and went back to classic voices

personally i'm content with the Ivona offline voices for now, they're quite cheap too - compared to all that cloud nonsense anyway. The intonation with AI voices is amazing but for me it's not worth the trouble with the UI and all that extra work, they're difficult to automate. This is what I'm working on now with classic tts: vimeo link
 
idk, i only looked into it a bit to see if there's an easy way to save tts as audio files, like a command line utility. couldn't find much and went back to classic voices

personally i'm content with the Ivona offline voices for now, they're quite cheap too - compared to all that cloud nonsense anyway. The intonation with AI voices is amazing but for me it's not worth the trouble with the UI and all that extra work, they're difficult to automate. This is what I'm working on now with classic tts: vimeo link
I understand. For me the robotic voice is kind of an immersion breaker. The voice clone software for which I made this VAM model, also has the source code available. So it is possible to just download the python commandline program and compile to it to an .exe. So theoretically it would be possible to have a commandline executable which is able to do "voice_python_executable.exe X or Y" and then generates a wavefie.
 
Hey guys, here's 2 apps that you could be interested in.

xVASynth v2 - https://store.steampowered.com/app/1765720/xVASynth_v2/

Kinda like the voice training app, but aimed at training and sharing voice from video games NPC, already have a few models from Fallout/Overwatch/Skyrim characters, etc...

As I'm a bit new to this, I'm not sure about what's different between all these differents models/apps, but I saw that this one give you a bit of control on some of the generated voices, like the pitch for example.

xVATrainer - https://store.steampowered.com/app/1922750/xVATrainer/

This one is very useful, it contain tools to prepare your dataset for voice training, and also train your voice to use on the 1st app.

These tools include: AI Source separation (to extract dialog from noises), AI Speaker diarization (to split an audio book on multiple sound files), Audio formatting, Silence split, Auto Transcribe and transcript quality evaluation.

Even if you don't wan't to use the xVASynth app, the xVATrainer help you do like 99% of the work for dataset preparation, which can then be used to train a voice in the Voice Cloning app.
 
Hey guys, here's 2 apps that you could be interested in.

xVASynth v2 - https://store.steampowered.com/app/1765720/xVASynth_v2/

Kinda like the voice training app, but aimed at training and sharing voice from video games NPC, already have a few models from Fallout/Overwatch/Skyrim characters, etc...

As I'm a bit new to this, I'm not sure about what's different between all these differents models/apps, but I saw that this one give you a bit of control on some of the generated voices, like the pitch for example.

xVATrainer - https://store.steampowered.com/app/1922750/xVATrainer/

This one is very useful, it contain tools to prepare your dataset for voice training, and also train your voice to use on the 1st app.

These tools include: AI Source separation (to extract dialog from noises), AI Speaker diarization (to split an audio book on multiple sound files), Audio formatting, Silence split, Auto Transcribe and transcript quality evaluation.

Even if you don't wan't to use the xVASynth app, the xVATrainer help you do like 99% of the work for dataset preparation, which can then be used to train a voice in the Voice Cloning app.
Wow, that's super useful! Thanks. I'll look into it!
 
Just stumbled on to this - well done @pinosante - awesome wee tool!
Question, does punctuation influence the outcome of the synthesis? And can it be used in place of a period to end a sentence?
It seems to do something in the tests I've tried, but I cannot tell if that's simply a result of another iteration.
 
Just stumbled on to this - well done @pinosante - awesome wee tool!
Question, does punctuation influence the outcome of the synthesis? And can it be used in place of a period to end a sentence?
It seems to do something in the tests I've tried, but I cannot tell if that's simply a result of another iteration.
Yeah, it matters. You can try ending a sentence with a , or … for instance. That might give a different tone. Another thing you can try is a crude form of “prompt engineering”, where you preface what you want to say, by something emotional. For instance: “I fucking hate you, why should I do this?” You can say, and it will have a more hateful sound probably. Then you can cut the “why should I do this?” with audacity for instance. Same idea: “I love you, why should I do this?” Will probably sound different as well.
 
Back
Top Bottom