Voice Model (Text-To-Speech, Neural Network based) - Plugins + Scripts -

Hi everyone!

This is a voice model based on a neural network. You can make it say your own things. There is a lot of TTS stuff out there and I have tried about everything there is to find, but I ended up training my own one using a neural network. After about 700h of training I'm sharing the model.

First time use (installation):

Download the latest release from https://github.com/BenAAndrew/Voice-Cloning-App/releases/tag/v1.1.1
- You can choose the one with or without GPU support. If you only use this to generate voices, you can download the CPUonly version
Put the .exe in a directory you like, but IMPORTANT: NOT in C:\Program Files or C:\Program Files (x86). Choose a directory like C:\Voice or C:\Games\Voice or C:\VAM\Voice.
Run the .exe (IMPORTANT: for about a minute or so, be patient, it looks like it's doing nothing, but it needs time to create some directories!) until it opens a web browser, and then close it again
The app will have created some directories ('data' with subdirectories 'datasets, hifigan, languages, models, results, training')
Download these two files: g_02500000 and config.json and save them somewhere (not important where)
Now download the "VAM Voice Model v2.zip" from mega.nz and unzip it in the "models" directory. You should end up with YOUR_PATH/data/models/VAM Voice Model v2.0/checkpoint_510426
IMPORTANT: delete the original zip file after unzipping!
Start the app and wait until a web browser appears
Click on Synthesis
Last row "Vocoder" click next to it on "Add more", this opens a new menu
- In the section "Add a Hifi_gan Vocoder" click next to "Hifi-gan model" and select the g_02500000 file you downloaded earlier
- Click next to "Hifi-gan config" and select the config.json file you downloaded earlier
- As a name, choose whatever you like, but "g_model" is probably a smart choice
- Click on "Submit" directly below the Hifi-gan config
- Click on back (on the top left)
Done!

Generating voices:

Run the app
Go to "Synthesis"
Click on Submit
Write a sentence in the text box, and click Submit
Click on the play button next to the linegraph to hear what is being said
Click on the three dots on the right, to download the clip if you like it
IMPORTANT: always end your sentences with a dot (.)

Caveats:

The voice can sound a bit "raspy" / "tinny", I solve this by having music on the background. v2.0 improved the sound quality of the voice a lot!
~~Trying to combine two sentences can end up with the voice model tripping up and generating garbage.~~ v2.0 of the model should solve this (for 95% of the cases).
Sometimes the voices sound very low (not girly at all), this is due to the source material I used (a female also voicing the male characters in the audiobook I used). Solution: add a "You are the best," or "I love you,", or "Did you know," before the sentence and it will often raise the pitch of the generated voice. (You can experiment a little). v2.0 of the model should solve this!
Not ending your sentence with a dot (.) will generate garbage most of the time. The model needs to know when the sentence ends.

Some tips:

If you don't like how the text is spoken, just submit the same text again. Even if the text is exactly the same every time, the generated speech will be different, with different accents and tone. Sometimes I run a line a few times, to pick the generated speech I like best.
If the generated speech doesn't sound how you want it, consider breaking it up in smaller sentences.
Sometimes you can also combine different sentences to make a good one: let's say you want to have the speech for "I just love doing some programming" and it doesn't come out right. You can then generate "I just love you". And "I am doing some programming". And then use Audacity to cut the 'you' and 'I am' and connect the two sentences.
Experiment with word order, comma's.
The tone of a sentence can change by what is being said. "I love apples" can sound different from "I love what you are doing."

Training your own voice:

Disclaimer: it is a lot of work and not for the faint of heart!
The general approach is this: find an audiobook you like, and use that to train a voice on.
You will need a very good (nvidia) GPU or do this online with Google Collab (which is a pain in the ass).
You will need at least 4 hours of good quality material to work with.
Training will take at least 30 full days (30 x 24h) to get some reasonable result.
If you are interested in training your own voice, this is the Discord channel where you can ask for help: https://discord.com/invite/wQd7zKCWxT
You can also ask me here or do a PM.

Resources:

The faq-site for the app: https://benaandrew.github.io/Voice-Cloning-App/
The github for the app: https://github.com/BenAAndrew/Voice-Cloning-App
The Discord channel for this voice cloning app: https://discord.com/invite/wQd7zKCWxT
Free sound editing software, Audacity: https://www.audacityteam.org/

A small tutorial below (showing the v1.0 installation, but v2.0 works the same way):

Plugins + Scripts Voice Model (Text-To-Speech, Neural Network based)

More resources from pinosante

Share this resource

Latest updates

Very minor update, updated the instructions

Improved the model (a lot!) and retrained it, please download this improved model!

Latest reviews