Big problem with lipsync

checkking · Jan 30, 2024

Anyone here has some experience with the lipsync plugin or the built-in lipsync? I can make both work, but I have very bad results. So I did a few tests with simple vocals. Basically, me making aaaaaaa at a different pitch, then ooo, iiii, etc... and the results are really bad:

The detected phoneme is sensitive to the pitch, but in reality pitch should have 0 impact on the shape of the mouth. If I make the sound "aaa" or "ooo" at specific pitch like "do re mi fa sol la si do", I get completely different phonemes with the built-in lipsync. Same with the plugin, but with the built-in one we can visualize the detection in real-time. The phoneme is also repeatable: I can make the mouth look completely different, even open or fully closed, by just making "ooo" with a pitch like "do re do re do" for example.
The detection seems to be working a bit better on very high pitch voice, but seems almost random on lower pitch voices.
I have seen a lipsync asset for Unity requiring a calibration for each voice, but there is no such option in VAM. Which may explain why the demo audio works well, but my results are so bad.

I could have gone further, but I believe that the voyels are the most important sounds to get right first, so if the mouth can't even get the correct open shape or stay open for a simple "ooo" then there is no point in investigating further.

I have a few ideas to fix the problem, worth investigating:

Find a way to bypass the phoneme detection and feed the phoneme based on text. That might seem complex, but in reality many audio sets already have the corresponding text, and the phoneme would be 100% accurate.
Just find a way to calibrate a voice in VAM.
Preprocessing with a better AI solution, and add the option to drop the realtime detection. The preprocessing could be done from VAM or externally, but the idea is that if we know the format required to trigger the morphs, we could match timestamped phoneme files with their corresponding audio and use better but resource hungry AI libraries to compute them. For example, some STT and TTS libraries generate phonemes as a step in their process.
- For reference, Whisper can provide timestamped words or tokens: https://github.com/linto-ai/whisper-timestamped
- Then it could be converted back to phonemes with a text to phoneme library.

Or, am I overthinking this? Is there a simple fix that I am not aware of?

atani · Jan 30, 2024

There's no more VaM 1.x development, this won't be addressed without someone making a plugin to extend the lip sync feature.

checkking · Jan 30, 2024

I know... and the plugin is just a dll, so it's not open for improvements and the author is not active...

Do you know if the built-in lipsync is "closed" or if it is extendable with a plugin? Would be cool if there was a way to directly feed the phonemes or add more phonemes.

atani · Jan 31, 2024

Unfortunately I have zero knowledge on how it works or is built. Perhaps you can send that question to Meshed or VAMdev from the Hub or Discord?

Big problem with lipsync

checkking

Member

atani

Invaluable member

checkking

Member

atani

Invaluable member

Similar threads