Anyone here has some experience with the lipsync plugin or the built-in lipsync? I can make both work, but I have very bad results. So I did a few tests with simple vocals. Basically, me making aaaaaaa at a different pitch, then ooo, iiii, etc... and the results are really bad:
I have a few ideas to fix the problem, worth investigating:
- The detected phoneme is sensitive to the pitch, but in reality pitch should have 0 impact on the shape of the mouth. If I make the sound "aaa" or "ooo" at specific pitch like "do re mi fa sol la si do", I get completely different phonemes with the built-in lipsync. Same with the plugin, but with the built-in one we can visualize the detection in real-time. The phoneme is also repeatable: I can make the mouth look completely different, even open or fully closed, by just making "ooo" with a pitch like "do re do re do" for example.
- The detection seems to be working a bit better on very high pitch voice, but seems almost random on lower pitch voices.
- I have seen a lipsync asset for Unity requiring a calibration for each voice, but there is no such option in VAM. Which may explain why the demo audio works well, but my results are so bad.
I have a few ideas to fix the problem, worth investigating:
- Find a way to bypass the phoneme detection and feed the phoneme based on text. That might seem complex, but in reality many audio sets already have the corresponding text, and the phoneme would be 100% accurate.
- Just find a way to calibrate a voice in VAM.
- Preprocessing with a better AI solution, and add the option to drop the realtime detection. The preprocessing could be done from VAM or externally, but the idea is that if we know the format required to trigger the morphs, we could match timestamped phoneme files with their corresponding audio and use better but resource hungry AI libraries to compute them. For example, some STT and TTS libraries generate phonemes as a step in their process.
- For reference, Whisper can provide timestamped words or tokens: https://github.com/linto-ai/whisper-timestamped
- Then it could be converted back to phonemes with a text to phoneme library.