My dream VAM plugin/scenario

Guy

New member
Messages
18
Reactions
4
Points
3
My dream for VAM is a VR version of the Holodeck.

Within this virtual space, you have "living" breathing, walking, talking people who know how to interact with their environment, have a library of actions they can perform when they feel like doing so, and know who the players of the space are, and how to interact with them.

This requires a framework that enables a character to "download" sets of actions (like kung-fu, hip-hop dancing, ballet, karma sutra, etc.)

It also requires the character to know how to talk, understand speech, and respond to commands (if willing). Then going further, exhibit personality and give preferences and react to how they are being interacted with based off those things.

This would allow developers who create animations to package them up as actions which could combine with libraries from others and use them all seamlessly on any character.

2.0 graphics are looking amazing. But we're going to get pretty bored with them if there aren't any brains behind the polygons. We need both sides of the equation.


I'm very impressed with strides towards this, but I would love to see talented developers team up on a strategy to bring VAM living, breathing, walking, talking and acting virtual people into this game.

Thanks and thanks in advance for all the work being done to bringing this into reality. You all are on the cutting edge of cyberspace here.
 
Last edited:
You are talking about "Minority Report" stuff? What scenario you wanna be in? Kill your boss? :D

I think this is where all this leads to, but this will take decades I think. What you want sounds like 2040-2050 to me. Maybe I'm wrong.
First thing would be to bring characters to life. There are a lot of plugins already that try to simulate stuff like that, and combining all of them it still feels not real.

Yeah, interesting idea. Far away from what is possible right now if you ask me. If this program is able to simulate kind of a lifelike character, then lets talk again.
 
I think this is the dream of most of us. Unfortunately I don't see it in the near future. There are those Japanese/Korean VR simulations where you can meet a girl and "talk" to her... but even those feels lifeless.

What I recently had in my mind is something that could be doable even with nowerdays functions:
Do some of the older PC users still know "Little Computer People"?
That was a very simple "Tamagotchi" like "simulation" where you just could sit and watch a little pixel-guy lives in his house, go to his job, cleaning, cooking, sleeping, ... Not as complex as The Sims, by far! You had some very few options to interact with him, so you almost had the impression someone was living independently in your computer. It was fun and relaxing.
With a big ammount of scripting work triggering some MoCap animations, I think it is allready possible in VaM.
 
It's feasible short term, but would require almost 2 years working full time. Need to concentrate solely on the AI and the brain, delegate all the animated content to creators. Create a framework to interface with that content. Considering the time required for a plugin, it's not really worth doing as a plugin alone. It needs to be used for something else.
 
It's feasible short term, but would require almost 2 years working full time. Need to concentrate solely on the AI and the brain, delegate all the animated content to creators. Create a framework to interface with that content. Considering the time required for a plugin, it's not really worth doing as a plugin alone. It needs to be used for something else.
You are talking about "Minority Report" stuff? What scenario you wanna be in? Kill your boss? :D

I think this is where all this leads to, but this will take decades I think. What you want sounds like 2040-2050 to me. Maybe I'm wrong.
First thing would be to bring characters to life. There are a lot of plugins already that try to simulate stuff like that, and combining all of them it still feels not real.

Yeah, interesting idea. Far away from what is possible right now if you ask me. If this program is able to simulate kind of a lifelike character, then lets talk again.

Its all doable now. The fatal mistake however is to think of the conversational AI as the 'brain'. If we can move away from this kind of false mind body dualism and think of everything the model says and does, a holistic simultaneity driven by the plugins, as the AI, then VAM will get there much quicker.

Big companies, embodied AI market leaders made this critical mistake, received $millions investment and they ended up with badly lip-synced talking heads, no torsos. Fair enough its also because they are using something like WebGL to interact with users, and this limits them to (I think) 400mbs (hence only head+ head animations) but VAM is not limited to using WebGL. We can make whatever we want, anything. Give me the me the C#/ XML/AIML modded speech recognition plugin, I can build the rest, before this Christmas. The rest is already on my hard drive in various scenes. I know how to do this, I've been studying it for the last three years.

At the moment the limit is to give the illusion of self awareness, but there may be ways of building levels of self reflexivity and self awareness into the AI. But that is a long way off. This is what Luca tried to do with Replika AI- Replika uses at least two 'neural nets'. They model a primitive brain and a more (representationally) conscious/ socially developed brain--higher and lower functions. What Luca didn't do was find a way to integrate them so one fires off the other to create a reflexive 'mirror' mimicing human self-awareness. That's why you have two states of Replika, with higher and lower functions. You will never get AI consciousness until you can simulate full sensory awareness, -- touch, taste, the beat of a human heart, the smell of clean skin, etc---without which, its like trying to play music on an air guitar and wondering why there's no sound. Full sensory simulation and machine consciousness; that's probably decades away. But we can mimic it, right now.
 
Last edited:
With a big amount of scripting work triggering some MoCap animations, I think it is already possible in VaM.

You are absolutely right, 100%. Does this translate correctly: verkettung (Eng: concatenation) ? We are over the tipping point and this process will increasingly drive itself.
 
Last edited:
I have posted in the past on this and a lot of it is already done for Game AI. There are papers going back to 2016 showint 3D humanoid models that can walk realistically around an environment, stepping over objects, bending under low areas and even sitting down. It's all handled by a pre-trained AI and doesn't use too much cpu. Even getting something like this working in VaM would be amazing because you could give a model a start and end point and have them walk there. No animation. No motion capture.
It wouldn't take long to train the AI from porn videos motion captured to work out how to do those actions too without fixed animation. The level or realism this would bring would be astounding.
Conversation is great, but we can have models that walk, move and pick things up NOW if somebody good at coding gets to it and integrates free open source game AI into VaM.
The "Balance" plugin and related stuff is an example of how amazing this sort of thing can be and it's nowhere near what the AI models do.
 
It's funny how so many people are AI enthousiasts and have thought about how to do it, yet there are so many different visions of how to do it or which part is the most important. In my case, perfecting movement is nothing without the emotions, the context, etc. That would be just creating an even bigger gap between the immersion/realistic behavior and the visual/animations.

I believe that the brain is the emotional AI and the big chunk of work, which should use the conversational AI more as a mean to interface with reality (or VR in that case.) Animations/expressions are another module or AI to interface with the emotional AI. To accomplish this, heavy labeling / special tokens are required to properly interface with emotions. This was done in a few SOTA papers released this year, but they almost got under the radar and they don't seem to be part of the hype.

Fine-tuning is required, and it must be done on custom datasets, not the usual reddit scraping that everyone is using, which requires other AI models like STT to properly scrape the content, and a lot of editing. I mean, if you want to create a chatbot, sure use chat data. But if you want to create a realistic face-to-face conversation, you're using the wrong data.

Assessing the level of work required feels like always going deeper into the rabbit hole and continously find another ML tool that needs to be added, tweaked and fed with the right data to go forward.
 
Conversational AI is WAY off. Certainly not VaM ready. It's sad, I would love it but we are not there yet.
AI movement to remove the need for programmed animations and mocap? It exists! there is no reason it can't be in VaM right now.
Of course fully conversant moving AI models are the "final goal" but that's a long way off. I just want what we could have to be done right now while the rest is worked on.
I see many existing game AI engines as being things that could easily be integrated into VaM for low cost (CPU/GPU) that would make the existing experience much better.
 
Conversational AI is WAY off. Certainly not VaM ready.
Having played around with a few BERT and GPT models, I have to disagree, except that there is a budget issue. It's often badly implemented because companies are trying to rely almost solely on the model without adding the required layer of complexity. Even with a generic model, you can have an interesting conversation if you tweak the input properly. The truth is, those free AI games or apps that people try for fun do not rely on the best models because these models are very GPU intensive. Also, very few people would buy a second 2000$ graphic card just for an AI. But if provided as a service, then it should be viable.

I see many existing game AI engines as being things that could easily be integrated into VaM for low cost (CPU/GPU) that would make the existing experience much better.
Could you provide a few examples? You still need some way to trigger those animations at the right moment.

P.S. Things are moving super fast in that field. A research team could publish a new revolutionary distillation technique with code anytime and bam, the economic state of conversational AI changes instantly.
 
By conversational AI I mean STT and TTS, that is not ready for local PC use or anywhere near. A text only chatbot is nothing these days. Full accurate real time speech recognition though? That is a nightmare. Realistic Text to Speech is pretty hard too.
As for "way to trigger those animations" no. You don't. That's the point of them. They are not triggered animations. The AI moves like a 'real' person, including walking, sitting, laying, picking things up. You scene could simply have instructions for "get out of bed, walk to the sofa, sit down" and the model would do that. It's not triggering animations but getting realistic motion for tasks.
I am not talking about actual AI here, simply making programmed scenes look realistic. Some pre recorded voice work with an AI motion plugin would look pretty close to a real person walking around. Certainly a lot better than preprogrammed animation we have now. Better than the glitchy mocap too because it can interact with the environment in real time.

Check out this video which features work published in 2019
This is not pre-programmed animation, but realtime AI generated movement that interacts with a dynamic environment.
 
Looks like we were not talking about the same thing. TTS and STT are usable today for real time applications.

The current SOTA for STT has less than 3% in WER (word error rate) and the speed is 2 to 6 times faster than the original audio. There are open source libraries with stream capabilities, meaning that words are converted live instead of waiting for the full sentence. The delay between end of speech and text can be within 200ms, which is more than enough for a game. BTW, real humans don't get 100%. When in doubt, they can ask: What did you say? and that can be implemented too, because confidence level is a metric. But, you probably haven't seen these in a game now because, well, things go so fast that these are the results for the latest 2021 models and open source libraries.

Regarding TTS, most SOTA techniques have a real-time ratio of 8:1 to 10:1. That means that a 5 seconds generated audio is computed in about 500ms. Thats's a bit slow, but you can create an illusion of faster pace with "shorter sentences before longers", pre-done adverbs, mouth sounds or body reactions at the beginning of the response and that 500ms is gone.

The problem with TTS is currently quality, or was it a Q3 2021 thing? Have a look at VITS for quality.
https://jaywalnut310.github.io/vits-demo/index.html
Everytime I hear a Tacotron model, my ears bleed and everyone is saying wow, remarkable, and I am like really? First time I heard VITS samples I was blown away, even though I think that those narrative voices are boring. Then I heard amateur samples shared on discord of voices made with VITS and I was blown away again because people got outstanding results with a completely different voice/pace/emotion with just a bit of fine-tuning. Then there is actually a bigger hype for Talknet. The quality is lower, but the usability for pitch/emotions is much better.

Those 2 will sure be beated by something even better in a close future, but my point is: we reached a point of usability. Much of the work should not be the fine-tuning but the integration, and these models are just interchangeable blocks that fit in an architecture. Don't focus too much on the models, but the AI wrapping it and making everything work together and we can have something working soon enough.

From an economical POV, SST, TTS, GPT, etc. It all only make sense as a service (for now), or it's for people without a budget limit.
 
As for "way to trigger those animations" no. You don't. That's the point of them. They are not triggered animations. The AI moves like a 'real' person, including walking, sitting, laying, picking things up. You scene could simply have instructions for "get out of bed, walk to the sofa, sit down" and the model would do that. It's not triggering animations but getting realistic motion for tasks.
I am not talking about actual AI here, simply making programmed scenes look realistic.
I understand your point, but I am not talking about making a scene, but having an "unconstrained" experience. Even tough it's just a few simple instructions, that's just a higher level usage of animations. The same results could be accomplished with a lot of coding and solving algorithms. Maybe not as good, but better than what we have. Then you will have a lot of custom instructions to implement, and you still need to assess how much time is required to train and gather the necessary data for training vs a more conventional framework to blend animations. Did we reach a more economical way to generate animations with ML? Or it just looks better with the current freely available mainstream datasets?

Those higher-level instructions, it would be really cool that you just don't need to trigger them or control them explicitely with a timeline. You can have a brain to control those instructions. Otherwise, once immersed, you will soon find out about the boundaries of the scene, mainly because your expectations will grow with your level of immersion.

I was thinking about a middle-ground between making a scene and making an unconstrained AI. Like a story with chapters, but you can go off-track, and at some point the AI figures out a way to put you back on track with the story.

VAM can be much more than a scene generator.
 
I just watched that video. They had to use many mocap of people sitting on a chair just to achieve someone sitting on different types of chairs. Do you have any idea of how much editing was probably required on a collection of mocaps to isolate them, then how they had to prepare these reference points on the chair? I mean it's cool, but very far from a turn-key solution that can be imported. Just for the action of sitting on chairs, you would need to standardize all chair assets with reference points, categorize them (some don't have side-arms) so that the ML animation could work. Which is also what you would need if you just wanted to create a generic framework for higher-level animations.
 
Looks like we were not talking about the same thing. TTS and STT are usable today for real time applications.

The current SOTA for STT has less than 3% in WER (word error rate) and the speed is 2 to 6 times faster than the original audio. There are open source libraries with stream capabilities, meaning that words are converted live instead of waiting for the full sentence. The delay between end of speech and text can be within 200ms, which is more than enough for a game. BTW, real humans don't get 100%. When in doubt, they can ask: What did you say? and that can be implemented too, because confidence level is a metric. But, you probably haven't seen these in a game now because, well, things go so fast that these are the results for the latest 2021 models and open source libraries.

Regarding TTS, most SOTA techniques have a real-time ratio of 8:1 to 10:1. That means that a 5 seconds generated audio is computed in about 500ms. Thats's a bit slow, but you can create an illusion of faster pace with "shorter sentences before longers", pre-done adverbs, mouth sounds or body reactions at the beginning of the response and that 500ms is gone.

The problem with TTS is currently quality, or was it a Q3 2021 thing? Have a look at VITS for quality.
https://jaywalnut310.github.io/vits-demo/index.html
Everytime I hear a Tacotron model, my ears bleed and everyone is saying wow, remarkable, and I am like really? First time I heard VITS samples I was blown away, even though I think that those narrative voices are boring. Then I heard amateur samples shared on discord of voices made with VITS and I was blown away again because people got outstanding results with a completely different voice/pace/emotion with just a bit of fine-tuning. Then there is actually a bigger hype for Talknet. The quality is lower, but the usability for pitch/emotions is much better.

Those 2 will sure be beated by something even better in a close future, but my point is: we reached a point of usability. Much of the work should not be the fine-tuning but the integration, and these models are just interchangeable blocks that fit in an architecture. Don't focus too much on the models, but the AI wrapping it and making everything work together and we can have something working soon enough.

From an economical POV, SST, TTS, GPT, etc. It all only make sense as a service (for now), or it's for people without a budget limit.
Hey since you seem very knowledgeable in this field, could you help me out? I’m currently recording some TTS samples for a scene I’m making for the girl responses. So far I have looked at “Azure” and google and I’ve gone with Azure since their Aria model has “emotion” patterns which can you use for the voice. Do you know of any other TTS high quality software? The Azure model is quite hit or miss. I have to mess around with how I phrase things because most of the time the AI comes up with an unnatural speech pattern. If you’d know an online source where I could just try stuff would be great, but using Python would be no problem either. Hope you can point me to something. I liked the page with VITS examples, but they are superdry. I’d want a more female, sexy kind of voice instead of a bored narrator. Same with the tacotron 2 voices. They are pretty good, but again, how do I change that to a more sexy voice? Azure from microsoft is the closest I get…
 
Hey since you seem very knowledgeable in this field, could you help me out? I’m currently recording some TTS samples for a scene I’m making for the girl responses. So far I have looked at “Azure” and google and I’ve gone with Azure since their Aria model has “emotion” patterns which can you use for the voice. Do you know of any other TTS high quality software? The Azure model is quite hit or miss. I have to mess around with how I phrase things because most of the time the AI comes up with an unnatural speech pattern. If you’d know an online source where I could just try stuff would be great, but using Python would be no problem either. Hope you can point me to something. I liked the page with VITS examples, but they are superdry. I’d want a more female, sexy kind of voice instead of a bored narrator. Same with the tacotron 2 voices. They are pretty good, but again, how do I change that to a more sexy voice? Azure from microsoft is the closest I get…
I am mostly following the open source projects available on github and I can't advice on the commercial services.

It's all about the data. You will need a dataset to train the AI, but if you can get the dataset, the training part is easier than you think. For example, this project is almost turnkey, you just need to get the dataset and push a button for training:
https://github.com/BenAAndrew/Voice-Cloning-App

For the data, you will need audio files and their matching sentences in text as a csv. That's the hard part. The Voice cloning app project can generate those files automatically from a kindle book or many audio books, but you don't want to take them from a book. What you can do is find a sexy voice with a script, but you will need at least 1h of audio for any passable quality, and optimally 4h of audio for optimal results. That's the hardest part, because when new models will be released, they will be integrated in github projects. If you have your data, the TTS that you produce will then improve by just training the new model with the same data.

You best shot is to find or create enough audio of the sexiest voice with the script. The script must match the words perfectly for best results. Best results without background music or noise.

Now, if you can't find or afford an actress reading a script for 4h for you, you can try to extract a voice from a place like
https://www.reddit.com/r/gonewildaudio/, where they have scripts linked to the voices. But if you intent to release the result, ask permission.

Another solution is to find JOI or solo porn with a woman pretending to be with you, and find a way to subtitle the video. Do it manually, or dare to use STT on it to generate the text. That's the rabbit hole of machine learning: there is always a better way by using another tool or method to improve the data.

Best high-quality solution with the current state of TTS: Find an actress, record the same actress multiple times with different emotions or situations, so that you can have a sexy voice that works for foreplay and a voice more adapted for intense action, like when she screams your name.

But talknet is supposed to improve a lot the problem of emotions.
 
Oh, and if using an actress, you should use a variety of scripts that match with the probable situations. Don't make her read any book. The intonation and pitch is trained based on how the words were prononced in context. It will try to match the tone with the text provided. For example, if you are training a submissive female voice, don't force all sentences to be 100% submissive. Some sentences like "I can't do that" or "I am not ready for this" should be acted accordingly, so that your trained voice will be able to roleplay those lines properly.

The AI will look for voice patterns based on the words/sentences. For example, begging sentences with the word "please" will be said differently based on your data. If the actress is always saying "pleeaaase" slowly and softly, the model should learn that sentences with "please" should be said that way.
 
Oh, and if using an actress, you should use a variety of scripts that match with the probable situations. Don't make her read any book. The intonation and pitch is trained based on how the words were prononced in context. It will try to match the tone with the text provided. For example, if you are training a submissive female voice, don't force all sentences to be 100% submissive. Some sentences like "I can't do that" or "I am not ready for this" should be acted accordingly, so that your trained voice will be able to roleplay those lines properly.

The AI will look for voice patterns based on the words/sentences. For example, begging sentences with the word "please" will be said differently based on your data. If the actress is always saying "pleeaaase" slowly and softly, the model should learn that sentences with "please" should be said that way.
Thanks for the feedback. I'm interested in taking the dive, probably using some voices from gonewildaudio. The one thing I'm a bit fearful of, is the quality of the produced sound. I hope it's going to be clear sound, without too much "grainy" noise in the background if you know what I mean.
 
Thanks for the feedback. I'm interested in taking the dive, probably using some voices from gonewildaudio. The one thing I'm a bit fearful of, is the quality of the produced sound. I hope it's going to be clear sound, without too much "grainy" noise in the background if you know what I mean.
I know. Check their discord for voice samples.

The result can only be as good as the data you have. The project I have linked is a Tacotron 2 implementation, but it's the most plug and play project that I have seen. One advantage is that it can also reformat your audio and script. If the results are not good enough, you can always try something else later.

This one already has VITS implemented and has streaming capabilities:
but I find it harder to learn because it's documented for people that already know what they are doing.
 
Ok, so I'm following this tutorial:

Currently training the model on 74 hand selected voice samples. Curious to what it'll bring. Just a testdrive. I'll update with results.

Edit: ugh. So I hit a roadblock, since the googl colab notebook requires a good GPU assigned to it, but I never get a good GPU assigned. So I need to run this locally. I'll need to dig further into this and find some local python version to try this out.
 
Last edited:
The free version of colab is not very reliable for training if you train for long hours. Are you new to python?
 
The free version of colab is not very reliable for training if you train for long hours. Are you new to python?
No, I have programmed some tkinter apps in python and installed some 3d party libraries. Also worked with foto2vam for python. I am not a professional python programmer by any means though. It’s just that this tutorial was very clear and just clicking on some web notebook was super easy. Had no idea however, that the colab needed a pro version. If running it locally is too much of a hassle I might buy the pro version, but I haven’t looked into the repo’s for a local version yet. I’ll give that Voice-Cloning-App you mentioned a whirl.

Edit: update. So I gave it a whirl on the 72 voice clips. I got the software to work, which is awesome. The "tone" of the voice was perfect, except that I could not make sense of what was said. 72 voice clips is way too small. I am now trying a new one with a woman with a nice voice reading a book. I know. But let's see how this works out.
 
Last edited:
Hey since you seem very knowledgeable in this field, could you help me out? I’m currently recording some TTS samples for a scene I’m making for the girl responses. So far I have looked at “Azure” and google and I’ve gone with Azure since their Aria model has “emotion” patterns which can you use for the voice. Do you know of any other TTS high quality software? The Azure model is quite hit or miss. I have to mess around with how I phrase things because most of the time the AI comes up with an unnatural speech pattern. If you’d know an online source where I could just try stuff would be great, but using Python would be no problem either. Hope you can point me to something. I liked the page with VITS examples, but they are superdry. I’d want a more female, sexy kind of voice instead of a bored narrator. Same with the tacotron 2 voices. They are pretty good, but again, how do I change that to a more sexy voice? Azure from microsoft is the closest I get…

If it's just mp3 voice recordings that you want to produce, and TTS with emotional emphasis, you could use Amazon Sumerian. It old tech, rubbish for animation compared to VAM, but the TTS is very good for what you want to do. You can use screen capture (Video) to record the audio as MP3, then load it into one of VAM's speech recognition plugins. Another similar way to go is RT Voice in Unity, good, but you will have to pay for the asset. The dev, Steph, is super helpful if you get stuck Hope that helps. I spent a long time messing with Cloud-based chatbots. My advice : don't bother, too fragile to embody in a 3D model in a game engine- too many links that can break in the chain. And, stay the hell away from IBM Watson, embodied in a 3D model, Windows' Speech Recognition is way better.
 
Last edited:
@Hedgepig could you just use Amazon Sumerian for the TTS? What is the quality? Right now I'm using Microsoft Azure which is a lot better than IBM Watson or Google Wave. I look at Amazon Polly but that TTS engine sucks. So Amazon Sumerian would be better? I checked out their page and it looks some kind of 3d online stuff you can use. I am only interested in the TTS part however...

Is it better than this? https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/#features
Using Aria (Neural) with Cheerful settings? Because that is what I'm using right now... and the best one I could find online. I hope Amazon Sumerian sounds better, would love to hear your opinion.
 
Last edited:
Edit: update. So I gave it a whirl on the 72 voice clips. I got the software to work, which is awesome. The "tone" of the voice was perfect, except that I could not make sense of what was said. 72 voice clips is way too small. I am now trying a new one with a woman with a nice voice reading a book. I know. But let's see how this works out.

It would be worth to do a few QC checks first to ensure that the problem is not your data.
  1. Did you split your audio into smaller samples (usually 2 to 10s long)?
  2. Did you trim the beginning of your audio or text to remove audio that doesn't match text?
  3. Did you check the samples to ensure that the text is matching with the audio?
  4. Did you use transfer learning or did you start training a new model?
  5. How many epochs did you train the model?
  6. Were all those clips from the same person and single voiced?
Another workflow that I played a bit with today is voice conversion. It gives you some control on the intonation. What you do is you train 2 voices: yours, and the voice you want. Then, you can say something and it will convert your audio directly to the other voice by keeping the emotion.
 
Back
Top Bottom