Performance Patch (up to 30% more physics speed)

Wait you are telling me ReadInt32() and ReadSingle() in the original code is actually opening the file every single time? That's insane and I had no idea the zip library would work like that.

C#:
        public void LoadDeltasFromBinaryFile(string path) {
                //Debug.Log("Loading deltas for morph " + morphName + " from "+path);
                try {
                        using (FileEntryStream fes = FileManager.OpenStream(path, true)) {
                                using (BinaryReader binReader = new BinaryReader(fes.Stream)) {
                                        numDeltas = binReader.ReadInt32();
                                        deltas = new DAZMorphVertex[numDeltas];
                                        for (int ind = 0; ind < numDeltas; ind++) {
                                                DAZMorphVertex dmv = new DAZMorphVertex();
                                                dmv.vertex = binReader.ReadInt32();
                                                Vector3 v;
                                                v.x = binReader.ReadSingle();
                                                v.y = binReader.ReadSingle();
                                                v.z = binReader.ReadSingle();
                                                dmv.delta = v;
                                                deltas[ind] = dmv;
                                        }
                                }
                        }
                }
                catch (System.Exception e) {
                        Debug.LogError("Error while loading binary delta file " + path + " " + e);
                }
        }
for every float, it is not opening the file every time, but it calls the ZipLibrary, which in turn does an allocation of 4kb, reads 4kb from disk (the minimum amount in the zip library), decompresses 4kb (yes, thats the minimum amount it can decompress), truncates to 4bytes, casts 4bytes to a single float and returns the float. This happens FOR EVERY SINGLE FLOAT. You see millions of ReadFile(4kbytes) in procmon when loading a scene lol. The amount actually read from the harddrive is exactly 1024 times as large as needed. The MemoryStream isnt an optimal solution, but atleast it skips the uneeded expensive zip decompression. The perfect solution would be to somehow load the whole zip once from disk, keep it in memory and decompress the individual files from memory. Not sure if VaM loads all files from the .var or only the needed parts, so memory usage might increase if you do it that way. But atleast do the small fix I mentioned, it just skips the redundant decompression.
It's actually a common mistake in unity game dev lol.

I think the callstack is something like BinaryReader.ReadSingle() -> BinaryReader.ReadBytes(4) -> ZipLibrarySomething.ReadBytes(4) -> ZipLibrarySomething.ReadInternal(4096) -> ZipLibrarySomething.Decompress(4096)
You really need to do decompression in a block as big as possible, there is additional huge overhead in initializing the decompression state machine. You might get away with the OS caching disk reads, but many small decompressions are burning the CPU.
 
Last edited:
I think if you also do IL2CPP for those special jobs [...] you might see some benefit, but it will not match a .dll compiled for AVX2 and optimized with clang.
Pretty sure you can get similar performance using Burst now, which makes it pretty convenient.

Edit: Oh and really nicely done @ this patch. I was just wondering about morphs and how they are applied currently a couple of days ago.
 
Not sure what ExtractSample exactly does but It's in top4 methods when I profile my scene.

ThreadCall countMethod name SMALLTotal runtime (ns)Total allocation (bytes)
56​
435​
DAZCharacterRun:RunThreaded (bool)
1663311300​
90755072​
43​
435​
DAZCharacterRun:RunThreaded (bool)
1026883000​
7184384​
68​
380​
SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)
902135000​
11866112​
67​
357​
SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)
843778400​
11665408​

Frist, the method normlizes the spectrum, then there are two nested for loops that look weird.
At the end it calls CepstralCoefficients but it somewhere at the bottom when it comes to total runtime.
67​
357​
SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)
1130400​
0​
3,166386555​
68​
380​
SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)
1110000​
0​
2,921052632​
 
Not sure what ExtractSample exactly does but It's in top4 methods when I profile my scene.

ThreadCall countMethod name SMALLTotal runtime (ns)Total allocation (bytes)
56​
435​
DAZCharacterRun:RunThreaded (bool)
1663311300​
90755072​
43​
435​
DAZCharacterRun:RunThreaded (bool)
1026883000​
7184384​
68​
380​
SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)
902135000​
11866112​
67​
357​
SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)
843778400​
11665408​

Frist, the method normlizes the spectrum, then there are two nested for loops that look weird.
At the end it calls CepstralCoefficients but it somewhere at the bottom when it comes to total runtime.
67​
357​
SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)
1130400​
0​
3,166386555​
68​
380​
SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)
1110000​
0​
2,921052632​
looks like it parses the audio for something, very float math heavy, very expensive in C#, no wonder its slow. Could be heavily improved, but since the lip aproximation is very bad anyway I have no desire for it lol
 
Just want to throw my thanks to you! Getting nearly Double frame on test going from 60fps to 106-121fps!
Intel 9700k and a 4090 - yes those are 2 very ill match components, but this patch brings them much closer!

Top is latest patch bottom in without -with my bloated VAM directory
speed_compare.jpg

For reference, this is with Glute softbody off but breast physics on. Glute Still tanks the frame rate. With Glute on no fix is 39FPS and with fix 44FPS
 
Just want to throw my thanks to you! Getting nearly Double frame on test going from 60fps to 106-121fps!
Intel 9700k and a 4090 - yes those are 2 very ill match components, but this patch brings them much closer!

Top is latest patch bottom in without -with my bloated VAM directory
View attachment 331988
For reference, this is with Glute softbody off but breast physics on. Glute Still tanks the frame rate. With Glute on no fix is 39FPS and with fix 44FPS
Nice. Soft body makes the patch negligible because my patch is a flat reduction on a few specific parts and sadly there is that FPS vs frametime conundrum: If rendering improves from 6ms to 4ms, you go from 166fps to 250fps, but if it improves from 30ms to 28ms you go from 33fps to 35fps lol. When you enable SoftBody, you tell the unity engine to use kinematicbodies instead of simple colliders. Sadly their performance is out of reach for VaM or me, noone but unity can improve their performance. Best you can do is not cluttering the CPU cache in other places so they run faster.
 
Nice. Soft body makes the patch negligible because my patch is a flat reduction on a few specific parts and sadly there is that FPS vs frametime conundrum: If rendering improves from 6ms to 4ms, you go from 166fps to 250fps, but if it improves from 30ms to 28ms you go from 33fps to 35fps lol. When you enable SoftBody, you tell the unity engine to use kinematicbodies instead of simple colliders. Sadly their performance is out of reach for VaM or me, noone but unity can improve their performance. Best you can do is not cluttering the CPU cache in other places so they run faster.
Still - This improves things a lot for me as I turn of glute anyway - just being able to get a steady 90fps with 1 girl and guy in VR has been something my system (CPU) hasn't been able to do - until now!
 
Been watching this over the weekend, very excited to see some continued discussion in here - even from the caveman himself.

I had high hopes this would be a miracle worker, but I actually lost 1 frame on average throughout the benchmark test. :ROFLMAO:

Before plugin
Benchmark-1.png
After plugin
Benchmark-2.png

I am overclocking my i7-11700K, default settings for the INI match my CPU.
Overall I would say the difference between patch / no patch is negligible on my end.
 
Do you remember where you found the Benchmark scene with only Baseline 3 enabled?
 
Been watching this over the weekend, very excited to see some continued discussion in here - even from the caveman himself.

I had high hopes this would be a miracle worker, but I actually lost 1 frame on average throughout the benchmark test. :ROFLMAO:

Before plugin
View attachment 332045
After plugin
View attachment 332046

I am overclocking my i7-11700K, default settings for the INI match my CPU.
Overall I would say the difference between patch / no patch is negligible on my end.
>baseline3 timings improved (lower is better)
>baseline3 fps got worse
weird
 
>baseline3 timings improved (lower is better)
>baseline3 fps got worse
weird
Right? The timings are actually better, but there's no real gain.
I'll tinker around with my typical setups, outside of the benchmark environment and see if I can actually tell any difference.

Judging by some of the other benchmarks floating around here, I'm getting the itch to upgrade and rebuild my PC again xD
 
Been watching this over the weekend, very excited to see some continued discussion in here - even from the caveman himself.

I had high hopes this would be a miracle worker, but I actually lost 1 frame on average throughout the benchmark test. :ROFLMAO:

Before plugin
View attachment 332045
After plugin
View attachment 332046

I am overclocking my i7-11700K, default settings for the INI match my CPU.
Overall I would say the difference between patch / no patch is negligible on my end.
The reason is simple. VaM at 4K can easliy put into the knees current gen top tier CPUs paired with 4090.
Even at 1080p there will be differences between the same CPU while using different GPUs.
So... As turtlebackgoofy mentioned, yours physics time did actually improved, but since you are at 4K, you're bottlenecked by yours GPU. The differences in fps are at the margin of error.
Try to run benchmark at 1080 and then compare :)

\ Edit \
So here are my results at 4K.

Benchmark-20240206-021001.png

Benchmark-20240206-021240.png

Only ~3-5 fps difference, while using 4090. Physics time didn't even improved by 0.2. Funny, i got more avg and max fps in test on 'vanilla' at 4K too. But 1% low went higher with the patch.
At 1080 i could see a way more difference [1.80->1.48].
But ... since real life VaM usage is not only heavy physics, my VR experience improved by about 15-20 fps overall.
 
Last edited:
The reason is simple. VaM at 4K can easliy put into the knees current gen top tier CPUs paired with 4090.
Even at 1080p there will be differences between the same CPU while using different GPUs.
So... As turtlebackgoofy mentioned, yours physics time did actually improved, but since you are at 4K, you're bottlenecked by yours GPU.
Try to run benchmark at 1080 and then compare :)
That's a good call, followup soon™

Update for 1920x1080, my goodness the UI is abysmal on a 4k monitor, actual game rendering still looks pretty decent though.
Been a long time since I switched over to 4k.

Before patch
Benchmark-1080-1.png
After
Benchmark-1080-2.png
Definitely a more noticeable difference, I'd like to try out 1440 as well, but I've had enough fiddling for the evening.

I do think it's odd that the performance is actually worse in the 'simpler physics' scene.
 
Last edited:
Benchmark-20240206-022010.png

hmm found something weird while experimenting, gotta investigate what happened, that additional 15% fps is sus
 
View attachment 332072
hmm found something weird while experimenting, gotta investigate what happened, that additional 15% fps is sus

Hmm yes, it appears there is something fundamentaly wrong how the unity engine runs on AMD CPUs on windows. But I think it can be fixed. Screwing around with thread schedueling makes my patch even faster.
Benchmark-20240206-030259.png

For comparison the vanilla VaM out of box experience:
vanilla.png

Vanilla VaM with better schedueling:
vanilla_ccd1_no_HT.png

There seems to be moments where the speculative execution just flies through the code, but something often prevents it. Modern CPUs are actualy capable of running blazingly fast if the conditions are right. Seems like Intel CPUs get more such conditions in VaM.
 
I guess I will trash the zukkiniquest3, the magic wireless fapping instrument,

and I will spend the rest of my (short) vam-life playing during my pervert nights with a 1080p bench.

Because:

standing to discord (vam2) vam-announcement, vam1 apocalypse is coming
Those with AMD cpus will be burnt eternally tortured by cigarettes fire, each unfaithful vamAMDfapper, by that blessed virtual fingers tracking of 72 (72 years old) old used virgins (blessed be ullah, his flying donkey and all hub prophets)
Those without hope to buy a 4090 will get the same terrible destiny
Those who were thinking to just play and so ignoring in their misery any benchmarking failures, as well
Those apostates, all those who like still eating pork steaks and to drink italian wine :eek:... they will never see vam2 light
Those retired old fappers who were not believing in those young (?) peaceful benchmak martyrs (blessed be intellah) ... better be ready to spend 100% of their misery pension (they will the same never get those joyful 100 faps-per-second)

because even with vam2 (blessed be) unitylight approaching, their sins will not be forgiven
is there enough time for repenting? is there still enough money in your arid pockets for upgrading all your crapware???
 
Last edited:
Try doing it again but turn soft body phyics off - I use GiveMeFPS to turn of just glute physics and with this patch it's a huge improvement
That's a good call, followup soon™

Update for 1920x1080, my goodness the UI is abysmal on a 4k monitor, actual game rendering still looks pretty decent though.
Been a long time since I switched over to 4k.

Before patch
View attachment 332088
After
View attachment 332089
Definitely a more noticeable difference, I'd like to try out 1440 as well, but I've had enough fiddling for the evening.

I do think it's odd that the performance is actually worse in the 'simpler physics' scene.
 
OMG!
Improve 40%++++

CPU: R5 5600 OC4.8
[threads]
computeColliders=4
skinmeshPart=1
CCD=0
IterateCCD=0

Fresh Vam
With patch9 and close HT
Benchmark-20240206-142107.png

Nopatch and open HT
3080ti.png


But there is no inprovement in VR
Is it possible to inprove aspects in GPU? orz...
With path9
3080ti 5600 patch9 关HT VR.png

Without patch
Benchmark-20240206-145542.png
 
Last edited:
for every float, it is not opening the file every time, but it calls the ZipLibrary, which in turn does an allocation of 4kb, reads 4kb from disk (the minimum amount in the zip library), decompresses 4kb (yes, thats the minimum amount it can decompress), truncates to 4bytes, casts 4bytes to a single float and returns the float. This happens FOR EVERY SINGLE FLOAT. You see millions of ReadFile(4kbytes) in procmon when loading a scene lol. The amount actually read from the harddrive is exactly 1024 times as large as needed. The MemoryStream isnt an optimal solution, but atleast it skips the uneeded expensive zip decompression. The perfect solution would be to somehow load the whole zip once from disk, keep it in memory and decompress the individual files from memory. Not sure if VaM loads all files from the .var or only the needed parts, so memory usage might increase if you do it that way. But atleast do the small fix I mentioned, it just skips the redundant decompression.
It's actually a common mistake in unity game dev lol.

I think the callstack is something like BinaryReader.ReadSingle() -> BinaryReader.ReadBytes(4) -> ZipLibrarySomething.ReadBytes(4) -> ZipLibrarySomething.ReadInternal(4096) -> ZipLibrarySomething.Decompress(4096)
You really need to do decompression in a block as big as possible, there is additional huge overhead in initializing the decompression state machine. You might get away with the OS caching disk reads, but many small decompressions are burning the CPU.

Reading the whole var at once needs to be avoided. Some of these are huge. Zip allows random access and is optimal for memory and runtime with large var files. I see in this case it is not optimal if the zip library is not buffering the read blocks with each Read call as I thought it would be doing when using a stream. If you are reading from a stream without seeking, that library should be buffering to avoid this. That sucks if it is not. I can investigate further if I am able to spend more time on original VaM.

Unfortunately I have little to no time to work on VaM. I have already declared VaM is done for development. I might consider opening it up again to roll in some performance fixes and a couple of other outstanding items if I think it is worth the time and delay to VaM2.
 
I actually tried the JobSystem first (without the burst compiler) and it made things even worse, because there was alot of additional memory copying and in the end the unity jobs were just glorified C# threads that you already use in skinmeshing. The newest version of unity jobs allows you to access transformations without the C#->Native transition. I think if you also do IL2CPP for those special jobs (is it even possible without making the whole game IL2CPP and breaking all third party scripts?) you might see some benefit, but it will not match a .dll compiled for AVX2 and optimized with clang.

I can't move original VaM to newer version of Unity which has a much better jobs/burst compile system. Without burst, jobs is useless. Without reorganizing everything into proper structs with proper jobs it is useless. Hence VaM2.

And yes using il2cpp breaks the plugin system. It can't work with il2cpp.
 
Yeah gotta look it up again, I think I mixed something up. Please correct me if I didnt understand it correctly: When you process skinmeshing you iterate over all bones and then apply the weights (not an expert on game engines) and as a result you get morphed vertexes that get rendered with the skin. What I learned was that you need to morph the vertexes in the order of the bones, otherwise you get a wrong result and the skin is all over the place. Basicaly you need to process every DAZSkinV2VertexWeights of every bone one by one and cant multithread it in theory. You however did a trick where each thread processes each bone in the same order, but each thread has a range of final vertexes that they are allowed to touch.
Each vertex can be processed independently and safely. The skinning sub-threads are split on a range of vertices to operate on. There is no conflict or bug here. If you have come up with a faster method that makes it so using sub-threads with vertex range is faster, that is great, because this is unnecessary complexity. I was operating withing the confines of C#, and this multithreading is what was allowing (in most case) the skinning thread to complete before anything else was waiting for it. The main thread would operate in parallel while skinning was happening in other threads. I realize there is a lot of cache contention (especially after your description and evidence here), so it is not ideal.

For whatever reason you solution is helping AMD processors a lot more than Intel. 90% of my performance tuning was done on Intel.

I don't think we'll have as much issue on VaM2, due to everything operating in sequence without 1-frame async subthreads (except for 1 current exception) and the use of tighter structs and jobs/burst and use of native methods in those jobs when possible where Unity provided access.
 
I think the best path is probably you keep developing this and work out the kinks, and possibly at some future point I can roll into application if you are willing to let me do that. If not, it can live on as a side patch.

Seriously great work and I'm very impressed!
 
little question:
limiting frames per second in Nvidia control panel (was setting max 90 for vam) seems to become effective only if I select desktop v-sync ... and the purpose is to save some stress on my 3080 gpu (I use also Afterburner in under-volt preset) ... playing desktop mode.

The result is that instead of average >150fps (with standard two persons animated scene and some behaviour plugins) I get near to the half value (around 70/80fps, enough for fun playing) : is this setting without relevant practical consequences on gpu temp and its max power stress or should I consider it really useful ? (Yes, I result positive to energy-saving virus) :unsure:
Have you tried using FrameRate control -> https://hub.virtamate.com/resources/macgruber-essentials.160/
It allows me to restrict for desktop mode to a frame rate (i use 30fps) so GPU isn't going at while I'm coding a plugin but it is disabled for VR
 
I tried your latest patch 9 but it's crashing on my really old CPU (i7-3770K) during opening the benchmark.
[threads]
computeColliders=6
skinmeshPart=1
CCD=0
IterateCCD=0

Error log:
Faulting application name: VaM.exe, version: 2018.1.9.48088, timestamp: 0x5b47339d
Faulting module name: ntdll.dll, version: 10.0.19041.3636, timestamp: 0x9b64aa6f
Exception code: 0xc0000005
Error offset: 0x000000000002faad
Bad process ID: 0x1f38
Faulting application start time: 0x01da590ec5b9090f
Path of the faulty application: C:\VaM_DEV\VaM.exe
Path of the faulty module: C:\WINDOWS\SYSTEM32\ntdll.dll
Report ID: 4d0cbe68-f1ce-42da-8dd1-69686d3805cc
Full name of the offending package:
Application ID that is relative to the failing package:
 
Back
Top Bottom