Performance Patch (up to 30% more physics speed)

turtlebackgoofy · Feb 5, 2024

Wait you are telling me ReadInt32() and ReadSingle() in the original code is actually opening the file every single time? That's insane and I had no idea the zip library would work like that.

C#:

        public void LoadDeltasFromBinaryFile(string path) {
                //Debug.Log("Loading deltas for morph " + morphName + " from "+path);
                try {
                        using (FileEntryStream fes = FileManager.OpenStream(path, true)) {
                                using (BinaryReader binReader = new BinaryReader(fes.Stream)) {
                                        numDeltas = binReader.ReadInt32();
                                        deltas = new DAZMorphVertex[numDeltas];
                                        for (int ind = 0; ind < numDeltas; ind++) {
                                                DAZMorphVertex dmv = new DAZMorphVertex();
                                                dmv.vertex = binReader.ReadInt32();
                                                Vector3 v;
                                                v.x = binReader.ReadSingle();
                                                v.y = binReader.ReadSingle();
                                                v.z = binReader.ReadSingle();
                                                dmv.delta = v;
                                                deltas[ind] = dmv;
                                        }
                                }
                        }
                }
                catch (System.Exception e) {
                        Debug.LogError("Error while loading binary delta file " + path + " " + e);
                }
        }

for every float, it is not opening the file every time, but it calls the ZipLibrary, which in turn does an allocation of 4kb, reads 4kb from disk (the minimum amount in the zip library), decompresses 4kb (yes, thats the minimum amount it can decompress), truncates to 4bytes, casts 4bytes to a single float and returns the float. This happens FOR EVERY SINGLE FLOAT. You see millions of ReadFile(4kbytes) in procmon when loading a scene lol. The amount actually read from the harddrive is exactly 1024 times as large as needed. The MemoryStream isnt an optimal solution, but atleast it skips the uneeded expensive zip decompression. The perfect solution would be to somehow load the whole zip once from disk, keep it in memory and decompress the individual files from memory. Not sure if VaM loads all files from the .var or only the needed parts, so memory usage might increase if you do it that way. But atleast do the small fix I mentioned, it just skips the redundant decompression.
It's actually a common mistake in unity game dev lol.

I think the callstack is something like BinaryReader.ReadSingle() -> BinaryReader.ReadBytes(4) -> ZipLibrarySomething.ReadBytes(4) -> ZipLibrarySomething.ReadInternal(4096) -> ZipLibrarySomething.Decompress(4096)
You really need to do decompression in a block as big as possible, there is additional huge overhead in initializing the decompression state machine. You might get away with the OS caching disk reads, but many small decompressions are burning the CPU.

Eye9 · Feb 5, 2024

turtlebackgoofy said:
I think if you also do IL2CPP for those special jobs [...] you might see some benefit, but it will not match a .dll compiled for AVX2 and optimized with clang.

Pretty sure you can get similar performance using Burst now, which makes it pretty convenient.

Edit: Oh and really nicely done @ this patch. I was just wondering about morphs and how they are applied currently a couple of days ago.

hijku · Feb 5, 2024

Not sure what ExtractSample exactly does but It's in top4 methods when I profile my scene.

Thread	Call count	Method name SMALL	Total runtime (ns)	Total allocation (bytes)
56	435	DAZCharacterRun:RunThreaded (bool)	1663311300	90755072
43	435	DAZCharacterRun:RunThreaded (bool)	1026883000	7184384
68	380	SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)	902135000	11866112
67	357	SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)	843778400	11665408

Frist, the method normlizes the spectrum, then there are two nested for loops that look weird.
At the end it calls CepstralCoefficients but it somewhere at the bottom when it comes to total runtime.

67	357	SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)	1130400	0	3,166386555
68	380	SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)	1110000	0	2,921052632

turtlebackgoofy · Feb 5, 2024

hijku said:
Not sure what ExtractSample exactly does but It's in top4 methods when I profile my scene.

Thread Call count Method name SMALL Total runtime (ns) Total allocation (bytes)
56
435
DAZCharacterRun:RunThreaded (bool)
1663311300
90755072
43
435
DAZCharacterRun:RunThreaded (bool)
1026883000
7184384
68
380
SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)
902135000
11866112
67
357
SpeechBlendEngine.ExtractFeatures:ExtractSample (single[],single[2],single[2],single[],single[2]&,int,int,SpeechBlendEngine.SpeechUtil/Accuracy)
843778400
11665408

Frist, the method normlizes the spectrum, then there are two nested for loops that look weird.
At the end it calls CepstralCoefficients but it somewhere at the bottom when it comes to total runtime.

67
357
SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)
1130400
0
3,166386555
68
380
SpeechBlendEngine.ExtractFeatures:CepstralCoefficients (single[],single[2],single[2]&,single[],int,int)
1110000
0
2,921052632

looks like it parses the audio for something, very float math heavy, very expensive in C#, no wonder its slow. Could be heavily improved, but since the lip aproximation is very bad anyway I have no desire for it lol

redeyes · Feb 5, 2024

Just want to throw my thanks to you! Getting nearly Double frame on test going from 60fps to 106-121fps!
Intel 9700k and a 4090 - yes those are 2 very ill match components, but this patch brings them much closer!

Top is latest patch bottom in without -with my bloated VAM directory

For reference, this is with Glute softbody off but breast physics on. Glute Still tanks the frame rate. With Glute on no fix is 39FPS and with fix 44FPS

turtlebackgoofy · Feb 5, 2024

redeyes said:
Just want to throw my thanks to you! Getting nearly Double frame on test going from 60fps to 106-121fps!
Intel 9700k and a 4090 - yes those are 2 very ill match components, but this patch brings them much closer!

Top is latest patch bottom in without -with my bloated VAM directory
View attachment 331988
For reference, this is with Glute softbody off but breast physics on. Glute Still tanks the frame rate. With Glute on no fix is 39FPS and with fix 44FPS

Nice. Soft body makes the patch negligible because my patch is a flat reduction on a few specific parts and sadly there is that FPS vs frametime conundrum: If rendering improves from 6ms to 4ms, you go from 166fps to 250fps, but if it improves from 30ms to 28ms you go from 33fps to 35fps lol. When you enable SoftBody, you tell the unity engine to use kinematicbodies instead of simple colliders. Sadly their performance is out of reach for VaM or me, noone but unity can improve their performance. Best you can do is not cluttering the CPU cache in other places so they run faster.

redeyes · Feb 5, 2024

turtlebackgoofy said:
Nice. Soft body makes the patch negligible because my patch is a flat reduction on a few specific parts and sadly there is that FPS vs frametime conundrum: If rendering improves from 6ms to 4ms, you go from 166fps to 250fps, but if it improves from 30ms to 28ms you go from 33fps to 35fps lol. When you enable SoftBody, you tell the unity engine to use kinematicbodies instead of simple colliders. Sadly their performance is out of reach for VaM or me, noone but unity can improve their performance. Best you can do is not cluttering the CPU cache in other places so they run faster.

Still - This improves things a lot for me as I turn of glute anyway - just being able to get a steady 90fps with 1 girl and guy in VR has been something my system (CPU) hasn't been able to do - until now!

Seraphim · Feb 5, 2024

turtlebackgoofy said:
when number goes up, its a new version

Do you remember where you found the Benchmark scene with only Baseline 3 enabled?

Lamp · Feb 5, 2024

Been watching this over the weekend, very excited to see some continued discussion in here - even from the caveman himself.

I had high hopes this would be a miracle worker, but I actually lost 1 frame on average throughout the benchmark test.

Before plugin

After plugin

I am overclocking my i7-11700K, default settings for the INI match my CPU.
Overall I would say the difference between patch / no patch is negligible on my end.

turtlebackgoofy · Feb 5, 2024

Seraphim said:
Do you remember where you found the Benchmark scene with only Baseline 3 enabled?

Benchmark Result Discussion

Another test with a version modified by me. It will only run the Baseline 3 test. This is very useful to quickly share results in the most critical test of the entire benchmark. If anyone is interested, I'll drop my modification in case you want to try it.

hub.virtamate.com

turtlebackgoofy · Feb 5, 2024

Lamp said:
Been watching this over the weekend, very excited to see some continued discussion in here - even from the caveman himself.

I had high hopes this would be a miracle worker, but I actually lost 1 frame on average throughout the benchmark test.

Before plugin
View attachment 332045
After plugin
View attachment 332046

I am overclocking my i7-11700K, default settings for the INI match my CPU.
Overall I would say the difference between patch / no patch is negligible on my end.

>baseline3 timings improved (lower is better)
>baseline3 fps got worse
weird

Lamp · Feb 5, 2024

turtlebackgoofy said:
>baseline3 timings improved (lower is better)
>baseline3 fps got worse
weird

Right? The timings are actually better, but there's no real gain.
I'll tinker around with my typical setups, outside of the benchmark environment and see if I can actually tell any difference.

Judging by some of the other benchmarks floating around here, I'm getting the itch to upgrade and rebuild my PC again xD

trety · Feb 5, 2024

Lamp said:
Been watching this over the weekend, very excited to see some continued discussion in here - even from the caveman himself.

I had high hopes this would be a miracle worker, but I actually lost 1 frame on average throughout the benchmark test.

Before plugin
View attachment 332045
After plugin
View attachment 332046

I am overclocking my i7-11700K, default settings for the INI match my CPU.
Overall I would say the difference between patch / no patch is negligible on my end.

The reason is simple. VaM at 4K can easliy put into the knees current gen top tier CPUs paired with 4090.
Even at 1080p there will be differences between the same CPU while using different GPUs.
So... As turtlebackgoofy mentioned, yours physics time did actually improved, but since you are at 4K, you're bottlenecked by yours GPU. The differences in fps are at the margin of error.
Try to run benchmark at 1080 and then compare

\ Edit \
So here are my results at 4K.

Only ~3-5 fps difference, while using 4090. Physics time didn't even improved by 0.2. Funny, i got more avg and max fps in test on 'vanilla' at 4K too. But 1% low went higher with the patch.
At 1080 i could see a way more difference [1.80->1.48].
But ... since real life VaM usage is not only heavy physics, my VR experience improved by about 15-20 fps overall.

Lamp · Feb 5, 2024

trety said:
The reason is simple. VaM at 4K can easliy put into the knees current gen top tier CPUs paired with 4090.
Even at 1080p there will be differences between the same CPU while using different GPUs.
So... As turtlebackgoofy mentioned, yours physics time did actually improved, but since you are at 4K, you're bottlenecked by yours GPU.
Try to run benchmark at 1080 and then compare

That's a good call, followup soon™

Update for 1920x1080, my goodness the UI is abysmal on a 4k monitor, actual game rendering still looks pretty decent though.
Been a long time since I switched over to 4k.

Before patch

After

Definitely a more noticeable difference, I'd like to try out 1440 as well, but I've had enough fiddling for the evening.

I do think it's odd that the performance is actually worse in the 'simpler physics' scene.

turtlebackgoofy · Feb 5, 2024

hmm found something weird while experimenting, gotta investigate what happened, that additional 15% fps is sus

turtlebackgoofy · Feb 5, 2024

turtlebackgoofy said:
View attachment 332072
hmm found something weird while experimenting, gotta investigate what happened, that additional 15% fps is sus

Hmm yes, it appears there is something fundamentaly wrong how the unity engine runs on AMD CPUs on windows. But I think it can be fixed. Screwing around with thread schedueling makes my patch even faster.

For comparison the vanilla VaM out of box experience:

Vanilla VaM with better schedueling:

There seems to be moments where the speculative execution just flies through the code, but something often prevents it. Modern CPUs are actualy capable of running blazingly fast if the conditions are right. Seems like Intel CPUs get more such conditions in VaM.

redeyes · Feb 6, 2024

Try doing it again but turn soft body phyics off - I use GiveMeFPS to turn of just glute physics and with this patch it's a huge improvement

Lamp said:
That's a good call, followup soon™

Update for 1920x1080, my goodness the UI is abysmal on a 4k monitor, actual game rendering still looks pretty decent though.
Been a long time since I switched over to 4k.

Before patch
View attachment 332088
After
View attachment 332089
Definitely a more noticeable difference, I'd like to try out 1440 as well, but I've had enough fiddling for the evening.

I do think it's odd that the performance is actually worse in the 'simpler physics' scene.

fsdcx · Feb 6, 2024

OMG!
Improve 40%++++

CPU: R5 5600 OC4.8
[threads]
computeColliders=4
skinmeshPart=1
CCD=0
IterateCCD=0

Fresh Vam
With patch9 and close HT

Nopatch and open HT

But there is no inprovement in VR
Is it possible to inprove aspects in GPU? orz...
With path9

Without patch

meshedvr · Feb 6, 2024

turtlebackgoofy said:
for every float, it is not opening the file every time, but it calls the ZipLibrary, which in turn does an allocation of 4kb, reads 4kb from disk (the minimum amount in the zip library), decompresses 4kb (yes, thats the minimum amount it can decompress), truncates to 4bytes, casts 4bytes to a single float and returns the float. This happens FOR EVERY SINGLE FLOAT. You see millions of ReadFile(4kbytes) in procmon when loading a scene lol. The amount actually read from the harddrive is exactly 1024 times as large as needed. The MemoryStream isnt an optimal solution, but atleast it skips the uneeded expensive zip decompression. The perfect solution would be to somehow load the whole zip once from disk, keep it in memory and decompress the individual files from memory. Not sure if VaM loads all files from the .var or only the needed parts, so memory usage might increase if you do it that way. But atleast do the small fix I mentioned, it just skips the redundant decompression.
It's actually a common mistake in unity game dev lol.

I think the callstack is something like BinaryReader.ReadSingle() -> BinaryReader.ReadBytes(4) -> ZipLibrarySomething.ReadBytes(4) -> ZipLibrarySomething.ReadInternal(4096) -> ZipLibrarySomething.Decompress(4096)
You really need to do decompression in a block as big as possible, there is additional huge overhead in initializing the decompression state machine. You might get away with the OS caching disk reads, but many small decompressions are burning the CPU.

Reading the whole var at once needs to be avoided. Some of these are huge. Zip allows random access and is optimal for memory and runtime with large var files. I see in this case it is not optimal if the zip library is not buffering the read blocks with each Read call as I thought it would be doing when using a stream. If you are reading from a stream without seeking, that library should be buffering to avoid this. That sucks if it is not. I can investigate further if I am able to spend more time on original VaM.

Unfortunately I have little to no time to work on VaM. I have already declared VaM is done for development. I might consider opening it up again to roll in some performance fixes and a couple of other outstanding items if I think it is worth the time and delay to VaM2.

meshedvr · Feb 6, 2024

I actually tried the JobSystem first (without the burst compiler) and it made things even worse, because there was alot of additional memory copying and in the end the unity jobs were just glorified C# threads that you already use in skinmeshing. The newest version of unity jobs allows you to access transformations without the C#->Native transition. I think if you also do IL2CPP for those special jobs (is it even possible without making the whole game IL2CPP and breaking all third party scripts?) you might see some benefit, but it will not match a .dll compiled for AVX2 and optimized with clang.

I can't move original VaM to newer version of Unity which has a much better jobs/burst compile system. Without burst, jobs is useless. Without reorganizing everything into proper structs with proper jobs it is useless. Hence VaM2.

And yes using il2cpp breaks the plugin system. It can't work with il2cpp.

meshedvr · Feb 6, 2024

Yeah gotta look it up again, I think I mixed something up. Please correct me if I didnt understand it correctly: When you process skinmeshing you iterate over all bones and then apply the weights (not an expert on game engines) and as a result you get morphed vertexes that get rendered with the skin. What I learned was that you need to morph the vertexes in the order of the bones, otherwise you get a wrong result and the skin is all over the place. Basicaly you need to process every DAZSkinV2VertexWeights of every bone one by one and cant multithread it in theory. You however did a trick where each thread processes each bone in the same order, but each thread has a range of final vertexes that they are allowed to touch.

Each vertex can be processed independently and safely. The skinning sub-threads are split on a range of vertices to operate on. There is no conflict or bug here. If you have come up with a faster method that makes it so using sub-threads with vertex range is faster, that is great, because this is unnecessary complexity. I was operating withing the confines of C#, and this multithreading is what was allowing (in most case) the skinning thread to complete before anything else was waiting for it. The main thread would operate in parallel while skinning was happening in other threads. I realize there is a lot of cache contention (especially after your description and evidence here), so it is not ideal.

For whatever reason you solution is helping AMD processors a lot more than Intel. 90% of my performance tuning was done on Intel.

I don't think we'll have as much issue on VaM2, due to everything operating in sequence without 1-frame async subthreads (except for 1 current exception) and the use of tighter structs and jobs/burst and use of native methods in those jobs when possible where Unity provided access.

meshedvr · Feb 6, 2024

I think the best path is probably you keep developing this and work out the kinks, and possibly at some future point I can roll into application if you are willing to let me do that. If not, it can live on as a side patch.

Seriously great work and I'm very impressed!

redeyes · Feb 6, 2024

keycode said:
little question:
limiting frames per second in Nvidia control panel (was setting max 90 for vam) seems to become effective only if I select desktop v-sync ... and the purpose is to save some stress on my 3080 gpu (I use also Afterburner in under-volt preset) ... playing desktop mode.

The result is that instead of average >150fps (with standard two persons animated scene and some behaviour plugins) I get near to the half value (around 70/80fps, enough for fun playing) : is this setting without relevant practical consequences on gpu temp and its max power stress or should I consider it really useful ? (Yes, I result positive to energy-saving virus)

Have you tried using FrameRate control -> https://hub.virtamate.com/resources/macgruber-essentials.160/
It allows me to restrict for desktop mode to a frame rate (i use 30fps) so GPU isn't going at while I'm coding a plugin but it is disabled for VR

MaryJane · Feb 6, 2024

I tried your latest patch 9 but it's crashing on my really old CPU (i7-3770K) during opening the benchmark.

[threads]
computeColliders=6
skinmeshPart=1
CCD=0
IterateCCD=0

Error log:

Faulting application name: VaM.exe, version: 2018.1.9.48088, timestamp: 0x5b47339d
Faulting module name: ntdll.dll, version: 10.0.19041.3636, timestamp: 0x9b64aa6f
Exception code: 0xc0000005
Error offset: 0x000000000002faad
Bad process ID: 0x1f38
Faulting application start time: 0x01da590ec5b9090f
Path of the faulty application: C:\VaM_DEV\VaM.exe
Path of the faulty module: C:\WINDOWS\SYSTEM32\ntdll.dll
Report ID: 4d0cbe68-f1ce-42da-8dd1-69686d3805cc
Full name of the offending package:
Application ID that is relative to the failing package:

Performance Patch (up to 30% more physics speed)

Well-known member

Member

Member

Well-known member

Well-known member

Well-known member

Well-known member

Active member

Member

Well-known member

Well-known member

Member

Well-known member

Member

Well-known member

Well-known member

Well-known member

New member

Administrator

Administrator

Administrator

Administrator

Well-known member

Member

Similar threads