Performance Patch (up to 30% more physics speed)

turtlebackgoofy · Feb 2, 2024

Please continue discussion in https://hub.virtamate.com/threads/c...to-30-faster-physics-up-to-60-more-fps.49738/

Old message

As mentioned in the Benchmark Results thread, here is my performance patch.
Offloaded most CPU demanding functions to a native dll and did clang optimizations with profiling.
Mainly SkinMeshPart and ComputeColliders were sped up and turned into multithreaded.
Instructions in included Readme.

Resource pending approval:

Other - CPU Performance Patch (Up to 30% faster physics, up to 60% more FPS)

As requested in this thread https://hub.virtamate.com/threads/benchmark-result-discussion.13131/page-37 here is a release of the cpu performance patch. FAQ at bottom Only VaM 1.22.0.3 is supported! Please share before and after benchmarks with...

hub.virtamate.com

VaM is unoptimized and does a lot of random memory access, which makes the skin meshing in scenes with multiple persons or many morphs very slow. By heavily multithreading it you can bruteforce the best RAM access order and sometimes you get crazy high frame times for a few frames. Overall you can expect about 15% better physics time. Please post your benchmarks from https://hub.virtamate.com/threads/benchmark-result-discussion.13131/ with and without the patch.

My tests with a clean install vam:

vanilla:

vanilla with processaffinity to CCD1:

with performance patch (automaticaly sets CCD at start)

So yeah, a crispy 60% FPS increase for me.

Also a version which removes performance hits from having a lot of vars installed. This was because morphs which were not used, but were installed still were checked every frame. Also loading cloths/hair from every var no longer loads cloths/hair from every var for each cloth/hair loaded, which speeds up scene loading when you have a lot of vars installed.

link removed, wait for official release in https://hub.virtamate.com/resources...to-30-faster-physics-up-to-60-more-fps.43427/

It even helps a little when you no vars are installed, since the unused 1000 builtin morphs already hinder FPS.

with performance patch and moprh clutter patch

AshArm · Feb 2, 2024

Great work, post it to Plugins or Other so more people can enjoy it.

turtlebackgoofy · Feb 3, 2024

This is still in testing.
Please post your
- Physics Time gains (In the performance monitor)
- CPU (and if you use gaming mode or something)
- RAM speed and CAS
- Any visual or other bugs you experience

So I can find out why the newest generation of CPUs is 400% faster in VaM dispite beeing only 20% faster in other benchmarks lol

trety · Feb 3, 2024

I wouldn't say it's 400% faster. It's not even x2 faster comaparing to like 4 years old 10th gen intel cpus.
The thing is, there is no way to exlude cpu performance alone in VAM, and even at 1080p 'cpu stress' baseline3\simpler physics, VaM is still highly demanding on GPU.
And now, with 40xx serie [and let's be honest, most people with top tier CPU does have 4080 \ 4090 gpus...] we are reaching much higher results, almost eliminating GPU bottleneck.
Example, same system, same cpu\ram, VaM install.
13900k + 3090 [1080p] :

13900k + 4090 [at 1440 lol]:

Out of curiousity, my very old 10th gen benchmark.

turtlebackgoofy · Feb 3, 2024

trety said:
I wouldn't say it's 400% faster. It's not even x2 faster comaparing to like 4 years old 10th gen intel cpus.
The thing is, there is no way to exlude cpu performance alone in VAM, and even at 1080p 'cpu stress' baseline3\simpler physics, VaM is still highly demanding on GPU.
And now, with 40xx serie [and let's be honest, most people with top tier CPU does have 4080 \ 4090 gpus...] we are reaching much higher results, almost eliminating GPU bottleneck.
Example, same system, same cpu\ram, VaM install.
13900k + 3090 [1080p] :

View attachment 330822

13900k + 4090 [at 1440 lol]:

View attachment 330823

Out of curiousity, my very old 10th gen benchmark.

View attachment 330826

your physics time is 1.86ms on 13900k, while my 5950x is about 9ms, so yeah its a 400% increase

trety · Feb 3, 2024

GPU?
The difference between 13900k and 10900k [while both using 3090, and 10th gen using DDR4 3600, 13th gen DDR5 5600] is not even x2.
Also, yeah, seems like VaM doesn't like AMD at all if it's not x3d cache.

mostvanvege · Feb 3, 2024

i

turtlebackgoofy said:
your physics time is 1.86ms on 13900k, while my 5950x is about 9ms, so yeah its a 400% increase

I looked back your result and I think your cpu underperformed heavily for some reason. I had a 6 core 5600x before and I know zen 3 cpus can give around 5ms average and 8ms low for baseline 3 physicstime (that is like 80-90 fps for low with a 6800xt) in 1080p.. You need to check it with the usual stuff (fresh vam, no apps hogging in win background etc). If it still bad, then test outside with something, like cinebench if you get what you should with your cpu. If there is a problem check your bios settings aswell, may be you did not set xmp or something.. Also cpu and ram tweaks are rewarded by vam. Use some curve optimization, set power limits well and a good kit of ddr4 ram can work with 3800-3900 mhz with nice timings, use process lasso to tie vam for a specific ccx, turn on smt (amd HT) if it was disabled for some reason - amd performs better with that. A 4090 that you got should have made to work that cpu waaay better.
On the other hand I am also glad that it made you take the work with this patch.

turtlebackgoofy · Feb 3, 2024

mostvanvege said:
i

I looked back your result and I think your cpu underperformed heavily for some reason. I had a 6 core 5600x before and I know zen 3 cpus can give around 5ms average and 8ms low for baseline 3 physicstime (that is like 80-90 fps for low with a 6800xt) in 1080p.. You need to check it with the usual stuff (fresh vam, no apps hogging in win background etc). If it still bad, then test outside with something, like cinebench if you get what you should with your cpu. If there is a problem check your bios settings aswell, may be you did not set xmp or something.. Also cpu and ram tweaks are rewarded by vam. Use some curve optimization, set power limits well and a good kit of ddr4 ram can work with 3800-3900 mhz with nice timings, use process lasso to tie vam for a specific ccx, turn on smt (amd HT) if it was disabled for some reason - amd performs better with that. A 4090 that you got should have made to work that cpu waaay better.
On the other hand I am also glad that it made you take the work with this patch.

slowest possible ram, my benchmark matches other 5950x systems

mostvanvege · Feb 4, 2024

turtlebackgoofy said:
slowest possible ram, my benchmark matches other 5950x systems

Cmon then, you are starving out that zen then!

On AM4 ddr4 overclock is a breeze, almost any 3200CL16 can be tweaked optimally for zen3 and if you check the bench results here they are not much behind in lows compred to current gens. I admit latest I9 and zen4x3d are brutal, but vam is not an esport title needing insanely high fps. Just dial in you setup for pushing the screen refresh rate with the lows and you are golden. I doubt you need anything above stable 90 fps which is possible with zen3s already. Of course resulotion matters, but that depends on gpu.

supplovr · Feb 4, 2024

Thank you very much for this patch. This is a game changer!

Question: Do you know if for an i9-14900KF with P&E&HT Cores the same numbers are to be applied?

P: 8/16 (8 HT)
E: 16/16 (0 HT)

Gives 24 physical and 8 HT cores; or 32 cores alltogether. 75% of 32 Cores equals 24 cores to be entered in your ini config file.
Not sure if the power and efficiency cores are handled equals / same as physical count.

turtlebackgoofy · Feb 4, 2024

supplovr said:
Thank you very much for this patch. This is a game changer!

Question: Do you know if for an i9-14900KF with P&E&HT Cores the same numbers are to be applied?

P: 8/16 (8 HT)

E: 16/16 (0 HT)

Gives 24 physical and 8 HT cores; or 32 cores alltogether. 75% of 32 Cores equals 24 cores to be entered in your ini config file.
Not sure if the power and efficiency cores are handled equals / same as physical count.

the 14900kf seems to have "smart cache". No idea how it is wired to the cores. I guess you need vam to run on only the performance cores which are directly connected to the L3 cache and leave 25% for other unity work for the best performance.
Either way you could just put it at

[threads]
computeColliders=6
skinmeshPart=1
CCD=0
IterateCCD=0

And afterwards try to set process affinity in taskmanager to Core 0-7.

hijku · Feb 4, 2024

I was getting slightly more FPS when forcing VAM to use only 8P cores (non HT ones) and switching all background processes to E cores with Process Lasso.
This patch also helps, not sure if it's 15% but definitely noticeable difference.

//edit
@turtlebackgoofy I think he needs to set affinity to cores 0, 2, 4, 6, 8, 10, 12, 14 (non HT Perf cores).
0-7 means 4 P cores + 4 P HT cores.

What does the CCD does?

hijku · Feb 4, 2024

One part that I can't understand: Why VaM with ~5k vars makes my scenes to have like 20-30 fps less than clean vam instance.
I understand that a lot of vars can slow down startup and loading times but why does it affects fps when everything is loaded?

turtlebackgoofy · Feb 4, 2024

hijku said:
I was getting slightly more FPS when forcing VAM to use only 8P cores (non HT ones) and switching all background processes to E cores with Process Lasso.
This patch also helps, not sure if it's 15% but definitely noticeable difference.

//edit
@turtlebackgoofy I think he needs to set affinity to cores 0, 2, 4, 6, 8, 10, 12, 14 (non HT Perf cores).
0-7 means 4 P cores + 4 P HT cores.

What does the CCD does?

the 14900kf doesnt have HT. CCD is just to limit VAM to first half of cores or second half of cores, so the same as process lasso.

hijku said:
One part that I can't understand: Why VaM with ~5k vars makes my scenes to have like 20-30 fps less than clean vam instance.
I understand that a lot of vars can slow down startup and loading times but why does it affects fps when everything is loaded?

Its because when a var is installed, it's morphs are loaded into the morphbank, even if they arent used in the current scene. On every frame, the morphbank gets searched in the method ApplyMorphsThreadedFast, which in turn calls GetMorph and that one's execution time grows the more morph vars you have installed. But I will recheck it.

hijku · Feb 4, 2024

Didn't know that 14900 doesn't have HT. I'm on 13900k
Ohh, thanks, so lowering number of vars with morphs could help. Or making GetMorph faster, can't imagine why this method would have different complexity than O(1).

turtlebackgoofy · Feb 4, 2024

hijku said:
Didn't know that 14900 doesn't have HT. I'm on 13900k
Ohh, thanks, so lowering number of vars with morphs could help. Or making GetMorph faster, can't imagine why this method would have different complexity than O(1).

doing a profile without vars and with a lot of vars right now, its already taking 1000 as long lol

hijku · Feb 4, 2024

Code:

public bool ApplyMorphsThreadedFast(
    Vector3[] verts,
    Vector3[] visibleNonPoseVerts,
    DAZBones bones)
  {
    bool flag1 = true;
    int num1 = 5;
    bool flag2 = false;
    int length = verts.Length;
    while (flag1)
    {
      --num1;
      if (num1 != 0)
      {
        flag1 = false;
        for (int index = 0; index < this._morphs.Count; ++index)
        {
          DAZMorph morph1 = this._morphs[index];
          if (!morph1.disable)
)

does it iterate over ALL morphs I have in VaM on every frame?? Or rather all morphs that are used in current scene?
Why num1=5?

//edit
ohh ok, so it goes over this._morphs up to 5 times or unless no morphs are changed in given iteration

turtlebackgoofy · Feb 4, 2024

hijku said:
Code:

public bool ApplyMorphsThreadedFast( Vector3[] verts, Vector3[] visibleNonPoseVerts, DAZBones bones) { bool flag1 = true; int num1 = 5; bool flag2 = false; int length = verts.Length; while (flag1) { --num1; if (num1 != 0) { flag1 = false; for (int index = 0; index < this._morphs.Count; ++index) { DAZMorph morph1 = this._morphs[index]; if (!morph1.disable) )

does it iterate over ALL morphs I have in VaM on every frame?? Or rather all morphs that are used in current scene?
Why num1=5?

//edit
ohh ok, so it goes over this._morphs up to 5 times or unless no morphs are changed in given iteration

this._morphs contains all morphs from all vars that fit in the morphbank specifications (male, female, etc) and MIGHT be enabled in the scene later. Even though it iterates up to 5 times, it still iterates over ALL morphs in this._morphs and skips most, since they are disabled. So the "if (!morph1.disable)" check is what eats up performance. It's mainly because it evicts the CPU cache when iterating over not-needed morphs.

Comparing the profiler timings for the same scene per render frame (the ms difference in a non-profiled run will be smaller, but it's still wasted CPU time):

no vars
Thread	Call count	Method name	Method name2	Method name3	Total runtime (ms)	Total runtime no waste(ms)	Total allocation (bytes)	per frame
56	258	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	155.23	155.172	0	1.20333333
90	258	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	151.443	151.394	4096	1.17397674
64	258	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	149.812	149.763	0	1.16133333
77	258	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	148.149	148.091	4096	1.14844186
43	258	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	118.103	118.062	0	0.91552713
								5.6026124

lots vars
Thread	Call count	Method name	Method name2	Method name3	Total runtime (ms)	Total runtime no waste(ms)	Total allocation (bytes)	per frame
77	322	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	306.531	306.448	8192	1.90391925
90	322	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	287.761	287.677	8192	1.7873354
64	322	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	281.346	281.281	0	1.74749068
56	322	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	272.618	272.554	0	1.6932795
43	322	DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones)	DAZCharacterRun:RunThreaded (bool)	unknown	205.434	205.376	0	1.27598758
								8.40801242

hijku · Feb 4, 2024

So something as simple as making that method using new list: this._enabledMorphs so we don't iterate over everything could help a lot for most of the people (most of us has more than few hundreds vars I assume).
The question is how to make _enabledMorphs up-to-date when morphs are enabled/disabled.

turtlebackgoofy · Feb 4, 2024

turtlebackgoofy said:
this._morphs contains all morphs from all vars that fit in the morphbank specifications (male, female, etc) and MIGHT be enabled in the scene later. Even though it iterates up to 5 times, it still iterates over ALL morphs in this._morphs and skips most, since they are disabled. So the "if (!morph1.disable)" check is what eats up performance. It's mainly because it evicts the CPU cache when iterating over not-needed morphs.

Comparing the profiler timings for the same scene per render frame (the ms difference in a non-profiled run will be smaller, but it's still wasted CPU time):

no vars
Thread Call count Method name Method name2 Method name3 Total runtime (ms) Total runtime no waste(ms) Total allocation (bytes) per frame
56
258
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
155.23
155.172
0
1.20333333
90
258
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
151.443
151.394
4096
1.17397674
64
258
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
149.812
149.763
0
1.16133333
77
258
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
148.149
148.091
4096
1.14844186
43
258
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
118.103
118.062
0
0.91552713
5.6026124
lots vars
Thread Call count Method name Method name2 Method name3 Total runtime (ms) Total runtime no waste(ms) Total allocation (bytes) per frame
77
322
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
306.531
306.448
8192
1.90391925
90
322
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
287.761
287.677
8192
1.7873354
64
322
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
281.346
281.281
0
1.74749068
56
322
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
272.618
272.554
0
1.6932795
43
322
DAZMorphBank:ApplyMorphsThreadedFast (UnityEngine.Vector3[],UnityEngine.Vector3[],DAZBones) DAZCharacterRun:RunThreaded (bool) unknown
205.434
205.376
0
1.27598758
8.40801242

A solution would be to somehow make all morphs not load into this._morphs but into this._unactivatedMorphs inside the function BuildMorphsList()

hijku · Feb 4, 2024

True, but this still requires updating all places that can flip morph.disable so we remove it from one list and add to other. And it can make enabling morph slower since the update would run in O( n ) but maybe it wouldn't be noticable.

hijku · Feb 4, 2024

[Error :VAMPatches] Morph count: 13602
[Error :VAMPatches] Inactive count: 103630
[Error :VAMPatches] Disabled morphs count: 0

Very interesting, The disabled morphs count is always 0 but I don't think Benchmark scene is using 13.6k morphs.

Clean vam for comparison:

[Error :VAMPatches] Morph count: 1361
[Error :VAMPatches] Inactive count: 0
[Error :VAMPatches] Disabled morphs count: 0

turtlebackgoofy · Feb 4, 2024

hijku said:
Very interesting, The disabled morphs count is always 0 but I don't think Benchmark scene is using 13.6k morphs.

Clean vam for comparison:

true, it appears "enabled" means the morph is visible in the morphs UI. The important part is if this._morphs.morphValue != 0.0f, which is the slider position if you dont want to use a morph in a scene.
So a solution would be to have 2 seperate lists, one for all possible morphs and one for all morphs whose morphvalue is != 0.0f. When setting a morph != 0.0f it should then be moved to the active morph list which gets used in ApplyMorphsThreadedFast and when setting it to 0.0f it should be removed from it. Working on it right now. Might even improve performance on a clean vam, since it still has 1300 unused morphs.

hijku · Feb 4, 2024

Maybe even simply changing the condition could help a lot (without going into two lists straight away)?
from
if(!morph.disable)
to
if(morph.active)

turtlebackgoofy · Feb 4, 2024

hijku said:
Maybe even simply changing the condition could help a lot (without going into two lists straight away)?
from
if(!morph.disable)
to
if(morph.active)

no, it has to be two separate lists, because DAZMorph is an object and has about 80 fields, which is about 640bytes per DAZMorph. When the list is 13602 long, thats 8.3MBytes of memory all over the RAM that gets iterated through, which will evict the entirety of CPU L3 cache. This will itself take a long time and make everything afterwards slower.

_morphs[13602] is a chunk of 13602*8b memory with pointers to DAZMorphs all over the heap
DAZMorph itself is a 640b chunk of memory, whose fields get read in the beginning and the end.

Since we are talking about saving 100uS each frame, even such small reads tank performance.

Performance Patch (up to 30% more physics speed)

Well-known member

New member

Well-known member

Well-known member

Well-known member

Well-known member

Member

Well-known member

Member

New member

Well-known member

Member

Member

Well-known member

Member

Well-known member

Member

Well-known member

Member

Well-known member

Member

Member

Well-known member

Member

Well-known member

Similar threads