https://bugs.winehq.org/show_bug.cgi?id=49113
Bug ID: 49113 Summary: Wine heap performs badly when multiple threads are concurrently allocating or freeing memory Product: Wine Version: 5.7 Hardware: x86 OS: Linux Status: UNCONFIRMED Severity: normal Priority: P2 Component: ntdll Assignee: wine-bugs@winehq.org Reporter: rbernon@codeweavers.com Distribution: ---
This can be easily reproduced with any synthetic heap benchmark, such as https://github.com/mjansson/rpmalloc-benchmark or https://github.com/daanx/mimalloc-bench.
Performance gets really bad as the number of concurrent thread increases.
For instance, running the rpmalloc benchmark with "<num threads> 0 0 2 20000 50000 5000 16 1000" parameter set, and 2 concurrent threads gives the following results (wine staging is testing with the staging heap improvement patches from https://bugs.winehq.org/show_bug.cgi?id=43224):
* win10 crt: 11977625 memory ops/CPU second, 106% overhead * linux crt: 5675754 memory ops/CPU second, 53% overhead * wine rpmalloc: 19700003 memory ops/CPU second, 131% overhead * wine upstream: 248333 memory ops/CPU second, 62% overhead * wine staging: 914004 memory ops/CPU second, 61% overhead
Increasing the number of thread makes the difference even worse for Wine.
In general this does not translate in much slowdowns, as memory allocation is rarely done in such highly concurrent way, but in some situations the difference is clearly noticeable, and in particular with many games during their loading times.
https://bugs.winehq.org/show_bug.cgi?id=49113
Rémi Bernon rbernon@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |rbernon@codeweavers.com
--- Comment #1 from Rémi Bernon rbernon@codeweavers.com --- Created attachment 67088 --> https://bugs.winehq.org/attachment.cgi?id=67088 Thread local heap implementation
Attaching a patch series that I sent to the mailing list. I believe it greatly improves the heap performance. It may be interesting to have it in staging.
https://bugs.winehq.org/show_bug.cgi?id=49113
Zebediah Figura z.figura12@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |z.figura12@gmail.com
--- Comment #2 from Zebediah Figura z.figura12@gmail.com --- (In reply to Rémi Bernon from comment #0)
In general this does not translate in much slowdowns, as memory allocation is rarely done in such highly concurrent way, but in some situations the difference is clearly noticeable, and in particular with many games during their loading times.
I'm not expecting leagues of difference of course, as you say, but all the same could you give some exact numbers for a handful of specific titles?
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #3 from Dmitry Timoshkov dmitry@baikal.ru --- This seems to be going in the wrong direction (is the actual problem due to locking primitives being inefficient?) since the whole effort has been driven by an artificial tests, and as the result there's no visible improvement for the real world applications. On the contrary Sebastian's patchset in the staging tree was based on the research and proper heap manager design, and as a result provided huge performance improvements for real world applications.
Just for the reference I'll copy/paste Sebastian's comment from the internal wine-staging patch tracker that accompanies his patchset:
============================================================================== https://dev.wine-staging.com/patches/submission/145/
New comments by Sebastian Lackner (slackner): ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Sorry for the delay, but such a complicated patchset took a bit more time to evaluate. During the last few weeks, Michael Müller wrote tools to evaluate heap allocation performance. These tools allow to record the heap allocation pattern for various applications, and to replay them with different heap implementations. The result (tested with various applications and games) confirms what I already suspected:
Although these patches by Niels Kuhnhenn help for some applications with a "bad allocation pattern" (see bug 43224 for example), they [b]reduce[/b] performance up to [b]30%[/b] in the "good case". This means that they are not really suitable for Wine Staging, users certainly would be upset about such a severe performance regression.
This result is also not really surprising - the whole idea is based on heuristics and "try & error", instead of a proper heap allocator design. I can understand that users are willing use it this workaround for certain apps, but it is not the correct solution. Since noone else seems to be working in this area at the moment, I've decided to give it a shot myself. The result is available at https://dev.wine-staging.com/patches/156/, please give it a try.
The new heap allocator uses (inspired by the way how it works on Windows) various fixed-size free lists, and a tree data structure for large elements. With this implementation, I get up to [b]60%[/b] improvement for apps with the "bad allocation pattern", and up to [b]15%[/b] improvement in the "good case". I am not aware of any application where this reduces performance, but of course this needs more careful testing. Michael was also planning to provide some more precise evaluation in the release notes or as a separate blog post after it has been merged.
Nevertheless, the newly proposed patchset is certainly better than this attempt, so I'm going to mark all patches as superseded (except the prefetch and wined3d patch, which still have to be evaluated separately). ==============================================================================
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #4 from Dmitry Timoshkov dmitry@baikal.ru --- I'd suggest to spend the efforts on mainlining Sebastian's patch instead.
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #5 from Rémi Bernon rbernon@codeweavers.com --- Created attachment 67093 --> https://bugs.winehq.org/attachment.cgi?id=67093 Dishonored 2 loading time
(In reply to Zebediah Figura from comment #2)
(In reply to Rémi Bernon from comment #0)
In general this does not translate in much slowdowns, as memory allocation is rarely done in such highly concurrent way, but in some situations the difference is clearly noticeable, and in particular with many games during their loading times.
I'm not expecting leagues of difference of course, as you say, but all the same could you give some exact numbers for a handful of specific titles?
I may be overselling it a bit and it's actually hard to measure precisely.
Here's for instance the individual frame time taken during the loading of Dishonored 2, with the standard heap, and the thread local implementation.
(In reply to Dmitry Timoshkov from comment #3)
This seems to be going in the wrong direction (is the actual problem due to locking primitives being inefficient?) since the whole effort has been driven by an artificial tests, and as the result there's no visible improvement for the real world applications. On the contrary Sebastian's patchset in the staging tree was based on the research and proper heap manager design, and as a result provided huge performance improvements for real world applications.
Of course, optimizing locking primitives also help, and esync and fsync have an effect there as well. I think it's not exclusive.
(In reply to Dmitry Timoshkov from comment #4)
I'd suggest to spend the efforts on mainlining Sebastian's patch instead.
Sure.
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #6 from Zebediah Figura z.figura12@gmail.com --- I'd like to point out that the patches currently in wine-staging don't have much along the lines of traceability either; i.e. we don't know what real applications are improved by them or how bad they were in the first place. The comment recovered from the old wine-staging.com website helps, but it only gives a couple of numbers. The bug linked to the aforementioned staging patch (bug 43224) is aimed at an artificial benchmark.
(Incidentally, Dmitry, do you have access to the old Staging website, or had you saved those paragraphs locally at some point? I've been unable to find any content from the Staging website ever since it got taken down [and unable to find any content from the old bug tracker ever], and I'd really appreciate it if anyone does have access, or even a significant amount of content saved, essentially for these reasons.)
Rémi, that graph is detailed but kind of difficult to get an overall picture from. I was hoping for instead an absolute measurement of how long loading time is, with both heap managers. Adding Staging also might be good...
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #7 from Dmitry Timoshkov dmitry@baikal.ru --- (In reply to Zebediah Figura from comment #6)
(Incidentally, Dmitry, do you have access to the old Staging website, or had you saved those paragraphs locally at some point?
Retired wine-staging patch tracker sent the links and comments (without actual patches) to the subscribers, I still have the comments saved for the patches I was interested in and my own work.
I've been unable to find any content from the Staging website ever since it got taken down [and unable to find any content from the old bug tracker ever], and I'd really appreciate it if anyone does have access, or even a significant amount of content saved, essentially for these reasons.)
You probably should ask Sebastian and Michael for more details.
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #8 from Zebediah Figura z.figura12@gmail.com --- (In reply to Dmitry Timoshkov from comment #7)
(In reply to Zebediah Figura from comment #6)
(Incidentally, Dmitry, do you have access to the old Staging website, or had you saved those paragraphs locally at some point?
Retired wine-staging patch tracker sent the links and comments (without actual patches) to the subscribers, I still have the comments saved for the patches I was interested in and my own work.
Mmh, okay, I see.
I've been unable to find any content from the Staging website ever since it got taken down [and unable to find any content from the old bug tracker ever], and I'd really appreciate it if anyone does have access, or even a significant amount of content saved, essentially for these reasons.)
You probably should ask Sebastian and Michael for more details.
Yes, I've tried. I've also tried asking them about specific patches. I haven't gotten a response in either case.
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #9 from Rémi Bernon rbernon@codeweavers.com --- Created attachment 67097 --> https://bugs.winehq.org/attachment.cgi?id=67097 Rbtree heap optimization
(In reply to Zebediah Figura from comment #6)
Rémi, that graph is detailed but kind of difficult to get an overall picture from. I was hoping for instead an absolute measurement of how long loading time is, with both heap managers. Adding Staging also might be good...
Well, the graph shows that on the x axis. Each point is a frame time and there's a fixed number of frames until the menu opens, which corresponds to the last spike on the right. Both graph shows the exact same activity, but the LFH version is slightly more compressed - thus shorter loading time.
Then to be completely honest, it's pretty hard to measure precisely and it also varies a lot. This graph was using the shortest loading time I could get with both heap managers and every run was different.
(In reply to Dmitry Timoshkov from comment #3)
This seems to be going in the wrong direction (is the actual problem due to locking primitives being inefficient?) since the whole effort has been driven by an artificial tests, and as the result there's no visible improvement for the real world applications. On the contrary Sebastian's patchset in the staging tree was based on the research and proper heap manager design, and as a result provided huge performance improvements for real world applications.
============================================================================= = https://dev.wine-staging.com/patches/submission/145/
New comments by Sebastian Lackner (slackner):
~ Sorry for the delay, but such a complicated patchset took a bit more time to evaluate. During the last few weeks, Michael Müller wrote tools to evaluate heap allocation performance. These tools allow to record the heap allocation pattern for various applications, and to replay them with different heap implementations. The result (tested with various applications and games) confirms what I already suspected:
So, in order to be fair I spent some time doing the same thing today. I don't know any application that actually suffers from bad allocation patterns, so I recorded the heap allocations when running Steam for Windows, then after starting a game and quitting everything after a bit.
Then, I replayed the allocations as quickly as possible, but using a single thread of execution -- just to get that out of the equation and because the allocations were not replayed faithfully in time, but that could be another experiment to do.
I also spent some time studying the staging patches to determine what the optimizations were exactly, in order to eventually try to clean up the patches and upstream them. My understanding is that there's four different optimizations in the patch:
* The number of free list buckets is increased to 128 (it's somewhere around 16 or 32 in mainline depending on the pointer size). * The largest buckets are replaced with an rbtree using block size as key. * The list buckets empty status is cached in a bitmask array (which makes little sense on its own but helps mitigate the impact of increasing the number of free buckets). * The list buckets are unlinked from each other and directly use a struct list instead of a full arena header.
Then, I replayed my recorded allocations, with the individual optimizations split and added separately, as well as the other versions, and the results are as follows:
Wine: ~13.5s + increase freelist count: ~14.5s + rbtree for large blocks: ~12s + cache freelist state: ~13.5s + struct list direct use: ~13.5s Wine + staging patch: ~12s Low fragmentation heap: ~10s
I'm not going to conclude much based on just this small experiment, but I think the optimizations in staging aren't that sophisticated. It's optimizing the worst cases by making the lookup of large blocks faster thanks to the rbtree, and moving some smaller sizes to individual free lists. The rbtree optimization can make sense, but it's also not very CPU friendly (as are the linked lists anyway).
The free list count increase on the other hand makes little sense to me. It's hardcoding some value, without trying to optimize the distribution. Mainline doesn't do much but there is at least a bit of categorization with a few larger sizes buckets. The patch drops all this and simply increases the number of small size buckets, relying on the rbtree and the state cache to handle the size categories and mitigate the induced CPU cache load. I mean, it's possible that it was giving the best results for /some/ other experiment but unless there's some data to back it up, for now I think it's just tweaking numbers.
I'm attaching the extracted rbtree optimization for reference because it seems useful. I'm also very interested to know if there's some specific applications suffering from bad allocation patterns.
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #10 from Dmitry Timoshkov dmitry@baikal.ru --- (In reply to Rémi Bernon from comment #9)
So, in order to be fair I spent some time doing the same thing today. I don't know any application that actually suffers from bad allocation patterns, so I recorded the heap allocations when running Steam for Windows, then after starting a game and quitting everything after a bit.
Bugs for the apps with bad allocation pattern are linked to the bug 43224: https://bugs.winehq.org/show_bug.cgi?id=24256 https://bugs.winehq.org/show_bug.cgi?id=37773
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #11 from Zebediah Figura z.figura12@gmail.com --- (In reply to Dmitry Timoshkov from comment #10)
Bugs for the apps with bad allocation pattern are linked to the bug 43224: https://bugs.winehq.org/show_bug.cgi?id=24256 https://bugs.winehq.org/show_bug.cgi?id=37773
While they're probably worth testing regardless, I'm not sure how interesting these two are in terms of improving current Wine, since they're both apparently addressed by 2175852f5 (24256 isn't marked as such, but the patch seems to be fundamentally the same). Maybe those applications could be improved further, though...
https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #12 from Rémi Bernon rbernon@codeweavers.com --- Well then from the fix it looks like they were doing a lot of very small allocations, as most application do -- in my recording there's 100 to 1000 times more allocations for each size below 128 than above (with a few exceptions for sizes around 256 and a some others). This is why free list distribution matters.
There's also another issue I believe with the default heap is that it causes a lot memory fragmentation -- which may very well be the issue for these apps as well, fragmentation will also fill the freelists with useless blocks. The low fragmentation implementation design is also supposed to cover that.
https://bugs.winehq.org/show_bug.cgi?id=49113
François Gouget fgouget@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |fgouget@codeweavers.com Keywords| |patch
https://bugs.winehq.org/show_bug.cgi?id=49113
soredake gi85qht0z@relay.firefox.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |gi85qht0z@relay.firefox.com
https://bugs.winehq.org/show_bug.cgi?id=49113
Rémi Bernon rbernon@codeweavers.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Fixed by SHA1| |40b7c3e89a95d6ccb190b234d4a | |d13b3a8304495 Resolution|--- |FIXED
--- Comment #13 from Rémi Bernon rbernon@codeweavers.com --- Marking as fixed.
https://bugs.winehq.org/show_bug.cgi?id=49113
Alexandre Julliard julliard@winehq.org changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |CLOSED
--- Comment #14 from Alexandre Julliard julliard@winehq.org --- Closing bugs fixed in 8.3.