https://bugs.winehq.org/show_bug.cgi?id=49113
--- Comment #9 from Rémi Bernon rbernon@codeweavers.com --- Created attachment 67097 --> https://bugs.winehq.org/attachment.cgi?id=67097 Rbtree heap optimization
(In reply to Zebediah Figura from comment #6)
Rémi, that graph is detailed but kind of difficult to get an overall picture from. I was hoping for instead an absolute measurement of how long loading time is, with both heap managers. Adding Staging also might be good...
Well, the graph shows that on the x axis. Each point is a frame time and there's a fixed number of frames until the menu opens, which corresponds to the last spike on the right. Both graph shows the exact same activity, but the LFH version is slightly more compressed - thus shorter loading time.
Then to be completely honest, it's pretty hard to measure precisely and it also varies a lot. This graph was using the shortest loading time I could get with both heap managers and every run was different.
(In reply to Dmitry Timoshkov from comment #3)
This seems to be going in the wrong direction (is the actual problem due to locking primitives being inefficient?) since the whole effort has been driven by an artificial tests, and as the result there's no visible improvement for the real world applications. On the contrary Sebastian's patchset in the staging tree was based on the research and proper heap manager design, and as a result provided huge performance improvements for real world applications.
============================================================================= = https://dev.wine-staging.com/patches/submission/145/
New comments by Sebastian Lackner (slackner):
~ Sorry for the delay, but such a complicated patchset took a bit more time to evaluate. During the last few weeks, Michael Müller wrote tools to evaluate heap allocation performance. These tools allow to record the heap allocation pattern for various applications, and to replay them with different heap implementations. The result (tested with various applications and games) confirms what I already suspected:
So, in order to be fair I spent some time doing the same thing today. I don't know any application that actually suffers from bad allocation patterns, so I recorded the heap allocations when running Steam for Windows, then after starting a game and quitting everything after a bit.
Then, I replayed the allocations as quickly as possible, but using a single thread of execution -- just to get that out of the equation and because the allocations were not replayed faithfully in time, but that could be another experiment to do.
I also spent some time studying the staging patches to determine what the optimizations were exactly, in order to eventually try to clean up the patches and upstream them. My understanding is that there's four different optimizations in the patch:
* The number of free list buckets is increased to 128 (it's somewhere around 16 or 32 in mainline depending on the pointer size). * The largest buckets are replaced with an rbtree using block size as key. * The list buckets empty status is cached in a bitmask array (which makes little sense on its own but helps mitigate the impact of increasing the number of free buckets). * The list buckets are unlinked from each other and directly use a struct list instead of a full arena header.
Then, I replayed my recorded allocations, with the individual optimizations split and added separately, as well as the other versions, and the results are as follows:
Wine: ~13.5s + increase freelist count: ~14.5s + rbtree for large blocks: ~12s + cache freelist state: ~13.5s + struct list direct use: ~13.5s Wine + staging patch: ~12s Low fragmentation heap: ~10s
I'm not going to conclude much based on just this small experiment, but I think the optimizations in staging aren't that sophisticated. It's optimizing the worst cases by making the lookup of large blocks faster thanks to the rbtree, and moving some smaller sizes to individual free lists. The rbtree optimization can make sense, but it's also not very CPU friendly (as are the linked lists anyway).
The free list count increase on the other hand makes little sense to me. It's hardcoding some value, without trying to optimize the distribution. Mainline doesn't do much but there is at least a bit of categorization with a few larger sizes buckets. The patch drops all this and simply increases the number of small size buckets, relying on the rbtree and the state cache to handle the size categories and mitigate the induced CPU cache load. I mean, it's possible that it was giving the best results for /some/ other experiment but unless there's some data to back it up, for now I think it's just tweaking numbers.
I'm attaching the extracted rbtree optimization for reference because it seems useful. I'm also very interested to know if there's some specific applications suffering from bad allocation patterns.