Currently, the free list consists of a "small list" for sizes below 256,
which are linearly spaced, and a "large list" which is manually split
into a few chunks.
This patch replaces it with a single log-linear policy, while expanding
the range the large list covers.
The old implementation had issues when a lot of large allocations
happened. In this case, all the allocations went in the last catch-all
bucket in the "large list", and what happens is:
1. The linked list grew in size over time, causing searching cost to
skyrocket.
2. With the first-fit allocation policy, fragmentation was also making
the problem worse.
The new bucketing covers the entire range up until we start allocating
large blocks, which will not enter the free list. It also makes the
allocation policy closer to best-fit (although not exactly), reducing
fragmentation.
The increase in number of free lists does incur some cost when it needs
to be skipped over, but the improvement in allocation performance
outweighs it.
For future work, these ideas (mostly from glibc) might or might not
benefit performance:
- Use an exact best-fit allocation policy.
- Add a bitmap for freelist, allowing empty lists to be skipped with a
single bit scan.
For the benchmark, this drastically improves initial shader loading performance in Overwatch 2. In this workload 78k shaders are passed to DXVK for DXBC -> SPIRV translation, and for each shader a few allocation happens in the 4K – 100K range for the staging buffer.
Before this patch, malloc consisted a whooping 43% of overhead. The overhead with log-linear bucketing is drastically lower, resulting in a ~2x improvement in loading time.
Overhead for each `FREE_LIST_LINEAR_BITS` is as below:
- 0: 7.7%
- 1: 2.9%
- 2: 1.3%
- 3: 0.6%
Since performance seems to scale linearly with increase in buckets (up to the point I have tested), I've opted for 3 (8 buckets per doubling) in the current revision of patch.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki(a)gmail.com>
--
v5: ntdll: Use log-linear bucketing for free lists.
https://gitlab.winehq.org/wine/wine/-/merge_requests/2622
--
v2: imm32: Use INPUTCONTEXT directly in ImmSetConversionStatus.
imm32: Use INPUTCONTEXT directly in ImmGetConversionStatus.
imm32: Compare open status values in ImmSetOpenStatus.
imm32: Cache INPUTCONTEXT values for every IME.
imm32: Use INPUTCONTEXT directly in ImmSetOpenStatus.
imm32: Use INPUTCONTEXT directly in ImmGetOpenStatus.
imm32: Serialize ImeInquire / ImeDestroy calls.
imm32/tests: Cleanup the cross thread IMC tests.
imm32/tests: Reduce the number of IME installations.
https://gitlab.winehq.org/wine/wine/-/merge_requests/2627
The main advantage is that this way we're getting valid DXBC checksums for DXBC blobs generated by d3dcompiler. See also https://bugs.winehq.org/show_bug.cgi?id=54464.
--
v4: d3dcompiler: Use vkd3d_shader_parse_dxbc() in d3dcompiler_shader_reflection_init().
d3dcompiler: Use vkd3d_shader_parse_dxbc() in d3dcompiler_strip_shader().
d3dcompiler: Use vkd3d_shader_parse_dxbc() in d3dcompiler_get_blob_part().
https://gitlab.winehq.org/wine/wine/-/merge_requests/2577
Currently, the free list consists of a "small list" for sizes below 256,
which are linearly spaced, and a "large list" which is manually split
into a few chunks.
This patch replaces it with a single log-linear policy, while expanding
the range the large list covers.
The old implementation had issues when a lot of large allocations
happened. In this case, all the allocations went in the last catch-all
bucket in the "large list", and what happens is:
1. The linked list grew in size over time, causing searching cost to
skyrocket.
2. With the first-fit allocation policy, fragmentation was also making
the problem worse.
The new bucketing covers the entire range up until we start allocating
large blocks, which will not enter the free list. It also makes the
allocation policy closer to best-fit (although not exactly), reducing
fragmentation.
The increase in number of free lists does incur some cost when it needs
to be skipped over, but the improvement in allocation performance
outweighs it.
For future work, these ideas (mostly from glibc) might or might not
benefit performance:
- Use an exact best-fit allocation policy.
- Add a bitmap for freelist, allowing empty lists to be skipped with a
single bit scan.
For the benchmark, this drastically improves initial shader loading performance in Overwatch 2. In this workload 78k shaders are passed to DXVK for DXBC -> SPIRV translation, and for each shader a few allocation happens in the 4K – 100K range for the staging buffer.
Before this patch, malloc consisted a whooping 43% of overhead. The overhead with log-linear bucketing is drastically lower, resulting in a ~2x improvement in loading time.
Overhead for each `FREE_LIST_LINEAR_BITS` is as below:
- 0: 7.7%
- 1: 2.9%
- 2: 1.3%
- 3: 0.6%
Since performance seems to scale linearly with increase in buckets (up to the point I have tested), I've opted for 3 (8 buckets per doubling) in the current revision of patch.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki(a)gmail.com>
--
v3: ntdll: Use log-linear bucketing for free lists.
https://gitlab.winehq.org/wine/wine/-/merge_requests/2622
To simplify SM 6 support, insert a control point id relative address where needed, and declare control point phase inputs where missing.
--
v5: vkd3d-shader/ir: Insert hull shader control point input declarations if no control point phase is defined.
https://gitlab.winehq.org/wine/vkd3d/-/merge_requests/141
The deleted test is a bad test: The behavior is different on different
drivers, for example, the test fails on the Windows 10 and 11 testbot
machines that have AMD video cards. Furthermore, MSDN does not say that
the destination context "must not" have any display lists yet but rather
that it "should not" have any.[1] The Khronos OpenGL Wiki similarly
advises against calling wglShareLists after the source or destination
context has at least one object, but if you do, it only says that "there
is a chance that wglShareLists will fail", not that it will necessarily
fail.[2] Since there's no clear "right" behavior here, we can adopt the
more permissive behavior that some programs expect, as long as it
doesn't corrupt the context.
[1] https://learn.microsoft.com/en-us/windows/win32/api/wingdi/nf-wingdi-wglsha…
[2] https://www.khronos.org/opengl/wiki/Platform_specifics:_Windows#wglShareLis…
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=11436
--
v4: winex11: Allow replacing either context in wglShareLists.
opengl32/tests: Make the wglShareLists tests comprehensive.
https://gitlab.winehq.org/wine/wine/-/merge_requests/2032
include: _InterlockedExchangePointer and _InterlockedCompareExchangePointer are intrinsics in x86 msvc.
I fixed this issue in ad05f33d67, but a40973f20 regressed this again. I was carrying a
patch for quite a while, feeling dejavu.
The msvc ver of 1900 is taken from Boost's interlocked.hpp, which matches MSVC 2015
(toolset version v140). Boost has a comment that claims that in msvc 2012 those
functions were defined in intrin.h, but those defines are broken with Microsoft's
winnt.h.
--
v2: include: x86 msvc has _InterlockedExchangePointer and _InterlockedCompareExchangePointer
https://gitlab.winehq.org/wine/wine/-/merge_requests/2591
Currently, the free list consists of a "small list" for sizes below 256,
which are linearly spaced, and a "large list" which is manually split
into a few chunks.
This patch replaces it with a single log-linear policy, while expanding
the range the large list covers.
The old implementation had issues when a lot of large allocations
happened. In this case, all the allocations went in the last catch-all
bucket in the "large list", and what happens is:
1. The linked list grew in size over time, causing searching cost to
skyrocket.
2. With the first-fit allocation policy, fragmentation was also making
the problem worse.
The new bucketing covers the entire range up until we start allocating
large blocks, which will not enter the free list. It also makes the
allocation policy closer to best-fit (although not exactly), reducing
fragmentation.
The increase in number of free lists does incur some cost when it needs
to be skipped over, but the improvement in allocation performance
outweighs it.
For future work, these ideas (mostly from glibc) might or might not
benefit performance:
- Use an exact best-fit allocation policy.
- Add a bitmap for freelist, allowing empty lists to be skipped with a
single bit scan.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki(a)gmail.com>
--
v2: ntdll: Use log-linear bucketing for free lists.
https://gitlab.winehq.org/wine/wine/-/merge_requests/2622