Hi All,
In the wake of the new WOW64 implementation (recent explanation [1]), there has been discussion in informal channels about how to we are going to handle pointers to mapped graphics resource memory which we receive from the graphics API, as the possibility exists that it will fall outside of the 32-bit address space.
Over time, a few creative solutions have been proposed and discussed, with a common theme being that we need changes in either the kernel or the graphics drivers to do this properly. As we already know the requirements for a solution to this problem, I think it would be responsible to hash this out now and then work with the relevant project maintainers earlier as to avoid blocking work on the wine side too long and to possibly allow more users to test the new path earlier.
The solutions which I've seen laid out so far:
- Use the mremap(2) interface, allowing us to duplicate the mapping we receive into the 32-bit address space. This solution would match what is already done for Crossover Mac's 32on64 support using Mac's mach_vm_remap functionality [2]. However, right now it is not possible to use the MREMAP_DONTUNMAP flag with mappings that aren't private and anonymous, which rules out there use on mapped FDs from libdrm. Due to this, a kernel change would be necessary.
Pro: A uniform solution across all APIs, which could help in the future with any unforeseen need to access host-allocated memory in 32-bit windows code.
Cons: Requires a kernel change, which of all the options may take the longest to get up-streamed and in the hands of users.
- Work with Khronos to introduce extensions into the relevant APIs enabling us to tell drivers where in the address space we want resources mapped.
Pro: Wouldn't require going around the backs of the driver, resulting in a more hardened solution. (Out there, but what if a creative driver returns a mapping without read or write permission and handles accesses through a page fault handler?)
Cons: The extension would have to be implemented by each individual vendor for every relevant API. This would implicitly drop support for those with cards whose graphics drivers are no longer being updated.
- Hook the driver's mmap call when we invoke memory mappings function, overriding the address to something in the 32-bit address space.
Pro: Unlike the other solutions, this wouldn't require any changes to other projects, and shares the advantage of the first solution.
Cons: Susceptible to breakage if the driver uses their own mapping mechanism separate from mmap. (Custom IOCTL, CPU driver returning something from the heap)
1: https://www.winehq.org/pipermail/wine-devel/2022-April/213677.html
2: https://www.codeweavers.com/crossover/source - see function `remap_memory` in `wine/dlls/winemac.drv/opengl.c`
On 4/24/22 21:18, Derek Lesho wrote:
Hi All,
In the wake of the new WOW64 implementation (recent explanation [1]), there has been discussion in informal channels about how to we are going to handle pointers to mapped graphics resource memory which we receive from the graphics API, as the possibility exists that it will fall outside of the 32-bit address space.
Over time, a few creative solutions have been proposed and discussed, with a common theme being that we need changes in either the kernel or the graphics drivers to do this properly. As we already know the requirements for a solution to this problem, I think it would be responsible to hash this out now and then work with the relevant project maintainers earlier as to avoid blocking work on the wine side too long and to possibly allow more users to test the new path earlier.
Thank you for starting this conversation! I agree with all of these points. WoW64 emulation is still a long way off, if it'll even happen by default on platforms other than Mac, but nevertheless this is something we should look into supporting sooner than later.
It would probably be good to start a dri-devel/mesa-dev thread to discuss this as well.
The solutions which I've seen laid out so far:
- Use the mremap(2) interface, allowing us to duplicate the mapping we
receive into the 32-bit address space. This solution would match what is already done for Crossover Mac's 32on64 support using Mac's mach_vm_remap functionality [2]. However, right now it is not possible to use the MREMAP_DONTUNMAP flag with mappings that aren't private and anonymous, which rules out there use on mapped FDs from libdrm. Due to this, a kernel change would be necessary.
Pro: A uniform solution across all APIs, which could help in the future with any unforeseen need to access host-allocated memory in 32-bit windows code.
Cons: Requires a kernel change, which of all the options may take the longest to get up-streamed and in the hands of users.
Frankly, I think it may be worth looking into this even if we do try to implement another solution for GPU mappings specifically. As you say, it may potentially come in useful in other places.
In fact, in general I think looking into multiple solutions, and being able to fall back from one to another, is not necessarily a bad idea.
Also: it may be worth looking into kernel extensions other than mremap(2). We already have to deal with the problem of reserving the low 2 GB for Win32 memory, and our current solutions to that can cause problems (I was recently bitten by this, in bug 52840 [1]).
A personality switch or pair of switches like "map everything under 2/4 GB" and "prefer mapping above 2/4 GB" would be helpful, so that we can force mapping under 2 GB in NtAllocateVirtualMemory() and GPU mappings and above 2 GB otherwise. Unlike extending mremap(2), these would be useful for normal allocations as well, i.e. they'd allow us to do a better job of placing system libraries where we want them.
See also below s.v. ADDR_LIMIT_32BIT.
[1] https://bugs.winehq.org/show_bug.cgi?id=52840
- Work with Khronos to introduce extensions into the relevant APIs
enabling us to tell drivers where in the address space we want resources mapped.
Pro: Wouldn't require going around the backs of the driver, resulting in a more hardened solution. (Out there, but what if a creative driver returns a mapping without read or write permission and handles accesses through a page fault handler?)
Cons: The extension would have to be implemented by each individual vendor for every relevant API. This would implicitly drop support for those with cards whose graphics drivers are no longer being updated.
- Hook the driver's mmap call when we invoke memory mappings function,
overriding the address to something in the 32-bit address space.
Pro: Unlike the other solutions, this wouldn't require any changes to other projects, and shares the advantage of the first solution.
Cons: Susceptible to breakage if the driver uses their own mapping mechanism separate from mmap. (Custom IOCTL, CPU driver returning something from the heap)
Here's a few other ideas / considerations I think are worth mentioning:
- Reserve the entire address space above 2G (or 3G with the appropriate image flags). This is essentially what we already do for 32-bit programs. I'm not sure if reserving 2**48 bytes of memory will run into problems, though? Has this been tried?
- Linux has a personality(2) switch ADDR_LIMIT_32BIT. The documentation is terse, so I'm not fully sure what this does, but it might be sufficient to ensure that new mappings are placed under 2 GB, while not breaking old mappings? And presumably it's also toggleable. It's not ideal exactly—we'd like to be able to set a 3 GB or 4 GB limit instead if the binary allows—but it's potentially already usable.
- We can emulate mappings for everything except coherent memory by manually implementing mapping functions with a separate sysmem location. We can implement persistent mappings this way, too, by copying on a flush, but unfortunately we can't expose GL_ARB_buffer_storage without coherent mappings.
[Fortunately d3d doesn't require coherent memory or ARB_buffer_storage, and the Vulkan backend doesn't require coherent memory for map acceleration. The GL backend currently does, but could be made not to. We'd have to add a private extension to use ARB_buffer_storage while not actually marking any maps as coherent. Of course, d3d isn't the only user of GL or Vulkan, and unfortunately ARB_buffer_storage is core in 4.3, so I'm sure there are GL applications out there that rely on it...]
I think we can actually emulate coherent memory as well, by tracking resource bindings and manually flushing on draws. That's a little painful, though.
- Crazy idea: On Linux, parse /proc/self/maps to allow remapping non-anonymous pages. Combined with mremap(2) or manual emulation, this allows mapping everything except for shared anonymous pages [and I can't imagine that a GPU driver would use those, especially given that the only way to make use of the SHARED flag is fork(2)].
ἔρρωσθε, Zeb
Am Montag, 25. April 2022, 08:31:51 EAT schrieb Zebediah Figura:
- Reserve the entire address space above 2G (or 3G with the appropriate
image flags). This is essentially what we already do for 32-bit programs. I'm not sure if reserving 2**48 bytes of memory will run into problems, though? Has this been tried?
Yes, I am doing this in hangover. It does not seem to cost any physical memory (for alloc'ing page tables or so), but it takes some time.
What I am doing is this: Try to grab every single page below 4 GB, by iterating over them by 4k granularity. Make note in a bitmap which page got grabbed that way.
(hangover specific thing: Try to load as many host libs as I think I might need to get them above 4 GB to save address space below)
Then mmap 1 << x bytes of memory, starting with x = 64 and decrementing x whenever such mmap fails. Throw away the returned pointer.
Then unmap the < 4 GB pages marked in the bitmap.
The stuff that takes time here is reserving / freeing the low 4 GB. Hangover process creation takes about 2 seconds due to that, which breaks some kernel32 tests. The part of actually grabbing all address space above 4 GB is relatively fast, I think ~0.5 sec.
If we go this way, we could maybe speed up the process by using the preloader to grab the lower 4GB in one block (similarly to the page zero thing on macos).
One drawback is that it makes it much harder to use the address space above 4 GB. We *want* that space, just not if we may have to pass a pointer to it to 32 bit code...
On 25.04.22 07:31, Zebediah Figura wrote:
- We can emulate mappings for everything except coherent memory by
manually implementing mapping functions with a separate sysmem location. We can implement persistent mappings this way, too, by copying on a flush, but unfortunately we can't expose GL_ARB_buffer_storage without coherent mappings.
[Fortunately d3d doesn't require coherent memory or ARB_buffer_storage, and the Vulkan backend doesn't require coherent memory for map acceleration. The GL backend currently does, but could be made not to. We'd have to add a private extension to use ARB_buffer_storage while not actually marking any maps as coherent. Of course, d3d isn't the only user of GL or Vulkan, and unfortunately ARB_buffer_storage is core in 4.3, so I'm sure there are GL applications out there that rely on it...]
Obvious performance issues of this solution aside, many Vulkan applications require coherent memory. There is also this requirement in the vk spec:
There *must* be at least one memory type with both the
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT and VK_MEMORY_PROPERTY_HOST_COHERENT_BIT bits set in its propertyFlags.
I think we can actually emulate coherent memory as well, by tracking resource bindings and manually flushing on draws. That's a little painful, though.
This is not possible in Vulkan, especially with newer features like buffer device address or update after bind.
- Crazy idea: On Linux, parse /proc/self/maps to allow remapping
non-anonymous pages. Combined with mremap(2) or manual emulation, this allows mapping everything except for shared anonymous pages [and I can't imagine that a GPU driver would use those, especially given that the only way to make use of the SHARED flag is fork(2)].
ἔρρωσθε, Zeb
On 4/25/22 03:48, Georg Lehmann wrote:
On 25.04.22 07:31, Zebediah Figura wrote:
- We can emulate mappings for everything except coherent memory by
manually implementing mapping functions with a separate sysmem location. We can implement persistent mappings this way, too, by copying on a flush, but unfortunately we can't expose GL_ARB_buffer_storage without coherent mappings.
[Fortunately d3d doesn't require coherent memory or ARB_buffer_storage, and the Vulkan backend doesn't require coherent memory for map acceleration. The GL backend currently does, but could be made not to. We'd have to add a private extension to use ARB_buffer_storage while not actually marking any maps as coherent. Of course, d3d isn't the only user of GL or Vulkan, and unfortunately ARB_buffer_storage is core in 4.3, so I'm sure there are GL applications out there that rely on it...]
Obvious performance issues of this solution aside, many Vulkan applications require coherent memory. There is also this requirement in the vk spec:
There *must* be at least one memory type with both the
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT and VK_MEMORY_PROPERTY_HOST_COHERENT_BIT bits set in its propertyFlags.
Right, I should clarify, the problem isn't supporting coherent memory, but rather device-visible coherent memory.
Unfortunately, it seems that Vulkan requires this as well. From the Vulkan 1.0.211 specification § 12.6, regarding vkGetBufferMemoryRequirements():
If buffer is a VkBuffer not created with the VK_BUFFER_CREATE_SPARSE_BINDING_BIT bit set, or if image is linear image, then the memoryTypeBits member always contains at least one bit set corresponding to a VkMemoryType with a propertyFlags that has both the VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT bit and the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT bit set. In other words, mappable coherent memory can always be attached to these objects.
Obviously tracking things manually will hurt performance in some applications, but I mention it since we may (or may not!) need a fallback, in order to guarantee we have *some* way of correctly emulating WoW64 maps. Personally I think it's better not to require WoW64 emulation in such cases.
I think we can actually emulate coherent memory as well, by tracking resource bindings and manually flushing on draws. That's a little painful, though.
This is not possible in Vulkan, especially with newer features like buffer device address or update after bind.
Interesting. What extensions do these correspond to?
On Mon, Apr 25, 2022 at 12:31:51AM -0500, Zebediah Figura wrote:
- Linux has a personality(2) switch ADDR_LIMIT_32BIT. The documentation is
terse, so I'm not fully sure what this does, but it might be sufficient to ensure that new mappings are placed under 2 GB, while not breaking old mappings? And presumably it's also toggleable. It's not ideal exactly—we'd like to be able to set a 3 GB or 4 GB limit instead if the binary allows—but it's potentially already usable.
FWIW, currently this only appears to affect alpha and arm architectures. That's not to say we couldn't try to get something in for x86. If we want something more useful, we'd likely be better adding to the prctl(2) interface.
Huw.
On 25.04.22 01:31, Zebediah Figura wrote:
On 4/24/22 21:18, Derek Lesho wrote:
Hi All,
In the wake of the new WOW64 implementation (recent explanation [1]), there has been discussion in informal channels about how to we are going to handle pointers to mapped graphics resource memory which we receive from the graphics API, as the possibility exists that it will fall outside of the 32-bit address space.
Over time, a few creative solutions have been proposed and discussed, with a common theme being that we need changes in either the kernel or the graphics drivers to do this properly. As we already know the requirements for a solution to this problem, I think it would be responsible to hash this out now and then work with the relevant project maintainers earlier as to avoid blocking work on the wine side too long and to possibly allow more users to test the new path earlier.
Thank you for starting this conversation! I agree with all of these points. WoW64 emulation is still a long way off, if it'll even happen by default on platforms other than Mac, but nevertheless this is something we should look into supporting sooner than later.
It would probably be good to start a dri-devel/mesa-dev thread to discuss this as well.
Agreed, I just filed a feature request at the Vulkan-Docs repo so that we can also hear the opinions of those working on non-mesa drivers like NV.
https://github.com/KhronosGroup/Vulkan-Docs/issues/1832
- Work with Khronos to introduce extensions into the relevant APIs
enabling us to tell drivers where in the address space we want resources mapped.
Pro: Wouldn't require going around the backs of the driver, resulting in a more hardened solution. (Out there, but what if a creative driver returns a mapping without read or write permission and handles accesses through a page fault handler?)
Cons: The extension would have to be implemented by each individual vendor for every relevant API. This would implicitly drop support for those with cards whose graphics drivers are no longer being updated.
- Hook the driver's mmap call when we invoke memory mappings
function, overriding the address to something in the 32-bit address space.
Pro: Unlike the other solutions, this wouldn't require any changes to other projects, and shares the advantage of the first solution.
Cons: Susceptible to breakage if the driver uses their own mapping mechanism separate from mmap. (Custom IOCTL, CPU driver returning something from the heap)
Here's a few other ideas / considerations I think are worth mentioning:
- Reserve the entire address space above 2G (or 3G with the
appropriate image flags). This is essentially what we already do for 32-bit programs. I'm not sure if reserving 2**48 bytes of memory will run into problems, though? Has this been tried?
- Linux has a personality(2) switch ADDR_LIMIT_32BIT. The
documentation is terse, so I'm not fully sure what this does, but it might be sufficient to ensure that new mappings are placed under 2 GB, while not breaking old mappings? And presumably it's also toggleable. It's not ideal exactly—we'd like to be able to set a 3 GB or 4 GB limit instead if the binary allows—but it's potentially already usable.
- We can emulate mappings for everything except coherent memory by
manually implementing mapping functions with a separate sysmem location. We can implement persistent mappings this way, too, by copying on a flush, but unfortunately we can't expose GL_ARB_buffer_storage without coherent mappings.
[Fortunately d3d doesn't require coherent memory or ARB_buffer_storage, and the Vulkan backend doesn't require coherent memory for map acceleration. The GL backend currently does, but could be made not to. We'd have to add a private extension to use ARB_buffer_storage while not actually marking any maps as coherent. Of course, d3d isn't the only user of GL or Vulkan, and unfortunately ARB_buffer_storage is core in 4.3, so I'm sure there are GL applications out there that rely on it...]
I think we can actually emulate coherent memory as well, by tracking resource bindings and manually flushing on draws. That's a little painful, though.
- Crazy idea: On Linux, parse /proc/self/maps to allow remapping
non-anonymous pages. Combined with mremap(2) or manual emulation, this allows mapping everything except for shared anonymous pages [and I can't imagine that a GPU driver would use those, especially given that the only way to make use of the SHARED flag is fork(2)].
Would this still work if the driver closed the FD after mmap-ing it?
ἔρρωσθε, Zeb
- Crazy idea: On Linux, parse /proc/self/maps to allow remapping
non-anonymous pages. Combined with mremap(2) or manual emulation, this allows mapping everything except for shared anonymous pages [and I can't imagine that a GPU driver would use those, especially given that the only way to make use of the SHARED flag is fork(2)].
Would this still work if the driver closed the FD after mmap-ing it?
Yes, procfs still displays the mapping in that case. Note that what's listed is the path (and inode), so we would reopen the path and then map it.
Hi,
Il 25/04/22 20:20, Zebediah Figura ha scritto:
Yes, procfs still displays the mapping in that case. Note that what's listed is the path (and inode), so we would reopen the path and then map it.
Couldn't the driver have deleted the file too? Or can you create a new link to a file given its inode? I couldn't find how after a brief internet search.
Giovanni.
On 4/28/22 03:10, Giovanni Mascellani wrote:
Hi,
Il 25/04/22 20:20, Zebediah Figura ha scritto:
Yes, procfs still displays the mapping in that case. Note that what's listed is the path (and inode), so we would reopen the path and then map it.
Couldn't the driver have deleted the file too?
Indeed, that is possible. I don't think it'd happen in practice, but it's one of the reasons it might be nice to have a fallback like I had mentioned.
Or can you create a new link to a file given its inode? I couldn't find how after a brief internet search.
I don't think you can do this; it would be a security hole.
On 25.04.22 12:38, Derek Lesho wrote:
On 25.04.22 01:31, Zebediah Figura wrote:
On 4/24/22 21:18, Derek Lesho wrote:
Hi All,
In the wake of the new WOW64 implementation (recent explanation [1]), there has been discussion in informal channels about how to we are going to handle pointers to mapped graphics resource memory which we receive from the graphics API, as the possibility exists that it will fall outside of the 32-bit address space.
Over time, a few creative solutions have been proposed and discussed, with a common theme being that we need changes in either the kernel or the graphics drivers to do this properly. As we already know the requirements for a solution to this problem, I think it would be responsible to hash this out now and then work with the relevant project maintainers earlier as to avoid blocking work on the wine side too long and to possibly allow more users to test the new path earlier.
Thank you for starting this conversation! I agree with all of these points. WoW64 emulation is still a long way off, if it'll even happen by default on platforms other than Mac, but nevertheless this is something we should look into supporting sooner than later.
It would probably be good to start a dri-devel/mesa-dev thread to discuss this as well.
Agreed, I just filed a feature request at the Vulkan-Docs repo so that we can also hear the opinions of those working on non-mesa drivers like NV.
It looks like Jason Ekstrand drafted two extensions for us here, and would like to know our opinion on which approach would be the best for us, as he is even willing to write the extension text for us.
As explained in the thread, the two approaches are
1) Introduce a MAP_32BIT flag to vkMapMemory which the driver would forward to mmap.
2) Read the ppData parameter used to return the memory mapping as a suggestion for the mapping location, similar to how BaseAddress is used NtAllocateVirtualMemory.
I think the more flexible second solution be most ideal for us, as it allows us to handle the LAA case, but what do you guys think?
Derek Lesho dlesho@codeweavers.com writes:
On 25.04.22 12:38, Derek Lesho wrote:
On 25.04.22 01:31, Zebediah Figura wrote:
On 4/24/22 21:18, Derek Lesho wrote:
Hi All,
In the wake of the new WOW64 implementation (recent explanation [1]), there has been discussion in informal channels about how to we are going to handle pointers to mapped graphics resource memory which we receive from the graphics API, as the possibility exists that it will fall outside of the 32-bit address space.
Over time, a few creative solutions have been proposed and discussed, with a common theme being that we need changes in either the kernel or the graphics drivers to do this properly. As we already know the requirements for a solution to this problem, I think it would be responsible to hash this out now and then work with the relevant project maintainers earlier as to avoid blocking work on the wine side too long and to possibly allow more users to test the new path earlier.
Thank you for starting this conversation! I agree with all of these points. WoW64 emulation is still a long way off, if it'll even happen by default on platforms other than Mac, but nevertheless this is something we should look into supporting sooner than later.
It would probably be good to start a dri-devel/mesa-dev thread to discuss this as well.
Agreed, I just filed a feature request at the Vulkan-Docs repo so that we can also hear the opinions of those working on non-mesa drivers like NV.
It looks like Jason Ekstrand drafted two extensions for us here, and would like to know our opinion on which approach would be the best for us, as he is even willing to write the extension text for us.
As explained in the thread, the two approaches are
- Introduce a MAP_32BIT flag to vkMapMemory which the driver would
forward to mmap.
- Read the ppData parameter used to return the memory mapping as a
suggestion for the mapping location, similar to how BaseAddress is used NtAllocateVirtualMemory.
I think the more flexible second solution be most ideal for us, as it allows us to handle the LAA case, but what do you guys think?
I don't think that's sufficient, because there's no way to ensure that the address that we picked is still available.
What we would want is to have it map into already reserved memory, which would require the driver to use our specified address with MAP_FIXED. Also it would have to avoid calling munmap() on free and let us take care of remapping anonymous memory.
On 25.04.22 04:18, Derek Lesho wrote:
Hi All,
In the wake of the new WOW64 implementation (recent explanation [1]), there has been discussion in informal channels about how to we are going to handle pointers to mapped graphics resource memory which we receive from the graphics API, as the possibility exists that it will fall outside of the 32-bit address space.
Over time, a few creative solutions have been proposed and discussed, with a common theme being that we need changes in either the kernel or the graphics drivers to do this properly. As we already know the requirements for a solution to this problem, I think it would be responsible to hash this out now and then work with the relevant project maintainers earlier as to avoid blocking work on the wine side too long and to possibly allow more users to test the new path earlier.
The solutions which I've seen laid out so far:
- Use the mremap(2) interface, allowing us to duplicate the mapping we
receive into the 32-bit address space. This solution would match what is already done for Crossover Mac's 32on64 support using Mac's mach_vm_remap functionality [2]. However, right now it is not possible to use the MREMAP_DONTUNMAP flag with mappings that aren't private and anonymous, which rules out there use on mapped FDs from libdrm. Due to this, a kernel change would be necessary.
Pro: A uniform solution across all APIs, which could help in the future with any unforeseen need to access host-allocated memory in 32-bit windows code.
Cons: Requires a kernel change, which of all the options may take the longest to get up-streamed and in the hands of users.
If we can be sure this works for every possible situation this seems like the best idea that came up. Maybe even the only feasible outside of reserving all non 32bit address space like zf suggested.
If we can't be sure then this isn't much better than the third idea, we shouldn't make assumptions about what drivers are doing.
- Work with Khronos to introduce extensions into the relevant APIs
enabling us to tell drivers where in the address space we want resources mapped.
Pro: Wouldn't require going around the backs of the driver, resulting in a more hardened solution. (Out there, but what if a creative driver returns a mapping without read or write permission and handles accesses through a page fault handler?)
I'm not sure if relying on page faults is something drivers can even do, I don't think we should worry much about this hypothetical case.
Cons: The extension would have to be implemented by each individual vendor for every relevant API. This would implicitly drop support for those with cards whose graphics drivers are no longer being updated.
Another downside of this approach is that I'm not sure if you can even get a new cross vendor extension for OpenGL these days. And what about other APIs like OpenCL or cuda?
Also, not sure if going through Khronos is really faster than the kernel idea. And how many Khronos vendors even care about an extension that's only useful for Wine? It's also unclear how much work in the drivers this requires, maybe those with a custom mapping mechanism might even need kernel changes.
- Hook the driver's mmap call when we invoke memory mappings function,
overriding the address to something in the 32-bit address space.
Pro: Unlike the other solutions, this wouldn't require any changes to other projects, and shares the advantage of the first solution.
Cons: Susceptible to breakage if the driver uses their own mapping mechanism separate from mmap. (Custom IOCTL, CPU driver returning something from the heap)
I don't like this idea, we shouldn't guess what the drivers are doing. This will only cause issues in the future if driver behavior changes.
1: https://www.winehq.org/pipermail/wine-devel/2022-April/213677.html
2: https://www.codeweavers.com/crossover/source - see function `remap_memory` in `wine/dlls/winemac.drv/opengl.c`
On Sunday, April 24, 2022 7:18:49 PM PDT Derek Lesho wrote:
The solutions which I've seen laid out so far:
- Use the mremap(2) interface, allowing us to duplicate the mapping we
receive into the 32-bit address space. This solution would match what is already done for Crossover Mac's 32on64 support using Mac's mach_vm_remap functionality [2]. However, right now it is not possible to use the MREMAP_DONTUNMAP flag with mappings that aren't private and anonymous, which rules out there use on mapped FDs from libdrm. Due to this, a kernel change would be necessary.
This doesn't sound at all safe, since it's essentially moving the memory mapping out from under the driver. MREMAP_DONTUNMAP avoids unmapping the original memory space, but that memory space becomes all but invalid, a page fault when accessed and is otherwise zero-filled to satisfy accesses:
MREMAP_DONTUNMAP (since Linux 5.7) ... After completion, any access to the range specified by old_address and old_size will result in a page fault. The page fault will be handled by a userfaultfd(2) handler if the address is in a range previously registered with userfaultfd(2). Otherwise, the kernel allocates a zero-filled page to handle the fault.
You can't know what a driver will do with mapped memory or pointer addresses it returns to the application, or where such memory comes from, so you can't be sure it doesn't have some bookkeeping with it or does manual copying using a cached pointer instead of the remapped location. You also can't know if it's using a preallocated pool that it returns to the app when "mapping" and reuses after "unmapping".
What you'd need for something like this is a method to duplicate a memory mapping, leaving the original intact instead of wiping it, so different pages/ addresses refer to the same underlying hardware memory. There doesn't seem to be an option for that, currently.
- Work with Khronos to introduce extensions into the relevant APIs
enabling us to tell drivers where in the address space we want resources mapped.
This seems like the only real option to me. It's the only way to be sure a driver knows what you actually want and won't break some assumptions when it's memory mappings are changed on it. This can also be useful in other non-Wine situations where a 64-bit app is running native 32-bit code and needs GPU memory in the lower 32-bit address space. It can even tell the driver to keep unmappable memory (memory the app itself won't directly access) out of the 32- bit address space to leave more for the 32-bit code, where in a pure 64-bit process such a thing wouldn't matter and it may not bother to.
- Hook the driver's mmap call when we invoke memory mappings function,
overriding the address to something in the 32-bit address space.
Similar to point 1, you can't be sure how the driver handles memory mapping. It could have preallocated memory that mapping simply returns a chunk of, meaning there wouldn't be an mmap call during the mapping function since it was done some time earlier. On 64-bit systems, the driver could also use a memory management style that's more efficient with a large address space instead of a smaller one. If you simply force 32-bit addresses on the driver, it could make the driver's memory management less efficient or be more wasteful with the already-limited 32-bit address space. Explicitly telling the driver you want 32-bit addresses for mapped memory would ensure the driver knows it needs to be more frugal with mappable memory.
On Tue, Apr 26, 2022 at 12:04 AM Chris Robinson chris.kcat@gmail.com wrote:
On Sunday, April 24, 2022 7:18:49 PM PDT Derek Lesho wrote:
The solutions which I've seen laid out so far:
- Use the mremap(2) interface, allowing us to duplicate the mapping we
receive into the 32-bit address space. This solution would match what is already done for Crossover Mac's 32on64 support using Mac's mach_vm_remap functionality [2]. However, right now it is not possible to use the MREMAP_DONTUNMAP flag with mappings that aren't private and anonymous, which rules out there use on mapped FDs from libdrm. Due to this, a kernel change would be necessary.
This doesn't sound at all safe, since it's essentially moving the memory mapping out from under the driver. MREMAP_DONTUNMAP avoids unmapping the original memory space, but that memory space becomes all but invalid, a page fault when accessed and is otherwise zero-filled to satisfy accesses:
MREMAP_DONTUNMAP (since Linux 5.7) ... After completion, any access to the range specified by old_address and old_size will result in a page fault. The page fault will be handled by a userfaultfd(2) handler if the address is in a range previously registered with userfaultfd(2). Otherwise, the kernel allocates a zero-filled page to handle the fault.
You can't know what a driver will do with mapped memory or pointer addresses it returns to the application, or where such memory comes from, so you can't be sure it doesn't have some bookkeeping with it or does manual copying using a cached pointer instead of the remapped location. You also can't know if it's using a preallocated pool that it returns to the app when "mapping" and reuses after "unmapping".
What you'd need for something like this is a method to duplicate a memory mapping, leaving the original intact instead of wiping it, so different pages/ addresses refer to the same underlying hardware memory. There doesn't seem to be an option for that, currently.
Indeed, that's what CrossOver's macOS hack does at the moment, by means of the mach_vm_remap() API. It sounds like currently there is no equivalent functionality in Linux. I have no clue if the Linux memory management infrastructure makes supporting that particularly hard, but otherwise I wouldn't expect any particular opposition on a matter of principle to introducing the support we need into the kernel.
Personally I find this to be the only really practical strategy, at least among those that have been mentioned. As Zeb pointed out, that's not to say there won't be any use in also pursuing other options though.
On 4/25/22 16:51, Chris Robinson wrote:
- Hook the driver's mmap call when we invoke memory mappings function,
overriding the address to something in the 32-bit address space.
Similar to point 1, you can't be sure how the driver handles memory mapping. It could have preallocated memory that mapping simply returns a chunk of, meaning there wouldn't be an mmap call during the mapping function since it was done some time earlier. On 64-bit systems, the driver could also use a memory management style that's more efficient with a large address space instead of a smaller one. If you simply force 32-bit addresses on the driver, it could make the driver's memory management less efficient or be more wasteful with the already-limited 32-bit address space. Explicitly telling the driver you want 32-bit addresses for mapped memory would ensure the driver knows it needs to be more frugal with mappable memory.
This is a good point. We've already been bitten by drivers *not* being frugal with mappable memory even when the process is truly a 32-bit one. The Mac GL driver is the obvious offender here, but I've also seen this become a problem with radeonsi. So assuming we can get drivers to care in the first place, we should implement this to ensure that they can make that decision on WoW64 as well.
N.B. as stated elsewhere in this thread, I don't necessarily think this is the *only* option we should pursue—many of the solutions have benefits outside of the specific problem posed in Derek's original post—but it does currently seem like the ideal code path for GPU mappings specifically.