My hardware is Ryzen 7 4700u with the integrated graphics
Flatout (Direct3D 9): 20 fps (renders correctly)
Unigine Heaven (OpenGL): ~40-60 fps (renders correctly)
Unigine Heaven (Direct3D 9): 20-30 fps (renders correctly)
Unigine Heaven (Direct3D 11): 20-30 fps as well (renders correctly)
Elder Scrolls IV (Direct3D 9): 20 fps (renders correctly)
BeamNG Tech Demo 0.3 (Direct3D 9): 2 fps (renders correctly, but still runs poorly)
Massive step up from getting 2 fps across many wined3d games, but it's still pretty bad, ~~and sometimes runs worse than the original code~~. Now with a combination of using the new and old code dynamically you can get the best of both worlds!
Unfortunately, we lose the ability to get lucky with the mapping just happening to be in the 32 bit range.
--
v7: opengl32: speed up wow64 mapping.
https://gitlab.winehq.org/wine/wine/-/merge_requests/5145
(For after the code freeze.)
I think we want to have this DLL living in Wine for easier development, and probably dynamically load our custom Chromium fork from here. (just like MSHTML and wine-Gecko)
The code for that fork could then be created in its own repo.
--
v7: embeddedbrowserwebview: Create CreateWebViewEnvironmentWithOptionsInternal stub.
embeddedbrowserwebview: Add stub dll.
https://gitlab.winehq.org/wine/wine/-/merge_requests/7032
Wine-Bug: https://bugs.winehq.org/show_bug.cgi?id=54194
If the SendMessageTimeout call takes a long time, we can get other
messages which also set the observed wparam value. Apparently,
this is especially likely on Windows 7.
This also removes the (wParam == 0xbaadbeef) check which may have
been intended to serve the same goal but doesn't work because the
observed wParam value is still assigned.
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/3862
x86_64 Windows and macOS both use `%gs` to access thread-specific data (Windows TEB, macOS TSD). To date, Wine has worked around this conflict by filling the most important TEB fields (`0x30`/`Self`, `0x58`/`ThreadLocalStorage`) in the macOS TSD structure (Apple reserved the fields for our use). This was sufficient for most Windows apps.
CrossOver's Wine had an additional hack to handle `0x60`/`ProcessEnvironmentBlock`, and binary patches for certain CEF binaries which directly accessed `0x8`/`StackBase`. Additionally, Apple's libd3dshared could activate a special mode in Rosetta 2 where code executing in certain regions would use the Windows TEB when accessing `%gs`.
Now that the PE separation is complete, GSBASE can be swapped when entering/exiting PE code. This is done in the syscall dispatcher, unix-call dispatcher, and for user-mode callbacks. GSBASE also needs to be set to the macOS TSD when entering signal handlers (in `init_handler()`), and then restored to the Windows TEB when exiting (in `leave_handler()`). There is a special-case needed in `usr1_handler`: when inside a syscall (on the kernel stack), GSBASE may need to be reset to either the TEB or the TSD. The only way to tell is to determine what GSBASE was set to on entry to the signal handler.
---
macOS does not have a public API for setting GSBASE, but the private `_thread_set_tsd_base()` works and was added in macOS 10.12.
`_thread_set_tsd_base()` is a small thunk that sets `%esi`, `%eax`, and does the `syscall`: https://github.com/apple-oss-distributions/xnu/blob/8d741a5de7ff4191bf97d57….
The syscall instruction itself clobbers `%rcx` and `%r11`.
I've tried to save as few registers as possible when calling `_thread_set_tsd_base()`, but there may be room for improvement there.
---
I've tested this successfully on macOS 15 (Apple Silicon and Intel) with several apps and games, including the `cefclient.exe` CEF sample.
I still need to test this patch on macOS 10.13, and I'd also like to do some performance testing.
---
I also tested an alternate implementation strategy for this which took advantage of the expanded "full" thread state which is passed to signal handlers when a process has set a user LDT. The full thread state includes GSBASE, so GSBASE is set back to whatever is in the sigcontext on return (like every other field in the context). This would avoid needing to explicitly reset GSBASE in `leave_handler()`, and avoid the special-case in `usr1_handler()`.
This strategy was simpler, but I'm not using it for 2 reasons:
- the "full" thread state is only available starting with macOS 10.15, and we still support 10.13.
- more crucially, Rosetta 2 doesn't seem to correctly implement the GS.base field of the full thread state. It's set to 0 on entry, and isn't read on exit.
--
v3: ntdll: Remove x86_64 Mac-specific TEB access workarounds that are no longer needed.
ntdll: On macOS x86_64, swap GSBASE between the TEB and macOS TSD when entering/leaving PE code.
ntdll: Ensure init_handler runs in signal handlers before any compiler-generated memset calls.
ntdll: Remove ugly fallback method for getting a thread's GSBASE on macOS.
ntdll: Leave kernel stack before accessing %gs in x86_64 syscall dispatcher.
ntdll: Do %gs accesses before switching to kernel stack in x86_64 syscall dispatcher.
https://gitlab.winehq.org/wine/wine/-/merge_requests/6866
This introduces a faster implementation of signal and wait operations on NT
events, semaphores, and mutexes, which improves performance to native levels for
a wide variety of games and other applications.
The goal here is similar to the long-standing out-of-tree "esync" and "fsync"
patch sets, but without the flaws that make those patch sets not upstreamable.
The Linux "ntsync" driver is not currently released. It has been accepted into
the trunk Linux tree for 6.14, so barring any extraordinary circumstances, the
API is frozen and it will be released in its current form in about 2 months.
Since it has passed all relevant reviewers on the kernel side, and the API is
all but released, it seems there is no reason any more not to submit the Wine
side to match.
Some important notes:
* This patch set does *not* include any way to disable ntsync support, since
that kind of configuration often seems to be dispreferred where not necessary.
In essence, ntsync should just work everywhere.
Probably the easiest way to effectively disable ntsync, for the purposes of
testing, is to chmod the /dev/ntsync device to prevent its being opened.
Regardless, a Wine switch to disable ntsync can be added simply enough. Note
that it should probably not take the form of a registry key, however, since it
needs to be easily accessible from the server itself.
* It is, generally speaking, not possible for only some objects, or some
processes, to have backing ntsync objects, while others use the old server
path. The esync/fsync patch sets explicitly protected against this by making
sure every process had a consistent view of whether esync was enabled. This is
not provided here, since no switch is provided to toggle ntsync, and it should
not be possible to get into such an inconsistent state without gross
misconfiguration.
* Similarly, no diagnostic messages are provided to note that ntsync is in use,
or not in use. These messages are part of esync/fsync, as well as part of
ntsync "testing" trees unofficially distributed. However, if ntsync is working
correctly, no message should be necessary.
The basic structure is:
* Each type of server object which can be waited on by the client (including
events, semaphores, mutexes, but also other types such as processes, files)
must store an "inproc_sync" object.
This "inproc_sync" object is a full server object (note that this differs from
esync/fsync). A vector and server request is introduced to retrieve an NT
handle to this object from an arbitrary NT handle.
Since the actual ntsync objects are simply distinct file descriptions, we then
call get_handle_fd from the client to retrieve an fd to the object, and then
perform ioctls on it.
* Objects signaled by the server (processes, files, etc) perform ntsync ioctls
on that object. The backing object in all such cases is simply an event.
* Signal and wait operations on the client side attempt to defer to an
"inproc_\*" function, falling back to the server implementation if it returns
STATUS_NOT_IMPLEMENTED. This mirrors how in-process synchronization objects
(critical sections, SRW locks, etc) used to be implemented—attempting to use
an architecture-specific "fast_\*" function and falling back if it returned
STATUS_NOT_IMPLEMENTED.
* The inproc_sync handles, once retrieved, are cached per-process. This caching
takes a similar form to the fd cache. It does not reuse the same
infrastructure, however.
The primary reason for this is that the fd cache is designed to fit within a
64-bit value and uses 64-bit atomic operations to ensure consistency. However,
we need to store more than 64 bits of information. [We also need to modify
them after caching, in order to correctly implement handle closing—see below.]
The secondary reason is that retrieving the ntsync fd from the inproc_sync
handle itself uses the fd cache.
* In order to keep the Linux driver simple, it does not implement access flags
(EVENT_MODIFY_STATE etc.) Instead, the flags are cached locally and validated
there. This too mirrors the fd cache. Note that this means that a malicious
process can now modify objects it should not be able modify—which is less true
than it is with wineserver—but this is no different from the way other objects
(notably fds) are handled, and would require manual syscalls.
* In order to achieve correct behaviour related to closing objects while they
are used, this patch set essentially relies on refcounting. This is broadly
true of the server as well, but because we need to avoid server calls when
performing object operations, significantly more care must be taken.
In particular, because waits need to be interruptable by signals and then be
restarted, we need the backing ntsync object to remain valid until all users
are done with it. On a process level, this is achieved by letting multiple
processes own handles to the underlying inproc_sync server object.
On a thread level, multiple simultaneous calls need to refcount the process's
local handle. This refcount is stored in the sync object cache. When it
reaches zero, the cache is cleared.
Punting this behaviour to the Linux driver would have introduced a great deal
more complexity, which is best kept in userspace and out of the kernel.
* The cache is, as such, treated as a cache. The penultimate commit, which
introduces client support but does not yet cache the objects, effectively
illustrates this by never actually caching anything, and retrieving a new NT
handle and fd every time.
* Certain waits, on internal handles (such as async, startup_info, completion),
are delegated to the server even when ntsync is used. Those server objects do
not create an underlying ntsync object.
--
v4: ntdll: Cache in-process synchronization objects.
ntdll: Use server_wait_for_object() when waiting on only the queue object.
ntdll: Use in-process synchronization objects.
ntdll: Introduce a helper to wait on an internal server handle.
server: Allow creating an event object for client-side user APC signaling.
server: Introduce select_inproc_queue and unselect_inproc_queue requests.
server: Add a request to retrieve the in-process synchronization object from a handle.
server: Create in-process synchronization objects for fd-based objects.
server: Create in-process synchronization objects for timers.
server: Create in-process synchronization objects for threads.
server: Create in-process synchronization objects for message queues.
server: Create in-process synchronization objects for jobs.
server: Create in-process synchronization objects for processes.
server: Create in-process synchronization objects for keyed events.
server: Create in-process synchronization objects for device managers.
server: Create in-process synchronization objects for debug objects.
server: Create in-process synchronization objects for console servers.
server: Create in-process synchronization objects for consoles.
server: Create in-process synchronization objects for completion ports.
server: Create in-process synchronization objects for mutexes.
server: Create in-process synchronization objects for semaphores.
server: Create in-process synchronization objects for events.
server: Add an object operation to retrieve an in-process synchronization object.
ntdll: Retrieve and cache an ntsync device in wait calls.
ntdll: Add stub functions for in-process synchronization.
ntdll: Add some traces to synchronization methods.
This merge request has too many patches to be relayed via email.
Please visit the URL below to see the contents of the merge request.
https://gitlab.winehq.org/wine/wine/-/merge_requests/7226
As shown by the testbot, doubling is not always sufficient.
--
v2: iphlpapi/tests: Call GetExtendedTcp/UdpTable() in a loop.
iphlpapi/tests: Call GetAdaptersAddresses() in a loop.
https://gitlab.winehq.org/wine/wine/-/merge_requests/3833