[Bug 59333] New: VNyan leaks memory (race in __wine_syscall_dispatcher $rsp switching breaks Unity GC)
http://bugs.winehq.org/show_bug.cgi?id=59333 Bug ID: 59333 Summary: VNyan leaks memory (race in __wine_syscall_dispatcher $rsp switching breaks Unity GC) Product: Wine Version: 11.0 Hardware: x86-64 OS: Linux Status: UNCONFIRMED Severity: major Priority: P2 Component: ntdll Assignee: wine-bugs@list.winehq.org Reporter: lina@lina.yt Distribution: --- Created attachment 80270 --> http://bugs.winehq.org/attachment.cgi?id=80270 Debug & gdb logs VNyan (https://suvidriel.itch.io/vnyan) under recent versions of Wine starts leaking memory at a random time (from minutes to hours after startup). It took a few days of debugging and reverse engineering, but I tracked this down to an issue with Unity's GC observing a weird stack switch in Wine. For reference, Proton 8-5 seems to be fine and GE Proton 10-15 triggers the problem. The problem reproduces with plain `wine-11.0 (Staging)` (wine-11.0-2.fc43.x86_64 from Fedora 43), which is what I used for the traces in this bug. This is the Win32 threads code in the Unity fork of bdwgc: https://github.com/Unity-Technologies/bdwgc/blob/unity-master/win32_threads.... Normally, bdwgc will use SuspendThread() on all application threads (see GC_suspend()) and then call GetThreadContext() to retrieve the thread context. It then uses the sp value to determine which stack segment to mark as roots for GC. The logic is complex and there is a safety check in case sp is out of bounds (then it collects the whole stack). However, on line 1695 above, sp is blindly subtracted from thread->stack_base and used to compute the stack usage in bytes, without any bounds checks. This means that when sp ends up above the stack, the subtraction overflows and returns a bogus huge stack size. This ends up throwing off the GC collection threshold, and the GC never runs again. During a normal stop-the-world GC ([1] in vnyan_debug.txt), all threads are stopped with $rip in Windows code (0x00006fffffxxxxxx) (other than the main thread which is in 0x000000018xxxxxx because I patched the Mono DLL to not relocate, for consistent debugging). Most threads are in NtWaitForSingleObject or NtWaitForAlertByThreadId, a couple in NtDelayExecution/NtNotifyChangeKey/NtUserMsgWaitForMultipleObjectsEx/NtWaitForMultipleObjects, and two threads in a mmdevapi wine_unix_call(). When the bug occurs, one thread is caught in the middle of UNIX code ([2] in vnyan_debug.txt). The stack pointer changes from this (pre-bug): Thread 82 (Thread 32.0x2c0): $201 = 0x7a8bf3e8 To this (during bug): Thread 82 (Thread 32.0x2c0): $295 = 0x12938eab0 The GC logs this warning (needs patches to enable debug logs): --> Marking for collection #2607 after 3617216 allocated bytes Marked from 150 dirty pages GC Warning: Thread stack pointer 000000012938EAB0 out of range, pushing everything Pushed 34 thread stacks And then the GC stops working. In the particular repro logged, rip is pointing here in __wine_syscall_dispatcher (it's usually within a few instructions of this area): https://github.com/wine-mirror/wine/blob/db11d0fe6a169c457e23d007e20404643d0... This means that is_inside_syscall() returned false and allowed the thread state to be captured directly from native thread state. This is defined as: static inline BOOL is_inside_syscall( ULONG_PTR sp ) { return ((char *)sp >= (char *)ntdll_get_thread_data()->kernel_stack && (char *)sp <= (char *)get_syscall_frame()); } Just a few lines before the instruction the thread was stopped in: "leaq 0x70(%rcx),%rsp\n\t" /* %rsp > frame means no longer inside syscall */ Indeed, %rsp as seen by the user app is 0x70 into syscall_frame (which is 0x12938ea40). Later in the function the stack is switched to the proper user stack: /* switch to user stack */ "movq 0x88(%rcx),%rsp\n\t" I believe when a thread is stopped in the entire range between those two instructions, user code can observe %rsp set to a bogus value that should not be possible. This code was moved around by commit 245e8cedf059 and previously introduced by 0a5f7a71036. I'm not sure why %rsp is being set to point into &frame->rip instead of simply restoring it to the user stack pointer earlier. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 easyaspi314 <easy-as-pi314.ttv@proton.me> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |easy-as-pi314.ttv@proton.me --- Comment #1 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- While it is difficult to tell when it does NOT happen due to how spontaneous it is, it appears that 8.0 Staging from Debian Bookworm is fine, as well as Proton 8-5 which is the "standard" for VNyan users. 9.0 (specifically Proton 9) has seen reports of the leak happening but I have not directly tested it. 10.0 Staging (wine-devel from the Ubuntu Questing PPA) does reproduce it. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #2 from Hoshino Lina <lina@lina.yt> --- I tracked down the real stack trace (after restoring %rsp): #0 0x00007ffff7d0dffd in ?? () #1 0x00006fffffc1bee6 in RtlWaitOnAddress (addr=0x7fffe65582bc, addr@entry=0x180750db8, cmp=0x7ffff7d22ea0, cmp@entry=0x7a8bf4ac, size=size@entry=4, timeout=0x12938ead8, timeout@entry=0x7a8bf508) at dlls/ntdll/sync.c:910 #2 0x00006fffffc177bb in RtlSleepConditionVariableSRW (variable=0x180750db8, lock=0x180750d80, timeout=0x7a8bf508, flags=<optimized out>) at dlls/ntdll/sync.c:806 #3 0x00006fffff40576c in SleepConditionVariableSRW (variable=<optimized out>, lock=<optimized out>, timeout=<optimized out>, flags=<optimized out>) at dlls/kernelbase/sync.c:1202 #4 0x0000000180084261 in sleep_interruptable () from Z:\mnt\nas\home\lina\vt\app\VNyan\MonoBleedingEdge\EmbedRuntime\mono-2.0-bdwgc.dll #5 0x000000018014153f in FUN_180141480 () from Z:\mnt\nas\home\lina\vt\app\VNyan\MonoBleedingEdge\EmbedRuntime\mono-2.0-bdwgc.dll So it's just a thread sleeping. I believe this can probably affect a huge number of Unity apps/games with different probability of happening, but for VNyan one that stands out is this thread poll loop in uOSC: https://github.com/hecomi/uOSC/blob/3c80df7ee3ce7c58cb33f5820921a3b192e22c04... Not sure if that's the one but it's definitely one candidate. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 Adalyn <adibtw@tuta.io> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |adibtw@tuta.io -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 Zeb Figura <z.figura12@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |z.figura12@gmail.com --- Comment #3 from Zeb Figura <z.figura12@gmail.com> --- Fundamentally we shouldn't be reporting the "real" context if we're anywhere inside __wine_syscall_dispatcher(). This is probably tricky to do, because the shuffling around we do to actually save the context means that the correct context values to report depend on *exactly* where we are in the function. This is another case that makes me think we really just need to be "masking" usr1 until we're ready to deal with it. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #4 from Hoshino Lina <lina@lina.yt> --- Yeah, I was thinking about that yesterday and it's not just SP. For the GC not to break, you need to make sure all registers are either preserved, or saved to the *user* stack, for any frames reported to user code. Otherwise the GC could miss some roots and free something it shouldn't. In practice, I think that means not clobbering any non-volatile registers while you are saving the frame. I think if you do that, and change the context code to something like: if (rip in __wine_syscall_dispatcher && rip during context save/restore sections) Just fake rip to be before/after the call so user does not observe it else if (is_inside_syscall()) Syscall path else Normal context path Then that would probably be reasonably fine? The user could observe volatile registers "magically" changing during the syscall in the corner cases but not much else. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #5 from Zeb Figura <z.figura12@gmail.com> --- I don't think a partial solution like that is particularly likely to be accepted. I could be wrong, though. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #6 from Hoshino Lina <lina@lina.yt> --- It was just a suggestion, in case masking signals has too much overhead and full tracking of IP position too complex. I'm not really comfortable enough with the code to attempt a proper solution myself... there's a lot of subtlety in there (e.g. CFI) that I have no idea about. I hope someone else who knows the code better can pick this up. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #7 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- I mean I may be overlooking things, but uh... Is Wine actually checking whether it is in a syscall when calling NtGetContextThread and returning nonzero? I don't see it in any of the instances of where it sets the return value. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #8 from Zeb Figura <z.figura12@gmail.com> --- (In reply to easyaspi314 from comment #7)
I mean I may be overlooking things, but uh... Is Wine actually checking whether it is in a syscall when calling NtGetContextThread and returning nonzero? I don't see it in any of the instances of where it sets the return value.
NtGetContextThread() is supposed to return a valid context while in a syscall. It's just that it's the context that will be restored on return to user mode, not the context of the kernel. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #9 from Hoshino Lina <lina@lina.yt> --- I was thinking about the "masking" USR1 and... I guess I was confused about the meaning. Since USR1 is just a normally handled signal, I guess it would be sufficient to just check if %rip is inside the syscall dispatcher and just set a flag and return? Then the syscall handler itself can check the flag, and if set, jump to the rest of the USR1 handling code. You'd need some subtlety around "carving out" the check itself so it would go something like: - Between start of __wine_syscall_dispatcher and loading the flag, do the above - After the flag is loaded, and before jumping into the syscall, just handle it normally in the signal handler (considered "in syscall"). - If it's after syscall return and back in __wine_syscall_dispatcher, perhaps the easiest thing to do is just have the USR1 handler alter the signal return context to restart the syscall return path (you'd have to ensure the syscall return code is idempotent with respect to user state). This only really matters if the context is allowed to be modified by another thread "during" a syscall. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #10 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- This definitely doesn't just affect VNyan, the bug seemed to be reproduced in VTube Studio on Lina's last stream, and one other person recalls this leak in VRM Posing Desktop (and I think there have also been reports of VRChat leaking?). While most of these are VTuber and there is a lot of sample bias in the VTuber community, I believe this affects all Unity software, and possibly all Mono programs. I think the reason it is notably affecting us VTubers is because: - We often run this software alongside other games which results in far more memory pressure than normal gamers - It's not uncommon for us to run these programs for many hours compared to most games which you run for a few hours and quit - The memory allocation patterns are far more aggressive than most games because unlike games which have predictable scripted animations and physics, VTuber software is animating based on tracking data parsed in realtime and user-controlled integrations and scripts. - Most VTuber models are extremely unoptimized compared to normal game assets. As for the Unity versions: - VNyan: 2022.3.62f3 - VTube Studio: 6000.0.58f2 - VRM Posing Desktop: Not 100% sure but it appears to be at least 2022.3 given the UniVRM version. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 Esme Povirk <madewokherd@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |madewokherd@gmail.com --- Comment #11 from Esme Povirk <madewokherd@gmail.com> --- I think Unity is the only common user of bdwgc anymore. So I would guess it affects all Unity titles but not Mono in general. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #12 from Esme Povirk <madewokherd@gmail.com> --- Well, even then it wouldn't affect Unity titles that use IL2CPP or .NET. Although I think Mono is still the most common runtime. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #13 from Adalyn <adibtw@tuta.io> --- I personally haven't noticed a memory leak like this in VRChat, which is an IL2CPP game. With how inconsistent the bug is, it's hard to say for sure that it doesn't happen, but I've played thousands of hours on Linux without this type of memory leak happening, and it also shares many similarities with VTubing software, including OSC usage and poorly optimized user-made models -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #14 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- That is fair Adalyn. Like this is only remembering hearing about it a while back. Did VRChat use Mono before they added EAC? IDK though, this is just conjecture. All I know is that VNyan and VTube Studio have confirmed leaks and they are both Unity+Mono based games. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #15 from Adalyn <adibtw@tuta.io> --- Based on the github archives from mods prior to EAC, it looks like they used IL2cpp before adding EAC. emmVRC has a blog post that mentions IL2cpp was added some time in April 2020, which was long before the commit that theoretically introduced this bug (0a5f7a71036 was made in 2023) -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #16 from Hoshino Lina <lina@lina.yt> --- My understanding is IL2CPP still uses the Mono runtime and bdwgc (it just replaces the managed bytecode with native code) but I could be wrong about how all the pieces fit together in different configurations. See this random IL2CPP build tree I found: https://git.noc.ruhr-uni-bochum.de/krekem24/ANN_Visualisation_in_AR_Builds/-... It's not just any Unity app though, it's specifically Unity apps that use managed threads that perform system calls, and it's a tight race that triggers it. uOSC (what VNyan uses) has a managed thread with a blatant poll loop. It's likely other managed libraries also cause the issue. VRChat doesn't use that though, it uses this: https://github.com/stella3d/OscCore While it does use a receive thread, it uses blocking reads, which means the bug can only trigger *each time an OSC message is received*. If you don't use OSC or use it sparingly chances are ~zero. Meanwhile uOSC: https://github.com/hecomi/uOSC/blob/3c80df7ee3ce7c58cb33f5820921a3b192e22c04... ... polls 1000 times per second, no matter what. So I think this still affects ~every Unity app that uses threads, IL2CPP or not. It's just that the race is so tight, that you're only likely to hit it if the Unity app does Very Silly Things Indeed, like uOSC does. To put things into perspective: If you do use OSC with VRChat but only get messages once per frame at ~60Hz, then you're 16 times less likely to hit the bug than on VTube Studio. If it takes 5 hours to hit the bug on average on VTS, it would take 80 hours on average to hit the bug on VRChat. It's all probabilistic, you can still get unlucky and hit it in 5 minutes, but you get the idea. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #17 from Hoshino Lina <lina@lina.yt> --- Also VNyan does OSC and 4x VMC receivers, so that's another factor of 5, for a total of 5000 chances to hit the race per second. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 epyon_avenger@chaos-axiom.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |epyon_avenger@chaos-axiom.c | |om -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 Madalee <madalee@ohlmeyers.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |madalee@ohlmeyers.com --- Comment #18 from Madalee <madalee@ohlmeyers.com> --- We wanted to report that we've seen this same behavior in Warudo (also Unity based vtuber software) in all version after Proton-8. This leak usually doesn't happen quickly, though we've seen apps leak in as little as an hour, we've seen it not manifest until around 30 hours. We believe we've seen this in other unity games as well. Basically, anything unity that we run on Linux through Proton newer than 8 seems to leak... eventually. As Hoshino Lina pointed out, the common thread among vtubers is that we often stream for many hours. We think this extended time with Unity is why it tends to affect us more than most users and why this very serious bug often goes unnoticed. Not sure how this is for other users, but for us when a Unity app does this, our whole system hangs, sometimes for several minutes while we wait for the memory goblin to rescue us by killing apps. We're happy to provide any additional details that can help, or run testing/debug versions of Wine and provide logs. Thanks, -Madalee -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 Austin English <austinenglish@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |regression --- Comment #19 from Austin English <austinenglish@gmail.com> --- A regression test to find what changed in upstream wine would be helpful: https://gitlab.winehq.org/wine/wine/-/wikis/Regression-Testing -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #20 from Hoshino Lina <lina@lina.yt> --- I didn't bisect the commit with testing, but I did identify 0a5f7a71036 as the likely culprit based on git blame (mentioned in my original report). This was introduced in wine-9, so that tracks with all the user reports that this starts happening in proton-9. I want to try writing an intentional reproducer for this that's faster than the test cases we have, though since it's a race I don't know if it will be fast enough to be incorporated into an actual test suite. @Madalee Yeah, Linux doesn't handle progressive memory leaks well. Here's one time it happened on stream: https://www.youtube.com/live/X3EfEZGe_Og?t=33280s You can't really see much because my whole stream system (separate from the one I was working on) froze pretty badly, but you can see the 4-minute jump in the clock when I came back (watched live, the stream was down for that time period, YouTube just cuts that out in the VOD). I don't remember if I had to restart the whole machine, but it definitely took down VTube Studio, OBS, and my audio system at least. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #21 from Adalyn <adibtw@tuta.io> --- I made a minimal Unity project that recreates this issue more consistently I have yet to be able to reproduce the issue on commit 0a5f7a71036, and I was able to recreate the issue within seconds on commit 245e8cedf059. I'll continue attempting to narrow down the range of commits, but it's taking quite a while to build wine between tests, so it could be a bit -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #22 from Zeb Figura <z.figura12@gmail.com> --- I don't think bisecting is going to be all that useful at this point; the problem is pretty well understood. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #23 from Madalee <madalee@ohlmeyers.com> --- We forked GE-Proton and Kron4ek Wine-Builds, then took the changes that Hoshino Lina pointed to, and reverted those changes in to Proton 10 - Wine. We've got a build of GE-Proton with those changes that we're testing now. If any of you have a better way to repeat the issue, like Adalyn, you can test this: https://github.com/madalee-com/proton-ge-custom-unity-leak-fix For GE-Proton we applied the change after the source code gets pulled and before any of the patches are applied. Not sure if these changes make their way through to the final build or not. None-the-less, testing it. For Kron4ek/Wine-Builds, this one is way more clean. We put the changes in place between the git clone and the build. Pretty sure that one should be a valid test. It's building now, though it wont be available for several hours. This one will be available in build artifacts here, if you have a logged in github account, we think. If not, we'll put it in a release for others to test: https://github.com/madalee-com/Wine-Builds-fix-unity-leak/actions/runs/22272... If you want to take a look at how we reverted the changes, here is the specific file we changed. This is pulled from Proton-10/wine branch and the reverted code was from Proton-8/wine branch. https://github.com/madalee-com/Wine-Builds-fix-unity-leak/blob/master/signal... -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #24 from Adalyn <adibtw@tuta.io> --- The GE-Proton build started leaking memory after running my test program for about 5 minutes, with the exact same behaviour (all leaked memory is Managed Reserved memory according to Unity's Memory Profiler) I'll give the wine build a try as well once it's ready -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #25 from Hoshino Lina <lina@lina.yt> --- I don't think you reverted any of the changes. The issue is in __wine_syscall_dispatcher which is in dlls/ntdll/unix/signal_x86_64.c. The file being modified there is dlls/ntdll/signal_x86_64.c (note the path). It's probably not that easy to revert the change, since the signal handler has changed since then. This is really subtle assembly code. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #26 from Madalee <madalee@ohlmeyers.com> --- We can't even get the right file, and yet we're gunna say. What do you mean it's not that easy? Easy game for babies! https://github.com/madalee-com/Wine-Builds-fix-unity-leak/blob/master/signal... Heh, we simply overwrote the whole __wine_syscall_dispatcher with the one from Proton 8. Then put the right file in the right place. And now we wait for them to compile again. To be clear, we don't understand what is happening in the code here. And this is a bit of a shot in the dark hoping that reverting the function will "resolve" the issue, with the intent of proving out the actual problem. We suppose if it works, this can be an okay workaround until a proper fix is made? But this is not, in any way, a thoughtful resolution of the issue at hand. Adalyn, any chance you can share your testing app? -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #27 from Adalyn <adibtw@tuta.io> --- it took about 50 minutes this time, but the new proton reversion also started leaking memory eventually my current test setup is simply an empty unity project, with this script attached to any object: https://gist.github.com/AdalynBlack/169c193fccd644a04af7cef8cef208e1 -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #28 from Madalee <madalee@ohlmeyers.com> --- FYI, after a few more changes to get the revert right, and applying it directly to a localish build of Proton 10.0-4, we might be on to something? Have a Proton 10.0-4 with the revert that's been running Warudo for around 5 hours now, no leak. Again, on rare occasion we've seen the leak hit after a whole day, so it could be a coincidence. At the same time we are testing a Proton-GE 10-32(on another machine) with the script as Adalyn suggested. We're working on putting up a github repo to repeat the Proton build with the revert for Proton-10.0-4 and Proton Experimental Will post links once those are all ready. The GE build is already available, though we have less confidence in it, since it has a ton of other patches: https://github.com/madalee-com/proton-ge-custom-unity-leak-fix/releases -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #29 from Madalee <madalee@ohlmeyers.com> --- Proton and Proton Experimental build with the revert are here: https://github.com/madalee-com/proton-unity-leak-test/releases This is just raw Proton with the revert, no other patches, so it's the best bet. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #30 from Madalee <madalee@ohlmeyers.com> --- Warudo has been running under this modified Proton 10 for 18 hours now and has not experienced "THE" memory leak. We say "THE" because it does seem like it's experiencing "A" memory leak. It's been, very slowly, increasing in memory use since around hour 2 or so. It's gone from 1.5G to 5.8G after 18 hours. Along with the increase in memory usage, the CPU usage has also climbed, with a setup that's basically just idling(not receiving VRM data or anything) the model stands and moves around in an idle pose. CPU was idling around 12% initially and it's up to an average of 17% now. The slowness of this leak and CPU increase could be on the Warudo side at this point. On a side nite, we tried running the Unity script for several hours, with normal Proton 10, and haven't seen the leak. We likely haven't setup the script properly. We don't develop in Unity, just use it enough to work with our VRM model so maybe we set it up wrong? -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #31 from Adalyn <adibtw@tuta.io> --- if the script is applied correctly, it should be using at least 800 threads, and basically every available CPU cycle the kernel will provide all that should be necessary to use it is adding the script as a component to any object in the scene -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #32 from Madalee <madalee@ohlmeyers.com> --- It is, indeed, hammering the CPU and running around 886 threads. The memory usage, however, didn't go up after leaving it running all night on GE-Proton10-10 (original, no leak patch) We know Unity has several different engine modes and stuff, maybe our project is in the wrong one? If you're willing, feel free to reach out to us on discord at "madalee" (In reply to Adalyn from comment #31)
if the script is applied correctly, it should be using at least 800 threads, and basically every available CPU cycle the kernel will provide all that should be necessary to use it is adding the script as a component to any object in the scene
-- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #33 from Madalee <madalee@ohlmeyers.com> --- While we haven't seen the big memory leak with Warudo while running our reverted Proton-10.0-4, we did manage to run in to it while running Adalyn's debugger. This would suggest that either that code block isn't at fault, or we reverted it wrong. As we said, we don't really understand what's happening in the assembly so its possible we reverted it wrong. However, that entire __wine_unix_call_dispatcher function is identical to Proton8(which does not have the bug) in our version except for the following: Changed: ".globl " __ASM_NAME("__wine_unix_call_dispatcher_prolog_end") "\n" __ASM_NAME("__wine_unix_call_dispatcher_prolog_end") ":\n\t" To: __ASM_LOCAL_LABEL("__wine_unix_call_dispatcher_prolog_end") ":\n\t" Changed: "jnz .L__wine_syscall_dispatcher_return\n\t" To: jnz " __ASM_LOCAL_LABEL("__wine_syscall_dispatcher_return") "\n\t" We needed to make these changes as Proton10 has an asm clock at the end that references __wine_syscall_dispatcher_return as opposed to L__wine_syscall_dispatcher_return It kinda looks like these may be the same thing with different syntax, but we are just guessing here. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #34 from Hoshino Lina <lina@lina.yt> --- Created attachment 80435 --> http://bugs.winehq.org/attachment.cgi?id=80435 Bug repro Attached is a non-Unity repro that catches the bug within a few seconds. It also shows that very often contexts within __wine_syscall_dispatcher_return are returned (which is still wrong, but won't break Unity by itself), and only a fraction of those have the bad sp (which breaks Unity). I still don't understand the point of the `leaq 0x70(%rcx),%rsp`. Why is rsp being set to anything arbitrary at all? Just replacing that line with `movq 0x88(%rcx),%rsp` might go a long way towards fixing this, at least for Unity. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #35 from Hoshino Lina <lina@lina.yt> --- Created attachment 80436 --> http://bugs.winehq.org/attachment.cgi?id=80436 Partial fix Partial fix attached. I believe this fixes the Unity case but it probably has corner cases (instrumentation, other situations where threads might leak user state). I'm not really familiar enough with the code to go further, so I think someone else will have to take over for a "proper" fix. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #36 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- Did a quick scan over the patch. Is the local label supposed to be 11? I don't understand this code so I might be wrong but the jz 1f followed by 11: seems a bit due to me. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 Hoshino Lina <lina@lina.yt> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #80436|0 |1 is obsolete| | --- Comment #37 from Hoshino Lina <lina@lina.yt> --- Created attachment 80441 --> http://bugs.winehq.org/attachment.cgi?id=80441 Revised fix Whoops, yeah, that was a copy and paste error while editing things. I'm surprised it worked... Fixed patch attached. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #38 from Madalee <madalee@ohlmeyers.com> --- Doing a build with Hoshino Lina's patch. Will post link once ready. We applied this to Proton 10.0-4 @Hoshino Lina: what version did you base the patch on? -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #39 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- Lina said "that patch was based on the latest tag (11.3 I think)." -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #40 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- Lina's patched version doesn't report a bogus %rsp and assert, but I can confirm that it does report %rip in the 0x7xxxxxxxxxxx Wine range very often. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #41 from Madalee <madalee@ohlmeyers.com> --- The Proton10 build we did with Lina's patch seems to work correctly as well. We hit the assert after 16 tries in the latest Wine staging in arch. When we run it through the patched Proton10 we go through all 2000 and see plenty of RIP but no assert for in_range Patched Proton 10 is here for anyone who wants to test it out with stuff in Steam https://github.com/madalee-com/proton-unity-leak-test/releases/tag/Proton-un... Also building Proton Experimental, and will link to it when it's ready. FYI, these builds are being done in github actions and pulling source straight from the other repos. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
http://bugs.winehq.org/show_bug.cgi?id=59333 --- Comment #42 from easyaspi314 <easy-as-pi314.ttv@proton.me> --- Madalee's results check out. So far my own test results: - wine-8.0 (Debian 8.0~repack-4) on Bookworm: seems to pass - wine-10.0 (Ubuntu 10.0~repack-6ubuntu1) on Ubuntu 25.10: fail - wine-11.3 from source: fail - wine-11.3 from source with Lina's patch: seems to pass - GE-Proton 10-28/29: fail I'm putting "seems to pass" here since being a race condition it's hard to say with 100% certainty. Again, all of these (including Wine 8) regularly report %rip in Wine code though, meaning that this part of the behavior is not a regression and may be a different bug. -- Do not reply to this email, post in Bugzilla using the above URL to reply. You are receiving this mail because: You are watching all bug changes.
participants (1)
-
WineHQ Bugzilla