>Yes, in my case I'm mostly only interested in the case with DWARF debug info in PE files. The applications I'm working with usually don't have debug information anyway and I'm trying to get stack traces and tooling support for Wine code specifically.
Okay, yeah that makes a lot of sense now. Just to note that Windows "runtime function" unwind information was almost always available when we checked.
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/1065#note_10931
> > FWIW this also works with Valgrind, and you might be interested in https://gitlab.winehq.org/wine/wine/-/merge_requests/1074 too.
>
> That is interesting. So, but do I see correctly that this only works, if the Windows Pe/Coff files, actually have **DWARF** unwind information embedded. Which is as far as I am aware only the case for cross-compilation with mingw, right?
> > I should probably add that it only does that when it has the debug info for the PE modules, this isn't currently well supported, and needs to be done somehow manually.
>
> Are you referring to the Windows unwind information (runtime function), or the DWARF information mentioned above? Our LLDB patch (which we hopefully upstream soon), as well as the libunwindstack implementation actually takes care of all kind of PE/Coff modules, and is able to use runtime function as well, as DWARF information.
Yes, in my case I'm mostly only interested in the case with DWARF debug info in PE files. The applications I'm working with usually don't have debug information anyway and I'm trying to get stack traces and tooling support for Wine code specifically.
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/1065#note_10930
>Hmm, I need to think about that. When I said it fully works, I should maybe indicate that GDB doesn't usually display local variables for the first frame below the syscall, so maybe it's because of this even though it manages to catch up for lower frames.
As far as I remember, local variables were available in all frames in our case. It should be mostly because we specify callee-save registers right away when they get stored. Also it is likely, that if the computation in not correct in the first place, that it will catch up later, as there will be new unwind information mostly specifying "register foo is saved at that position on the stack".
>FWIW this also works with Valgrind, and you might be interested in https://gitlab.winehq.org/wine/wine/-/merge_requests/1074 too.
That is interesting. So, but do I see correctly that this only works, if the Windows Pe/Coff files, actually have **DWARF** unwind information embedded. Which is as far as I am aware only the case for cross-compilation with mingw, right?
>It's not exactly the `pushfl` but more that the dispatcher pops `rip` first, which effectively removes the return address from the stack, then `pushfl` overrides its value. It's not causing problems later because the dispatcher uses the poped `rip` value to return to the caller, but it temporarily confuses GDB stack unwinding. Maybe with your CFI instructions it's not required.
Right! And that is the reason why we don't need it, as we do specify "RIP" based on syscall frame and not based on CFA, right as it gets popped. I think that's another case, where it is actually better to do the `breg` instructions, as it gives you the freedom to specify some variables based on CFA and some based on register content.
>The reason I've been told, is that some DRMs are checking that the NT syscalls didn't touch the stack. I don't really know more.
Oh interesting. That makes a lot of sense, and our lives so much harder... :smile:
>Yes, it's already an issue for `perf` captures when you want stack traces that cross user and kernel stacks, and I don't really have any good solution for that. Having an optional working mode which would interleave kernel frames and user frames may be possible but it doesn't seem much tractable.
Yep, `perf` is actually using the same `perf_event_open` syscall that we are using. So it makes sense that we are observing the same issue. The "bruteforce" way we worked around with that, is to inject into the syscall dispatcher (using a uprobe) and always collect the "user-space" stack, then for the actual samples, make both stacks available to the unwinder. This obviously comes with a big performance hit, and might corrupt your performance characteristics. However, it actually turned out to be useful to find some issues.
An alternative solution, that we never ended up actually implementing would be to use a eBPF program for the sampling/stack collection. We could make it copy both stacks explicitly.
Obvious the more elegant solution would be, as you mentioned, to either have a switch in Wine to only have one stack, or alternatively, make it possible in perf_event_open, to collect both stacks. It would be be not the first time that Linux kernel has special handling for wine.
>I should probably add that it only does that when it has the debug info for the PE modules, this isn't currently well supported, and needs to be done somehow manually.
Are you referring to the Windows unwind information (runtime function), or the DWARF information mentioned above?
Our LLDB patch (which we hopefully upstream soon), as well as the libunwindstack implementation actually takes care of all kind of PE/Coff modules, and is able to use runtime function as well, as DWARF information.
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/1065#note_10929
> Ah, that is good to know, thanks. Apparently gdb's unwinder works slightly different, than the used we looked at. Also, I am surprised it actually can unwind Windows PE/Coff-based frames.
I should probably add that it only does that when it has the debug info for the PE modules, this isn't currently well supported, and needs to be done somehow manually.
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/1065#note_10927
> > BTW, if the issue is only that the CFA is supposed to stay constant within a CFI procedure, would it work if we were splitting the dispatcher with some jumps / call / fake CFI procedures, at the points where we need to change its register?
>
> My point is, that the issue is actually not about CFA not being constant, but that CFA has a semantic that many unwinding tools/libraries rely on, i.e. that CFA is suppose to be the value of the stack pointer at the caller, right before the call. In particular, CFA is suppose to be a value that would be a valid stack pointer.
>
> In your change, however, CFA gets later set to be a pointer into syscall_frame.
Hmm, I need to think about that. When I said it fully works, I should maybe indicate that GDB doesn't usually display local variables for the first frame below the syscall, so maybe it's because of this even though it manages to catch up for lower frames.
> > With GDB and with the two other changes I mentioned, it fully works yes.
>
> Ah, that is good to know, thanks. Apparently gdb's unwinder works slightly different, than the used we looked at. Also, I am surprised it actually can unwind Windows PE/Coff-based frames.
FWIW this also works with Valgrind, and you might be interested in https://gitlab.winehq.org/wine/wine/-/merge_requests/1074 too.
> > I'm a bit surprised that you don't need the flags push/pop change, as I believe it corrupts the return address on the stack, but maybe the unwinder has other ways to get it back.
>
> Interesting. I am wondering why the `pushfl` instruction corrupts the return address. If this would be the case, wouldn't that not lead to actual runtime issues when the dispatcher returns? Maybe the reason why we don't see an issue with that, is that we are actually setting the CFI rule for the return address based on `syscall_frame` right away?
It's not exactly the `pushfl` but more that the dispatcher pops `rip` first, which effectively removes the return address from the stack, then `pushfl` overrides its value. It's not causing problems later because the dispatcher uses the poped `rip` value to return to the caller, but it temporarily confuses GDB stack unwinding. Maybe with your CFI instructions it's not required.
> > Yes, that's what I've been told (maybe even *you* did).
>
> No, that wasnt me :smile:, it's actually my first PR. Also note, that I am not a wine expert, so please forgive me, if I miss something obvious.
Well, nothing in this area is obvious to me either.
> > Now the other part where unwinding doesn't work well is with the other side of syscalls, with KiUserCallbackDispatcher callbacks.
>
> Unfortunately, I am not aware of the `KiUserCallbackDispatcher`. So I can't tell if there is more work to be done/how this could be fixed.
>
> We basically looked at reported unwinding errors in Orbit, and discovered the missing unwind info (CFI) being the root cause of most of them. With this, and some adjustments to our unwinder to cope with certain corner-cases of PE/Coff files, we ended up with less than 1% of unwinding errors.
>
> Actually, another issue that kept us quite busy is, that the `kernel_stack` is not allocated on the "normal"/"user" stack. That apparently used to be the case before Wine 7. This way, Linux performance collector tools, such as perf_event_open, fail to collect the "user-space" part of the stack. For debugging this is fine, as the program execution is halted and we can just read the data from memory, but for Profiling, we needed to massively work around that. Does someone know about the reason for the two separated stacks? Would it be maybe possible, to actually allocate that "kernel_stack" at the normal stack?
The reason I've been told, is that some DRMs are checking that the NT syscalls didn't touch the stack. I don't really know more.
Yes, it's already an issue for `perf` captures when you want stack traces that cross user and kernel stacks, and I don't really have any good solution for that. Having an optional working mode which would interleave kernel frames and user frames may be possible but it doesn't seem much tractable.
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/1065#note_10925
>BTW, if the issue is only that the CFA is supposed to stay constant within a CFI procedure, would it work if we were splitting the dispatcher with some jumps / call / fake CFI procedures, at the points where we need to change its register?
My point is, that the issue is actually not about CFA not being constant, but that CFA has a semantic that many unwinding tools/libraries rely on, i.e. that CFA is suppose to be the value of the stack pointer at the caller, right before the call. In particular, CFA is suppose to be a value that would be a valid stack pointer.
In your change, however, CFA gets later set to be a pointer into syscall_frame.
>With GDB and with the two other changes I mentioned, it fully works yes.
Ah, that is good to know, thanks. Apparently gdb's unwinder works slightly different, than the used we looked at. Also, I am surprised it actually can unwind Windows PE/Coff-based frames.
>I'm a bit surprised that you don't need the flags push/pop change, as I believe it corrupts the return address on the stack, but maybe the unwinder has other ways to get it back.
Interesting. I am wondering why the `pushfl` instruction corrupts the return address. If this would be the case, wouldn't that not lead to actual runtime issues when the dispatcher returns?
Maybe the reason why we don't see an issue with that, is that we are actually setting the CFI rule for the return address based on `syscall_frame` right away?
>Yes, that's what I've been told (maybe even *you* did).
No, that wasnt me :smile:, it's actually my first PR. Also note, that I am not a wine expert, so please forgive me, if I miss something obvious.
>Now the other part where unwinding doesn't work well is with the other side of syscalls, with KiUserCallbackDispatcher callbacks.
Unfortunately, I am not aware of the `KiUserCallbackDispatcher`. So I can't tell if there is more work to be done/how this could be fixed.
We basically looked at reported unwinding errors in Orbit, and discovered the missing unwind info (CFI) being the root cause of most of them. With this, and some adjustments to our unwinder to cope with certain corner-cases of PE/Coff files, we ended up with less than 1% of unwinding errors.
Actually, another issue that kept us quite busy is, that the `kernel_stack` is not allocated on the "normal"/"user" stack. That apparently used to be the case before Wine 7. This way, Linux performance collector tools, such as perf_event_open, fail to collect the "user-space" part of the stack. For debugging this is fine, as the program execution is halted and we can just read the data from memory, but for Profiling, we needed to massively work around that.
Does someone know about the reason for the two separated stacks? Would it be maybe possible, to actually allocate that "kernel_stack" at the normal stack?
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/1065#note_10922
Valgrind support requires a fork, which I've published to https://gitlab.winehq.org/rbernon/valgrind. The fork implements loading DWARF debug info from PE files, instead of the old and broken upstream PDB support. I've tried to upstream these changes a long time ago but didn't receive any feedback.
I think we could maybe consider keeping a fork, which I'm happy to maintain, as the changes aren't too large. We may want to investigate adding 32-on-64 support, which may require a bit more changes (to VEX specifically, because its amd64 guest doesn't support segment register manipulation).
The changes here are not all related to Valgrind, and I'll create separate MR for those which may make sense independently from Valgrind / GDB.
Also included is a suppression file to silent some annoying false positives, many of which are coming from the cross-stack accesses during syscalls, which are confusing Valgrind's stack heuristics. One can try this out with something like:
`WINELOADERNOEXEC=1 valgrind --suppressions=tools/valgrind.supp wine64/loader/wine64 wine64/programs/winecfg/winecfg.exe`
--
https://gitlab.winehq.org/wine/wine/-/merge_requests/1074