[PATCH v2 0/28] MR9928: Draft: dsound: Speed up resampling.
This implements a number of optimizations, in particular: - Swapping around the resampling loops in case of downsampling, allowing the FIR step to stay fixed regardless of the resampling ratio. - Rearranging the FIR array elements to make the access sequential. - Adding SSE versions of the resampling functions. Together, these amount to more than a 5x reduction of `cp_fields_resample` execution time. The quality of the resampling should be the same, or even improve slightly, due to a more precise `rem` calculation and removal of the FIR step rounding, although I haven't yet conducted any measurements. -- v2: dsound: Add a 32-bit SSE version of downsample. dsound: Add a 32-bit SSE version of upsample. This merge request has too many patches to be relayed via email. Please visit the URL below to see the contents of the merge request. https://gitlab.winehq.org/wine/wine/-/merge_requests/9928
On Sat Jan 24 06:47:40 2026 +0000, Matteo Bruni wrote:
Hi Anton! I ran this MR through my tests and it looks pretty good! I wrote a couple of tools during my dsound work: a "loopback" one to record the output of the resampler when given an impulse or sine wave signal in input, to be able to study the output for distortion and such, and a "mixer" one which plays a bunch of audio buffers with different parameters (notably frequency) at the same time, to investigate performance - by manually looking at the CPU usage of the process from top :sweat_smile: I realize just now that I should probably clean them up a bit and make them available... Anyway, the loopback output looks good. The distortion I mentioned previously in !9588 is indeed gone (probably fixed by 0846c910ad4a31ad05bd5891d9c7e9ba92839241) and I don't see obvious new artifacts. As for performance, this is what I get with the mixer test with 128 buffers: - this one (FIR) SSE: 42% CPU usage - this one (FIR) non-SSE: 64% - !9588 (cubic) SSE: 27% - !9588 (cubic) non-SSE: 31% The numbers fluctuate quite a bit but this should be a reasonably fair representation of the relative performance. Giving it a look with perf, most of the time is spent inside the resampler's inner loop, symbol `upsample_sse.L2` (~35% of total system-wide time, according to the tool). After it `DSOUND_MixToPrimary` still takes about 13%, `putieee32` almost 5%, the rest (including other parts of the resampler) below that. It looks like my "general mixer" improvements should help a bit here as well, although proportionally they will make significantly less of an impact. It's possible that a handmade SSE version can squeeze a bit more performance out of it, although it's clear that this is largely up to the huge complexity difference between the two filtering algorithms. For reference, with !9588 `DSOUND_MixToPrimary` takes 24% of the CPU samples, `putieee32` is at 5.5% and `cubic_resample_sse2` only comes in 3rd at 5.25%. That shows that "mixing performance" in !9588 is dragged down by things other than the resampler and suggests that we can afford a slower resampler, up to a point. I haven't retested the game with this MR yet (I'll do it soon and report back) but my guess is that it's fast enough for our needs. I had only a quick look at the actual patches but they generally look very reasonable. From what I'd seen here, I don't think it's much of a big deal to avoid 64-bit integers, even on 32-bit, so maybe that part is mostly unnecessary. Actually in one of my followup patches I start storing the buffer "subsample" cursor position in fixed point, which allows some simplifications throughout the mixer. See https://gitlab.winehq.org/Mystral/wine/-/commit/aca8b39927dd75268cb18fe19307... for the general idea. Hi Matteo!
I realize just now that I should probably clean them up a bit and make them available...
That would be great.
From what I'd seen here, I don't think it's much of a big deal to avoid 64-bit integers, even on 32-bit, so maybe that part is mostly unnecessary.
That's mainly to simplify the assembly, which is already quite hard to follow.
Actually in one of my followup patches I start storing the buffer "subsample" cursor position in fixed point, which allows some simplifications throughout the mixer.
Thanks for the idea. Fixed point might help eliminate the divisions in the outer loops of `downsample` and `upsample`. Although instead of 48.16, I'd go for 32.32, as this would make the resampling ratios more precise and also give a shorter assembly code. And in case of downsampling, I'd actually invert the fraction so that `freq_adjust_num` is fixed as we are dividing by `freq_adjust_num` there. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/9928#note_127902
I realize just now that I should probably clean them up a bit and make them available...
That would be great.
I pushed them to https://gitlab.winehq.org/Mystral/audio-test-tools. They're still not especially pretty, but they should get the job done.
Actually in one of my followup patches I start storing the buffer "subsample" cursor position in fixed point, which allows some simplifications throughout the mixer.
Thanks for the idea. Fixed point might help eliminate the divisions in the outer loops of `downsample` and `upsample`. Although instead of 48.16, I'd go for 32.32, as this would make the resampling ratios more precise and also give a shorter assembly code. And in case of downsampling, I'd actually invert the fraction so that `freq_adjust_num` is fixed as we are dividing by `freq_adjust_num` there.
Right, it seems to offer opportunities for simplification elsewhere as well. Sure, no problems with picking up a different split. I did test the MR a bit with the game I wrote !9588 for. It's certainly a large improvement but I don't feel like it's rock solid in the "for sure it's not going to be an issue anymore" territory. TLDR: I think we want to simplify the filter as well. As I mentioned in https://gitlab.winehq.org/wine/wine/-/merge_requests/9588#note_127395, the FIR we are currently using is very complex. I'm convinced it's too complex, in fact. Looking at the dsound impulse response on Win10 (e.g. by running "loopback i" and opening the capture.wav file on Audacity) you can see that 8 output samples are non-0 for each impulse, and they're shaped like the first 2 lobes of a sinc i.e. they're very likely using a 4-tap sinc filter. dsoal goes even further and uses cubic interpolation by default. FTR, I ended up picking that one in !9588 because I wanted to be sure that the resampling filter wouldn't be a problem going forward. Sticking with the current sinc filter is fine but I think we want to tweak the parameters to bring it roughly in line with the complexity of the native filter. I'm going to attach a few patches to make_fir showing a few options. Unsurprisingly, performance improves a lot with shorter filters. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/9928#note_128291
On Wed Jan 28 16:57:57 2026 +0000, Matteo Bruni wrote: > > > I realize just now that I should probably clean them up a bit and > make them available... > > > > That would be great. > I pushed them to https://gitlab.winehq.org/Mystral/audio-test-tools. > They're still not especially pretty, but they should get the job done. > > > Actually in one of my followup patches I start storing the buffer > "subsample" cursor position in fixed point, which allows some > simplifications throughout the mixer. > > > > Thanks for the idea. Fixed point might help eliminate the divisions in > the outer loops of `downsample` and `upsample`. Although instead of > 48.16, I'd go for 32.32, as this would make the resampling ratios more > precise and also give a shorter assembly code. And in case of > downsampling, I'd actually invert the fraction so that `freq_adjust_num` > is fixed as we are dividing by `freq_adjust_num` there. > Right, it seems to offer opportunities for simplification elsewhere as > well. Sure, no problems with picking up a different split. > I did test the MR a bit with the game I wrote !9588 for. It's certainly > a large improvement but I don't feel like it's rock solid in the "for > sure it's not going to be an issue anymore" territory. TLDR: I think we > want to simplify the filter as well. > As I mentioned in > https://gitlab.winehq.org/wine/wine/-/merge_requests/9588#note_127395, > the FIR we are currently using is very complex. I'm convinced it's too > complex, in fact. Looking at the dsound impulse response on Win10 (e.g. > by running "loopback i" and opening the capture.wav file on Audacity) > you can see that 8 output samples are non-0 for each impulse, and > they're shaped like the first 2 lobes of a sinc i.e. they're very likely > using a 4-tap sinc filter. dsoal goes even further and uses cubic > interpolation by default. > FTR, I ended up picking that one in !9588 because I wanted to be sure > that the resampling filter wouldn't be a problem going forward. Sticking > with the current sinc filter is fine but I think we want to tweak the > parameters to bring it roughly in line with the complexity of the native > filter. I'm going to attach a few patches to make_fir showing a few > options. Unsurprisingly, performance improves a lot with shorter filters. - [dsound-fir-order24.txt](/uploads/9956a4793cc611e0cfdcbae5193ccc47/dsound-fir-order24.txt): This one generates an order 24 filter, mostly keeping the general response characteristics of the original filter. - [dsound-fir-order8.txt](/uploads/c9f67299b558532451a51e40fa71aa01/dsound-fir-order8.txt): This is the simplest filter we can get with the current generation approach. Still twice as long as Win10 / Win11, starts breaking WRT aliasing (which I'm pretty sure we can fix somehow, Windows doesn't have this issue). - [dsound-fir-order4.txt](/uploads/b926e76bea683ec74cbd2d1f6cc79858/dsound-fir-order4.txt): This is just a rough draft, most certainly missing pieces, like the resampler changes. The idea is to only store 1 "wing" but still fit the first 2 lobes. I haven't really thought it through, there might be issues with this approach. For reference, with this MR and the order 4 FIR (but otherwise code unchanged) here I get "mixer 128" at maybe 30% CPU usage, while perf shows `DSOUND_MixToPrimary` at 21%, `putieee32` around 8% and both `upsample_sse.L3` and `upsample_sse.L2` close to 6.5%. Once my mixer patches get rid of the top two symbols in this perf output here, this should go safely into "fast enough" territory. Bottom line: I'm okay with optimizing the current resampler along the lines of this MR (again, great job, I thought it was unsalvageable :sweat_smile:). Afterwards I'll want to simplify the actual filter as well, ideally to be competitive against modern Windows in performance and quality. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/9928#note_128292
v2: - Use `__ASM_CFI` for the CFI directives to fix the arm64 build failure. -- https://gitlab.winehq.org/wine/wine/-/merge_requests/9928#note_129003
participants (2)
-
Anton Baskanov (@baskanov) -
Matteo Bruni (@Mystral)