Alexandros Frantzis (@afrantzis) commented about dlls/winewayland.drv/window_surface.c:
if (width_bytes == src_stride && width_bytes == dst_stride) { memcpy(dst, src, height * width_bytes);
for (x = 3; x < height * width_bytes; x += bpp) dst[x] = 0xff;
From a few synthetic benchmarks I ran locally, this change has a significant performance impact (1.7x-3x compared to just doing the memcpy, with higher overheads for smaller regions).
Switching to a manual 32-bit pixel copy while setting the alpha is a definite improvement, with the results I am seeing being in the range of 1.2x-2x compared to just the memcpy:
``` width = rc.right - rc.left; for (x = 0; x < height * width; ++x) ((UINT32 *)dst)[x] = ((UINT32 *)src)[x] | 0xff000000; ```
So perhaps it's worth switching to such a loop?