To be honest I find this much less readable than the current code.
It's a little long for sure. The other advantage is that less atomics translate to higher performance in uncontended case. We don't know how it actually performs in practice without experimenting anyway.
The loop suggest that it can somehow spin there, where I don't think it should ever.
For what it's worth, many interlocked (atomic) operations are actually implemented in terms of either compare-and-swap or LL/SC. As an example, `InterlockedAnd` uses `lock cmpxchg` if its return value is used; see https://godbolt.org/z/vhdEc8Yfc and https://godbolt.org/z/9c4dxaKhx.
In general, the CAS loop is a fairly common pattern for implementing complex read-modify-write atomic operations, both inside Wine codebase (grep `while.*InterlockedCompareExchange`) and other projects ([example][1]).
As for forward progress guarantee, the loop is guaranteed terminate in a bounded number of iterations even during contention, since concurrent threads can only increment `group->free_bits` monotonically over time by `InterlockedOr`.
[1]: https://www.kernel.org/doc/html/v4.10/core-api/atomic_ops.html#atomic-bitmas...