On Tue Feb 14 14:55:14 2023 +0000, Huw Davies wrote:
The optimization below doesn't seem to be worth it. On x86_64 I couldn't measure a difference between the algorithm above and the optimization below. On i386, while the optimization was about 10% faster, the algorithm above matched native's performance, so there seems little point in adding the complexity. If you really need the optimization, you could potentially introduce it in a later MR.
It was an attempt to implement it in an unusual way so that there was low likelihood that the algorithm matched any particular existing implementation.
I'll switch to the modulo one anyway.