STRD is an ARMv7 instruction that stores two consecutive 32-bit words in one operation. A single STRD saves one clock cycle compared to two LDRs. Shortened code also leads to less time spent on instruction fetch and more efficient I-cache utilization.
Running llvm-mca --timeline on the old code reports: 0123 Index 0123456789
[0,0] DeeeeER . . ldr.w r1, [r1, #472] [0,1] D====eER . . add.w r0, r1, #16 [0,2] D=====eeeeeER. stm.w r0, {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr} [0,3] .DeE--------R. add r2, sp, #16 [0,4] .D=====eE---R. str r2, [r1, #56] [0,5] .D======eE--R. str r3, [r1, #60] [0,6] . D---------R. mrs r0, apsr [0,7] . DeeE------R. bfi r0, lr, #5, #1 [0,8] . D======eE-R. str r0, [r1, #64] [0,9] . DeE-------R. mov.w r0, #0 [0,10] . D======eER. str r0, [r1, #68] [0,11] . D======E-R. vmrs r0, fpscr [0,12] . D=======eER str r0, [r1, #72] [0,13] . D=eE------R add.w r0, r1, #96
Running llvm-mca --timeline on the new code reports: 012 Index 0123456789
[0,0] DeeeeER . . ldr.w r1, [r1, #472] [0,1] D====eER . . add.w r0, r1, #16 [0,2] D=====eeeeeER stm.w r0, {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr} [0,3] .DeE--------R add r2, sp, #16 [0,4] .D=====eE---R strd r2, r3, [r1, #56] [0,5] .D----------R mrs r0, apsr [0,6] .DeeE-------R bfi r0, lr, #5, #1 [0,7] . D=====eE--R str r0, [r1, #64] [0,8] . DeE-------R mov.w r0, #0 [0,9] . D======eE-R str r0, [r1, #68] [0,10] . D=====E--R vmrs r0, fpscr [0,11] . D======eER str r0, [r1, #72] [0,12] . D=eE-----R add.w r0, r1, #96
Signed-off-by: Jinoh Kang jinoh.kang.kr@gmail.com --- dlls/ntdll/unix/signal_arm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/dlls/ntdll/unix/signal_arm.c b/dlls/ntdll/unix/signal_arm.c index a1bcb0ddd32..8c8750153df 100644 --- a/dlls/ntdll/unix/signal_arm.c +++ b/dlls/ntdll/unix/signal_arm.c @@ -1157,8 +1157,7 @@ __ASM_GLOBAL_FUNC( __wine_syscall_dispatcher, "add r0, r1, #0x10\n\t" "stm r0, {r4-r12,lr}\n\t" "add r2, sp, #0x10\n\t" - "str r2, [r1, #0x38]\n\t" - "str r3, [r1, #0x3c]\n\t" + "strd r2, r3, [r1, #0x38]\n\t" "mrs r0, CPSR\n\t" "bfi r0, lr, #5, #1\n\t" /* set thumb bit */ "str r0, [r1, #0x40]\n\t"
LDRD is an ARMv7 instruction that loads two consecutive 32-bit words in one operation. A single LDRD saves one clock cycle compared to two LDRs.
The optimization introduced by this commit may not significantly affect performance, since the final LDM instruction may have to wait for the previous load instruction(s) to free up the load pipeline. However, future modifications that reduce the load pressure from the succeding instructions can benefit from this optimization. Also, shortened code leads to less time spent on instruction fetch and more efficient I-cache utilization.
Running llvm-mca --timeline on the old code reports: 01234567 Index 0123456789
[0,0] DeER . . . . tst.w r12, #2 [0,1] D--R . . . . it ne [0,2] .DeeeeER . . . ldmne.w r8, {r0, r1, r2, r3} [0,3] . D===eeeeER . . ldr.w lr, [r8, #60] [0,4] . D====eeeeER . . ldr.w sp, [r8, #56] [0,5] . DeE------R . . add.w r8, r8, #16 [0,6] . D===eeeeeeeeER ldm.w r8, {r4, r5, r6, r7, r8, r9, r10, r11, r12, pc}
Running llvm-mca --timeline on the new code reports: 01234567 Index 0123456789
[0,0] DeER . . . . tst.w r12, #2 [0,1] D--R . . . . it ne [0,2] .DeeeeER . . . ldmne.w r8, {r0, r1, r2, r3} [0,3] . D===eeeeER . . ldrd sp, lr, [r8, #56] [0,4] . DeE-----R . . add.w r8, r8, #16 [0,5] . D===eeeeeeeeER ldm.w r8, {r4, r5, r6, r7, r8, r9, r10, r11, r12, pc}
Signed-off-by: Jinoh Kang jinoh.kang.kr@gmail.com --- dlls/ntdll/unix/signal_arm.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/dlls/ntdll/unix/signal_arm.c b/dlls/ntdll/unix/signal_arm.c index 8c8750153df..c63f669668c 100644 --- a/dlls/ntdll/unix/signal_arm.c +++ b/dlls/ntdll/unix/signal_arm.c @@ -1208,8 +1208,7 @@ __ASM_GLOBAL_FUNC( __wine_syscall_dispatcher, "tst ip, #2\n\t" /* CONTEXT_INTEGER */ "it ne\n\t" "ldmne r8, {r0-r3}\n\t" - "ldr lr, [r8, #0x3c]\n\t" - "ldr sp, [r8, #0x38]\n\t" + "ldrd sp, lr, [r8, #0x38]\n\t" "add r8, r8, #0x10\n\t" "ldm r8, {r4-r12,pc}\n" "5:\tmovw r0, #0x000d\n\t" /* STATUS_INVALID_PARAMETER */
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=107343
Your paranoid android.
=== debian11 (32 bit Chinese:China report) ===
ntdll: virtual: Timeout
=== debian11 (32 bit WoW report) ===
ntdll: virtual: Timeout
Shortened code leads to less time spent on instruction fetch and more efficient I-cache utilization.
Signed-off-by: Jinoh Kang jinoh.kang.kr@gmail.com --- dlls/ntdll/unix/signal_arm.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/dlls/ntdll/unix/signal_arm.c b/dlls/ntdll/unix/signal_arm.c index c63f669668c..9bd03cc8aae 100644 --- a/dlls/ntdll/unix/signal_arm.c +++ b/dlls/ntdll/unix/signal_arm.c @@ -1208,9 +1208,8 @@ __ASM_GLOBAL_FUNC( __wine_syscall_dispatcher, "tst ip, #2\n\t" /* CONTEXT_INTEGER */ "it ne\n\t" "ldmne r8, {r0-r3}\n\t" - "ldrd sp, lr, [r8, #0x38]\n\t" - "add r8, r8, #0x10\n\t" - "ldm r8, {r4-r12,pc}\n" + "ldrd sp, lr, [r8, #0x38]!\n\t" + "ldmdb r8, {r4-r12,pc}\n" "5:\tmovw r0, #0x000d\n\t" /* STATUS_INVALID_PARAMETER */ "movt r0, #0xc000\n\t" "add sp, sp, #0x10\n\t"
Signed-off-by: Jinoh Kang jinoh.kang.kr@gmail.com --- dlls/ntdll/unix/signal_arm.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/dlls/ntdll/unix/signal_arm.c b/dlls/ntdll/unix/signal_arm.c index 9bd03cc8aae..9e8a227f2e6 100644 --- a/dlls/ntdll/unix/signal_arm.c +++ b/dlls/ntdll/unix/signal_arm.c @@ -1206,9 +1206,11 @@ __ASM_GLOBAL_FUNC( __wine_syscall_dispatcher, "3:\n\t" #endif "tst ip, #2\n\t" /* CONTEXT_INTEGER */ - "it ne\n\t" - "ldmne r8, {r0-r3}\n\t" - "ldrd sp, lr, [r8, #0x38]!\n\t" + "beq 3f\n\t" + "ldr r4, [r8, #0x40]\n\t" + "ldm r8, {r0-r3}\n\t" + "msr cpsr_fsxc, r4\n" + "3:\tldrd sp, lr, [r8, #0x38]!\n\t" "ldmdb r8, {r4-r12,pc}\n" "5:\tmovw r0, #0x000d\n\t" /* STATUS_INVALID_PARAMETER */ "movt r0, #0xc000\n\t"
Hi,
While running your changed tests, I think I found new failures. Being a bot and all I'm not very good at pattern recognition, so I might be wrong, but could you please double-check?
Full results can be found at: https://testbot.winehq.org/JobDetails.pl?Key=107345
Your paranoid android.
=== debian11 (64 bit WoW report) ===
ntdll: change.c:277: Test failed: should be ready
We had to avoid strd previously due to some processors not supporting it [1]. Is this instruction portable enough?
On Wed, Feb 9, 2022, 3:29 AM Zebediah Figura zfigura@codeweavers.com wrote:
We had to avoid strd previously due to some processors not supporting it [1]. Is this instruction portable enough?
LDRD and STRD are ARMv7 instructions. There is at least one instance of LDRD in dlls/ntdll/signal_arm.c, in call_consolidate_callback. If we're aiming for pre-ARMv7 CPUs, we might as well have to remove it.
Otherwise I'd just blame the assembler. Maybe use .inst instead?
Jin-oh Kang jinoh.kang.kr@gmail.com writes:
On Wed, Feb 9, 2022, 3:29 AM Zebediah Figura zfigura@codeweavers.com wrote:
We had to avoid strd previously due to some processors not supporting it [1]. Is this instruction portable enough?
[1] https://bugs.winehq.org/show_bug.cgi?id=44365
LDRD and STRD are ARMv7 instructions. There is at least one instance of LDRD in dlls/ntdll/signal_arm.c, in call_consolidate_callback. If we're aiming for pre-ARMv7 CPUs, we might as well have to remove it.
These days we basically require ARMv7, for multiple reasons.