This is v7 of this series. The six previous submissions can be found here [1], here [2], here[3], here[4], here[5] and here[6]. This version addresses the comments received in v6 plus improvements of the handling of exceptions unrelated to UMIP as well as corner cases in virtual-8086 mode. Please see details in the change log.
=== What is UMIP?
User-Mode Instruction Prevention (UMIP) is a security feature present in new Intel Processors. If enabled, it prevents the execution of certain instructions if the Current Privilege Level (CPL) is greater than 0. If these instructions were executed while in CPL > 0, user space applications could have access to system-wide settings such as the global and local descriptor tables, the segment selectors to the current task state and the local descriptor table. Hiding these system resources reduces the tools available to craft privilege escalation attacks such as [7].
These are the instructions covered by UMIP: * SGDT - Store Global Descriptor Table * SIDT - Store Interrupt Descriptor Table * SLDT - Store Local Descriptor Table * SMSW - Store Machine Status Word * STR - Store Task Register
If any of these instructions is executed with CPL > 0, a general protection exception is issued when UMIP is enabled.
=== How does it impact applications?
We want to have UMIP enabled by default. However, UMIP will change the behavior that certain applications expect from the operating system. For instance, programs running on WineHQ and DOSEMU2 rely on some of these instructions to function. Stas Sergeev found that Microsoft Windows 3.1 and dos4gw use the instruction SMSW when running in virtual-8086 mode[8]. SGDT and SIDT can also run on virtual-8086 mode.
In order to not change the behavior of the system. This patchset emulates the SGDT, SIDT and SMSW. This should be sufficient to not break the applications mentioned above. Regarding the two remaining instructions, STR and SLDT, the WineHQ team has shown interest catching the general protection fault and use it as a vehicle to fix broken applications[9]. Furthermore, STR and SLDT can only run in protected and long modes.
DOSEMU2 emulates virtual-8086 mode via KVM. No applications will be broken unless DOSEMU2 decides to enable the CR4.UMIP bit in platforms that support it. Also, this should not pose a security risk as no system resouces would be revealed. Instead, code running inside the KVM would only see the KVM's GDT, IDT and MSW.
Please note that UMIP is always enabled for both 64-bit and 32-bit Linux builds. However, emulation of the UMIP-protected instructions is not done for 64-bit processes. 64-bit user space applications will receive the SIGSEGV signal when UMIP instructions causes a general protection fault.
=== How are UMIP-protected instructions emulated?
This version keeps UMIP enabled at all times and by default. If a general protection fault caused by the instructions protected by UMIP is detected, such fault will be fixed-up by returning dummy values as follows:
* SGDT and SIDT return hard-coded dummy values as the base of the global descriptor and interrupt descriptor tables. These hard-coded values correspond to memory addresses that are near the end of the kernel memory map. This is also the case for virtual-8086 mode tasks. In all my experiments in x86_32, the base of GDT and IDT was always a 4-byte address, even for 16-bit operands. Thus, my emulation code does the same. In all cases, the limit of the table is set to 0. * SMSW returns the value with which the CR0 register is programmed in head_32/64.S at boot time. This is, the following bits are enabled: CR0.0 for Protection Enable, CR.1 for Monitor Coprocessor, CR.4 for Extension Type, which will always be 1 in recent processors with UMIP; CR.5 for Numeric Error, CR0.16 for Write Protect, CR0.18 for Alignment Mask. As per the Intel 64 and IA-32 Architectures Software Developer's Manual, SMSW returns a 16-bit results for memory operands. However, when the operand is a register, the results can be up to CR0[63:0]. Since the emulation code only kicks-in in x86_32, we return up to CR[31:0]. * The proposed emulation code is handles faults that happens in both protected and virtual-8086 mode. * Again, STR and SLDT are not emulated.
=== How is this series laid out?
++ Preparatory work As per suggestions from Andy Lutormirsky and Borislav Petkov, I moved the x86 page fault error codes to a header. Also, I made user_64bit_mode available to x86_32 builds. This helps to reuse code and reduce the number of #ifdef's in these patches.
++ Fix bugs in MPX address evaluator I found very useful the code for Intel MPX (Memory Protection Extensions) used to parse opcodes and the memory locations contained in the general purpose registers when used as operands. I put some of this code in a separate library file that both MPX and UMIP can access and avoid code duplication. Before creating the new library, I fixed a couple of bugs that I found in how MPX determines the address contained in the instruction and operands.
++ Provide a new x86 instruction evaluating library With bugs fixed, the MPX evaluating code is relocated in a new insn-eval.c library. The basic functionality of this library is extended to obtain the segment descriptor selected by either segment override prefixes or the default segment by the involved registers in the calculation of the effective address. It was also extended to obtain the default address and operand sizes as well as the segment base address. Also, support to process 16-bit address encodings. Armed with this arsenal, it is now possible to determine the linear address onto which the emulated results shall be copied.
This code supports Normal 32-bit and 64-bit (i.e., __USER32_CS and/or __USER_CS) protected mode, virtual-8086 mode, 16-bit protected mode with 32-bit base address.
++ Emulate UMIP instructions A new fixup_umip_exception functions inspect the instruction at the instruction pointer. If it is an UMIP-protected instruction, it executes the emulation code. This uses all the address-computing code of the previous section.
++ Add self-tests Lastly, self-tests are added to entry_from_v86.c to exercise the most typical use cases of UMIP-protected instructions in a virtual-8086 mode.
++ Extensive tests Extensive tests were performed to test all the combinations of ModRM, SiB and displacements for 16-bit and 32-bit encodings for the ss, ds, es, fs and gs segments. Tests also include a 64-bit program that uses segmentation via fs and gs. For this purpose, I temporarily enabled UMIP support for 64-bit process. This change is not part of this patchset. The intention is to test the computations of linear addresses in 64-bit mode, including the extra R8-R15 registers. Extensive test is also implemented for virtual-8086 tasks. Code of these tests can be found here [10] and here [11].
++ Merging this series? Am I any close to see these patches merged? :)
[1]. https://lwn.net/Articles/705877/ [2]. https://lkml.org/lkml/2016/12/23/265 [3]. https://lkml.org/lkml/2017/1/25/622 [4]. https://lkml.org/lkml/2017/2/23/40 [5]. https://lkml.org/lkml/2017/3/3/678 [6]. https://lkml.org/lkml/2017/3/7/866 [7]. http://timetobleed.com/a-closer-look-at-a-recent-privilege-escalation-bug-in... [8]. https://www.winehq.org/pipermail/wine-devel/2017-April/117159.html [10]. https://github.com/01org/luv-yocto/tree/rneri/umip/meta-luv/recipes-core/umi... [11]. https://github.com/01org/luv-yocto/commit/a72a7fe7d68693c0f4100ad86de6ecabde...
Thanks and BR, Ricardo
Changes since V6: *Reworded and addded more details on the special cases of ModRM and SIB bytes. To avoid confusion, I ommited mentioning the involved registers (EBP and ESP). *Replaced BUG() with printk_ratelimited in function get_reg_offset of insn-eval.c *Removed unused utility functions that obtain a register value from pt_regs given a SIB base and index. *Clarified nomenclature to call CS, DS, ES, FS, GS and SS segment registers and their values segment selectors. *Reworked function resolve_seg_register to issue an error when more than one segment overrides prefixes are used in the instruction. *Added logic in resolve_seg_register to ignore segment register when in long mode and not using FS or GS. *Added logic to ensure the effective address is within the limits of the segment in protected mode. *Added logic to ensure segment override prefixes are ignored when resolving the segment of EIP and EDI with string instructions. *Added code to make user_64bit_mode() available in CONFIG_X86_32... and make it return false, of course. *Merged the two functions that obtain the default address and operand size of a code segment into one as they are always used together. *Corrected logic of displacement-only addressing in long mode to make the displacement relative to the RIP of the next instruction. *Reworked logic to sign-extend 32-bit memory offsets into 64-bit signed memory offsets. This include more checks and putting all together in an utility function. *Removed the 'unlikely' of conditional statements as we are not in a critical path. *In virtual-8086 mode, ensure that effective addresses are always less than 0x10000, even when address override prefixes are used. Also, ensure that linear addresses have a size of 20-bits.
Changes since V5: * Relocate the page fault error code enumerations to traps.h
Changes since V4: * Audited patches to use braces in all the branches of conditional. statements, except those in which the conditional action only takes one line. * Implemented support in 64-builds for both 32-bit and 64-bit tasks in the instruction evaluating library. * Split segment selector function in the instruction evaluating library into two functions to resolve the segment type by instruction override or default and a separate function to actually read the segment selector. * Fixed a bug when evaluating 32-bit effective addresses with 64-bit kernels. * Split patches further for for easier review. * Use signed variables for computation of effective address. * Fixed issue with a spurious static modifier in function insn_get_addr_ref found by kbuild test bot. * Removed comparison between true and fixup_umip_exception. * Reworked check logic when identifying erroneous vs invalid values of the SiB base and index.
Changes since V3: * Limited emulation to 32-bit and 16-bit modes. For 64-bit mode, a general protection fault is still issued when UMIP-protected instructions are executed with CPL > 0. * Expanded instruction-evaluating code to obtain segment descriptor along with their attributes such as base address and default address and operand sizes. Also, support for 16-bit encodings in protected mode was implemented. * When getting a segment descriptor, this include support to obtain those of a local descriptor table. * Now the instruction-evaluating code returns -EDOM when the value of registers should not be used in calculating the effective address. The value -EINVAL is left for errors. * Incorporate the value of the segment base address in the computation of linear addresses. * Renamed new instruction evaluation library from insn-kernel.c to insn-eval.c * Exported functions insn_get_reg_offset_* to obtain the register offset by ModRM r/m, SiB base and SiB index. * Improved documentation of functions. * Split patches further for easier review.
Changes since V2: * Added new utility functions to decode the memory addresses contained in registers when the 16-bit addressing encodings are used. This includes code to obtain and compute memory addresses using segment selectors for real-mode address translation. * Added support to emulate UMIP-protected instructions for virtual-8086 tasks. * Added self-tests for virtual-8086 mode that contains representative use cases: address represented as a displacement, address in registers and registers as operands. * Instead of maintaining a static variable for the dummy base addresses of the IDT and GDT, a hard-coded value is used. * The emulated SMSW instructions now return the value with which the CR0 register is programmed in head_32/64.S This is: PE | MP | ET | NE | WP | AM. For x86_64, PG is also enabled. * The new file arch/x86/lib/insn-utils.c is now renamed as arch/x86/lib/ insn-kernel.c. It also has its own header. This helps keep in sync the the kernel and objtool instruction decoders. Also, the new insn-kernel.c contains utility functions that are only relevant in a kernel context. * Removed printed warnings for errors that occur when decoding instructions with invalid operands. * Added more comments on fixes in the instruction-decoding MPX functions. * Now user_64bit_mode(regs) is used instead of test_thread_flag(TIF_IA32) to determine if the task is 32-bit or 64-bit. * Found and fixed a bug in insn-decoder in which X86_MODRM_RM was incorrectly used to obtain the mod part of the ModRM byte. * Added more explanatory code in emulation and instruction decoding code. This includes a comment regarding that copy_from_user could fail if there exists a memory protection key in place. * Tested code with CONFIG_X86_DECODER_SELFTEST=y and everything passes now. * Prefixed get_reg_offset_rm with insn_ as this function is exposed via a header file. For clarity, this function was added in a separate patch.
Changes since V1: * Virtual-8086 mode tasks are not treated in a special manner. All code for this purpose was removed. * Instead of attempting to disable UMIP during a context switch or when entering virtual-8086 mode, UMIP remains enabled all the time. General protection faults that occur are fixed-up by returning dummy values as detailed above. * Removed umip= kernel parameter in favor of using clearcpuid=514 to disable UMIP. * Removed selftests designed to detect the absence of SIGSEGV signals when running in virtual-8086 mode. * Reused code from MPX to decode instructions operands. For this purpose code was put in a common location. * Fixed two bugs in MPX code that decodes operands.
Ricardo Neri (26): ptrace,x86: Make user_64bit_mode() available to 32-bit builds x86/mm: Relocate page fault error codes to traps.h x86/mpx: Use signed variables to compute effective addresses x86/mpx: Do not use SIB.index if its value is 100b and ModRM.mod is not 11b x86/mpx: Do not use SIB.base if its value is 101b and ModRM.mod = 0 x86/mpx, x86/insn: Relocate insn util functions to a new insn-eval file x86/insn-eval: Do not BUG on invalid register type x86/insn-eval: Add a utility function to get register offsets x86/insn-eval: Add utility function to identify string instructions x86/insn-eval: Add utility functions to get segment selector x86/insn-eval: Add utility function to get segment descriptor x86/insn-eval: Add utility functions to get segment descriptor base address and limit x86/insn-eval: Add function to get default params of code segment x86/insn-eval: Indicate a 32-bit displacement if ModRM.mod is 0 and ModRM.rm is 5 x86/insn-eval: Incorporate segment base and limit in linear address computation x86/insn-eval: Support both signed 32-bit and 64-bit effective addresses x86/insn-eval: Handle 32-bit address encodings in virtual-8086 mode x86/insn-eval: Add support to resolve 16-bit addressing encodings x86/insn-eval: Add wrapper function for 16-bit and 32-bit address encodings x86/cpufeature: Add User-Mode Instruction Prevention definitions x86: Add emulation code for UMIP instructions x86/umip: Force a page fault when unable to copy emulated result to user x86/traps: Fixup general protection faults caused by UMIP x86: Enable User-Mode Instruction Prevention selftests/x86: Add tests for User-Mode Instruction Prevention selftests/x86: Add tests for instruction str and sldt
arch/x86/Kconfig | 10 + arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/insn-eval.h | 25 + arch/x86/include/asm/ptrace.h | 6 +- arch/x86/include/asm/traps.h | 18 + arch/x86/include/asm/umip.h | 15 + arch/x86/include/uapi/asm/processor-flags.h | 2 + arch/x86/kernel/Makefile | 1 + arch/x86/kernel/cpu/common.c | 16 +- arch/x86/kernel/traps.c | 4 + arch/x86/kernel/umip.c | 286 +++++++ arch/x86/lib/Makefile | 2 +- arch/x86/lib/insn-eval.c | 1066 +++++++++++++++++++++++++ arch/x86/mm/fault.c | 88 +- arch/x86/mm/mpx.c | 120 +-- tools/testing/selftests/x86/entry_from_vm86.c | 89 ++- 17 files changed, 1580 insertions(+), 177 deletions(-) create mode 100644 arch/x86/include/asm/insn-eval.h create mode 100644 arch/x86/include/asm/umip.h create mode 100644 arch/x86/kernel/umip.c create mode 100644 arch/x86/lib/insn-eval.c
In its current form, user_64bit_mode() can only be used when CONFIG_X86_64 is selected. This implies that code built with CONFIG_X86_64=n cannot use it. If a piece of code needs to be built for both CONFIG_X86_64=y and CONFIG_X86_64=n and wants to use this function, it needs to wrap it in an #ifdef/#endif; potentially, in multiple places.
This can be easily avoided with a single #ifdef/#endif pair within user_64bit_mode() itself.
Suggested-by: Borislav Petkov bp@suse.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/ptrace.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 2b5d686..ea78a84 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -115,9 +115,9 @@ static inline int v8086_mode(struct pt_regs *regs) #endif }
-#ifdef CONFIG_X86_64 static inline bool user_64bit_mode(struct pt_regs *regs) { +#ifdef CONFIG_X86_64 #ifndef CONFIG_PARAVIRT /* * On non-paravirt systems, this is the only long mode CPL 3 @@ -128,8 +128,12 @@ static inline bool user_64bit_mode(struct pt_regs *regs) /* Headers are too twisted for this to go in paravirt.h. */ return regs->cs == __USER_CS || regs->cs == pv_info.extra_user_64bit_cs; #endif +#else /* !CONFIG_X86_64 */ + return false; +#endif }
+#ifdef CONFIG_X86_64 #define current_user_stack_pointer() current_pt_regs()->sp #define compat_user_stack_pointer() current_pt_regs()->sp #endif
On Fri, May 05, 2017 at 11:16:59AM -0700, Ricardo Neri wrote:
In its current form, user_64bit_mode() can only be used when CONFIG_X86_64 is selected. This implies that code built with CONFIG_X86_64=n cannot use it. If a piece of code needs to be built for both CONFIG_X86_64=y and CONFIG_X86_64=n and wants to use this function, it needs to wrap it in an #ifdef/#endif; potentially, in multiple places.
This can be easily avoided with a single #ifdef/#endif pair within user_64bit_mode() itself.
Suggested-by: Borislav Petkov bp@suse.de Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/ptrace.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
Reviewed-by: Borislav Petkov bp@suse.de
Up to this point, only fault.c used the definitions of the page fault error codes. Thus, it made sense to keep them within such file. Other portions of code might be interested in those definitions too. For instance, the User- Mode Instruction Prevention emulation code will use such definitions to emulate a page fault when it is unable to successfully copy the results of the emulated instructions to user space.
While relocating the error code enumeration, the prefix X86_ is used to make it consistent with the rest of the definitions in traps.h. Of course, code using the enumeration had to be updated as well. No functional changes were performed.
Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Andy Lutomirski luto@kernel.org Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: x86@kernel.org Reviewed-by: Andy Lutomirski luto@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/traps.h | 18 +++++++++ arch/x86/mm/fault.c | 88 +++++++++++++++++--------------------------- 2 files changed, 52 insertions(+), 54 deletions(-)
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h index 01fd0a7..4a2e585 100644 --- a/arch/x86/include/asm/traps.h +++ b/arch/x86/include/asm/traps.h @@ -148,4 +148,22 @@ enum { X86_TRAP_IRET = 32, /* 32, IRET Exception */ };
+/* + * Page fault error code bits: + * + * bit 0 == 0: no page found 1: protection fault + * bit 1 == 0: read access 1: write access + * bit 2 == 0: kernel-mode access 1: user-mode access + * bit 3 == 1: use of reserved bit detected + * bit 4 == 1: fault was an instruction fetch + * bit 5 == 1: protection keys block access + */ +enum x86_pf_error_code { + X86_PF_PROT = 1 << 0, + X86_PF_WRITE = 1 << 1, + X86_PF_USER = 1 << 2, + X86_PF_RSVD = 1 << 3, + X86_PF_INSTR = 1 << 4, + X86_PF_PK = 1 << 5, +}; #endif /* _ASM_X86_TRAPS_H */ diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 8ad91a0..32f3070 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -29,26 +29,6 @@ #include <asm/trace/exceptions.h>
/* - * Page fault error code bits: - * - * bit 0 == 0: no page found 1: protection fault - * bit 1 == 0: read access 1: write access - * bit 2 == 0: kernel-mode access 1: user-mode access - * bit 3 == 1: use of reserved bit detected - * bit 4 == 1: fault was an instruction fetch - * bit 5 == 1: protection keys block access - */ -enum x86_pf_error_code { - - PF_PROT = 1 << 0, - PF_WRITE = 1 << 1, - PF_USER = 1 << 2, - PF_RSVD = 1 << 3, - PF_INSTR = 1 << 4, - PF_PK = 1 << 5, -}; - -/* * Returns 0 if mmiotrace is disabled, or if the fault is not * handled by mmiotrace: */ @@ -149,7 +129,7 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr) * If it was a exec (instruction fetch) fault on NX page, then * do not ignore the fault: */ - if (error_code & PF_INSTR) + if (error_code & X86_PF_INSTR) return 0;
instr = (void *)convert_ip_to_linear(current, regs); @@ -179,7 +159,7 @@ is_prefetch(struct pt_regs *regs, unsigned long error_code, unsigned long addr) * siginfo so userspace can discover which protection key was set * on the PTE. * - * If we get here, we know that the hardware signaled a PF_PK + * If we get here, we know that the hardware signaled a X86_PF_PK * fault and that there was a VMA once we got in the fault * handler. It does *not* guarantee that the VMA we find here * was the one that we faulted on. @@ -205,7 +185,7 @@ static void fill_sig_info_pkey(int si_code, siginfo_t *info, /* * force_sig_info_fault() is called from a number of * contexts, some of which have a VMA and some of which - * do not. The PF_PK handing happens after we have a + * do not. The X86_PF_PK handing happens after we have a * valid VMA, so we should never reach this without a * valid VMA. */ @@ -695,7 +675,7 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, if (!oops_may_print()) return;
- if (error_code & PF_INSTR) { + if (error_code & X86_PF_INSTR) { unsigned int level; pgd_t *pgd; pte_t *pte; @@ -779,7 +759,7 @@ no_context(struct pt_regs *regs, unsigned long error_code, */ if (current->thread.sig_on_uaccess_err && signal) { tsk->thread.trap_nr = X86_TRAP_PF; - tsk->thread.error_code = error_code | PF_USER; + tsk->thread.error_code = error_code | X86_PF_USER; tsk->thread.cr2 = address;
/* XXX: hwpoison faults will set the wrong code. */ @@ -899,7 +879,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, struct task_struct *tsk = current;
/* User mode accesses just cause a SIGSEGV */ - if (error_code & PF_USER) { + if (error_code & X86_PF_USER) { /* * It's possible to have interrupts off here: */ @@ -920,7 +900,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, * Instruction fetch faults in the vsyscall page might need * emulation. */ - if (unlikely((error_code & PF_INSTR) && + if (unlikely((error_code & X86_PF_INSTR) && ((address & ~0xfff) == VSYSCALL_ADDR))) { if (emulate_vsyscall(regs, address)) return; @@ -933,7 +913,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code, * are always protection faults. */ if (address >= TASK_SIZE_MAX) - error_code |= PF_PROT; + error_code |= X86_PF_PROT;
if (likely(show_unhandled_signals)) show_signal_msg(regs, error_code, address, tsk); @@ -989,11 +969,11 @@ static inline bool bad_area_access_from_pkeys(unsigned long error_code,
if (!boot_cpu_has(X86_FEATURE_OSPKE)) return false; - if (error_code & PF_PK) + if (error_code & X86_PF_PK) return true; /* this checks permission keys on the VMA: */ - if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), - (error_code & PF_INSTR), foreign)) + if (!arch_vma_access_permitted(vma, (error_code & X86_PF_WRITE), + (error_code & X86_PF_INSTR), foreign)) return true; return false; } @@ -1021,7 +1001,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address, int code = BUS_ADRERR;
/* Kernel mode? Handle exceptions or die: */ - if (!(error_code & PF_USER)) { + if (!(error_code & X86_PF_USER)) { no_context(regs, error_code, address, SIGBUS, BUS_ADRERR); return; } @@ -1050,14 +1030,14 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code, unsigned long address, struct vm_area_struct *vma, unsigned int fault) { - if (fatal_signal_pending(current) && !(error_code & PF_USER)) { + if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) { no_context(regs, error_code, address, 0, 0); return; }
if (fault & VM_FAULT_OOM) { /* Kernel mode? Handle exceptions or die: */ - if (!(error_code & PF_USER)) { + if (!(error_code & X86_PF_USER)) { no_context(regs, error_code, address, SIGSEGV, SEGV_MAPERR); return; @@ -1082,16 +1062,16 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
static int spurious_fault_check(unsigned long error_code, pte_t *pte) { - if ((error_code & PF_WRITE) && !pte_write(*pte)) + if ((error_code & X86_PF_WRITE) && !pte_write(*pte)) return 0;
- if ((error_code & PF_INSTR) && !pte_exec(*pte)) + if ((error_code & X86_PF_INSTR) && !pte_exec(*pte)) return 0; /* * Note: We do not do lazy flushing on protection key - * changes, so no spurious fault will ever set PF_PK. + * changes, so no spurious fault will ever set X86_PF_PK. */ - if ((error_code & PF_PK)) + if ((error_code & X86_PF_PK)) return 1;
return 1; @@ -1137,8 +1117,8 @@ spurious_fault(unsigned long error_code, unsigned long address) * change, so user accesses are not expected to cause spurious * faults. */ - if (error_code != (PF_WRITE | PF_PROT) - && error_code != (PF_INSTR | PF_PROT)) + if (error_code != (X86_PF_WRITE | X86_PF_PROT) && + error_code != (X86_PF_INSTR | X86_PF_PROT)) return 0;
pgd = init_mm.pgd + pgd_index(address); @@ -1198,19 +1178,19 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) * always an unconditional error and can never result in * a follow-up action to resolve the fault, like a COW. */ - if (error_code & PF_PK) + if (error_code & X86_PF_PK) return 1;
/* * Make sure to check the VMA so that we do not perform - * faults just to hit a PF_PK as soon as we fill in a + * faults just to hit a X86_PF_PK as soon as we fill in a * page. */ - if (!arch_vma_access_permitted(vma, (error_code & PF_WRITE), - (error_code & PF_INSTR), foreign)) + if (!arch_vma_access_permitted(vma, (error_code & X86_PF_WRITE), + (error_code & X86_PF_INSTR), foreign)) return 1;
- if (error_code & PF_WRITE) { + if (error_code & X86_PF_WRITE) { /* write, present and write, not present: */ if (unlikely(!(vma->vm_flags & VM_WRITE))) return 1; @@ -1218,7 +1198,7 @@ access_error(unsigned long error_code, struct vm_area_struct *vma) }
/* read, present: */ - if (unlikely(error_code & PF_PROT)) + if (unlikely(error_code & X86_PF_PROT)) return 1;
/* read, not present: */ @@ -1241,7 +1221,7 @@ static inline bool smap_violation(int error_code, struct pt_regs *regs) if (!static_cpu_has(X86_FEATURE_SMAP)) return false;
- if (error_code & PF_USER) + if (error_code & X86_PF_USER) return false;
if (!user_mode(regs) && (regs->flags & X86_EFLAGS_AC)) @@ -1297,7 +1277,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, * protection error (error_code & 9) == 0. */ if (unlikely(fault_in_kernel_space(address))) { - if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) { + if (!(error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) { if (vmalloc_fault(address) >= 0) return;
@@ -1325,7 +1305,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, if (unlikely(kprobes_fault(regs))) return;
- if (unlikely(error_code & PF_RSVD)) + if (unlikely(error_code & X86_PF_RSVD)) pgtable_bad(regs, error_code, address);
if (unlikely(smap_violation(error_code, regs))) { @@ -1351,7 +1331,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, */ if (user_mode(regs)) { local_irq_enable(); - error_code |= PF_USER; + error_code |= X86_PF_USER; flags |= FAULT_FLAG_USER; } else { if (regs->flags & X86_EFLAGS_IF) @@ -1360,9 +1340,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
- if (error_code & PF_WRITE) + if (error_code & X86_PF_WRITE) flags |= FAULT_FLAG_WRITE; - if (error_code & PF_INSTR) + if (error_code & X86_PF_INSTR) flags |= FAULT_FLAG_INSTRUCTION;
/* @@ -1382,7 +1362,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, * space check, thus avoiding the deadlock: */ if (unlikely(!down_read_trylock(&mm->mmap_sem))) { - if ((error_code & PF_USER) == 0 && + if ((error_code & X86_PF_USER) == 0 && !search_exception_tables(regs->ip)) { bad_area_nosemaphore(regs, error_code, address, NULL); return; @@ -1409,7 +1389,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, bad_area(regs, error_code, address); return; } - if (error_code & PF_USER) { + if (error_code & X86_PF_USER) { /* * Accessing the stack below %sp is always a bug. * The large cushion allows instructions like enter
On Fri, May 05, 2017 at 11:17:00AM -0700, Ricardo Neri wrote:
Up to this point, only fault.c used the definitions of the page fault error codes. Thus, it made sense to keep them within such file. Other portions of code might be interested in those definitions too. For instance, the User- Mode Instruction Prevention emulation code will use such definitions to emulate a page fault when it is unable to successfully copy the results of the emulated instructions to user space.
While relocating the error code enumeration, the prefix X86_ is used to make it consistent with the rest of the definitions in traps.h. Of course, code using the enumeration had to be updated as well. No functional changes were performed.
Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Andy Lutomirski luto@kernel.org Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: x86@kernel.org Reviewed-by: Andy Lutomirski luto@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/traps.h | 18 +++++++++ arch/x86/mm/fault.c | 88 +++++++++++++++++--------------------------- 2 files changed, 52 insertions(+), 54 deletions(-)
...
@@ -1382,7 +1362,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, * space check, thus avoiding the deadlock: */ if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if ((error_code & PF_USER) == 0 &&
if ((error_code & X86_PF_USER) == 0 &&
if (!(error_code & X86_PF_USER))
With that fixed:
Reviewed-by: Borislav Petkov bp@suse.de
On Sun, 2017-05-21 at 16:23 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:00AM -0700, Ricardo Neri wrote:
Up to this point, only fault.c used the definitions of the page fault error codes. Thus, it made sense to keep them within such file. Other portions of code might be interested in those definitions too. For instance, the User- Mode Instruction Prevention emulation code will use such definitions to emulate a page fault when it is unable to successfully copy the results of the emulated instructions to user space.
While relocating the error code enumeration, the prefix X86_ is used to make it consistent with the rest of the definitions in traps.h. Of course, code using the enumeration had to be updated as well. No functional changes were performed.
Cc: Thomas Gleixner tglx@linutronix.de Cc: Ingo Molnar mingo@redhat.com Cc: "H. Peter Anvin" hpa@zytor.com Cc: Andy Lutomirski luto@kernel.org Cc: "Kirill A. Shutemov" kirill.shutemov@linux.intel.com Cc: Josh Poimboeuf jpoimboe@redhat.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: x86@kernel.org Reviewed-by: Andy Lutomirski luto@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/traps.h | 18 +++++++++ arch/x86/mm/fault.c | 88 +++++++++++++++++--------------------------- 2 files changed, 52 insertions(+), 54 deletions(-)
...
@@ -1382,7 +1362,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code, * space check, thus avoiding the deadlock: */ if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
if ((error_code & PF_USER) == 0 &&
if ((error_code & X86_PF_USER) == 0 &&
if (!(error_code & X86_PF_USER))
This change was initially intended to only rename the error codes, without functional changes. Would making change be considered a change in functionality? The behavior would be preserved, though.
Thanks and BR, Ricardo
With that fixed:
Reviewed-by: Borislav Petkov bp@suse.de
Thank you for your review!
BR, Ricardo
-- Regards/Gruss, Boris.
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
On Fri, May 26, 2017 at 08:40:26PM -0700, Ricardo Neri wrote:
This change was initially intended to only rename the error codes, without functional changes. Would making change be considered a change in functionality?
How?
The before-and-after asm should be the identical.
On Sat, 2017-05-27 at 12:13 +0200, Borislav Petkov wrote:
On Fri, May 26, 2017 at 08:40:26PM -0700, Ricardo Neri wrote:
This change was initially intended to only rename the error codes, without functional changes. Would making change be considered a
change
in functionality?
How?
The before-and-after asm should be the identical.
Yes but it reads differently. I just wanted to double check. I will make this change, which keeps functionality but is written differently.
Thanks and BR, Ricardo
Even though memory addresses are unsigned, the operands used to compute the effective address do have a sign. This is true for the ModRM.rm, SIB.base, SIB.index as well as the displacement bytes. Thus, signed variables shall be used when computing the effective address from these operands. Once the signed effective address has been computed, it is casted to an unsigned long to determine the linear address.
Variables are renamed to better reflect the type of address being computed.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nathan Howard liverlint@gmail.com Cc: Adan Hawthorn adanhawthorn@gmail.com Cc: Joe Perches joe@perches.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/mm/mpx.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 1c34b76..ebdead8 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -138,7 +138,8 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, */ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) { - unsigned long addr, base, indx; + unsigned long linear_addr; + long eff_addr, base, indx; int addr_offset, base_offset, indx_offset; insn_byte_t sib;
@@ -150,7 +151,7 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) goto out_err; - addr = regs_get_register(regs, addr_offset); + eff_addr = regs_get_register(regs, addr_offset); } else { if (insn->sib.nbytes) { base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); @@ -163,16 +164,18 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs)
base = regs_get_register(regs, base_offset); indx = regs_get_register(regs, indx_offset); - addr = base + indx * (1 << X86_SIB_SCALE(sib)); + eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) goto out_err; - addr = regs_get_register(regs, addr_offset); + eff_addr = regs_get_register(regs, addr_offset); } - addr += insn->displacement.value; + eff_addr += insn->displacement.value; } - return (void __user *)addr; + linear_addr = (unsigned long)eff_addr; + + return (void __user *)linear_addr; out_err: return (void __user *)-1; }
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod !=11b and ModRM.rm = 100b indexed register-indirect addressing is used. In other words, a SIB byte follows the ModRM byte. In the specific case of SIB.index = 100b, the scale*index portion of the computation of the effective address is null. To signal callers of this particular situation, get_reg_offset() can return -EDOM (-EINVAL continues to indicate that an error when decoding the SIB byte).
An example of this situation can be the following instruction:
8b 4c 23 80 mov -0x80(%rbx,%riz,1),%rcx ModRM: 0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][index:100b][base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
The %riz 'register' indicates a null index.
In long mode, a REX prefix may be used. When a REX prefix is present, REX.X adds a fourth bit to the register selection of SIB.index. This gives the ability to refer to all the 16 general purpose registers. When REX.X is 1b and SIB.index is 100b, the index is indicated in %r12. In our example, this would look like:
42 8b 4c 23 80 mov -0x80(%rbx,%r12,1),%rcx REX: 0x42 [W:0b][R:0b][X:1b][B:0b] ModRM: 0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][.X: 1b, index:100b][.B:0b, base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nathan Howard liverlint@gmail.com Cc: Adan Hawthorn adanhawthorn@gmail.com Cc: Joe Perches joe@perches.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/mm/mpx.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index ebdead8..7397b81 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -110,6 +110,14 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, regno = X86_SIB_INDEX(insn->sib.value); if (X86_REX_X(insn->rex_prefix.value)) regno += 8; + /* + * If ModRM.mod !=3 and SIB.index (regno=4) the scale*index + * portion of the address computation is null. This is + * true only if REX.X is 0. In such a case, the SIB index + * is used in the address computation. + */ + if (X86_MODRM_MOD(insn->modrm.value) != 3 && regno == 4) + return -EDOM; break;
case REG_TYPE_BASE: @@ -159,11 +167,19 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) goto out_err;
indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); - if (indx_offset < 0) + /* + * A negative offset generally means a error, except + * -EDOM, which means that the contents of the register + * should not be used as index. + */ + if (indx_offset == -EDOM) + indx = 0; + else if (indx_offset < 0) goto out_err; + else + indx = regs_get_register(regs, indx_offset);
base = regs_get_register(regs, base_offset); - indx = regs_get_register(regs, indx_offset); eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
On Fri, May 05, 2017 at 11:17:02AM -0700, Ricardo Neri wrote:
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod !=11b and ModRM.rm = 100b indexed register-indirect addressing is used. In other words, a SIB byte follows the ModRM byte. In the specific case of SIB.index = 100b, the scale*index portion of the computation of the effective address is null. To signal callers of this particular situation, get_reg_offset() can return -EDOM (-EINVAL continues to indicate that an error when decoding the SIB byte).
An example of this situation can be the following instruction:
8b 4c 23 80 mov -0x80(%rbx,%riz,1),%rcx ModRM: 0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][index:100b][base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
The %riz 'register' indicates a null index.
In long mode, a REX prefix may be used. When a REX prefix is present, REX.X adds a fourth bit to the register selection of SIB.index. This gives the ability to refer to all the 16 general purpose registers. When REX.X is 1b and SIB.index is 100b, the index is indicated in %r12. In our example, this would look like:
42 8b 4c 23 80 mov -0x80(%rbx,%r12,1),%rcx REX: 0x42 [W:0b][R:0b][X:1b][B:0b] ModRM: 0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][.X: 1b, index:100b][.B:0b, base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nathan Howard liverlint@gmail.com Cc: Adan Hawthorn adanhawthorn@gmail.com Cc: Joe Perches joe@perches.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/mm/mpx.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index ebdead8..7397b81 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -110,6 +110,14 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, regno = X86_SIB_INDEX(insn->sib.value); if (X86_REX_X(insn->rex_prefix.value)) regno += 8;
<--- newline.
/*
* If ModRM.mod !=3 and SIB.index (regno=4) the scale*index
* portion of the address computation is null. This is
* true only if REX.X is 0. In such a case, the SIB index
* is used in the address computation.
*/
if (X86_MODRM_MOD(insn->modrm.value) != 3 && regno == 4)
return -EDOM;
break;
case REG_TYPE_BASE:
@@ -159,11 +167,19 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) goto out_err;
indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX);
if (indx_offset < 0)
<--- newline.
/*
* A negative offset generally means a error, except
an
* -EDOM, which means that the contents of the register
* should not be used as index.
*/
if (indx_offset == -EDOM)
indx = 0;
else if (indx_offset < 0) goto out_err;
else
indx = regs_get_register(regs, indx_offset); base = regs_get_register(regs, base_offset);
} else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);indx = regs_get_register(regs, indx_offset); eff_addr = base + indx * (1 << X86_SIB_SCALE(sib));
-- 2.9.3
On Wed, 2017-05-24 at 15:37 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:02AM -0700, Ricardo Neri wrote:
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod !=11b and ModRM.rm = 100b indexed register-indirect addressing is used. In other words, a SIB byte follows the ModRM byte. In the specific case of SIB.index = 100b, the scale*index portion of the computation of the effective address is null. To signal callers of this particular situation, get_reg_offset() can return -EDOM (-EINVAL continues to indicate that an error when decoding the SIB byte).
An example of this situation can be the following instruction:
8b 4c 23 80 mov -0x80(%rbx,%riz,1),%rcx ModRM: 0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][index:100b][base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
The %riz 'register' indicates a null index.
In long mode, a REX prefix may be used. When a REX prefix is present, REX.X adds a fourth bit to the register selection of SIB.index. This gives the ability to refer to all the 16 general purpose registers. When REX.X is 1b and SIB.index is 100b, the index is indicated in %r12. In our example, this would look like:
42 8b 4c 23 80 mov -0x80(%rbx,%r12,1),%rcx REX: 0x42 [W:0b][R:0b][X:1b][B:0b] ModRM: 0x4c [mod:1b][reg:1b][rm:100b] SIB: 0x23 [scale:0b][.X: 1b, index:100b][.B:0b, base:11b] Displacement: 0x80 (1-byte, as per ModRM.mod = 1b)
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nathan Howard liverlint@gmail.com Cc: Adan Hawthorn adanhawthorn@gmail.com Cc: Joe Perches joe@perches.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/mm/mpx.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index ebdead8..7397b81 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -110,6 +110,14 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, regno = X86_SIB_INDEX(insn->sib.value); if (X86_REX_X(insn->rex_prefix.value)) regno += 8;
<--- newline.
I will add a new line here.
/*
* If ModRM.mod !=3 and SIB.index (regno=4) the scale*index
* portion of the address computation is null. This is
* true only if REX.X is 0. In such a case, the SIB index
* is used in the address computation.
*/
if (X86_MODRM_MOD(insn->modrm.value) != 3 && regno == 4)
return -EDOM;
break;
case REG_TYPE_BASE:
@@ -159,11 +167,19 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) goto out_err;
indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX);
if (indx_offset < 0)
<--- newline.
I will add a new line here.
/*
* A negative offset generally means a error, except
an
* -EDOM, which means that the contents of the register
* should not be used as index.
*/
if (indx_offset == -EDOM)
indx = 0;
else if (indx_offset < 0) goto out_err;
else
indx = regs_get_register(regs, indx_offset); base = regs_get_register(regs, base_offset);
} else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);indx = regs_get_register(regs, indx_offset); eff_addr = base + indx * (1 << X86_SIB_SCALE(sib));
-- 2.9.3
-- Regards/Gruss, Boris.
Thanks for reviewing!
BR, Ricardo
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when a SIB byte is used and the base of the SIB byte points is base = 101b and the mod part of the ModRM byte is zero, the base port on the effective address computation is null. In this case, a 32-bit displacement follows the SIB byte. This is obtained when the instruction decoder parses the operands.
To signal this scenario, a -EDOM error is returned to indicate callers that they should ignore the base.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nathan Howard liverlint@gmail.com Cc: Adan Hawthorn adanhawthorn@gmail.com Cc: Joe Perches joe@perches.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/mm/mpx.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 7397b81..30aef92 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -122,6 +122,15 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
case REG_TYPE_BASE: regno = X86_SIB_BASE(insn->sib.value); + /* + * If ModRM.mod is 0 and SIB.base == 5, the base of the + * register-indirect addressing is 0. In this case, a + * 32-bit displacement is expected in this case; the + * instruction decoder finds such displacement for us. + */ + if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5) + return -EDOM; + if (X86_REX_B(insn->rex_prefix.value)) regno += 8; break; @@ -162,16 +171,21 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) eff_addr = regs_get_register(regs, addr_offset); } else { if (insn->sib.nbytes) { + /* + * Negative values in the base and index offset means + * an error when decoding the SIB byte. Except -EDOM, + * which means that the registers should not be used + * in the address computation. + */ base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); - if (base_offset < 0) + if (base_offset == -EDOM) + base = 0; + else if (base_offset < 0) goto out_err; + else + base = regs_get_register(regs, base_offset);
indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); - /* - * A negative offset generally means a error, except - * -EDOM, which means that the contents of the register - * should not be used as index. - */ if (indx_offset == -EDOM) indx = 0; else if (indx_offset < 0) @@ -179,7 +193,6 @@ static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) else indx = regs_get_register(regs, indx_offset);
- base = regs_get_register(regs, base_offset); eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
On Fri, May 05, 2017 at 11:17:03AM -0700, Ricardo Neri wrote:
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when a SIB byte is used and the base of the SIB byte points is base = 101b and the mod part of the ModRM byte is zero, the base port on the effective address computation is null. In this case, a 32-bit displacement follows the SIB byte. This is obtained when the instruction decoder parses the operands.
To signal this scenario, a -EDOM error is returned to indicate callers that they should ignore the base.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nathan Howard liverlint@gmail.com Cc: Adan Hawthorn adanhawthorn@gmail.com Cc: Joe Perches joe@perches.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/mm/mpx.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 7397b81..30aef92 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -122,6 +122,15 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
case REG_TYPE_BASE: regno = X86_SIB_BASE(insn->sib.value);
/*
* If ModRM.mod is 0 and SIB.base == 5, the base of the
* register-indirect addressing is 0. In this case, a
* 32-bit displacement is expected in this case; the
* instruction decoder finds such displacement for us.
That last sentence reads funny. Just say:
"In this case, a 32-bit displacement follows the SIB byte."
*/
if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5)
return -EDOM;
- if (X86_REX_B(insn->rex_prefix.value)) regno += 8; break;
On Mon, 2017-05-29 at 15:07 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:03AM -0700, Ricardo Neri wrote:
Section 2.2.1.2 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when a SIB byte is used and the base of the SIB byte points is base = 101b and the mod part of the ModRM byte is zero, the base port on the effective address computation is null. In this case, a 32-bit displacement follows the SIB byte. This is obtained when the instruction decoder parses the operands.
To signal this scenario, a -EDOM error is returned to indicate callers that they should ignore the base.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Peter Zijlstra peterz@infradead.org Cc: Nathan Howard liverlint@gmail.com Cc: Adan Hawthorn adanhawthorn@gmail.com Cc: Joe Perches joe@perches.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/mm/mpx.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 7397b81..30aef92 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -122,6 +122,15 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs,
case REG_TYPE_BASE: regno = X86_SIB_BASE(insn->sib.value);
/*
* If ModRM.mod is 0 and SIB.base == 5, the base of the
* register-indirect addressing is 0. In this case, a
* 32-bit displacement is expected in this case; the
* instruction decoder finds such displacement for us.
That last sentence reads funny. Just say:
"In this case, a 32-bit displacement follows the SIB byte."
Agreed. I will update the comment to make more sense.
Thanks and BR, Ricardo
Other kernel submodules can benefit from using the utility functions defined in mpx.c to obtain the addresses and values of operands contained in the general purpose registers. An instance of this is the emulation code used for instructions protected by the Intel User-Mode Instruction Prevention feature.
Thus, these functions are relocated to a new insn-eval.c file. The reason to not relocate these utilities into insn.c is that the latter solely analyses instructions given by a struct insn without any knowledge of the meaning of the values of instruction operands. This new utility insn- eval.c aims to be used to resolve and userspace linear addresses based on the contents of the instruction operands as well as the contents of pt_regs structure.
These utilities come with a separate header. This is to avoid taking insn.c out of sync from the instructions decoders under tools/obj and tools/perf. This also avoids adding cumbersome #ifdef's for the #include'd files required to decode instructions in a kernel context.
Functions are simply relocated. There are not functional or indentation changes. The checkpatch script issues the following warning with this commit:
WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather than BUG() or BUG_ON() + BUG();
This warning will be fixed in a subsequent patch.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/insn-eval.h | 16 ++++ arch/x86/lib/Makefile | 2 +- arch/x86/lib/insn-eval.c | 159 +++++++++++++++++++++++++++++++++++++++ arch/x86/mm/mpx.c | 152 +------------------------------------ 4 files changed, 178 insertions(+), 151 deletions(-) create mode 100644 arch/x86/include/asm/insn-eval.h create mode 100644 arch/x86/lib/insn-eval.c
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h new file mode 100644 index 0000000..5cab1b1 --- /dev/null +++ b/arch/x86/include/asm/insn-eval.h @@ -0,0 +1,16 @@ +#ifndef _ASM_X86_INSN_EVAL_H +#define _ASM_X86_INSN_EVAL_H +/* + * A collection of utility functions for x86 instruction analysis to be + * used in a kernel context. Useful when, for instance, making sense + * of the registers indicated by operands. + */ + +#include <linux/compiler.h> +#include <linux/bug.h> +#include <linux/err.h> +#include <asm/ptrace.h> + +void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); + +#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile index 34a7413..675d7b0 100644 --- a/arch/x86/lib/Makefile +++ b/arch/x86/lib/Makefile @@ -23,7 +23,7 @@ lib-y := delay.o misc.o cmdline.o cpu.o lib-y += usercopy_$(BITS).o usercopy.o getuser.o putuser.o lib-y += memcpy_$(BITS).o lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o -lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o +lib-$(CONFIG_INSTRUCTION_DECODER) += insn.o inat.o insn-eval.o lib-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
obj-y += msr.o msr-reg.o msr-reg-export.o hweight.o diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c new file mode 100644 index 0000000..e746a6f --- /dev/null +++ b/arch/x86/lib/insn-eval.c @@ -0,0 +1,159 @@ +/* + * Utility functions for x86 operand and address decoding + * + * Copyright (C) Intel Corporation 2017 + */ +#include <linux/kernel.h> +#include <linux/string.h> +#include <asm/inat.h> +#include <asm/insn.h> +#include <asm/insn-eval.h> + +enum reg_type { + REG_TYPE_RM = 0, + REG_TYPE_INDEX, + REG_TYPE_BASE, +}; + +static int get_reg_offset(struct insn *insn, struct pt_regs *regs, + enum reg_type type) +{ + int regno = 0; + + static const int regoff[] = { + offsetof(struct pt_regs, ax), + offsetof(struct pt_regs, cx), + offsetof(struct pt_regs, dx), + offsetof(struct pt_regs, bx), + offsetof(struct pt_regs, sp), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), +#ifdef CONFIG_X86_64 + offsetof(struct pt_regs, r8), + offsetof(struct pt_regs, r9), + offsetof(struct pt_regs, r10), + offsetof(struct pt_regs, r11), + offsetof(struct pt_regs, r12), + offsetof(struct pt_regs, r13), + offsetof(struct pt_regs, r14), + offsetof(struct pt_regs, r15), +#endif + }; + int nr_registers = ARRAY_SIZE(regoff); + /* + * Don't possibly decode a 32-bit instructions as + * reading a 64-bit-only register. + */ + if (IS_ENABLED(CONFIG_X86_64) && !insn->x86_64) + nr_registers -= 8; + + switch (type) { + case REG_TYPE_RM: + regno = X86_MODRM_RM(insn->modrm.value); + if (X86_REX_B(insn->rex_prefix.value)) + regno += 8; + break; + + case REG_TYPE_INDEX: + regno = X86_SIB_INDEX(insn->sib.value); + if (X86_REX_X(insn->rex_prefix.value)) + regno += 8; + /* + * If ModRM.mod !=3 and SIB.index (regno=4) the scale*index + * portion of the address computation is null. This is + * true only if REX.X is 0. In such a case, the SIB index + * is used in the address computation. + */ + if (X86_MODRM_MOD(insn->modrm.value) != 3 && regno == 4) + return -EDOM; + break; + + case REG_TYPE_BASE: + regno = X86_SIB_BASE(insn->sib.value); + /* + * If ModRM.mod is 0 and SIB.base == 5, the base of the + * register-indirect addressing is 0. In this case, a + * 32-bit displacement is expected in this case; the + * instruction decoder finds such displacement for us. + */ + if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5) + return -EDOM; + + if (X86_REX_B(insn->rex_prefix.value)) + regno += 8; + break; + + default: + pr_err("invalid register type"); + BUG(); + break; + } + + if (regno >= nr_registers) { + WARN_ONCE(1, "decoded an instruction with an invalid register"); + return -EINVAL; + } + return regoff[regno]; +} + +/* + * return the address being referenced be instruction + * for rm=3 returning the content of the rm reg + * for rm!=3 calculates the address using SIB and Disp + */ +void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) +{ + unsigned long linear_addr; + long eff_addr, base, indx; + int addr_offset, base_offset, indx_offset; + insn_byte_t sib; + + insn_get_modrm(insn); + insn_get_sib(insn); + sib = insn->sib.value; + + if (X86_MODRM_MOD(insn->modrm.value) == 3) { + addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); + if (addr_offset < 0) + goto out_err; + eff_addr = regs_get_register(regs, addr_offset); + } else { + if (insn->sib.nbytes) { + /* + * Negative values in the base and index offset means + * an error when decoding the SIB byte. Except -EDOM, + * which means that the registers should not be used + * in the address computation. + */ + base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); + if (base_offset == -EDOM) + base = 0; + else if (base_offset < 0) + goto out_err; + else + base = regs_get_register(regs, base_offset); + + indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); + if (indx_offset == -EDOM) + indx = 0; + else if (indx_offset < 0) + goto out_err; + else + indx = regs_get_register(regs, indx_offset); + + eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); + } else { + addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); + if (addr_offset < 0) + goto out_err; + eff_addr = regs_get_register(regs, addr_offset); + } + eff_addr += insn->displacement.value; + } + linear_addr = (unsigned long)eff_addr; + + return (void __user *)linear_addr; +out_err: + return (void __user *)-1; +} diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c index 30aef92..c3f02be 100644 --- a/arch/x86/mm/mpx.c +++ b/arch/x86/mm/mpx.c @@ -12,6 +12,7 @@ #include <linux/sched/sysctl.h>
#include <asm/insn.h> +#include <asm/insn-eval.h> #include <asm/mman.h> #include <asm/mmu_context.h> #include <asm/mpx.h> @@ -60,155 +61,6 @@ static unsigned long mpx_mmap(unsigned long len) return addr; }
-enum reg_type { - REG_TYPE_RM = 0, - REG_TYPE_INDEX, - REG_TYPE_BASE, -}; - -static int get_reg_offset(struct insn *insn, struct pt_regs *regs, - enum reg_type type) -{ - int regno = 0; - - static const int regoff[] = { - offsetof(struct pt_regs, ax), - offsetof(struct pt_regs, cx), - offsetof(struct pt_regs, dx), - offsetof(struct pt_regs, bx), - offsetof(struct pt_regs, sp), - offsetof(struct pt_regs, bp), - offsetof(struct pt_regs, si), - offsetof(struct pt_regs, di), -#ifdef CONFIG_X86_64 - offsetof(struct pt_regs, r8), - offsetof(struct pt_regs, r9), - offsetof(struct pt_regs, r10), - offsetof(struct pt_regs, r11), - offsetof(struct pt_regs, r12), - offsetof(struct pt_regs, r13), - offsetof(struct pt_regs, r14), - offsetof(struct pt_regs, r15), -#endif - }; - int nr_registers = ARRAY_SIZE(regoff); - /* - * Don't possibly decode a 32-bit instructions as - * reading a 64-bit-only register. - */ - if (IS_ENABLED(CONFIG_X86_64) && !insn->x86_64) - nr_registers -= 8; - - switch (type) { - case REG_TYPE_RM: - regno = X86_MODRM_RM(insn->modrm.value); - if (X86_REX_B(insn->rex_prefix.value)) - regno += 8; - break; - - case REG_TYPE_INDEX: - regno = X86_SIB_INDEX(insn->sib.value); - if (X86_REX_X(insn->rex_prefix.value)) - regno += 8; - /* - * If ModRM.mod !=3 and SIB.index (regno=4) the scale*index - * portion of the address computation is null. This is - * true only if REX.X is 0. In such a case, the SIB index - * is used in the address computation. - */ - if (X86_MODRM_MOD(insn->modrm.value) != 3 && regno == 4) - return -EDOM; - break; - - case REG_TYPE_BASE: - regno = X86_SIB_BASE(insn->sib.value); - /* - * If ModRM.mod is 0 and SIB.base == 5, the base of the - * register-indirect addressing is 0. In this case, a - * 32-bit displacement is expected in this case; the - * instruction decoder finds such displacement for us. - */ - if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5) - return -EDOM; - - if (X86_REX_B(insn->rex_prefix.value)) - regno += 8; - break; - - default: - pr_err("invalid register type"); - BUG(); - break; - } - - if (regno >= nr_registers) { - WARN_ONCE(1, "decoded an instruction with an invalid register"); - return -EINVAL; - } - return regoff[regno]; -} - -/* - * return the address being referenced be instruction - * for rm=3 returning the content of the rm reg - * for rm!=3 calculates the address using SIB and Disp - */ -static void __user *mpx_get_addr_ref(struct insn *insn, struct pt_regs *regs) -{ - unsigned long linear_addr; - long eff_addr, base, indx; - int addr_offset, base_offset, indx_offset; - insn_byte_t sib; - - insn_get_modrm(insn); - insn_get_sib(insn); - sib = insn->sib.value; - - if (X86_MODRM_MOD(insn->modrm.value) == 3) { - addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); - if (addr_offset < 0) - goto out_err; - eff_addr = regs_get_register(regs, addr_offset); - } else { - if (insn->sib.nbytes) { - /* - * Negative values in the base and index offset means - * an error when decoding the SIB byte. Except -EDOM, - * which means that the registers should not be used - * in the address computation. - */ - base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); - if (base_offset == -EDOM) - base = 0; - else if (base_offset < 0) - goto out_err; - else - base = regs_get_register(regs, base_offset); - - indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); - if (indx_offset == -EDOM) - indx = 0; - else if (indx_offset < 0) - goto out_err; - else - indx = regs_get_register(regs, indx_offset); - - eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); - } else { - addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); - if (addr_offset < 0) - goto out_err; - eff_addr = regs_get_register(regs, addr_offset); - } - eff_addr += insn->displacement.value; - } - linear_addr = (unsigned long)eff_addr; - - return (void __user *)linear_addr; -out_err: - return (void __user *)-1; -} - static int mpx_insn_decode(struct insn *insn, struct pt_regs *regs) { @@ -321,7 +173,7 @@ siginfo_t *mpx_generate_siginfo(struct pt_regs *regs) info->si_signo = SIGSEGV; info->si_errno = 0; info->si_code = SEGV_BNDERR; - info->si_addr = mpx_get_addr_ref(&insn, regs); + info->si_addr = insn_get_addr_ref(&insn, regs); /* * We were not able to extract an address from the instruction, * probably because there was something invalid in it.
We are not in a critical failure path. The invalid register type is caused when trying to decode invalid instruction bytes from a user-space program. Thus, simply print an error message. To prevent this warning from being abused from user space programs, use the rate-limited variant of printk.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index e746a6f..182e2ae 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -5,6 +5,7 @@ */ #include <linux/kernel.h> #include <linux/string.h> +#include <linux/ratelimit.h> #include <asm/inat.h> #include <asm/insn.h> #include <asm/insn-eval.h> @@ -85,9 +86,8 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, break;
default: - pr_err("invalid register type"); - BUG(); - break; + printk_ratelimited(KERN_ERR "insn-eval: x86: invalid register type"); + return -EINVAL; }
if (regno >= nr_registers) {
On Fri, May 05, 2017 at 11:17:05AM -0700, Ricardo Neri wrote:
We are not in a critical failure path. The invalid register type is caused when trying to decode invalid instruction bytes from a user-space program. Thus, simply print an error message. To prevent this warning from being abused from user space programs, use the rate-limited variant of printk.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index e746a6f..182e2ae 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -5,6 +5,7 @@ */ #include <linux/kernel.h> #include <linux/string.h> +#include <linux/ratelimit.h> #include <asm/inat.h> #include <asm/insn.h> #include <asm/insn-eval.h> @@ -85,9 +86,8 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, break;
default:
pr_err("invalid register type");
BUG();
break;
printk_ratelimited(KERN_ERR "insn-eval: x86: invalid register type");
You can use pr_err_ratelimited() and define "insn-eval" with pr_fmt. Look for examples in the tree.
Btw, "insn-eval" is perhaps not the right name - since we're building an instruction decoder, maybe it should be called "insn-dec" or so. I'm looking at those other arch/x86/lib/insn.c, arch/x86/include/asm/inat.h things and how they're starting to morph into one decoding facility, AFAICT.
On Mon, 2017-05-29 at 18:37 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:05AM -0700, Ricardo Neri wrote:
We are not in a critical failure path. The invalid register type is caused when trying to decode invalid instruction bytes from a user-space program. Thus, simply print an error message. To prevent this warning from being abused from user space programs, use the rate-limited variant of printk.
Cc: Borislav Petkov bp@suse.de Cc: Andy Lutomirski luto@kernel.org Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index e746a6f..182e2ae 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -5,6 +5,7 @@ */ #include <linux/kernel.h> #include <linux/string.h> +#include <linux/ratelimit.h> #include <asm/inat.h> #include <asm/insn.h> #include <asm/insn-eval.h> @@ -85,9 +86,8 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, break;
default:
pr_err("invalid register type");
BUG();
break;
printk_ratelimited(KERN_ERR "insn-eval: x86: invalid register type");
You can use pr_err_ratelimited() and define "insn-eval" with pr_fmt. Look for examples in the tree.
Will do. I have looked at the examples.
Btw, "insn-eval" is perhaps not the right name - since we're building an instruction decoder, maybe it should be called "insn-dec" or so. I'm looking at those other arch/x86/lib/insn.c, arch/x86/include/asm/inat.h things and how they're starting to morph into one decoding facility, AFAICT.
I agree that insn-eval reads somewhat funny. I did not want to go with insn-dec.c as insn.c, in my opinion, already decodes the instruction (i.e., it finds prefixes, opcodes, ModRM, SIB and displacement bytes). In insn-eval.c I simply take those decoded parameters and evaluate them to obtain the values they contain (e.g., a specific memory location). Perhaps, insn-resolve.c could be a better name? Or maybe isnn-operands?
Thanks and BR, Ricardo
On Mon, Jun 05, 2017 at 11:06:58PM -0700, Ricardo Neri wrote:
I agree that insn-eval reads somewhat funny. I did not want to go with insn-dec.c as insn.c, in my opinion, already decodes the instruction (i.e., it finds prefixes, opcodes, ModRM, SIB and displacement bytes). In insn-eval.c I simply take those decoded parameters and evaluate them to obtain the values they contain (e.g., a specific memory location). Perhaps, insn-resolve.c could be a better name? Or maybe isnn-operands?
So actually I'm gravitating towards calling all that instruction "massaging" code with a single prefix to denote this comes from the insn decoder/handler/whatever...
I.e.,
"insn-decoder: x86: invalid register type"
or
"inat: x86: invalid register type"
or something to that effect.
I mean, If we're going to grow our own - as we do, apparently - maybe it all should be a separate entity with its proper name.
Hmm.
On Tue, 2017-06-06 at 13:58 +0200, Borislav Petkov wrote:
On Mon, Jun 05, 2017 at 11:06:58PM -0700, Ricardo Neri wrote:
I agree that insn-eval reads somewhat funny. I did not want to go with insn-dec.c as insn.c, in my opinion, already decodes the instruction (i.e., it finds prefixes, opcodes, ModRM, SIB and displacement bytes). In insn-eval.c I simply take those decoded parameters and evaluate them to obtain the values they contain (e.g., a specific memory location). Perhaps, insn-resolve.c could be a better name? Or maybe isnn-operands?
So actually I'm gravitating towards calling all that instruction "massaging" code with a single prefix to denote this comes from the insn decoder/handler/whatever...
I.e.,
"insn-decoder: x86: invalid register type"
or
"inat: x86: invalid register type"
or something to that effect.
I mean, If we're going to grow our own - as we do, apparently - maybe it all should be a separate entity with its proper name.
I see. You were more concerned about the naming of the coding artifacts (e.g., function names, error prints, etc) than the actual filenames. I think I have aligned with the function naming of insn.c in all the functions that are exposed via header by using the inns_ prefix. For static functions I don't use that prefix. Perhaps I can use the __ prefix as insn.c does.
Thanks and BR, Ricardo
On Tue, Jun 06, 2017 at 05:28:52PM -0700, Ricardo Neri wrote:
I see. You were more concerned about the naming of the coding artifacts (e.g., function names, error prints, etc) than the actual filenames.
Well, I'm not sure here. We could either have a generalized prefix or put the function name in there - __func__ - for easier debuggability, i.e., find the origin of the error message faster.
But I'm sensing that we're already well inside the bikeshed so let's not change anything now. :)
The function get_reg_offset() returns the offset to the register the argument specifies as indicated in an enumeration of type offset. Callers of this function would need the definition of such enumeration. This is not needed. Instead, add helper functions for this purpose. These functions are useful in cases when, for instance, the caller needs to decide whether the operand is a register or a memory location by looking at the rm part of the ModRM byte. As of now, this is the only helper function that is needed.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/insn-eval.h | 1 + arch/x86/lib/insn-eval.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+)
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 5cab1b1..7e8c963 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -12,5 +12,6 @@ #include <asm/ptrace.h>
void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); +int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 182e2ae..8b16761 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -97,6 +97,21 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, return regoff[regno]; }
+/** + * insn_get_reg_offset_modrm_rm() - Obtain register in r/m part of ModRM byte + * @insn: Instruction structure containing the ModRM byte + * @regs: Structure with register values as seen when entering kernel mode + * + * Return: The register indicated by the r/m part of the ModRM byte. The + * register is obtained as an offset from the base of pt_regs. In specific + * cases, the returned value can be -EDOM to indicate that the particular value + * of ModRM does not refer to a register and shall be ignored. + */ +int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) +{ + return get_reg_offset(insn, regs, REG_TYPE_RM); +} + /* * return the address being referenced be instruction * for rm=3 returning the content of the rm reg
On Fri, May 05, 2017 at 11:17:06AM -0700, Ricardo Neri wrote:
The function get_reg_offset() returns the offset to the register the argument specifies as indicated in an enumeration of type offset. Callers of this function would need the definition of such enumeration. This is not needed. Instead, add helper functions for this purpose. These functions are useful in cases when, for instance, the caller needs to decide whether the operand is a register or a memory location by looking at the rm part of the ModRM byte. As of now, this is the only helper function that is needed.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/insn-eval.h | 1 + arch/x86/lib/insn-eval.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+)
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 5cab1b1..7e8c963 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -12,5 +12,6 @@ #include <asm/ptrace.h>
void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); +int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 182e2ae..8b16761 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -97,6 +97,21 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, return regoff[regno]; }
+/**
- insn_get_reg_offset_modrm_rm() - Obtain register in r/m part of ModRM byte
That name needs to be synced with the function name below.
- @insn: Instruction structure containing the ModRM byte
- @regs: Structure with register values as seen when entering kernel mode
- Return: The register indicated by the r/m part of the ModRM byte. The
- register is obtained as an offset from the base of pt_regs. In specific
- cases, the returned value can be -EDOM to indicate that the particular value
- of ModRM does not refer to a register and shall be ignored.
- */
+int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs)
^^^^^^^^^^^^^^^^^^^^
On Mon, 2017-05-29 at 19:16 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:06AM -0700, Ricardo Neri wrote:
The function get_reg_offset() returns the offset to the register the argument specifies as indicated in an enumeration of type offset. Callers of this function would need the definition of such enumeration. This is not needed. Instead, add helper functions for this purpose. These functions are useful in cases when, for instance, the caller needs to decide whether the operand is a register or a memory location by looking at the rm part of the ModRM byte. As of now, this is the only helper function that is needed.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/insn-eval.h | 1 + arch/x86/lib/insn-eval.c | 15 +++++++++++++++ 2 files changed, 16 insertions(+)
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 5cab1b1..7e8c963 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -12,5 +12,6 @@ #include <asm/ptrace.h>
void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); +int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs);
#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 182e2ae..8b16761 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -97,6 +97,21 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, return regoff[regno]; }
+/**
- insn_get_reg_offset_modrm_rm() - Obtain register in r/m part of ModRM byte
That name needs to be synced with the function name below.
Ugh! I missed this. I will update accordingly. Thanks for the detailed review.
BR, Ricardo
String instructions are special because in protected mode, the linear address is always obtained via the ES segment register in operands that use the (E)DI register. Segment override prefixes are ignored. non- string instructions use DS as the default segment register and it can be overridden with a segment override prefix.
This function will be used in a subsequent commmit that introduces a function to determine the segment register to use given the instruction, operands and segment override prefixes.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 8b16761..1634762 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -16,6 +16,73 @@ enum reg_type { REG_TYPE_BASE, };
+enum string_instruction { + INSB = 0x6c, + INSW_INSD = 0x6d, + OUTSB = 0x6e, + OUTSW_OUTSD = 0x6f, + MOVSB = 0xa4, + MOVSW_MOVSD = 0xa5, + CMPSB = 0xa6, + CMPSW_CMPSD = 0xa7, + STOSB = 0xaa, + STOSW_STOSD = 0xab, + LODSB = 0xac, + LODSW_LODSD = 0xad, + SCASB = 0xae, + SCASW_SCASD = 0xaf, +}; + +/** + * is_string_instruction - Determine if instruction is a string instruction + * @insn: Instruction structure containing the opcode + * + * Return: true if the instruction, determined by the opcode, is any of the + * string instructions as defined in the Intel Software Development manual. + * False otherwise. + */ +static bool is_string_instruction(struct insn *insn) +{ + insn_get_opcode(insn); + + /* all string instructions have a 1-byte opcode */ + if (insn->opcode.nbytes != 1) + return false; + + switch (insn->opcode.bytes[0]) { + case INSB: + /* fall through */ + case INSW_INSD: + /* fall through */ + case OUTSB: + /* fall through */ + case OUTSW_OUTSD: + /* fall through */ + case MOVSB: + /* fall through */ + case MOVSW_MOVSD: + /* fall through */ + case CMPSB: + /* fall through */ + case CMPSW_CMPSD: + /* fall through */ + case STOSB: + /* fall through */ + case STOSW_STOSD: + /* fall through */ + case LODSB: + /* fall through */ + case LODSW_LODSD: + /* fall through */ + case SCASB: + /* fall through */ + case SCASW_SCASD: + return true; + default: + return false; + } +} + static int get_reg_offset(struct insn *insn, struct pt_regs *regs, enum reg_type type) {
On Fri, May 05, 2017 at 11:17:07AM -0700, Ricardo Neri wrote:
String instructions are special because in protected mode, the linear address is always obtained via the ES segment register in operands that use the (E)DI register.
... and DS for rSI.
If we're going to account for both operands of string instructions with two operands.
Btw, LODS and OUTS use only DS:rSI as a source operand. So we have to be careful with the generalization here. So if ES:rDI is the only seg. reg we want, then we don't need to look at those insns... (we assume DS by default).
...
+/**
- is_string_instruction - Determine if instruction is a string instruction
- @insn: Instruction structure containing the opcode
- Return: true if the instruction, determined by the opcode, is any of the
- string instructions as defined in the Intel Software Development manual.
- False otherwise.
- */
+static bool is_string_instruction(struct insn *insn) +{
- insn_get_opcode(insn);
- /* all string instructions have a 1-byte opcode */
- if (insn->opcode.nbytes != 1)
return false;
- switch (insn->opcode.bytes[0]) {
- case INSB:
/* fall through */
- case INSW_INSD:
/* fall through */
- case OUTSB:
/* fall through */
- case OUTSW_OUTSD:
/* fall through */
- case MOVSB:
/* fall through */
- case MOVSW_MOVSD:
/* fall through */
- case CMPSB:
/* fall through */
- case CMPSW_CMPSD:
/* fall through */
- case STOSB:
/* fall through */
- case STOSW_STOSD:
/* fall through */
- case LODSB:
/* fall through */
- case LODSW_LODSD:
/* fall through */
- case SCASB:
/* fall through */
That "fall through" for every opcode is just too much. Also, you can use the regularity of the x86 opcode space and do:
case 0x6c ... 0x6f: /* INS/OUTS */ case 0xa4 ... 0xa7: /* MOVS/CMPS */ case 0xaa ... 0xaf: /* STOS/LODS/SCAS */ return true; default: return false; }
And voila, there's your compact is_string_insn() function! :^)
(Modulo the exact list, as I mentioned above).
Thanks.
On Mon, 2017-05-29 at 23:48 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:07AM -0700, Ricardo Neri wrote:
String instructions are special because in protected mode, the linear address is always obtained via the ES segment register in operands that use the (E)DI register.
... and DS for rSI.
Right, I omitted this in the commit message.
If we're going to account for both operands of string instructions with two operands.
Btw, LODS and OUTS use only DS:rSI as a source operand. So we have to be careful with the generalization here. So if ES:rDI is the only seg. reg we want, then we don't need to look at those insns... (we assume DS by default).
My intention with this function is to write a function that does only one thing: identify string instructions, irrespective of the operands they use. A separate function, resolve_seg_register, will have the logic to decide what to segment register to use based on the registers used as operands, whether we are looking at a string instruction, whether we have segment override prefixes and whether such overrides should be ignored.
If I was to leave out string instructions from this function it should be renamed as is_string_instruction_non_lods_outs. In my opinion this separation makes the code more clear and I would end up having logic to decide which segment register to use in two places. Does it makes sense to you?
...
+/**
- is_string_instruction - Determine if instruction is a string instruction
- @insn: Instruction structure containing the opcode
- Return: true if the instruction, determined by the opcode, is any of the
- string instructions as defined in the Intel Software Development manual.
- False otherwise.
- */
+static bool is_string_instruction(struct insn *insn) +{
- insn_get_opcode(insn);
- /* all string instructions have a 1-byte opcode */
- if (insn->opcode.nbytes != 1)
return false;
- switch (insn->opcode.bytes[0]) {
- case INSB:
/* fall through */
- case INSW_INSD:
/* fall through */
- case OUTSB:
/* fall through */
- case OUTSW_OUTSD:
/* fall through */
- case MOVSB:
/* fall through */
- case MOVSW_MOVSD:
/* fall through */
- case CMPSB:
/* fall through */
- case CMPSW_CMPSD:
/* fall through */
- case STOSB:
/* fall through */
- case STOSW_STOSD:
/* fall through */
- case LODSB:
/* fall through */
- case LODSW_LODSD:
/* fall through */
- case SCASB:
/* fall through */
That "fall through" for every opcode is just too much. Also, you can use the regularity of the x86 opcode space and do:
case 0x6c ... 0x6f: /* INS/OUTS */ case 0xa4 ... 0xa7: /* MOVS/CMPS */ case 0xaa ... 0xaf: /* STOS/LODS/SCAS */ return true; default: return false; }
And voila, there's your compact is_string_insn() function! :^)
Thanks for the suggestion! It looks really nice. I will implement accordingly.
Thanks and BR, Ricardo
On Mon, Jun 05, 2017 at 11:01:21PM -0700, Ricardo Neri wrote:
If I was to leave out string instructions from this function it should be renamed as is_string_instruction_non_lods_outs. In my opinion this separation makes the code more clear and I would end up having logic to decide which segment register to use in two places. Does it makes sense to you?
Ok, sure.
Thanks.
When computing a linear address and segmentation is used, we need to know the base address of the segment involved in the computation. In most of the cases, the segment base address will be zero as in USER_DS/USER32_DS. However, it may be possible that a user space program defines its own segments via a local descriptor table. In such a case, the segment base address may not be zero .Thus, the segment base address is needed to calculate correctly the linear address.
The segment selector to be used when computing a linear address is determined by either any of segment override prefixes in the instruction or inferred from the registers involved in the computation of the effective address; in that order. Also, there are cases when the overrides shall be ignored (code segments are always selected by the CS segment register; string instructions always use the ES segment register along with the EDI register).
For clarity, this process can be split into two steps: resolving the relevant segment register to use and, once known, read its value to obtain the segment selector.
The method to obtain the segment selector depends on several factors. In 32-bit builds, segment selectors are saved into the pt_regs structure when switching to kernel mode. The same is also true for virtual-8086 mode. In 64-bit builds, segmentation is mostly ignored, except when running a program in 32-bit legacy mode. In this case, CS and SS can be obtained from pt_regs. DS, ES, FS and GS can be read directly from the respective segment registers.
Lastly, the only two segment registers that are not ignored in long mode are FS and GS. In these two cases, base addresses are obtained from the respective MSRs.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 256 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 256 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 1634762..0a496f4 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -9,6 +9,7 @@ #include <asm/inat.h> #include <asm/insn.h> #include <asm/insn-eval.h> +#include <asm/vm86.h>
enum reg_type { REG_TYPE_RM = 0, @@ -33,6 +34,17 @@ enum string_instruction { SCASW_SCASD = 0xaf, };
+enum segment_register { + SEG_REG_INVAL = -1, + SEG_REG_IGNORE = 0, + SEG_REG_CS = 0x23, + SEG_REG_SS = 0x36, + SEG_REG_DS = 0x3e, + SEG_REG_ES = 0x26, + SEG_REG_FS = 0x64, + SEG_REG_GS = 0x65, +}; + /** * is_string_instruction - Determine if instruction is a string instruction * @insn: Instruction structure containing the opcode @@ -83,6 +95,250 @@ static bool is_string_instruction(struct insn *insn) } }
+/** + * resolve_seg_register() - obtain segment register + * @insn: Instruction structure with segment override prefixes + * @regs: Structure with register values as seen when entering kernel mode + * @regoff: Operand offset, in pt_regs, used to deterimine segment register + * + * The segment register to which an effective address refers depends on + * a) whether segment override prefixes must be ignored: always use CS when + * the register is (R|E)IP; always use ES when operand register is (E)DI with + * string instructions as defined in the Intel documentation. b) If segment + * overrides prefixes are used in the instruction instruction prefixes. C) Use + * the default segment register associated with the operand register. + * + * The operand register, regoff, is represented as the offset from the base of + * pt_regs. Also, regoff can be -EDOM for cases in which registers are not + * used as operands (e.g., displacement-only memory addressing). + * + * This function returns the segment register as value from an enumeration + * as per the conditions described above. Please note that this function + * does not return the value in the segment register (i.e., the segment + * selector). The segment selector needs to be obtained using + * get_segment_selector() and passing the segment register resolved by + * this function. + * + * Return: Enumerated segment register to use, among CS, SS, DS, ES, FS, GS, + * ignore (in 64-bit mode as applicable), or -EINVAL in case of error. + */ +static enum segment_register resolve_seg_register(struct insn *insn, + struct pt_regs *regs, + int regoff) +{ + int i; + int sel_overrides = 0; + int seg_register = SEG_REG_IGNORE; + + if (!insn) + return SEG_REG_INVAL; + + /* First handle cases when segment override prefixes must be ignored */ + if (regoff == offsetof(struct pt_regs, ip)) { + if (user_64bit_mode(regs)) + return SEG_REG_IGNORE; + else + return SEG_REG_CS; + return SEG_REG_CS; + } + + /* + * If the (E)DI register is used with string instructions, the ES + * segment register is always used. + */ + if ((regoff == offsetof(struct pt_regs, di)) && + is_string_instruction(insn)) { + if (user_64bit_mode(regs)) + return SEG_REG_IGNORE; + else + return SEG_REG_ES; + return SEG_REG_CS; + } + + /* Then check if we have segment overrides prefixes*/ + for (i = 0; i < insn->prefixes.nbytes; i++) { + switch (insn->prefixes.bytes[i]) { + case SEG_REG_CS: + seg_register = SEG_REG_CS; + sel_overrides++; + break; + case SEG_REG_SS: + seg_register = SEG_REG_SS; + sel_overrides++; + break; + case SEG_REG_DS: + seg_register = SEG_REG_DS; + sel_overrides++; + break; + case SEG_REG_ES: + seg_register = SEG_REG_ES; + sel_overrides++; + break; + case SEG_REG_FS: + seg_register = SEG_REG_FS; + sel_overrides++; + break; + case SEG_REG_GS: + seg_register = SEG_REG_GS; + sel_overrides++; + break; + default: + return SEG_REG_INVAL; + } + } + + /* + * Having more than one segment override prefix leads to undefined + * behavior. If this is the case, return with error. + */ + if (sel_overrides > 1) + return SEG_REG_INVAL; + + if (sel_overrides == 1) { + /* + * If in long mode all segment registers but FS and GS are + * ignored. + */ + if (user_64bit_mode(regs) && !(seg_register == SEG_REG_FS || + seg_register == SEG_REG_GS)) + return SEG_REG_IGNORE; + + return seg_register; + } + + /* In long mode, all segment registers except FS and GS are ignored */ + if (user_64bit_mode(regs)) + return SEG_REG_IGNORE; + + /* + * Lastly, if no segment overrides were found, determine the default + * segment register as described in the Intel documentation: SS for + * (E)SP or (E)BP. DS for all data references, AX, CX and DX are not + * valid register operands in 16-bit address encodings. + * -EDOM is reserved to identify for cases in which no register is used + * the default segment register (displacement-only addressing). The + * default segment register used in these cases is DS. + */ + + switch (regoff) { + case offsetof(struct pt_regs, ax): + /* fall through */ + case offsetof(struct pt_regs, cx): + /* fall through */ + case offsetof(struct pt_regs, dx): + if (insn && insn->addr_bytes == 2) + return SEG_REG_INVAL; + case offsetof(struct pt_regs, di): + /* fall through */ + case -EDOM: + /* fall through */ + case offsetof(struct pt_regs, bx): + /* fall through */ + case offsetof(struct pt_regs, si): + return SEG_REG_DS; + case offsetof(struct pt_regs, bp): + /* fall through */ + case offsetof(struct pt_regs, sp): + return SEG_REG_SS; + case offsetof(struct pt_regs, ip): + return SEG_REG_CS; + default: + return SEG_REG_INVAL; + } +} + +/** + * get_segment_selector() - obtain segment selector + * @regs: Structure with register values as seen when entering kernel mode + * @seg_reg: Segment register to use + * + * Obtain the segment selector from any of the CS, SS, DS, ES, FS, GS segment + * registers. In CONFIG_X86_32, the segment is obtained from either pt_regs or + * kernel_vm86_regs as applicable. In CONFIG_X86_64, CS and SS are obtained + * from pt_regs. DS, ES, FS and GS are obtained by reading the actual CPU + * registers. This done for only for completeness as in CONFIG_X86_64 segment + * registers are ignored. + * + * Return: Value of the segment selector, including null when running in + * long mode. -1 on error. + */ +static unsigned short get_segment_selector(struct pt_regs *regs, + enum segment_register seg_reg) +{ +#ifdef CONFIG_X86_64 + unsigned short sel; + + switch (seg_reg) { + case SEG_REG_IGNORE: + return 0; + case SEG_REG_CS: + return (unsigned short)(regs->cs & 0xffff); + case SEG_REG_SS: + return (unsigned short)(regs->ss & 0xffff); + case SEG_REG_DS: + savesegment(ds, sel); + return sel; + case SEG_REG_ES: + savesegment(es, sel); + return sel; + case SEG_REG_FS: + savesegment(fs, sel); + return sel; + case SEG_REG_GS: + savesegment(gs, sel); + return sel; + default: + return -1; + } +#else /* CONFIG_X86_32 */ + struct kernel_vm86_regs *vm86regs = (struct kernel_vm86_regs *)regs; + + if (v8086_mode(regs)) { + switch (seg_reg) { + case SEG_REG_CS: + return (unsigned short)(regs->cs & 0xffff); + case SEG_REG_SS: + return (unsigned short)(regs->ss & 0xffff); + case SEG_REG_DS: + return vm86regs->ds; + case SEG_REG_ES: + return vm86regs->es; + case SEG_REG_FS: + return vm86regs->fs; + case SEG_REG_GS: + return vm86regs->gs; + case SEG_REG_IGNORE: + /* fall through */ + default: + return -1; + } + } + + switch (seg_reg) { + case SEG_REG_CS: + return (unsigned short)(regs->cs & 0xffff); + case SEG_REG_SS: + return (unsigned short)(regs->ss & 0xffff); + case SEG_REG_DS: + return (unsigned short)(regs->ds & 0xffff); + case SEG_REG_ES: + return (unsigned short)(regs->es & 0xffff); + case SEG_REG_FS: + return (unsigned short)(regs->fs & 0xffff); + case SEG_REG_GS: + /* + * GS may or may not be in regs as per CONFIG_X86_32_LAZY_GS. + * The macro below takes care of both cases. + */ + return get_user_gs(regs); + case SEG_REG_IGNORE: + /* fall through */ + default: + return -1; + } +#endif /* CONFIG_X86_64 */ +} + static int get_reg_offset(struct insn *insn, struct pt_regs *regs, enum reg_type type) {
On Fri, May 05, 2017 at 11:17:08AM -0700, Ricardo Neri wrote:
When computing a linear address and segmentation is used, we need to know the base address of the segment involved in the computation. In most of the cases, the segment base address will be zero as in USER_DS/USER32_DS. However, it may be possible that a user space program defines its own segments via a local descriptor table. In such a case, the segment base address may not be zero .Thus, the segment base address is needed to calculate correctly the linear address.
The segment selector to be used when computing a linear address is determined by either any of segment override prefixes in the instruction or inferred from the registers involved in the computation of the effective address; in that order. Also, there are cases when the overrides shall be ignored (code segments are always selected by the CS segment register; string instructions always use the ES segment register along with the EDI register).
For clarity, this process can be split into two steps: resolving the relevant segment register to use and, once known, read its value to obtain the segment selector.
The method to obtain the segment selector depends on several factors. In 32-bit builds, segment selectors are saved into the pt_regs structure when switching to kernel mode. The same is also true for virtual-8086 mode. In 64-bit builds, segmentation is mostly ignored, except when running a program in 32-bit legacy mode. In this case, CS and SS can be obtained from pt_regs. DS, ES, FS and GS can be read directly from the respective segment registers.
Lastly, the only two segment registers that are not ignored in long mode are FS and GS. In these two cases, base addresses are obtained from the respective MSRs.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 256 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 256 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 1634762..0a496f4 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -9,6 +9,7 @@ #include <asm/inat.h> #include <asm/insn.h> #include <asm/insn-eval.h> +#include <asm/vm86.h>
enum reg_type { REG_TYPE_RM = 0, @@ -33,6 +34,17 @@ enum string_instruction { SCASW_SCASD = 0xaf, };
+enum segment_register {
- SEG_REG_INVAL = -1,
- SEG_REG_IGNORE = 0,
- SEG_REG_CS = 0x23,
- SEG_REG_SS = 0x36,
- SEG_REG_DS = 0x3e,
- SEG_REG_ES = 0x26,
- SEG_REG_FS = 0x64,
- SEG_REG_GS = 0x65,
+};
Yuck, didn't we talk about this already?
Those are segment override prefixes so call them as such.
#define SEG_OVR_PFX_CS 0x23 #define SEG_OVR_PFX_SS 0x36 ...
and we already have those!
arch/x86/include/asm/inat.h: ... #define INAT_PFX_CS 5 /* 0x2E */ #define INAT_PFX_DS 6 /* 0x3E */ #define INAT_PFX_ES 7 /* 0x26 */ #define INAT_PFX_FS 8 /* 0x64 */ #define INAT_PFX_GS 9 /* 0x65 */ #define INAT_PFX_SS 10 /* 0x36 */
well, kinda, they're numbers there and not the actual prefix values.
And then there's:
arch/x86/kernel/uprobes.c::is_prefix_bad() which looks at some of those.
Please add your defines to inat.h and make that function is_prefix_bad() use them instead of naked numbers. We need to pay attention to all those different things needing to look at insn opcodes and not let them go unwieldy by each defining and duplicating stuff.
/**
- is_string_instruction - Determine if instruction is a string instruction
- @insn: Instruction structure containing the opcode
@@ -83,6 +95,250 @@ static bool is_string_instruction(struct insn *insn) } }
+/**
- resolve_seg_register() - obtain segment register
That function is still returning the segment override prefix and we use *that* to determine the segment register.
- @insn: Instruction structure with segment override prefixes
- @regs: Structure with register values as seen when entering kernel mode
- @regoff: Operand offset, in pt_regs, used to deterimine segment register
- The segment register to which an effective address refers depends on
- a) whether segment override prefixes must be ignored: always use CS when
- the register is (R|E)IP; always use ES when operand register is (E)DI with
- string instructions as defined in the Intel documentation. b) If segment
- overrides prefixes are used in the instruction instruction prefixes. C) Use
- the default segment register associated with the operand register.
- The operand register, regoff, is represented as the offset from the base of
- pt_regs. Also, regoff can be -EDOM for cases in which registers are not
- used as operands (e.g., displacement-only memory addressing).
- This function returns the segment register as value from an enumeration
- as per the conditions described above. Please note that this function
- does not return the value in the segment register (i.e., the segment
- selector). The segment selector needs to be obtained using
- get_segment_selector() and passing the segment register resolved by
- this function.
- Return: Enumerated segment register to use, among CS, SS, DS, ES, FS, GS,
- ignore (in 64-bit mode as applicable), or -EINVAL in case of error.
- */
+static enum segment_register resolve_seg_register(struct insn *insn,
struct pt_regs *regs,
int regoff)
+{
- int i;
- int sel_overrides = 0;
- int seg_register = SEG_REG_IGNORE;
- if (!insn)
return SEG_REG_INVAL;
- /* First handle cases when segment override prefixes must be ignored */
- if (regoff == offsetof(struct pt_regs, ip)) {
if (user_64bit_mode(regs))
return SEG_REG_IGNORE;
else
return SEG_REG_CS;
return SEG_REG_CS;
Simplify:
if (user_64bit_mode(regs)) return SEG_REG_IGNORE;
return SEG_REG_CS;
- }
- /*
* If the (E)DI register is used with string instructions, the ES
* segment register is always used.
*/
- if ((regoff == offsetof(struct pt_regs, di)) &&
is_string_instruction(insn)) {
if (user_64bit_mode(regs))
return SEG_REG_IGNORE;
else
return SEG_REG_ES;
return SEG_REG_CS;
What is that second return actually supposed to do?
- }
- /* Then check if we have segment overrides prefixes*/
Missing space and fullstop: "... overrides prefixes. */"
- for (i = 0; i < insn->prefixes.nbytes; i++) {
switch (insn->prefixes.bytes[i]) {
case SEG_REG_CS:
seg_register = SEG_REG_CS;
sel_overrides++;
break;
case SEG_REG_SS:
seg_register = SEG_REG_SS;
sel_overrides++;
break;
case SEG_REG_DS:
seg_register = SEG_REG_DS;
sel_overrides++;
break;
case SEG_REG_ES:
seg_register = SEG_REG_ES;
sel_overrides++;
break;
case SEG_REG_FS:
seg_register = SEG_REG_FS;
sel_overrides++;
break;
case SEG_REG_GS:
seg_register = SEG_REG_GS;
sel_overrides++;
break;
default:
return SEG_REG_INVAL;
So SEG_REG_NONE or so? It is not invalid if it is not a segment override prefix.
- /*
* Having more than one segment override prefix leads to undefined
* behavior. If this is the case, return with error.
*/
- if (sel_overrides > 1)
return SEG_REG_INVAL;
Yuck, wrapping of -E value in a SEG_REG enum. Just return -EINVAL here and make the function return an int, not that ugly enum.
And the return convention should be straight-forward: default segment if no prefix or ignored, -EINVAL if error and the actual override prefix if present.
Also, that test should be *after* the user_64bit_mode() because in long mode, segment overrides get ignored. IOW, those three if-tests around here should be combined into a single one, i.e., something like this:
if (64-bit) { if (!FS || !GS) ignore else return seg_override_pfx; <--- Yes, that variable should be called seg_override_pfx to denote what it is. } else if (sel_overrides > 1) -EINVAL else if (sel_overrides) return seg_override_pfx;
- if (sel_overrides == 1) {
/*
* If in long mode all segment registers but FS and GS are
* ignored.
*/
if (user_64bit_mode(regs) && !(seg_register == SEG_REG_FS ||
seg_register == SEG_REG_GS))
return SEG_REG_IGNORE;
return seg_register;
- }
- /* In long mode, all segment registers except FS and GS are ignored */
- if (user_64bit_mode(regs))
return SEG_REG_IGNORE;
- /*
* Lastly, if no segment overrides were found, determine the default
* segment register as described in the Intel documentation: SS for
* (E)SP or (E)BP. DS for all data references, AX, CX and DX are not
* valid register operands in 16-bit address encodings.
* -EDOM is reserved to identify for cases in which no register is used
* the default segment register (displacement-only addressing). The
* default segment register used in these cases is DS.
*/
- switch (regoff) {
- case offsetof(struct pt_regs, ax):
/* fall through */
- case offsetof(struct pt_regs, cx):
/* fall through */
- case offsetof(struct pt_regs, dx):
if (insn && insn->addr_bytes == 2)
return SEG_REG_INVAL;
- case offsetof(struct pt_regs, di):
/* fall through */
- case -EDOM:
/* fall through */
- case offsetof(struct pt_regs, bx):
/* fall through */
- case offsetof(struct pt_regs, si):
return SEG_REG_DS;
- case offsetof(struct pt_regs, bp):
/* fall through */
- case offsetof(struct pt_regs, sp):
return SEG_REG_SS;
- case offsetof(struct pt_regs, ip):
return SEG_REG_CS;
- default:
return SEG_REG_INVAL;
- }
So group all the fall through cases together so that you don't have this dense block of code with "/* fall through */" on every other line.
On Tue, 2017-05-30 at 12:35 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:08AM -0700, Ricardo Neri wrote:
When computing a linear address and segmentation is used, we need to know the base address of the segment involved in the computation. In most of the cases, the segment base address will be zero as in USER_DS/USER32_DS. However, it may be possible that a user space program defines its own segments via a local descriptor table. In such a case, the segment base address may not be zero .Thus, the segment base address is needed to calculate correctly the linear address.
The segment selector to be used when computing a linear address is determined by either any of segment override prefixes in the instruction or inferred from the registers involved in the computation of the effective address; in that order. Also, there are cases when the overrides shall be ignored (code segments are always selected by the CS segment register; string instructions always use the ES segment register along with the EDI register).
For clarity, this process can be split into two steps: resolving the relevant segment register to use and, once known, read its value to obtain the segment selector.
The method to obtain the segment selector depends on several factors. In 32-bit builds, segment selectors are saved into the pt_regs structure when switching to kernel mode. The same is also true for virtual-8086 mode. In 64-bit builds, segmentation is mostly ignored, except when running a program in 32-bit legacy mode. In this case, CS and SS can be obtained from pt_regs. DS, ES, FS and GS can be read directly from the respective segment registers.
Lastly, the only two segment registers that are not ignored in long mode are FS and GS. In these two cases, base addresses are obtained from the respective MSRs.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 256 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 256 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 1634762..0a496f4 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -9,6 +9,7 @@ #include <asm/inat.h> #include <asm/insn.h> #include <asm/insn-eval.h> +#include <asm/vm86.h>
enum reg_type { REG_TYPE_RM = 0, @@ -33,6 +34,17 @@ enum string_instruction { SCASW_SCASD = 0xaf, };
+enum segment_register {
- SEG_REG_INVAL = -1,
- SEG_REG_IGNORE = 0,
- SEG_REG_CS = 0x23,
- SEG_REG_SS = 0x36,
- SEG_REG_DS = 0x3e,
- SEG_REG_ES = 0x26,
- SEG_REG_FS = 0x64,
- SEG_REG_GS = 0x65,
+};
Yuck, didn't we talk about this already?
I am sorry Borislav. I thought you agreed that I could use the values of the segment override prefixes to identify the segment registers [1].
Those are segment override prefixes so call them as such.
#define SEG_OVR_PFX_CS 0x23 #define SEG_OVR_PFX_SS 0x36 ...
and we already have those!
arch/x86/include/asm/inat.h: ... #define INAT_PFX_CS 5 /* 0x2E */ #define INAT_PFX_DS 6 /* 0x3E */ #define INAT_PFX_ES 7 /* 0x26 */ #define INAT_PFX_FS 8 /* 0x64 */ #define INAT_PFX_GS 9 /* 0x65 */ #define INAT_PFX_SS 10 /* 0x36 */
well, kinda, they're numbers there and not the actual prefix values.
These numbers can 'translated' to the actual value of the prefixes via inat_get_opcode_attribute(). In my next version I am planning to use these function and reuse the aforementioned definitions.
And then there's:
arch/x86/kernel/uprobes.c::is_prefix_bad() which looks at some of those.
Please add your defines to inat.h
Will do.
and make that function is_prefix_bad() use them instead of naked numbers. We need to pay attention to all those different things needing to look at insn opcodes and not let them go unwieldy by each defining and duplicating stuff.
I have implemented this change and will be part of my next version.
/**
- is_string_instruction - Determine if instruction is a string instruction
- @insn: Instruction structure containing the opcode
@@ -83,6 +95,250 @@ static bool is_string_instruction(struct insn *insn) } }
+/**
- resolve_seg_register() - obtain segment register
That function is still returning the segment override prefix and we use *that* to determine the segment register.
Once I add new definitions for the segment registers and reuse the existing definitions of the segment override prefixes this problem will be fixed.
- @insn: Instruction structure with segment override prefixes
- @regs: Structure with register values as seen when entering kernel mode
- @regoff: Operand offset, in pt_regs, used to deterimine segment register
- The segment register to which an effective address refers depends on
- a) whether segment override prefixes must be ignored: always use CS when
- the register is (R|E)IP; always use ES when operand register is (E)DI with
- string instructions as defined in the Intel documentation. b) If segment
- overrides prefixes are used in the instruction instruction prefixes. C) Use
- the default segment register associated with the operand register.
- The operand register, regoff, is represented as the offset from the base of
- pt_regs. Also, regoff can be -EDOM for cases in which registers are not
- used as operands (e.g., displacement-only memory addressing).
- This function returns the segment register as value from an enumeration
- as per the conditions described above. Please note that this function
- does not return the value in the segment register (i.e., the segment
- selector). The segment selector needs to be obtained using
- get_segment_selector() and passing the segment register resolved by
- this function.
- Return: Enumerated segment register to use, among CS, SS, DS, ES, FS, GS,
- ignore (in 64-bit mode as applicable), or -EINVAL in case of error.
- */
+static enum segment_register resolve_seg_register(struct insn *insn,
struct pt_regs *regs,
int regoff)
+{
- int i;
- int sel_overrides = 0;
- int seg_register = SEG_REG_IGNORE;
- if (!insn)
return SEG_REG_INVAL;
- /* First handle cases when segment override prefixes must be ignored */
- if (regoff == offsetof(struct pt_regs, ip)) {
if (user_64bit_mode(regs))
return SEG_REG_IGNORE;
else
return SEG_REG_CS;
return SEG_REG_CS;
Simplify:
if (user_64bit_mode(regs)) return SEG_REG_IGNORE; return SEG_REG_CS;
Will do.
- }
- /*
* If the (E)DI register is used with string instructions, the ES
* segment register is always used.
*/
- if ((regoff == offsetof(struct pt_regs, di)) &&
is_string_instruction(insn)) {
if (user_64bit_mode(regs))
return SEG_REG_IGNORE;
else
return SEG_REG_ES;
return SEG_REG_CS;
What is that second return actually supposed to do?
This is not correct and I will remove it. Actually, will never run due to the if/else above it. Thanks for noticing it.
- }
- /* Then check if we have segment overrides prefixes*/
Missing space and fullstop: "... overrides prefixes. */"
Will fix.
- for (i = 0; i < insn->prefixes.nbytes; i++) {
switch (insn->prefixes.bytes[i]) {
case SEG_REG_CS:
seg_register = SEG_REG_CS;
sel_overrides++;
break;
case SEG_REG_SS:
seg_register = SEG_REG_SS;
sel_overrides++;
break;
case SEG_REG_DS:
seg_register = SEG_REG_DS;
sel_overrides++;
break;
case SEG_REG_ES:
seg_register = SEG_REG_ES;
sel_overrides++;
break;
case SEG_REG_FS:
seg_register = SEG_REG_FS;
sel_overrides++;
break;
case SEG_REG_GS:
seg_register = SEG_REG_GS;
sel_overrides++;
break;
default:
return SEG_REG_INVAL;
So SEG_REG_NONE or so? It is not invalid if it is not a segment override prefix.
Right, we can have more prefixes. We should need a default action as we are only looking for the segment override prefixes, as you mention.
- /*
* Having more than one segment override prefix leads to undefined
* behavior. If this is the case, return with error.
*/
- if (sel_overrides > 1)
return SEG_REG_INVAL;
Yuck, wrapping of -E value in a SEG_REG enum. Just return -EINVAL here and make the function return an int, not that ugly enum.
Will do.
And the return convention should be straight-forward: default segment if no prefix or ignored, -EINVAL if error and the actual override prefix if present.
Wouldn't this be ending up mixing the actual segment register and segment register overrides? I plan to have a function that parses the segment override prefixes and returns SEG_REG_CS/DS/ES/FS/GS or SEG_REG_IGNORE for long mode or SEG_REG_DEFAULT when the default segment register needs to be used. A separate function will determine what such default segment register is. Does this make sense?
Also, that test should be *after* the user_64bit_mode() because in long mode, segment overrides get ignored. IOW, those three if-tests around here should be combined into a single one, i.e., something like this:
if (64-bit) { if (!FS || !GS) ignore else return seg_override_pfx; <--- Yes, that variable should be called seg_override_pfx to denote what it is.
Perhaps it can return what I have described above?
} else if (sel_overrides > 1) -EINVAL else if (sel_overrides) return seg_override_pfx;
Will re-do these tests are you mention.
- if (sel_overrides == 1) {
/*
* If in long mode all segment registers but FS and GS are
* ignored.
*/
if (user_64bit_mode(regs) && !(seg_register == SEG_REG_FS ||
seg_register == SEG_REG_GS))
return SEG_REG_IGNORE;
return seg_register;
- }
- /* In long mode, all segment registers except FS and GS are ignored */
- if (user_64bit_mode(regs))
return SEG_REG_IGNORE;
- /*
* Lastly, if no segment overrides were found, determine the default
* segment register as described in the Intel documentation: SS for
* (E)SP or (E)BP. DS for all data references, AX, CX and DX are not
* valid register operands in 16-bit address encodings.
* -EDOM is reserved to identify for cases in which no register is used
* the default segment register (displacement-only addressing). The
* default segment register used in these cases is DS.
*/
- switch (regoff) {
- case offsetof(struct pt_regs, ax):
/* fall through */
- case offsetof(struct pt_regs, cx):
/* fall through */
- case offsetof(struct pt_regs, dx):
if (insn && insn->addr_bytes == 2)
return SEG_REG_INVAL;
- case offsetof(struct pt_regs, di):
/* fall through */
- case -EDOM:
/* fall through */
- case offsetof(struct pt_regs, bx):
/* fall through */
- case offsetof(struct pt_regs, si):
return SEG_REG_DS;
- case offsetof(struct pt_regs, bp):
/* fall through */
- case offsetof(struct pt_regs, sp):
return SEG_REG_SS;
- case offsetof(struct pt_regs, ip):
return SEG_REG_CS;
- default:
return SEG_REG_INVAL;
- }
So group all the fall through cases together so that you don't have this dense block of code with "/* fall through */" on every other line.
Will do.
Thanks and BR, Ricardo
On Thu, 2017-06-15 at 11:37 -0700, Ricardo Neri wrote:
Yuck, didn't we talk about this already?
I am sorry Borislav. I thought you agreed that I could use the values of the segment override prefixes to identify the segment registers [1].
This time with the reference: [1]. https://lkml.org/lkml/2017/5/5/377
On Thu, Jun 15, 2017 at 12:04:21PM -0700, Ricardo Neri wrote:
On Thu, 2017-06-15 at 11:37 -0700, Ricardo Neri wrote:
Yuck, didn't we talk about this already?
I am sorry Borislav. I thought you agreed that I could use the values of the segment override prefixes to identify the segment registers [1].
Yes, I agreed with that but...
This time with the reference: [1]. https://lkml.org/lkml/2017/5/5/377
... this says it already: "... but you should call them what they are: "enum seg_override_pfxs" or "enum seg_ovr_pfx" or..." IOW, those are segment *override* prefixes and should be called such and not "enum segment_register" as this way is misleading.
IOW, here's what I think you should do:
/* Segment override prefixes: */ #define SEG_CS_OVERRIDE 0x23 #define SEG_SS_OVERRIDE 0x36 #define SEG_DS_OVERRIDE 0x3e
... and so on...
and use the defines directly. The enum is fine and dandy but then you need to return an error value too so you can just as well have the function return an int simply and make sure you check the retval.
On Thu, Jun 15, 2017 at 11:37:51AM -0700, Ricardo Neri wrote:
Wouldn't this be ending up mixing the actual segment register and segment register overrides? I plan to have a function that parses the segment override prefixes and returns SEG_REG_CS/DS/ES/FS/GS or SEG_REG_IGNORE for long mode or SEG_REG_DEFAULT when the default segment register needs to be used. A separate function will determine what such default segment register is. Does this make sense?
Yes.
The segment descriptor contains information that is relevant to how linear address need to be computed. It contains the default size of addresses as well as the base address of the segment. Thus, given a segment selector, we ought look at segment descriptor to correctly calculate the linear address.
In protected mode, the segment selector might indicate a segment descriptor from either the global descriptor table or a local descriptor table. Both cases are considered in this function.
This function is the initial implementation for subsequent functions that will obtain the aforementioned attributes of the segment descriptor.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 0a496f4..f46cb31 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -6,9 +6,13 @@ #include <linux/kernel.h> #include <linux/string.h> #include <linux/ratelimit.h> +#include <linux/mmu_context.h> +#include <asm/desc_defs.h> +#include <asm/desc.h> #include <asm/inat.h> #include <asm/insn.h> #include <asm/insn-eval.h> +#include <asm/ldt.h> #include <asm/vm86.h>
enum reg_type { @@ -421,6 +425,57 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, }
/** + * get_desc() - Obtain address of segment descriptor + * @sel: Segment selector + * + * Given a segment selector, obtain a pointer to the segment descriptor. + * Both global and local descriptor tables are supported. + * + * Return: pointer to segment descriptor on success. NULL on failure. + */ +static struct desc_struct *get_desc(unsigned short sel) +{ + struct desc_ptr gdt_desc = {0, 0}; + struct desc_struct *desc = NULL; + unsigned long desc_base; + +#ifdef CONFIG_MODIFY_LDT_SYSCALL + if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT) { + /* Bits [15:3] contain the index of the desired entry. */ + sel >>= 3; + + mutex_lock(¤t->active_mm->context.lock); + /* The size of the LDT refers to the number of entries. */ + if (!current->active_mm->context.ldt || + sel >= current->active_mm->context.ldt->size) { + mutex_unlock(¤t->active_mm->context.lock); + return NULL; + } + + desc = ¤t->active_mm->context.ldt->entries[sel]; + mutex_unlock(¤t->active_mm->context.lock); + return desc; + } +#endif + native_store_gdt(&gdt_desc); + + /* + * Segment descriptors have a size of 8 bytes. Thus, the index is + * multiplied by 8 to obtain the offset of the desired descriptor from + * the start of the GDT. As bits [15:3] of the segment selector contain + * the index, it can be regarded multiplied by 8 already. All that + * remains is to clear bits [2:0]. + */ + desc_base = sel & ~(SEGMENT_RPL_MASK | SEGMENT_TI_MASK); + + if (desc_base > gdt_desc.size) + return NULL; + + desc = (struct desc_struct *)(gdt_desc.address + desc_base); + return desc; +} + +/** * insn_get_reg_offset_modrm_rm() - Obtain register in r/m part of ModRM byte * @insn: Instruction structure containing the ModRM byte * @regs: Structure with register values as seen when entering kernel mode
With segmentation, the base address of the segment descriptor is needed to compute a linear address. The segment descriptor used in the address computation depends on either any segment override prefixes in the instruction or the default segment determined by the registers involved in the address computation. Thus, both the instruction as well as the register (specified as the offset from the base of pt_regs) are given as inputs, along with a boolean variable to select between override and default.
The segment selector is determined by get_seg_selector() with the inputs described above. Once the selector is known, the base address is determined. In protected mode, the selector is used to obtain the segment descriptor and then its base address. If in 64-bit user mode, the segment base address is zero except when FS or GS are used. In virtual-8086 mode, the base address is computed as the value of the segment selector shifted 4 positions to the left.
In protected mode, segment limits are enforced. Thus, a function to determine the limit of the segment is added. Segment limits are not enforced in long or virtual-8086. For the latter, addresses are limited to 20 bits; address size will be handled when computing the linear address.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/insn-eval.h | 2 + arch/x86/lib/insn-eval.c | 127 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 129 insertions(+)
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 7e8c963..7f3c7fe 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -13,5 +13,7 @@
void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs); +unsigned long insn_get_seg_base(struct pt_regs *regs, struct insn *insn, + int regoff);
#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index f46cb31..c77ed80 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -476,6 +476,133 @@ static struct desc_struct *get_desc(unsigned short sel) }
/** + * insn_get_seg_base() - Obtain base address of segment descriptor. + * @regs: Structure with register values as seen when entering kernel mode + * @insn: Instruction structure with selector override prefixes + * @regoff: Operand offset, in pt_regs, of which the selector is needed + * + * Obtain the base address of the segment descriptor as indicated by either + * any segment override prefixes contained in insn or the default segment + * applicable to the register indicated by regoff. regoff is specified as the + * offset in bytes from the base of pt_regs. + * + * Return: In protected mode, base address of the segment. Zero in for long + * mode, except when FS or GS are used. In virtual-8086 mode, the segment + * selector shifted 4 positions to the right. -1L in case of + * error. + */ +unsigned long insn_get_seg_base(struct pt_regs *regs, struct insn *insn, + int regoff) +{ + struct desc_struct *desc; + unsigned short sel; + enum segment_register seg_reg; + + seg_reg = resolve_seg_register(insn, regs, regoff); + if (seg_reg == SEG_REG_INVAL) + return -1L; + + sel = get_segment_selector(regs, seg_reg); + if ((short)sel < 0) + return -1L; + + if (v8086_mode(regs)) + /* + * Base is simply the segment selector shifted 4 + * positions to the right. + */ + return (unsigned long)(sel << 4); + + if (user_64bit_mode(regs)) { + /* + * Only FS or GS will have a base address, the rest of + * the segments' bases are forced to 0. + */ + unsigned long base; + + if (seg_reg == SEG_REG_FS) + rdmsrl(MSR_FS_BASE, base); + else if (seg_reg == SEG_REG_GS) + /* + * swapgs was called at the kernel entry point. Thus, + * MSR_KERNEL_GS_BASE will have the user-space GS base. + */ + rdmsrl(MSR_KERNEL_GS_BASE, base); + else if (seg_reg != SEG_REG_IGNORE) + /* We should ignore the rest of segment registers */ + base = -1L; + else + base = 0; + return base; + } + + /* In protected mode the segment selector cannot be null */ + if (!sel) + return -1L; + + desc = get_desc(sel); + if (!desc) + return -1L; + + return get_desc_base(desc); +} + +/** + * get_seg_limit() - Obtain the limit of a segment descriptor + * @regs: Structure with register values as seen when entering kernel mode + * @insn: Instruction structure with selector override prefixes + * @regoff: Operand offset, in pt_regs, of which the selector is needed + * + * Obtain the limit of the segment descriptor. The segment selector is obtained + * by inspecting any segment override prefixes or the default selector + * inferred by regoff. regoff is specified as the offset in bytes from the base + * of pt_regs. + * + * Return: In protected mode, the limit of the segment descriptor in bytes. + * In long mode and virtual-8086 mode, segment limits are not enforced. Thus, + * limit is returned as -1L to imply a limit-less segment. Zero is returned on + * error. + */ +static unsigned long get_seg_limit(struct pt_regs *regs, struct insn *insn, + int regoff) +{ + struct desc_struct *desc; + unsigned short sel; + unsigned long limit; + enum segment_register seg_reg; + + seg_reg = resolve_seg_register(insn, regs, regoff); + if (seg_reg == SEG_REG_INVAL) + return 0; + + sel = get_segment_selector(regs, seg_reg); + if ((short)sel < 0) + return 0; + + if (user_64bit_mode(regs) || v8086_mode(regs)) + return -1L; + + if (!sel) + return 0; + + desc = get_desc(sel); + if (!desc) + return 0; + + /* + * If the granularity bit is set, the limit is given in multiples + * of 4096. When the granularity bit is set, the least 12 significant + * bits are not tested when checking the segment limits. In practice, + * this means that the segment ends in (limit << 12) + 0xfff. + */ + limit = get_desc_limit(desc); + if (desc->g) + limit <<= 12 | 0x7; + + return limit; +} + +/** * insn_get_reg_offset_modrm_rm() - Obtain register in r/m part of ModRM byte * @insn: Instruction structure containing the ModRM byte * @regs: Structure with register values as seen when entering kernel mode
On Fri, May 05, 2017 at 11:17:10AM -0700, Ricardo Neri wrote:
With segmentation, the base address of the segment descriptor is needed to compute a linear address. The segment descriptor used in the address computation depends on either any segment override prefixes in the instruction or the default segment determined by the registers involved in the address computation. Thus, both the instruction as well as the register (specified as the offset from the base of pt_regs) are given as inputs, along with a boolean variable to select between override and default.
...
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index f46cb31..c77ed80 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -476,6 +476,133 @@ static struct desc_struct *get_desc(unsigned short sel) }
/**
- insn_get_seg_base() - Obtain base address of segment descriptor.
- @regs: Structure with register values as seen when entering kernel mode
- @insn: Instruction structure with selector override prefixes
- @regoff: Operand offset, in pt_regs, of which the selector is needed
- Obtain the base address of the segment descriptor as indicated by either
- any segment override prefixes contained in insn or the default segment
- applicable to the register indicated by regoff. regoff is specified as the
- offset in bytes from the base of pt_regs.
- Return: In protected mode, base address of the segment. Zero in for long
- mode, except when FS or GS are used. In virtual-8086 mode, the segment
- selector shifted 4 positions to the right. -1L in case of
- error.
- */
+unsigned long insn_get_seg_base(struct pt_regs *regs, struct insn *insn,
int regoff)
+{
- struct desc_struct *desc;
- unsigned short sel;
- enum segment_register seg_reg;
- seg_reg = resolve_seg_register(insn, regs, regoff);
- if (seg_reg == SEG_REG_INVAL)
return -1L;
- sel = get_segment_selector(regs, seg_reg);
- if ((short)sel < 0)
I guess it would be better if that function returned a signed short so you don't have to cast it here. (You're casting it to an unsigned long below anyway.)
return -1L;
- if (v8086_mode(regs))
/*
* Base is simply the segment selector shifted 4
* positions to the right.
*/
return (unsigned long)(sel << 4);
...
+static unsigned long get_seg_limit(struct pt_regs *regs, struct insn *insn,
int regoff)
+{
- struct desc_struct *desc;
- unsigned short sel;
- unsigned long limit;
- enum segment_register seg_reg;
- seg_reg = resolve_seg_register(insn, regs, regoff);
- if (seg_reg == SEG_REG_INVAL)
return 0;
- sel = get_segment_selector(regs, seg_reg);
- if ((short)sel < 0)
Ditto.
return 0;
- if (user_64bit_mode(regs) || v8086_mode(regs))
return -1L;
- if (!sel)
return 0;
- desc = get_desc(sel);
- if (!desc)
return 0;
- /*
* If the granularity bit is set, the limit is given in multiples
* of 4096. When the granularity bit is set, the least 12 significant
the 12 least significant bits
* bits are not tested when checking the segment limits. In practice,
* this means that the segment ends in (limit << 12) + 0xfff.
*/
- limit = get_desc_limit(desc);
- if (desc->g)
limit <<= 12 | 0x7;
That 0x7 doesn't look like 0xfff - it shifts limit by 15 instead. You can simply write it like you mean it:
limit = (limit << 12) + 0xfff;
On Wed, 2017-05-31 at 18:58 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:10AM -0700, Ricardo Neri wrote:
With segmentation, the base address of the segment descriptor is needed to compute a linear address. The segment descriptor used in the address computation depends on either any segment override prefixes in the instruction or the default segment determined by the registers involved in the address computation. Thus, both the instruction as well as the register (specified as the offset from the base of pt_regs) are given as inputs, along with a boolean variable to select between override and default.
...
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index f46cb31..c77ed80 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -476,6 +476,133 @@ static struct desc_struct *get_desc(unsigned short sel) }
/**
- insn_get_seg_base() - Obtain base address of segment descriptor.
- @regs: Structure with register values as seen when entering kernel mode
- @insn: Instruction structure with selector override prefixes
- @regoff: Operand offset, in pt_regs, of which the selector is needed
- Obtain the base address of the segment descriptor as indicated by either
- any segment override prefixes contained in insn or the default segment
- applicable to the register indicated by regoff. regoff is specified as the
- offset in bytes from the base of pt_regs.
- Return: In protected mode, base address of the segment. Zero in for long
- mode, except when FS or GS are used. In virtual-8086 mode, the segment
- selector shifted 4 positions to the right. -1L in case of
- error.
- */
+unsigned long insn_get_seg_base(struct pt_regs *regs, struct insn *insn,
int regoff)
+{
- struct desc_struct *desc;
- unsigned short sel;
- enum segment_register seg_reg;
- seg_reg = resolve_seg_register(insn, regs, regoff);
- if (seg_reg == SEG_REG_INVAL)
return -1L;
- sel = get_segment_selector(regs, seg_reg);
- if ((short)sel < 0)
I guess it would be better if that function returned a signed short so you don't have to cast it here. (You're casting it to an unsigned long below anyway.)
Yes, this make sense. I will make this change.
return -1L;
- if (v8086_mode(regs))
/*
* Base is simply the segment selector shifted 4
* positions to the right.
*/
return (unsigned long)(sel << 4);
...
+static unsigned long get_seg_limit(struct pt_regs *regs, struct insn *insn,
int regoff)
+{
- struct desc_struct *desc;
- unsigned short sel;
- unsigned long limit;
- enum segment_register seg_reg;
- seg_reg = resolve_seg_register(insn, regs, regoff);
- if (seg_reg == SEG_REG_INVAL)
return 0;
- sel = get_segment_selector(regs, seg_reg);
- if ((short)sel < 0)
Ditto.
Here as well.
return 0;
- if (user_64bit_mode(regs) || v8086_mode(regs))
return -1L;
- if (!sel)
return 0;
- desc = get_desc(sel);
- if (!desc)
return 0;
- /*
* If the granularity bit is set, the limit is given in multiples
* of 4096. When the granularity bit is set, the least 12 significant
the 12 least significant bits
* bits are not tested when checking the segment limits. In practice,
* this means that the segment ends in (limit << 12) + 0xfff.
*/
- limit = get_desc_limit(desc);
- if (desc->g)
limit <<= 12 | 0x7;
That 0x7 doesn't look like 0xfff - it shifts limit by 15 instead. You can simply write it like you mean it:
limit = (limit << 12) + 0xfff;
You are right, this wrong. I will implement as you mention.
Thanks and BR, Ricardo
This function returns the default values of the address and operand sizes as specified in the segment descriptor. This information is determined from the D and L bits. Hence, it can be used for both IA-32e 64-bit and 32-bit legacy modes. For virtual-8086 mode, the default address and operand sizes are always 2 bytes.
The D bit is only meaningful for code segments. Thus, these functions always use the code segment selector contained in regs.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/insn-eval.h | 6 ++++ arch/x86/lib/insn-eval.c | 65 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+)
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 7f3c7fe..9ed1c88 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -11,9 +11,15 @@ #include <linux/err.h> #include <asm/ptrace.h>
+struct insn_code_seg_defaults { + unsigned char address_bytes; + unsigned char operand_bytes; +}; + void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs); unsigned long insn_get_seg_base(struct pt_regs *regs, struct insn *insn, int regoff); +struct insn_code_seg_defaults insn_get_code_seg_defaults(struct pt_regs *regs);
#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index c77ed80..693e5a8 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -603,6 +603,71 @@ static unsigned long get_seg_limit(struct pt_regs *regs, struct insn *insn, }
/** + * insn_get_code_seg_defaults() - Obtain code segment default parameters + * @regs: Structure with register values as seen when entering kernel mode + * + * Obtain the default parameters of the code segment: address and operand sizes. + * The code segment is obtained from the selector contained in the CS register + * in regs. In protected mode, the default address is determined by inspecting + * the L and D bits of the segment descriptor. In virtual-8086 mode, the default + * is always two bytes for both address and operand sizes. + * + * Return: A populated insn_code_seg_defaults structure on success. The + * structure contains only zeros on failure. + */ +struct insn_code_seg_defaults insn_get_code_seg_defaults(struct pt_regs *regs) +{ + struct desc_struct *desc; + struct insn_code_seg_defaults defs; + unsigned short sel; + /* + * The most significant byte of AR_TYPE_MASK determines whether a + * segment contains data or code. + */ + unsigned int type_mask = AR_TYPE_MASK & (1 << 11); + + memset(&defs, 0, sizeof(defs)); + + if (v8086_mode(regs)) { + defs.address_bytes = 2; + defs.operand_bytes = 2; + return defs; + } + + sel = (unsigned short)regs->cs; + + desc = get_desc(sel); + if (!desc) + return defs; + + /* if data segment, return */ + if (!(desc->b & type_mask)) + return defs; + + switch ((desc->l << 1) | desc->d) { + case 0: /* Legacy mode. CS.L=0, CS.D=0 */ + defs.address_bytes = 2; + defs.operand_bytes = 2; + break; + case 1: /* Legacy mode. CS.L=0, CS.D=1 */ + defs.address_bytes = 4; + defs.operand_bytes = 4; + break; + case 2: /* IA-32e 64-bit mode. CS.L=1, CS.D=0 */ + defs.address_bytes = 8; + defs.operand_bytes = 4; + break; + case 3: /* Invalid setting. CS.L=1, CS.D=1 */ + /* fall through */ + default: + defs.address_bytes = 0; + defs.operand_bytes = 0; + } + + return defs; +} + +/** * insn_get_reg_offset_modrm_rm() - Obtain register in r/m part of ModRM byte * @insn: Instruction structure containing the ModRM byte * @regs: Structure with register values as seen when entering kernel mode
On Fri, May 05, 2017 at 11:17:11AM -0700, Ricardo Neri wrote:
This function returns the default values of the address and operand sizes as specified in the segment descriptor. This information is determined from the D and L bits. Hence, it can be used for both IA-32e 64-bit and 32-bit legacy modes. For virtual-8086 mode, the default address and operand sizes are always 2 bytes.
The D bit is only meaningful for code segments. Thus, these functions always use the code segment selector contained in regs.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/insn-eval.h | 6 ++++ arch/x86/lib/insn-eval.c | 65 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+)
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 7f3c7fe..9ed1c88 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -11,9 +11,15 @@ #include <linux/err.h> #include <asm/ptrace.h>
+struct insn_code_seg_defaults {
A whole struct for a function which gets called only once?
Bah, that's a bit too much, if you ask me.
So you're returning two small unsigned integers - i.e., you can just as well return a single u8 and put address and operand sizes in there:
ret = oper_sz | addr_sz << 4;
No need for special structs for that.
- unsigned char address_bytes;
- unsigned char operand_bytes;
+};
void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs); unsigned long insn_get_seg_base(struct pt_regs *regs, struct insn *insn, int regoff); +struct insn_code_seg_defaults insn_get_code_seg_defaults(struct pt_regs *regs);
#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index c77ed80..693e5a8 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -603,6 +603,71 @@ static unsigned long get_seg_limit(struct pt_regs *regs, struct insn *insn, }
/**
- insn_get_code_seg_defaults() - Obtain code segment default parameters
- @regs: Structure with register values as seen when entering kernel mode
- Obtain the default parameters of the code segment: address and operand sizes.
- The code segment is obtained from the selector contained in the CS register
- in regs. In protected mode, the default address is determined by inspecting
- the L and D bits of the segment descriptor. In virtual-8086 mode, the default
- is always two bytes for both address and operand sizes.
- Return: A populated insn_code_seg_defaults structure on success. The
- structure contains only zeros on failure.
s/failure/error/
- */
+struct insn_code_seg_defaults insn_get_code_seg_defaults(struct pt_regs *regs) +{
- struct desc_struct *desc;
- struct insn_code_seg_defaults defs;
- unsigned short sel;
- /*
* The most significant byte of AR_TYPE_MASK determines whether a
* segment contains data or code.
*/
- unsigned int type_mask = AR_TYPE_MASK & (1 << 11);
- memset(&defs, 0, sizeof(defs));
- if (v8086_mode(regs)) {
defs.address_bytes = 2;
defs.operand_bytes = 2;
return defs;
- }
- sel = (unsigned short)regs->cs;
- desc = get_desc(sel);
- if (!desc)
return defs;
- /* if data segment, return */
- if (!(desc->b & type_mask))
return defs;
So you can simplify that into:
/* A code segment? */ if (!(desc->b & BIT(11))) return defs;
and remove that type_mask thing.
On Wed, 2017-06-07 at 14:59 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:11AM -0700, Ricardo Neri wrote:
This function returns the default values of the address and operand sizes as specified in the segment descriptor. This information is determined from the D and L bits. Hence, it can be used for both IA-32e 64-bit and 32-bit legacy modes. For virtual-8086 mode, the default address and operand sizes are always 2 bytes.
The D bit is only meaningful for code segments. Thus, these functions always use the code segment selector contained in regs.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/insn-eval.h | 6 ++++ arch/x86/lib/insn-eval.c | 65 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 71 insertions(+)
diff --git a/arch/x86/include/asm/insn-eval.h b/arch/x86/include/asm/insn-eval.h index 7f3c7fe..9ed1c88 100644 --- a/arch/x86/include/asm/insn-eval.h +++ b/arch/x86/include/asm/insn-eval.h @@ -11,9 +11,15 @@ #include <linux/err.h> #include <asm/ptrace.h>
+struct insn_code_seg_defaults {
A whole struct for a function which gets called only once?
Bah, that's a bit too much, if you ask me.
So you're returning two small unsigned integers - i.e., you can just as well return a single u8 and put address and operand sizes in there:
ret = oper_sz | addr_sz << 4;
No need for special structs for that.
OK. This makes sense. Perhaps I can use a couple of #define's to set and get the the address and operand sizes in a single u8. This would make the code more readable.
- unsigned char address_bytes;
- unsigned char operand_bytes;
+};
void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs); int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs); unsigned long insn_get_seg_base(struct pt_regs *regs, struct insn *insn, int regoff); +struct insn_code_seg_defaults insn_get_code_seg_defaults(struct pt_regs *regs);
#endif /* _ASM_X86_INSN_EVAL_H */ diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index c77ed80..693e5a8 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -603,6 +603,71 @@ static unsigned long get_seg_limit(struct pt_regs *regs, struct insn *insn, }
/**
- insn_get_code_seg_defaults() - Obtain code segment default parameters
- @regs: Structure with register values as seen when entering kernel mode
- Obtain the default parameters of the code segment: address and operand sizes.
- The code segment is obtained from the selector contained in the CS register
- in regs. In protected mode, the default address is determined by inspecting
- the L and D bits of the segment descriptor. In virtual-8086 mode, the default
- is always two bytes for both address and operand sizes.
- Return: A populated insn_code_seg_defaults structure on success. The
- structure contains only zeros on failure.
s/failure/error/
Will correct.
- */
+struct insn_code_seg_defaults insn_get_code_seg_defaults(struct pt_regs *regs) +{
- struct desc_struct *desc;
- struct insn_code_seg_defaults defs;
- unsigned short sel;
- /*
* The most significant byte of AR_TYPE_MASK determines whether a
* segment contains data or code.
*/
- unsigned int type_mask = AR_TYPE_MASK & (1 << 11);
- memset(&defs, 0, sizeof(defs));
- if (v8086_mode(regs)) {
defs.address_bytes = 2;
defs.operand_bytes = 2;
return defs;
- }
- sel = (unsigned short)regs->cs;
- desc = get_desc(sel);
- if (!desc)
return defs;
- /* if data segment, return */
- if (!(desc->b & type_mask))
return defs;
So you can simplify that into:
/* A code segment? */ if (!(desc->b & BIT(11))) return defs;
and remove that type_mask thing.
Alternatively, I can do desc->type & BIT(3) to avoid using desc-b, which is less elegant.
Thanks and BR, Ricardo
On Thu, Jun 15, 2017 at 12:24:35PM -0700, Ricardo Neri wrote:
OK. This makes sense. Perhaps I can use a couple of #define's to set and get the the address and operand sizes in a single u8. This would make the code more readable.
Sure but don't get too tangled in defines if it is going to be used in one place only. Sometimes a clear comment and the naked bitwise operations are already clear enough.
Alternatively, I can do desc->type & BIT(3) to avoid using desc-b, which is less elegant.
Sure.
Section 2.2.1.3 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod is zero and ModRM.rm is 101b, a 32-bit displacement follows the ModRM byte. This means that none of the registers are used in the computation of the effective address. A return value of -EDOM signals callers that they should not use the value of registers when computing the effective address for the instruction.
In IA-32e 64-bit mode (long mode), the effective address is given by the 32-bit displacement plus the value of RIP of the next instruction. In IA-32e compatibility mode (protected mode), only the displacement is used.
The instruction decoder takes care of obtaining the displacement.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 693e5a8..4f600de 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -379,6 +379,12 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, switch (type) { case REG_TYPE_RM: regno = X86_MODRM_RM(insn->modrm.value); + /* + * ModRM.mod == 0 and ModRM.rm == 5 means a 32-bit displacement + * follows the ModRM byte. + */ + if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5) + return -EDOM; if (X86_REX_B(insn->rex_prefix.value)) regno += 8; break; @@ -730,9 +736,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); - if (addr_offset < 0) + /* + * -EDOM means that we must ignore the address_offset. + * In such a case, in 64-bit mode the effective address + * relative to the RIP of the following instruction. + */ + if (addr_offset == -EDOM) { + eff_addr = 0; + if (user_64bit_mode(regs)) + eff_addr = (long)regs->ip + + insn->length; + } else if (addr_offset < 0) { goto out_err; - eff_addr = regs_get_register(regs, addr_offset); + } else { + eff_addr = regs_get_register(regs, addr_offset); + } } eff_addr += insn->displacement.value; }
On Fri, May 05, 2017 at 11:17:12AM -0700, Ricardo Neri wrote:
Section 2.2.1.3 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod is zero and ModRM.rm is 101b, a 32-bit displacement follows the ModRM byte. This means that none of the registers are used in the computation of the effective address. A return value of -EDOM signals callers that they should not use the value of registers when computing the effective address for the instruction.
In IA-32e 64-bit mode (long mode), the effective address is given by the 32-bit displacement plus the value of RIP of the next instruction. In IA-32e compatibility mode (protected mode), only the displacement is used.
The instruction decoder takes care of obtaining the displacement.
...
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 693e5a8..4f600de 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -379,6 +379,12 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, switch (type) { case REG_TYPE_RM: regno = X86_MODRM_RM(insn->modrm.value);
<---- newline here.
/*
* ModRM.mod == 0 and ModRM.rm == 5 means a 32-bit displacement
* follows the ModRM byte.
*/
if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5)
if (X86_REX_B(insn->rex_prefix.value)) regno += 8; break;return -EDOM;
@@ -730,9 +736,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset < 0)
ditto.
/*
* -EDOM means that we must ignore the address_offset.
* In such a case, in 64-bit mode the effective address
* relative to the RIP of the following instruction.
*/
if (addr_offset == -EDOM) {
eff_addr = 0;
if (user_64bit_mode(regs))
eff_addr = (long)regs->ip +
insn->length;
Let that line stick out and write it balanced:
if (addr_offset == -EDOM) { if (user_64bit_mode(regs)) eff_addr = (long)regs->ip + insn->length; else eff_addr = 0;
should be easier parseable this way.
On Wed, 2017-06-07 at 15:15 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:12AM -0700, Ricardo Neri wrote:
Section 2.2.1.3 of the Intel 64 and IA-32 Architectures Software Developer's Manual volume 2A states that when ModRM.mod is zero and ModRM.rm is 101b, a 32-bit displacement follows the ModRM byte. This means that none of the registers are used in the computation of the effective address. A return value of -EDOM signals callers that they should not use the value of registers when computing the effective address for the instruction.
In IA-32e 64-bit mode (long mode), the effective address is given by the 32-bit displacement plus the value of RIP of the next instruction. In IA-32e compatibility mode (protected mode), only the displacement is used.
The instruction decoder takes care of obtaining the displacement.
...
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 693e5a8..4f600de 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -379,6 +379,12 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, switch (type) { case REG_TYPE_RM: regno = X86_MODRM_RM(insn->modrm.value);
<---- newline here.
Will add the new line.
/*
* ModRM.mod == 0 and ModRM.rm == 5 means a 32-bit displacement
* follows the ModRM byte.
*/
if (!X86_MODRM_MOD(insn->modrm.value) && regno == 5)
if (X86_REX_B(insn->rex_prefix.value)) regno += 8; break;return -EDOM;
@@ -730,9 +736,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset < 0)
ditto.
Will add the new line.
/*
* -EDOM means that we must ignore the address_offset.
* In such a case, in 64-bit mode the effective address
* relative to the RIP of the following instruction.
*/
if (addr_offset == -EDOM) {
eff_addr = 0;
if (user_64bit_mode(regs))
eff_addr = (long)regs->ip +
insn->length;
Let that line stick out and write it balanced:
if (addr_offset == -EDOM) { if (user_64bit_mode(regs)) eff_addr = (long)regs->ip + insn->length; else eff_addr = 0;
should be easier parseable this way.
Will rewrite as you suggest.
Thanks and BR, Ricardo
insn_get_addr_ref() returns the effective address as defined by the section 3.7.5.1 Vol 1 of the Intel 64 and IA-32 Architectures Software Developer's Manual. In order to compute the linear address, we must add to the effective address the segment base address as set in the segment descriptor. Furthermore, the segment descriptor to use depends on the register that is used as the base of the effective address. The effective base address varies depending on whether the operand is a register or a memory address and on whether a SiB byte is used.
In most cases, the segment base address will be 0 if the USER_DS/USER32_DS segment is used or if segmentation is not used. However, the base address is not necessarily zero if a user programs defines its own segments. This is possible by using a local descriptor table.
Since the effective address is a signed quantity, the unsigned segment base address is saved in a separate variable and added to the final effective address.
Before returning the linear address, we check if the computed effective address is within the segment limit. In protected mode segment limits are not enforced. We can keep the check as get_seg_limit() return -1L in this case.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 4f600de..1a5f5a6 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -695,7 +695,7 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) */ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) { - unsigned long linear_addr; + unsigned long linear_addr, seg_base_addr, seg_limit; long eff_addr, base, indx; int addr_offset, base_offset, indx_offset; insn_byte_t sib; @@ -709,6 +709,10 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) if (addr_offset < 0) goto out_err; eff_addr = regs_get_register(regs, addr_offset); + seg_base_addr = insn_get_seg_base(regs, insn, addr_offset); + if (seg_base_addr == -1L) + goto out_err; + seg_limit = get_seg_limit(regs, insn, addr_offset); } else { if (insn->sib.nbytes) { /* @@ -734,6 +738,11 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) indx = regs_get_register(regs, indx_offset);
eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); + seg_base_addr = insn_get_seg_base(regs, insn, + base_offset); + if (seg_base_addr == -1L) + goto out_err; + seg_limit = get_seg_limit(regs, insn, base_offset); } else { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); /* @@ -751,10 +760,25 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) } else { eff_addr = regs_get_register(regs, addr_offset); } + seg_base_addr = insn_get_seg_base(regs, insn, + addr_offset); + if (seg_base_addr == -1L) + goto out_err; + seg_limit = get_seg_limit(regs, insn, addr_offset); } eff_addr += insn->displacement.value; } + linear_addr = (unsigned long)eff_addr; + /* + * Make sure the effective address is within the limits of the + * segment. In long mode, the limit is -1L. Thus, the second part + * of the check always succeeds. + */ + if (linear_addr > seg_limit) + goto out_err; + + linear_addr += seg_base_addr;
return (void __user *)linear_addr; out_err:
The 32-bit and 64-bit address encodings are identical. This means that we can use the same function in both cases. In order to reuse the function for 32-bit address encodings, we must sign-extend our 32-bit signed operands to 64-bit signed variables (only for 64-bit builds). To decide on whether sign extension is needed, we rely on the address size as given by the instruction structure.
Once the effective address has been computed, a special verification is needed for 32-bit processes. If running on a 64-bit kernel, such processes can address up to 4GB of memory. Hence, for instance, an effective address of 0xffff1234 would be misinterpreted as 0xffffffffffff1234 due to the sign extension mentioned above. For this reason, the 4 must be truncated to obtain the true effective address.
Lastly, before computing the linear address, we verify that the effective address is within the limits of the segment. The check is kept for long mode because in such a case the limit is set to -1L. This is the largest unsigned number possible. This is equivalent to a limit-less segment.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 99 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 88 insertions(+), 11 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 1a5f5a6..c7c1239 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -688,6 +688,62 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) return get_reg_offset(insn, regs, REG_TYPE_RM); }
+/** + * _to_signed_long() - Cast an unsigned long into signed long + * @val A 32-bit or 64-bit unsigned long + * @long_bytes The number of bytes used to represent a long number + * @out The casted signed long + * + * Return: A signed long of either 32 or 64 bits, as per the build configuration + * of the kernel. + */ +static int _to_signed_long(unsigned long val, int long_bytes, long *out) +{ + if (!out) + return -EINVAL; + +#ifdef CONFIG_X86_64 + if (long_bytes == 4) { + /* higher bytes should all be zero */ + if (val & ~0xffffffff) + return -EINVAL; + + /* sign-extend to a 64-bit long */ + *out = (long)((int)(val)); + return 0; + } else if (long_bytes == 8) { + *out = (long)val; + return 0; + } else { + return -EINVAL; + } +#else + *out = (long)val; + return 0; +#endif +} + +/** get_mem_offset() - Obtain the memory offset indicated in operand register + * @regs Structure with register values as seen when entering kernel mode + * @reg_offset Offset from the base of pt_regs of the operand register + * @addr_size Address size of the code segment in use + * + * Obtain the offset (a signed number with size as specified in addr_size) + * indicated in the register used for register-indirect memory adressing. + * + * Return: A memory offset to be used in the computation of effective address. + */ +long get_mem_offset(struct pt_regs *regs, int reg_offset, int addr_size) +{ + int ret; + long offset = -1L; + unsigned long uoffset = regs_get_register(regs, reg_offset); + + ret = _to_signed_long(uoffset, addr_size, &offset); + if (ret) + return -1L; + return offset; +} /* * return the address being referenced be instruction * for rm=3 returning the content of the rm reg @@ -697,18 +753,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) { unsigned long linear_addr, seg_base_addr, seg_limit; long eff_addr, base, indx; - int addr_offset, base_offset, indx_offset; + int addr_offset, base_offset, indx_offset, addr_bytes; insn_byte_t sib;
insn_get_modrm(insn); insn_get_sib(insn); sib = insn->sib.value; + addr_bytes = insn->addr_bytes;
if (X86_MODRM_MOD(insn->modrm.value) == 3) { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) goto out_err; - eff_addr = regs_get_register(regs, addr_offset); + eff_addr = get_mem_offset(regs, addr_offset, addr_bytes); + if (eff_addr == -1L) + goto out_err; seg_base_addr = insn_get_seg_base(regs, insn, addr_offset); if (seg_base_addr == -1L) goto out_err; @@ -722,20 +781,28 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) * in the address computation. */ base_offset = get_reg_offset(insn, regs, REG_TYPE_BASE); - if (base_offset == -EDOM) + if (base_offset == -EDOM) { base = 0; - else if (base_offset < 0) + } else if (base_offset < 0) { goto out_err; - else - base = regs_get_register(regs, base_offset); + } else { + base = get_mem_offset(regs, base_offset, + addr_bytes); + if (base == -1L) + goto out_err; + }
indx_offset = get_reg_offset(insn, regs, REG_TYPE_INDEX); - if (indx_offset == -EDOM) + if (indx_offset == -EDOM) { indx = 0; - else if (indx_offset < 0) + } else if (indx_offset < 0) { goto out_err; - else - indx = regs_get_register(regs, indx_offset); + } else { + indx = get_mem_offset(regs, indx_offset, + addr_bytes); + if (indx == -1L) + goto out_err; + }
eff_addr = base + indx * (1 << X86_SIB_SCALE(sib)); seg_base_addr = insn_get_seg_base(regs, insn, @@ -758,7 +825,10 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) } else if (addr_offset < 0) { goto out_err; } else { - eff_addr = regs_get_register(regs, addr_offset); + eff_addr = get_mem_offset(regs, addr_offset, + addr_bytes); + if (eff_addr == -1L) + goto out_err; } seg_base_addr = insn_get_seg_base(regs, insn, addr_offset); @@ -771,6 +841,13 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs)
linear_addr = (unsigned long)eff_addr; /* + * If address size is 32-bit, truncate the 4 most significant bytes. + * This is to avoid phony negative offsets. + */ + if (addr_bytes == 4) + linear_addr &= 0xffffffff; + + /* * Make sure the effective address is within the limits of the * segment. In long mode, the limit is -1L. Thus, the second part * of the check always succeeds.
On Fri, May 05, 2017 at 11:17:14AM -0700, Ricardo Neri wrote:
The 32-bit and 64-bit address encodings are identical. This means that we can use the same function in both cases. In order to reuse the function for 32-bit address encodings, we must sign-extend our 32-bit signed operands to 64-bit signed variables (only for 64-bit builds). To decide on whether sign extension is needed, we rely on the address size as given by the instruction structure.
Once the effective address has been computed, a special verification is needed for 32-bit processes. If running on a 64-bit kernel, such processes can address up to 4GB of memory. Hence, for instance, an effective address of 0xffff1234 would be misinterpreted as 0xffffffffffff1234 due to the sign extension mentioned above. For this reason, the 4 must be
Which 4?
truncated to obtain the true effective address.
Lastly, before computing the linear address, we verify that the effective address is within the limits of the segment. The check is kept for long mode because in such a case the limit is set to -1L. This is the largest unsigned number possible. This is equivalent to a limit-less segment.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 99 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 88 insertions(+), 11 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 1a5f5a6..c7c1239 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -688,6 +688,62 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) return get_reg_offset(insn, regs, REG_TYPE_RM); }
+/**
- _to_signed_long() - Cast an unsigned long into signed long
- @val A 32-bit or 64-bit unsigned long
- @long_bytes The number of bytes used to represent a long number
- @out The casted signed long
- Return: A signed long of either 32 or 64 bits, as per the build configuration
- of the kernel.
- */
+static int _to_signed_long(unsigned long val, int long_bytes, long *out) +{
- if (!out)
return -EINVAL;
+#ifdef CONFIG_X86_64
- if (long_bytes == 4) {
/* higher bytes should all be zero */
if (val & ~0xffffffff)
return -EINVAL;
/* sign-extend to a 64-bit long */
So this is a 32-bit userspace on a 64-bit kernel, right?
If so, how can a memory offset be > 32-bits and we have to extend it to a 64-bit long?!?
I *think* you want to say that you want to convert it to long so that you can do the calculation in longs.
However!
If you're a 64-bit kernel running a 32-bit userspace, you need to do the calculation in 32-bits only so that it overflows, as it would do on 32-bit hardware. IOW, the clamping to 32-bits at the end is not something you wanna do but actually let it wrap if it overflows.
Or am I missing something?
I am sorry Boris, while working on this series I missed a few of your feedback comments.
On Wed, 2017-06-07 at 17:48 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:14AM -0700, Ricardo Neri wrote:
The 32-bit and 64-bit address encodings are identical. This means that we can use the same function in both cases. In order to reuse the function for 32-bit address encodings, we must sign-extend our 32-bit signed operands to 64-bit signed variables (only for 64-bit builds). To decide on whether sign extension is needed, we rely on the address size as given by the instruction structure.
Once the effective address has been computed, a special verification is needed for 32-bit processes. If running on a 64-bit kernel, such processes can address up to 4GB of memory. Hence, for instance, an effective address of 0xffff1234 would be misinterpreted as 0xffffffffffff1234 due to the sign extension mentioned above. For this reason, the 4 must be
Which 4?
I meant to say the 4 most significant bytes. In this case, the 64-address 0xffffffffffff1234 would lie in the kernel memory while 0xffff1234 would correctly be in the user space memory.
truncated to obtain the true effective address.
Lastly, before computing the linear address, we verify that the effective address is within the limits of the segment. The check is kept for long mode because in such a case the limit is set to -1L. This is the largest unsigned number possible. This is equivalent to a limit-less segment.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 99 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 88 insertions(+), 11 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 1a5f5a6..c7c1239 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -688,6 +688,62 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) return get_reg_offset(insn, regs, REG_TYPE_RM); }
+/**
- _to_signed_long() - Cast an unsigned long into signed long
- @val A 32-bit or 64-bit unsigned long
- @long_bytes The number of bytes used to represent a long number
- @out The casted signed long
- Return: A signed long of either 32 or 64 bits, as per the build configuration
- of the kernel.
- */
+static int _to_signed_long(unsigned long val, int long_bytes, long *out) +{
- if (!out)
return -EINVAL;
+#ifdef CONFIG_X86_64
- if (long_bytes == 4) {
/* higher bytes should all be zero */
if (val & ~0xffffffff)
return -EINVAL;
/* sign-extend to a 64-bit long */
So this is a 32-bit userspace on a 64-bit kernel, right?
Yes.
If so, how can a memory offset be > 32-bits and we have to extend it to a 64-bit long?!?
Yes, perhaps the check above is not needed. I included that check as part of my argument validation. In a 64-bit kernel, this function could be called with val with non-zero most significant bytes.
I *think* you want to say that you want to convert it to long so that you can do the calculation in longs.
That is exactly what I meant. More specifically, I want to convert my 32-bit variables into 64-bit signed longs; this is the reason I need the sign extension.
However!
If you're a 64-bit kernel running a 32-bit userspace, you need to do the calculation in 32-bits only so that it overflows, as it would do on 32-bit hardware. IOW, the clamping to 32-bits at the end is not something you wanna do but actually let it wrap if it overflows.
I have looked into this closely and as far as I can see, the 4 least significant bytes will wrap around when using 64-bit signed numbers as they would when using 32-bit signed numbers. For instance, for two positive numbers we have:
7fff:ffff + 7000:0000 = efff:ffff.
The addition above overflows. When sign-extended to 64-bit numbers we would have:
0000:0000:7fff:ffff + 0000:0000:7000:0000 = 0000:0000:efff:ffff.
The addition above does not overflow. However, the 4 least significant bytes overflow as we expect. We can clamp the 4 most significant bytes.
For a two's complement negative numbers we can have:
ffff:ffff + 8000:0000 = 7fff:ffff with a carry flag.
The addition above overflows.
When sign-extending to 64-bit numbers we would have:
ffff:ffff:ffff:ffff + ffff:ffff:8000:0000 = ffff:ffff:7fff:ffff with a carry flag.
The addition above does not overflow. However, the 4 least significant bytes overflew and wrapped around as they would when using 32-bit signed numbers.
Or am I missing something?
Now, am I missing something?
Thanks and BR, Ricardo
On Tue, Jul 25, 2017 at 04:48:13PM -0700, Ricardo Neri wrote:
I meant to say the 4 most significant bytes. In this case, the 64-address 0xffffffffffff1234 would lie in the kernel memory while 0xffff1234 would correctly be in the user space memory.
That explanation is better.
Yes, perhaps the check above is not needed. I included that check as part of my argument validation. In a 64-bit kernel, this function could be called with val with non-zero most significant bytes.
So say that in the comment so that it is obvious *why*.
I have looked into this closely and as far as I can see, the 4 least significant bytes will wrap around when using 64-bit signed numbers as they would when using 32-bit signed numbers. For instance, for two positive numbers we have:
7fff:ffff + 7000:0000 = efff:ffff.
The addition above overflows.
Yes, MSB changes.
When sign-extended to 64-bit numbers we would have:
0000:0000:7fff:ffff + 0000:0000:7000:0000 = 0000:0000:efff:ffff.
The addition above does not overflow. However, the 4 least significant bytes overflow as we expect.
No they don't - you are simply using 64-bit regs:
0x00005555555546b8 <+8>: movq $0x7fffffff,-0x8(%rbp) 0x00005555555546c0 <+16>: movq $0x70000000,-0x10(%rbp) 0x00005555555546c8 <+24>: mov -0x8(%rbp),%rdx 0x00005555555546cc <+28>: mov -0x10(%rbp),%rax => 0x00005555555546d0 <+32>: add %rdx,%rax
rax 0xefffffff 4026531839 rbx 0x0 0 rcx 0x0 0 rdx 0x7fffffff 2147483647
...
eflags 0x206 [ PF IF ]
(OF flag is not set).
We can clamp the 4 most significant bytes.
For a two's complement negative numbers we can have:
ffff:ffff + 8000:0000 = 7fff:ffff with a carry flag.
The addition above overflows.
Yes.
When sign-extending to 64-bit numbers we would have:
ffff:ffff:ffff:ffff + ffff:ffff:8000:0000 = ffff:ffff:7fff:ffff with a carry flag.
The addition above does not overflow. However, the 4 least significant bytes overflew and wrapped around as they would when using 32-bit signed numbers.
Right. Ok.
And come to think of it now, I'm wondering, whether it would be better/easier/simpler/more straight-forward, to do the 32-bit operations with 32-bit types and separate 32-bit functions and have the hardware do that for you.
This way you can save yourself all that ugly and possibly error-prone casting back and forth and have the code much more readable too.
Hmmm.
On Thu, 2017-07-27 at 15:26 +0200, Borislav Petkov wrote:
On Tue, Jul 25, 2017 at 04:48:13PM -0700, Ricardo Neri wrote:
I meant to say the 4 most significant bytes. In this case, the 64-address 0xffffffffffff1234 would lie in the kernel memory while 0xffff1234 would correctly be in the user space memory.
That explanation is better.
Yes, perhaps the check above is not needed. I included that check as part of my argument validation. In a 64-bit kernel, this function could be called with val with non-zero most significant bytes.
So say that in the comment so that it is obvious *why*.
I have looked into this closely and as far as I can see, the 4 least significant bytes will wrap around when using 64-bit signed numbers as they would when using 32-bit signed numbers. For instance, for two positive numbers we have:
7fff:ffff + 7000:0000 = efff:ffff.
The addition above overflows.
Yes, MSB changes.
When sign-extended to 64-bit numbers we would have:
0000:0000:7fff:ffff + 0000:0000:7000:0000 = 0000:0000:efff:ffff.
The addition above does not overflow. However, the 4 least significant bytes overflow as we expect.
No they don't - you are simply using 64-bit regs:
0x00005555555546b8 <+8>: movq $0x7fffffff,-0x8(%rbp) 0x00005555555546c0 <+16>: movq $0x70000000,-0x10(%rbp) 0x00005555555546c8 <+24>: mov -0x8(%rbp),%rdx 0x00005555555546cc <+28>: mov -0x10(%rbp),%rax => 0x00005555555546d0 <+32>: add %rdx,%rax
rax 0xefffffff 4026531839 rbx 0x0 0 rcx 0x0 0 rdx 0x7fffffff 2147483647
...
eflags 0x206 [ PF IF ]
(OF flag is not set).
True, I don't have the OF set. However the 4 least significant bytes wrapped around; which is what I needed.
We can clamp the 4 most significant bytes.
For a two's complement negative numbers we can have:
ffff:ffff + 8000:0000 = 7fff:ffff with a carry flag.
The addition above overflows.
Yes.
When sign-extending to 64-bit numbers we would have:
ffff:ffff:ffff:ffff + ffff:ffff:8000:0000 = ffff:ffff:7fff:ffff with a carry flag.
The addition above does not overflow. However, the 4 least significant bytes overflew and wrapped around as they would when using 32-bit signed numbers.
Right. Ok.
And come to think of it now, I'm wondering, whether it would be better/easier/simpler/more straight-forward, to do the 32-bit operations with 32-bit types and separate 32-bit functions and have the hardware do that for you.
This way you can save yourself all that ugly and possibly error-prone casting back and forth and have the code much more readable too.
That sounds fair. I had to explain a lot this code and probably is not worth it. I can definitely use 32-bit variable types for the 32-bit case and drop all these castings.
The 32-bit and 64-bit functions would look identical except for the variables used to compute the effective address. Perhaps I could use a union:
union eff_addr { #if CONFIG_X86_64 long addr64; #endif int addr32; };
And use one or the other based on the address size given by the CS.L CS.D bits of the segment descriptor or address size overrides.
However using the union could be less readable than having two almost identical functions.
Thanks and BR, Ricardo
On Thu, Jul 27, 2017 at 07:04:52PM -0700, Ricardo Neri wrote:
However using the union could be less readable than having two almost identical functions.
So having some small duplication for the sake of clarity and readability is much better, if you ask me. And it's not like you're duplicating a lot of code - it is only a handful of functions.
On Fri, May 05, 2017 at 11:17:14AM -0700, Ricardo Neri wrote:
@@ -697,18 +753,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) { unsigned long linear_addr, seg_base_addr, seg_limit; long eff_addr, base, indx;
- int addr_offset, base_offset, indx_offset;
int addr_offset, base_offset, indx_offset, addr_bytes; insn_byte_t sib;
insn_get_modrm(insn); insn_get_sib(insn); sib = insn->sib.value;
addr_bytes = insn->addr_bytes;
if (X86_MODRM_MOD(insn->modrm.value) == 3) { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) goto out_err;
eff_addr = regs_get_register(regs, addr_offset);
eff_addr = get_mem_offset(regs, addr_offset, addr_bytes);
if (eff_addr == -1L)
seg_base_addr = insn_get_seg_base(regs, insn, addr_offset); if (seg_base_addr == -1L) goto out_err;goto out_err;
This code here is too dense, it needs spacing for better readability.
On Wed, 2017-06-07 at 17:49 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:14AM -0700, Ricardo Neri wrote:
@@ -697,18 +753,21 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) { unsigned long linear_addr, seg_base_addr, seg_limit; long eff_addr, base, indx;
- int addr_offset, base_offset, indx_offset;
int addr_offset, base_offset, indx_offset, addr_bytes; insn_byte_t sib;
insn_get_modrm(insn); insn_get_sib(insn); sib = insn->sib.value;
addr_bytes = insn->addr_bytes;
if (X86_MODRM_MOD(insn->modrm.value) == 3) { addr_offset = get_reg_offset(insn, regs, REG_TYPE_RM); if (addr_offset < 0) goto out_err;
eff_addr = regs_get_register(regs, addr_offset);
eff_addr = get_mem_offset(regs, addr_offset, addr_bytes);
if (eff_addr == -1L)
seg_base_addr = insn_get_seg_base(regs, insn, addr_offset); if (seg_base_addr == -1L) goto out_err;goto out_err;
This code here is too dense, it needs spacing for better readability.
I have spaced out in my upcoming version.
Thanks and BR, Ricardo
It is possible to utilize 32-bit address encodings in virtual-8086 mode via an address override instruction prefix. However, the range of address is still limited to [0x-0xffff]. In such a case, return error.
Also, linear addresses in virtual-8086 mode are limited to 20 bits. Enforce such limit by truncating the most significant bytes of the computed linear address.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index c7c1239..9822061 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -848,6 +848,12 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) linear_addr &= 0xffffffff;
/* + * Even though 32-bit address encodings are allowed in virtual-8086 + * mode, the address range is still limited to [0x-0xffff]. + */ + if (v8086_mode(regs) && (linear_addr & ~0xffff)) + goto out_err; + /* * Make sure the effective address is within the limits of the * segment. In long mode, the limit is -1L. Thus, the second part * of the check always succeeds. @@ -857,6 +863,10 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs)
linear_addr += seg_base_addr;
+ /* Limit linear address to 20 bits */ + if (v8086_mode(regs)) + linear_addr &= 0xfffff; + return (void __user *)linear_addr; out_err: return (void __user *)-1;
Tasks running in virtual-8086 mode or in protected mode with code segment descriptors that specify 16-bit default address sizes via the D bit will use 16-bit addressing form encodings as described in the Intel 64 and IA-32 Architecture Software Developer's Manual Volume 2A Section 2.1.5. 16-bit addressing encodings differ in several ways from the 32-bit/64-bit addressing form encodings: ModRM.rm points to different registers and, in some cases, effective addresses are indicated by the addition of the value of two registers. Also, there is no support for SIB bytes. Thus, a separate function is needed to parse this form of addressing.
A couple of functions are introduced. get_reg_offset_16() obtains the offset from the base of pt_regs of the registers indicated by the ModRM byte of the address encoding. get_addr_ref_16() computes the linear address indicated by the instructions using the value of the registers given by ModRM as well as the base address of the segment.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 9822061..928a662 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -431,6 +431,73 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, }
/** + * get_reg_offset_16 - Obtain offset of register indicated by instruction + * @insn: Instruction structure containing ModRM and SiB bytes + * @regs: Structure with register values as seen when entering kernel mode + * @offs1: Offset of the first operand register + * @offs2: Offset of the second opeand register, if applicable. + * + * Obtain the offset, in pt_regs, of the registers indicated by the ModRM byte + * within insn. This function is to be used with 16-bit address encodings. The + * offs1 and offs2 will be written with the offset of the two registers + * indicated by the instruction. In cases where any of the registers is not + * referenced by the instruction, the value will be set to -EDOM. + * + * Return: 0 on success, -EINVAL on failure. + */ +static int get_reg_offset_16(struct insn *insn, struct pt_regs *regs, + int *offs1, int *offs2) +{ + /* 16-bit addressing can use one or two registers */ + static const int regoff1[] = { + offsetof(struct pt_regs, bx), + offsetof(struct pt_regs, bx), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), + offsetof(struct pt_regs, bp), + offsetof(struct pt_regs, bx), + }; + + static const int regoff2[] = { + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), + offsetof(struct pt_regs, si), + offsetof(struct pt_regs, di), + -EDOM, + -EDOM, + -EDOM, + -EDOM, + }; + + if (!offs1 || !offs2) + return -EINVAL; + + /* operand is a register, use the generic function */ + if (X86_MODRM_MOD(insn->modrm.value) == 3) { + *offs1 = insn_get_modrm_rm_off(insn, regs); + *offs2 = -EDOM; + return 0; + } + + *offs1 = regoff1[X86_MODRM_RM(insn->modrm.value)]; + *offs2 = regoff2[X86_MODRM_RM(insn->modrm.value)]; + + /* + * If no displacement is indicated in the mod part of the ModRM byte, + * (mod part is 0) and the r/m part of the same byte is 6, no register + * is used caculate the operand address. An r/m part of 6 means that + * the second register offset is already invalid. + */ + if ((X86_MODRM_MOD(insn->modrm.value) == 0) && + (X86_MODRM_RM(insn->modrm.value) == 6)) + *offs1 = -EDOM; + + return 0; +} + +/** * get_desc() - Obtain address of segment descriptor * @sel: Segment selector * @@ -689,6 +756,94 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) }
/** + * get_addr_ref_16() - Obtain the 16-bit address referred by instruction + * @insn: Instruction structure containing ModRM byte and displacement + * @regs: Structure with register values as seen when entering kernel mode + * + * This function is to be used with 16-bit address encodings. Obtain the memory + * address referred by the instruction's ModRM bytes and displacement. Also, the + * segment used as base is determined by either any segment override prefixes in + * insn or the default segment of the registers involved in the address + * computation. In protected mode, segment limits are enforced. + * + * Return: linear address referenced by instruction and registers on success. + * -1L on failure. + */ +static void __user *get_addr_ref_16(struct insn *insn, struct pt_regs *regs) +{ + unsigned long linear_addr, seg_base_addr, seg_limit; + short eff_addr, addr1 = 0, addr2 = 0; + int addr_offset1, addr_offset2; + int ret; + + insn_get_modrm(insn); + insn_get_displacement(insn); + + /* + * If operand is a register, the layout is the same as in + * 32-bit and 64-bit addressing. + */ + if (X86_MODRM_MOD(insn->modrm.value) == 3) { + addr_offset1 = get_reg_offset(insn, regs, REG_TYPE_RM); + if (addr_offset1 < 0) + goto out_err; + eff_addr = regs_get_register(regs, addr_offset1); + seg_base_addr = insn_get_seg_base(regs, insn, addr_offset1); + if (seg_base_addr == -1L) + goto out_err; + seg_limit = get_seg_limit(regs, insn, addr_offset1); + } else { + ret = get_reg_offset_16(insn, regs, &addr_offset1, + &addr_offset2); + if (ret < 0) + goto out_err; + /* + * Don't fail on invalid offset values. They might be invalid + * because they cannot be used for this particular value of + * the ModRM. Instead, use them in the computation only if + * they contain a valid value. + */ + if (addr_offset1 != -EDOM) + addr1 = 0xffff & regs_get_register(regs, addr_offset1); + if (addr_offset2 != -EDOM) + addr2 = 0xffff & regs_get_register(regs, addr_offset2); + eff_addr = addr1 + addr2; + /* + * The first register is in the operand implies the SS or DS + * segment selectors, the second register in the operand can + * only imply DS. Thus, use the first register to obtain + * the segment selector. + */ + seg_base_addr = insn_get_seg_base(regs, insn, addr_offset1); + if (seg_base_addr == -1L) + goto out_err; + seg_limit = get_seg_limit(regs, insn, addr_offset1); + + eff_addr += (insn->displacement.value & 0xffff); + } + + linear_addr = (unsigned long)(eff_addr & 0xffff); + + /* + * Make sure the effective address is within the limits of the + * segment. In long mode, the limit is -1L. Thus, the second part + * of the check always succeeds. + */ + if (linear_addr > seg_limit) + goto out_err; + + linear_addr += seg_base_addr; + + /* Limit linear address to 20 bits */ + if (v8086_mode(regs)) + linear_addr &= 0xfffff; + + return (void __user *)linear_addr; +out_err: + return (void __user *)-1; +} + +/** * _to_signed_long() - Cast an unsigned long into signed long * @val A 32-bit or 64-bit unsigned long * @long_bytes The number of bytes used to represent a long number
On Fri, May 05, 2017 at 11:17:16AM -0700, Ricardo Neri wrote:
Tasks running in virtual-8086 mode or in protected mode with code segment descriptors that specify 16-bit default address sizes via the D bit will use 16-bit addressing form encodings as described in the Intel 64 and IA-32 Architecture Software Developer's Manual Volume 2A Section 2.1.5. 16-bit addressing encodings differ in several ways from the 32-bit/64-bit addressing form encodings: ModRM.rm points to different registers and, in some cases, effective addresses are indicated by the addition of the value of two registers. Also, there is no support for SIB bytes. Thus, a separate function is needed to parse this form of addressing.
A couple of functions are introduced. get_reg_offset_16() obtains the offset from the base of pt_regs of the registers indicated by the ModRM byte of the address encoding. get_addr_ref_16() computes the linear address indicated by the instructions using the value of the registers given by ModRM as well as the base address of the segment.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 9822061..928a662 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -431,6 +431,73 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, }
/**
- get_reg_offset_16 - Obtain offset of register indicated by instruction
Please end function names with parentheses.
- @insn: Instruction structure containing ModRM and SiB bytes
s/SiB/SIB/g
- @regs: Structure with register values as seen when entering kernel mode
- @offs1: Offset of the first operand register
- @offs2: Offset of the second opeand register, if applicable.
- Obtain the offset, in pt_regs, of the registers indicated by the ModRM byte
- within insn. This function is to be used with 16-bit address encodings. The
- offs1 and offs2 will be written with the offset of the two registers
- indicated by the instruction. In cases where any of the registers is not
- referenced by the instruction, the value will be set to -EDOM.
- Return: 0 on success, -EINVAL on failure.
- */
+static int get_reg_offset_16(struct insn *insn, struct pt_regs *regs,
int *offs1, int *offs2)
+{
- /* 16-bit addressing can use one or two registers */
- static const int regoff1[] = {
offsetof(struct pt_regs, bx),
offsetof(struct pt_regs, bx),
offsetof(struct pt_regs, bp),
offsetof(struct pt_regs, bp),
offsetof(struct pt_regs, si),
offsetof(struct pt_regs, di),
offsetof(struct pt_regs, bp),
offsetof(struct pt_regs, bx),
- };
- static const int regoff2[] = {
offsetof(struct pt_regs, si),
offsetof(struct pt_regs, di),
offsetof(struct pt_regs, si),
offsetof(struct pt_regs, di),
-EDOM,
-EDOM,
-EDOM,
-EDOM,
- };
You mean "Table 2-1. 16-Bit Addressing Forms with the ModR/M Byte" in the SDM, right?
Please add a comment pointing to it here because it is not trivial to map that code to the documentation.
- if (!offs1 || !offs2)
return -EINVAL;
- /* operand is a register, use the generic function */
- if (X86_MODRM_MOD(insn->modrm.value) == 3) {
*offs1 = insn_get_modrm_rm_off(insn, regs);
*offs2 = -EDOM;
return 0;
- }
- *offs1 = regoff1[X86_MODRM_RM(insn->modrm.value)];
- *offs2 = regoff2[X86_MODRM_RM(insn->modrm.value)];
- /*
* If no displacement is indicated in the mod part of the ModRM byte,
s/"no "//
* (mod part is 0) and the r/m part of the same byte is 6, no register
* is used caculate the operand address. An r/m part of 6 means that
* the second register offset is already invalid.
*/
- if ((X86_MODRM_MOD(insn->modrm.value) == 0) &&
(X86_MODRM_RM(insn->modrm.value) == 6))
*offs1 = -EDOM;
- return 0;
+}
+/**
- get_desc() - Obtain address of segment descriptor
- @sel: Segment selector
@@ -689,6 +756,94 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) }
/**
- get_addr_ref_16() - Obtain the 16-bit address referred by instruction
- @insn: Instruction structure containing ModRM byte and displacement
- @regs: Structure with register values as seen when entering kernel mode
- This function is to be used with 16-bit address encodings. Obtain the memory
- address referred by the instruction's ModRM bytes and displacement. Also, the
- segment used as base is determined by either any segment override prefixes in
- insn or the default segment of the registers involved in the address
- computation. In protected mode, segment limits are enforced.
- Return: linear address referenced by instruction and registers on success.
- -1L on failure.
- */
+static void __user *get_addr_ref_16(struct insn *insn, struct pt_regs *regs) +{
- unsigned long linear_addr, seg_base_addr, seg_limit;
- short eff_addr, addr1 = 0, addr2 = 0;
- int addr_offset1, addr_offset2;
- int ret;
- insn_get_modrm(insn);
- insn_get_displacement(insn);
- /*
* If operand is a register, the layout is the same as in
* 32-bit and 64-bit addressing.
*/
- if (X86_MODRM_MOD(insn->modrm.value) == 3) {
addr_offset1 = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset1 < 0)
goto out_err;
<---- newline here.
eff_addr = regs_get_register(regs, addr_offset1);
seg_base_addr = insn_get_seg_base(regs, insn, addr_offset1);
if (seg_base_addr == -1L)
goto out_err;
ditto.
seg_limit = get_seg_limit(regs, insn, addr_offset1);
- } else {
ret = get_reg_offset_16(insn, regs, &addr_offset1,
&addr_offset2);
if (ret < 0)
goto out_err;
ditto.
/*
* Don't fail on invalid offset values. They might be invalid
* because they cannot be used for this particular value of
* the ModRM. Instead, use them in the computation only if
* they contain a valid value.
*/
if (addr_offset1 != -EDOM)
addr1 = 0xffff & regs_get_register(regs, addr_offset1);
if (addr_offset2 != -EDOM)
addr2 = 0xffff & regs_get_register(regs, addr_offset2);
eff_addr = addr1 + addr2;
ditto.
Space those codelines out, we want to be able to read that code again at some point :-)))
/*
* The first register is in the operand implies the SS or DS
* segment selectors, the second register in the operand can
* only imply DS. Thus, use the first register to obtain
* the segment selector.
*/
seg_base_addr = insn_get_seg_base(regs, insn, addr_offset1);
if (seg_base_addr == -1L)
goto out_err;
seg_limit = get_seg_limit(regs, insn, addr_offset1);
eff_addr += (insn->displacement.value & 0xffff);
- }
- linear_addr = (unsigned long)(eff_addr & 0xffff);
- /*
* Make sure the effective address is within the limits of the
* segment. In long mode, the limit is -1L. Thus, the second part
Long mode in a 16-bit handling function?
* of the check always succeeds.
*/
- if (linear_addr > seg_limit)
goto out_err;
- linear_addr += seg_base_addr;
- /* Limit linear address to 20 bits */
- if (v8086_mode(regs))
linear_addr &= 0xfffff;
- return (void __user *)linear_addr;
+out_err:
- return (void __user *)-1;
+}
+/**
- _to_signed_long() - Cast an unsigned long into signed long
- @val A 32-bit or 64-bit unsigned long
- @long_bytes The number of bytes used to represent a long number
-- 2.9.3
On Wed, 2017-06-07 at 18:28 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:16AM -0700, Ricardo Neri wrote:
Tasks running in virtual-8086 mode or in protected mode with code segment descriptors that specify 16-bit default address sizes via the D bit will use 16-bit addressing form encodings as described in the Intel 64 and IA-32 Architecture Software Developer's Manual Volume 2A Section 2.1.5. 16-bit addressing encodings differ in several ways from the 32-bit/64-bit addressing form encodings: ModRM.rm points to different registers and, in some cases, effective addresses are indicated by the addition of the value of two registers. Also, there is no support for SIB bytes. Thus, a separate function is needed to parse this form of addressing.
A couple of functions are introduced. get_reg_offset_16() obtains the offset from the base of pt_regs of the registers indicated by the ModRM byte of the address encoding. get_addr_ref_16() computes the linear address indicated by the instructions using the value of the registers given by ModRM as well as the base address of the segment.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/lib/insn-eval.c | 155 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 9822061..928a662 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -431,6 +431,73 @@ static int get_reg_offset(struct insn *insn, struct pt_regs *regs, }
/**
- get_reg_offset_16 - Obtain offset of register indicated by instruction
Please end function names with parentheses.
I will correct.
- @insn: Instruction structure containing ModRM and SiB bytes
s/SiB/SIB/g
I will correct.
- @regs: Structure with register values as seen when entering kernel mode
- @offs1: Offset of the first operand register
- @offs2: Offset of the second opeand register, if applicable.
- Obtain the offset, in pt_regs, of the registers indicated by the ModRM byte
- within insn. This function is to be used with 16-bit address encodings. The
- offs1 and offs2 will be written with the offset of the two registers
- indicated by the instruction. In cases where any of the registers is not
- referenced by the instruction, the value will be set to -EDOM.
- Return: 0 on success, -EINVAL on failure.
- */
+static int get_reg_offset_16(struct insn *insn, struct pt_regs *regs,
int *offs1, int *offs2)
+{
- /* 16-bit addressing can use one or two registers */
- static const int regoff1[] = {
offsetof(struct pt_regs, bx),
offsetof(struct pt_regs, bx),
offsetof(struct pt_regs, bp),
offsetof(struct pt_regs, bp),
offsetof(struct pt_regs, si),
offsetof(struct pt_regs, di),
offsetof(struct pt_regs, bp),
offsetof(struct pt_regs, bx),
- };
- static const int regoff2[] = {
offsetof(struct pt_regs, si),
offsetof(struct pt_regs, di),
offsetof(struct pt_regs, si),
offsetof(struct pt_regs, di),
-EDOM,
-EDOM,
-EDOM,
-EDOM,
- };
You mean "Table 2-1. 16-Bit Addressing Forms with the ModR/M Byte" in the SDM, right?
Yes.
Please add a comment pointing to it here because it is not trivial to map that code to the documentation.
Sure, I will add a comment pointing to this table.
- if (!offs1 || !offs2)
return -EINVAL;
- /* operand is a register, use the generic function */
- if (X86_MODRM_MOD(insn->modrm.value) == 3) {
*offs1 = insn_get_modrm_rm_off(insn, regs);
*offs2 = -EDOM;
return 0;
- }
- *offs1 = regoff1[X86_MODRM_RM(insn->modrm.value)];
- *offs2 = regoff2[X86_MODRM_RM(insn->modrm.value)];
- /*
* If no displacement is indicated in the mod part of the ModRM byte,
s/"no "//
* (mod part is 0) and the r/m part of the same byte is 6, no register
* is used caculate the operand address. An r/m part of 6 means that
* the second register offset is already invalid.
Perhaps my comment was misleading. When ModRM.mod is 0, no displacement is used except for ModRM.mod = 0 and ModRM.rm 110b. In this case we have displacement-only addressing. I will reword the comment to reflect this fact.
*/
- if ((X86_MODRM_MOD(insn->modrm.value) == 0) &&
(X86_MODRM_RM(insn->modrm.value) == 6))
*offs1 = -EDOM;
- return 0;
+}
+/**
- get_desc() - Obtain address of segment descriptor
- @sel: Segment selector
@@ -689,6 +756,94 @@ int insn_get_modrm_rm_off(struct insn *insn, struct pt_regs *regs) }
/**
- get_addr_ref_16() - Obtain the 16-bit address referred by instruction
- @insn: Instruction structure containing ModRM byte and displacement
- @regs: Structure with register values as seen when entering kernel mode
- This function is to be used with 16-bit address encodings. Obtain the memory
- address referred by the instruction's ModRM bytes and displacement. Also, the
- segment used as base is determined by either any segment override prefixes in
- insn or the default segment of the registers involved in the address
- computation. In protected mode, segment limits are enforced.
- Return: linear address referenced by instruction and registers on success.
- -1L on failure.
- */
+static void __user *get_addr_ref_16(struct insn *insn, struct pt_regs *regs) +{
- unsigned long linear_addr, seg_base_addr, seg_limit;
- short eff_addr, addr1 = 0, addr2 = 0;
- int addr_offset1, addr_offset2;
- int ret;
- insn_get_modrm(insn);
- insn_get_displacement(insn);
- /*
* If operand is a register, the layout is the same as in
* 32-bit and 64-bit addressing.
*/
- if (X86_MODRM_MOD(insn->modrm.value) == 3) {
addr_offset1 = get_reg_offset(insn, regs, REG_TYPE_RM);
if (addr_offset1 < 0)
goto out_err;
<---- newline here.
Will add newline.
eff_addr = regs_get_register(regs, addr_offset1);
seg_base_addr = insn_get_seg_base(regs, insn, addr_offset1);
if (seg_base_addr == -1L)
goto out_err;
ditto.
Will add newline.
seg_limit = get_seg_limit(regs, insn, addr_offset1);
- } else {
ret = get_reg_offset_16(insn, regs, &addr_offset1,
&addr_offset2);
if (ret < 0)
goto out_err;
ditto.
Will add newline.
/*
* Don't fail on invalid offset values. They might be invalid
* because they cannot be used for this particular value of
* the ModRM. Instead, use them in the computation only if
* they contain a valid value.
*/
if (addr_offset1 != -EDOM)
addr1 = 0xffff & regs_get_register(regs, addr_offset1);
if (addr_offset2 != -EDOM)
addr2 = 0xffff & regs_get_register(regs, addr_offset2);
eff_addr = addr1 + addr2;
ditto.
Will add newline.
Space those codelines out, we want to be able to read that code again at some point :-)))
Sure! I have gone through all this code adding newlines as necessary.
/*
* The first register is in the operand implies the SS or DS
* segment selectors, the second register in the operand can
* only imply DS. Thus, use the first register to obtain
* the segment selector.
*/
seg_base_addr = insn_get_seg_base(regs, insn, addr_offset1);
if (seg_base_addr == -1L)
goto out_err;
seg_limit = get_seg_limit(regs, insn, addr_offset1);
eff_addr += (insn->displacement.value & 0xffff);
- }
- linear_addr = (unsigned long)(eff_addr & 0xffff);
- /*
* Make sure the effective address is within the limits of the
* segment. In long mode, the limit is -1L. Thus, the second part
Long mode in a 16-bit handling function?
Yes, this is not correct. However, it is true for virtual-8086 mode. I will update the comment accordingly.
Thanks and BR, Ricardo
Convert the function insn_get_add_ref() into a wrapper function that calls the correct static address-decoding function depending on the address size In this way, callers do not need to worry about calling the correct function and decreases the number of functions that need to be exposed.
To this end, the function insn_get_addr_ref() used to obtain linear addresses from the 32/64-bit encodings is renamed as get_addr_ref_32_64() to reflect the type of address encodings that it handles.
Documentation is added to the new wrapper function and the documentation for the 32/64-bit address decoding function is improved.
Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Adam Buchbinder adam.buchbinder@gmail.com Cc: Colin Ian King colin.king@canonical.com Cc: Lorenzo Stoakes lstoakes@gmail.com Cc: Qiaowei Ren qiaowei.ren@intel.com Cc: Arnaldo Carvalho de Melo acme@redhat.com Cc: Masami Hiramatsu mhiramat@kernel.org Cc: Adrian Hunter adrian.hunter@intel.com Cc: Kees Cook keescook@chromium.org Cc: Thomas Garnier thgarnie@google.com Cc: Peter Zijlstra peterz@infradead.org Cc: Borislav Petkov bp@suse.de Cc: Dmitry Vyukov dvyukov@google.com Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: x86@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/lib/insn-eval.c | 48 +++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 43 insertions(+), 5 deletions(-)
diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index 928a662..8914884 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -899,12 +899,22 @@ long get_mem_offset(struct pt_regs *regs, int reg_offset, int addr_size) return -1L; return offset; } -/* - * return the address being referenced be instruction - * for rm=3 returning the content of the rm reg - * for rm!=3 calculates the address using SIB and Disp + +/** + * get_addr_ref_32_64() - Obtain a 32/64-bit linear address + * @insn: Instruction struct with ModRM and SiB bytes and displacement + * @regs: Structure with register values as seen when entering kernel mode + * + * This function is to be used with 32-bit and 64-bit address encodings to + * obtain the effective memory address referred by the instruction's ModRM, + * SIB, and displacement bytes, as applicable. Also, the segment base is used + * to compute the linear address. In protected mode, segment limits are + * enforced. + * + * Return: linear address referenced by instruction and registers on success. + * -1L on failure. */ -void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) +static void __user *get_addr_ref_32_64(struct insn *insn, struct pt_regs *regs) { unsigned long linear_addr, seg_base_addr, seg_limit; long eff_addr, base, indx; @@ -1026,3 +1036,31 @@ void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) out_err: return (void __user *)-1; } + +/** + * insn_get_addr_ref() - Obtain the linear address referred by instruction + * @insn: Instruction structure containing ModRM byte and displacement + * @regs: Structure with register values as seen when entering kernel mode + * + * Obtain the memory address referred by the instruction's ModRM bytes and + * displacement. Also, the segment used as base is determined by either any + * segment override prefixes in insn or the default segment of the registers + * involved in the address computation. In protected mode, segment limits + * are enforced. + * + * Return: linear address referenced by instruction and registers on success. + * -1L on failure. + */ +void __user *insn_get_addr_ref(struct insn *insn, struct pt_regs *regs) +{ + switch (insn->addr_bytes) { + case 2: + return get_addr_ref_16(insn, regs); + case 4: + /* fall through */ + case 8: + return get_addr_ref_32_64(insn, regs); + default: + return (void __user *)-1; + } +}
User-Mode Instruction Prevention is a security feature present in new Intel processors that, when set, prevents the execution of a subset of instructions if such instructions are executed in user mode (CPL > 0). Attempting to execute such instructions causes a general protection exception.
The subset of instructions comprises:
* SGDT - Store Global Descriptor Table * SIDT - Store Interrupt Descriptor Table * SLDT - Store Local Descriptor Table * SMSW - Store Machine Status Word * STR - Store Task Register
This feature is also added to the list of disabled-features to allow a cleaner handling of build-time configuration.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org
Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +++++++- arch/x86/include/uapi/asm/processor-flags.h | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 2701e5f..f1d61d2 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -289,6 +289,7 @@
/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 16 */ #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/ +#define X86_FEATURE_UMIP (16*32+ 2) /* User Mode Instruction Protection */ #define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */ #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */ #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 5dff775..7adaef7 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -16,6 +16,12 @@ # define DISABLE_MPX (1<<(X86_FEATURE_MPX & 31)) #endif
+#ifdef CONFIG_X86_INTEL_UMIP +# define DISABLE_UMIP 0 +#else +# define DISABLE_UMIP (1<<(X86_FEATURE_UMIP & 31)) +#endif + #ifdef CONFIG_X86_64 # define DISABLE_VME (1<<(X86_FEATURE_VME & 31)) # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) @@ -61,7 +67,7 @@ #define DISABLED_MASK13 0 #define DISABLED_MASK14 0 #define DISABLED_MASK15 0 -#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57) +#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP) #define DISABLED_MASK17 0 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 567de50..d2c2af8 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -104,6 +104,8 @@ #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT) #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */ #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT) +#define X86_CR4_UMIP_BIT 11 /* enable UMIP support */ +#define X86_CR4_UMIP _BITUL(X86_CR4_UMIP_BIT) #define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */ #define X86_CR4_VMXE _BITUL(X86_CR4_VMXE_BIT) #define X86_CR4_SMXE_BIT 14 /* enable safer mode (TXT) */
On 05/05/2017 20:17, Ricardo Neri wrote:
User-Mode Instruction Prevention is a security feature present in new Intel processors that, when set, prevents the execution of a subset of instructions if such instructions are executed in user mode (CPL > 0). Attempting to execute such instructions causes a general protection exception.
The subset of instructions comprises:
- SGDT - Store Global Descriptor Table
- SIDT - Store Interrupt Descriptor Table
- SLDT - Store Local Descriptor Table
- SMSW - Store Machine Status Word
- STR - Store Task Register
This feature is also added to the list of disabled-features to allow a cleaner handling of build-time configuration.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org
Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
Would it be possible to have this patch in a topic branch for KVM's consumption?
Thanks,
Paolo
arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +++++++- arch/x86/include/uapi/asm/processor-flags.h | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index 2701e5f..f1d61d2 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -289,6 +289,7 @@
/* Intel-defined CPU features, CPUID level 0x00000007:0 (ecx), word 16 */ #define X86_FEATURE_AVX512VBMI (16*32+ 1) /* AVX512 Vector Bit Manipulation instructions*/ +#define X86_FEATURE_UMIP (16*32+ 2) /* User Mode Instruction Protection */ #define X86_FEATURE_PKU (16*32+ 3) /* Protection Keys for Userspace */ #define X86_FEATURE_OSPKE (16*32+ 4) /* OS Protection Keys Enable */ #define X86_FEATURE_AVX512_VPOPCNTDQ (16*32+14) /* POPCNT for vectors of DW/QW */ diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 5dff775..7adaef7 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -16,6 +16,12 @@ # define DISABLE_MPX (1<<(X86_FEATURE_MPX & 31)) #endif
+#ifdef CONFIG_X86_INTEL_UMIP +# define DISABLE_UMIP 0 +#else +# define DISABLE_UMIP (1<<(X86_FEATURE_UMIP & 31)) +#endif
#ifdef CONFIG_X86_64 # define DISABLE_VME (1<<(X86_FEATURE_VME & 31)) # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) @@ -61,7 +67,7 @@ #define DISABLED_MASK13 0 #define DISABLED_MASK14 0 #define DISABLED_MASK15 0 -#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57) +#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP) #define DISABLED_MASK17 0 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 18)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h index 567de50..d2c2af8 100644 --- a/arch/x86/include/uapi/asm/processor-flags.h +++ b/arch/x86/include/uapi/asm/processor-flags.h @@ -104,6 +104,8 @@ #define X86_CR4_OSFXSR _BITUL(X86_CR4_OSFXSR_BIT) #define X86_CR4_OSXMMEXCPT_BIT 10 /* enable unmasked SSE exceptions */ #define X86_CR4_OSXMMEXCPT _BITUL(X86_CR4_OSXMMEXCPT_BIT) +#define X86_CR4_UMIP_BIT 11 /* enable UMIP support */ +#define X86_CR4_UMIP _BITUL(X86_CR4_UMIP_BIT) #define X86_CR4_VMXE_BIT 13 /* enable VMX virtualization */ #define X86_CR4_VMXE _BITUL(X86_CR4_VMXE_BIT) #define X86_CR4_SMXE_BIT 14 /* enable safer mode (TXT) */
On Sat, 2017-05-06 at 11:04 +0200, Paolo Bonzini wrote:
On 05/05/2017 20:17, Ricardo Neri wrote:
User-Mode Instruction Prevention is a security feature present in
new
Intel processors that, when set, prevents the execution of a subset
of
instructions if such instructions are executed in user mode (CPL >
0).
Attempting to execute such instructions causes a general protection exception.
The subset of instructions comprises:
- SGDT - Store Global Descriptor Table
- SIDT - Store Interrupt Descriptor Table
- SLDT - Store Local Descriptor Table
- SMSW - Store Machine Status Word
- STR - Store Task Register
This feature is also added to the list of disabled-features to allow a cleaner handling of build-time configuration.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org
Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
Would it be possible to have this patch in a topic branch for KVM's consumption?
I have put a branch here with this single patch:
https://github.com/ricardon/tip.git rneri/umip_for_kvm
This is based on Linux v4.11. Please let me know if this works for your or you'd prefer it to be based on a different branch/commit/repo.
Thanks and BR, Ricardo
On Fri, May 05, 2017 at 11:17:18AM -0700, Ricardo Neri wrote:
User-Mode Instruction Prevention is a security feature present in new Intel processors that, when set, prevents the execution of a subset of instructions if such instructions are executed in user mode (CPL > 0). Attempting to execute such instructions causes a general protection exception.
The subset of instructions comprises:
- SGDT - Store Global Descriptor Table
- SIDT - Store Interrupt Descriptor Table
- SLDT - Store Local Descriptor Table
- SMSW - Store Machine Status Word
- STR - Store Task Register
This feature is also added to the list of disabled-features to allow a cleaner handling of build-time configuration.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org
Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +++++++- arch/x86/include/uapi/asm/processor-flags.h | 2 ++ 3 files changed, 10 insertions(+), 1 deletion(-)
Reviewed-by: Borislav Petkov bp@suse.de
The feature User-Mode Instruction Prevention present in recent Intel processor prevents a group of instructions from being executed with CPL > 0. Otherwise, a general protection fault is issued.
Rather than relaying this fault to the user space (in the form of a SIGSEGV signal), the instructions protected by UMIP can be emulated to provide dummy results. This allows to conserve the current kernel behavior and not reveal the system resources that UMIP intends to protect (the global descriptor and interrupt descriptor tables, the segment selectors of the local descriptor table and the task state and the machine status word).
This emulation is needed because certain applications (e.g., WineHQ and DOSEMU2) rely on this subset of instructions to function.
The instructions protected by UMIP can be split in two groups. Those who return a kernel memory address (sgdt and sidt) and those who return a value (sldt, str and smsw).
For the instructions that return a kernel memory address, applications such as WineHQ rely on the result being located in the kernel memory space. The result is emulated as a hard-coded value that, lies close to the top of the kernel memory. The limit for the GDT and the IDT are set to zero.
Given that sldt and str are not used in common in programs supported by WineHQ and DOSEMU2, they are not emulated.
The instruction smsw is emulated to return the value that the register CR0 has at boot time as set in the head_32.
Care is taken to appropriately emulate the results when segmentation is used. This is, rather than relying on USER_DS and USER_CS, the function insn_get_addr_ref() inspects the segment descriptor pointed by the registers in pt_regs. This ensures that we correctly obtain the segment base address and the address and operand sizes even if the user space application uses local descriptor table.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/include/asm/umip.h | 15 +++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/umip.c | 245 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 261 insertions(+) create mode 100644 arch/x86/include/asm/umip.h create mode 100644 arch/x86/kernel/umip.c
diff --git a/arch/x86/include/asm/umip.h b/arch/x86/include/asm/umip.h new file mode 100644 index 0000000..077b236 --- /dev/null +++ b/arch/x86/include/asm/umip.h @@ -0,0 +1,15 @@ +#ifndef _ASM_X86_UMIP_H +#define _ASM_X86_UMIP_H + +#include <linux/types.h> +#include <asm/ptrace.h> + +#ifdef CONFIG_X86_INTEL_UMIP +bool fixup_umip_exception(struct pt_regs *regs); +#else +static inline bool fixup_umip_exception(struct pt_regs *regs) +{ + return false; +} +#endif /* CONFIG_X86_INTEL_UMIP */ +#endif /* _ASM_X86_UMIP_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 4b99423..cc1b7cc 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -123,6 +123,7 @@ obj-$(CONFIG_EFI) += sysfb_efi.o obj-$(CONFIG_PERF_EVENTS) += perf_regs.o obj-$(CONFIG_TRACING) += tracepoint.o obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o +obj-$(CONFIG_X86_INTEL_UMIP) += umip.o
ifdef CONFIG_FRAME_POINTER obj-y += unwind_frame.o diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c new file mode 100644 index 0000000..c7c5795 --- /dev/null +++ b/arch/x86/kernel/umip.c @@ -0,0 +1,245 @@ +/* + * umip.c Emulation for instruction protected by the Intel User-Mode + * Instruction Prevention. The instructions are: + * sgdt + * sldt + * sidt + * str + * smsw + * + * Copyright (c) 2017, Intel Corporation. + * Ricardo Neri ricardo.neri@linux.intel.com + */ + +#include <linux/uaccess.h> +#include <asm/umip.h> +#include <asm/traps.h> +#include <asm/insn.h> +#include <asm/insn-eval.h> +#include <linux/ratelimit.h> + +/* + * == Base addresses of GDT and IDT + * Some applications to function rely finding the global descriptor table (GDT) + * and the interrupt descriptor table (IDT) in kernel memory. + * For x86_32, the selected values do not match any particular hole, but it + * suffices to provide a memory location within kernel memory. + * + * == CRO flags for SMSW + * Use the flags given when booting, as found in head_32.S + */ + +#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | X86_CR0_NE | \ + X86_CR0_WP | X86_CR0_AM) +#define UMIP_DUMMY_GDT_BASE 0xfffe0000 +#define UMIP_DUMMY_IDT_BASE 0xffff0000 + +enum umip_insn { + UMIP_SGDT = 0, /* opcode 0f 01 ModR/M reg 0 */ + UMIP_SIDT, /* opcode 0f 01 ModR/M reg 1 */ + UMIP_SLDT, /* opcode 0f 00 ModR/M reg 0 */ + UMIP_SMSW, /* opcode 0f 01 ModR/M reg 4 */ + UMIP_STR, /* opcode 0f 00 ModR/M reg 1 */ +}; + +/** + * __identify_insn() - Identify a UMIP-protected instruction + * @insn: Instruction structure with opcode and ModRM byte. + * + * From the instruction opcode and the reg part of the ModRM byte, identify, + * if any, a UMIP-protected instruction. + * + * Return: an enumeration of a UMIP-protected instruction; -EINVAL on failure. + */ +static int __identify_insn(struct insn *insn) +{ + /* By getting modrm we also get the opcode. */ + insn_get_modrm(insn); + + /* All the instructions of interest start with 0x0f. */ + if (insn->opcode.bytes[0] != 0xf) + return -EINVAL; + + if (insn->opcode.bytes[1] == 0x1) { + switch (X86_MODRM_REG(insn->modrm.value)) { + case 0: + return UMIP_SGDT; + case 1: + return UMIP_SIDT; + case 4: + return UMIP_SMSW; + default: + return -EINVAL; + } + } + /* SLDT AND STR are not emulated */ + return -EINVAL; +} + +/** + * __emulate_umip_insn() - Emulate UMIP instructions with dummy values + * @insn: Instruction structure with ModRM byte + * @umip_inst: Instruction to emulate + * @data: Buffer onto which the dummy values will be copied + * @data_size: Size of the emulated result + * + * Emulate an instruction protected by UMIP. The result of the emulation + * is saved in the provided buffer. The size of the results depends on both + * the instruction and type of operand (register vs memory address). Thus, + * the size of the result needs to be updated. + * + * Result: 0 if success, -EINVAL on failure to emulate + */ +static int __emulate_umip_insn(struct insn *insn, enum umip_insn umip_inst, + unsigned char *data, int *data_size) +{ + unsigned long dummy_base_addr; + unsigned short dummy_limit = 0; + unsigned int dummy_value = 0; + + switch (umip_inst) { + /* + * These two instructions return the base address and limit of the + * global and interrupt descriptor table. The base address can be + * 24-bit, 32-bit or 64-bit. Limit is always 16-bit. If the operand + * size is 16-bit the returned value of the base address is supposed + * to be a zero-extended 24-byte number. However, it seems that a + * 32-byte number is always returned in legacy protected mode + * irrespective of the operand size. + */ + case UMIP_SGDT: + /* fall through */ + case UMIP_SIDT: + if (umip_inst == UMIP_SGDT) + dummy_base_addr = UMIP_DUMMY_GDT_BASE; + else + dummy_base_addr = UMIP_DUMMY_IDT_BASE; + if (X86_MODRM_MOD(insn->modrm.value) == 3) { + /* SGDT and SIDT do not take register as argument. */ + return -EINVAL; + } + + memcpy(data + 2, &dummy_base_addr, sizeof(dummy_base_addr)); + memcpy(data, &dummy_limit, sizeof(dummy_limit)); + *data_size = sizeof(dummy_base_addr) + sizeof(dummy_limit); + break; + case UMIP_SMSW: + /* + * Even though CR0_STATE contain 4 bytes, the number + * of bytes to be copied in the result buffer is determined + * by whether the operand is a register or a memory location. + */ + dummy_value = CR0_STATE; + /* + * These two instructions return a 16-bit value. We return + * all zeros. This is equivalent to a null descriptor for + * str and sldt. + */ + /* SLDT and STR are not emulated */ + /* fall through */ + case UMIP_SLDT: + /* fall through */ + case UMIP_STR: + /* fall through */ + default: + return -EINVAL; + } + return 0; +} + +/** + * fixup_umip_exception() - Fixup #GP faults caused by UMIP + * @regs: Registers as saved when entering the #GP trap + * + * The instructions sgdt, sidt, str, smsw, sldt cause a general protection + * fault if with CPL > 0 (i.e., from user space). This function can be + * used to emulate the results of the aforementioned instructions with + * dummy values. Results are copied to user-space memory as indicated by + * the instruction pointed by EIP using the registers indicated in the + * instruction operands. This function also takes care of determining + * the address to which the results must be copied. + */ +bool fixup_umip_exception(struct pt_regs *regs) +{ + struct insn insn; + unsigned char buf[MAX_INSN_SIZE]; + /* 10 bytes is the maximum size of the result of UMIP instructions */ + unsigned char dummy_data[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}; + unsigned long seg_base; + int not_copied, nr_copied, reg_offset, dummy_data_size; + void __user *uaddr; + unsigned long *reg_addr; + enum umip_insn umip_inst; + struct insn_code_seg_defaults seg_defs; + + /* + * Use the segment base in case user space used a different code + * segment, either in protected (e.g., from an LDT) or virtual-8086 + * modes. In most of the cases seg_base will be zero as in USER_CS. + */ + seg_base = insn_get_seg_base(regs, &insn, + offsetof(struct pt_regs, ip)); + not_copied = copy_from_user(buf, (void __user *)(seg_base + regs->ip), + sizeof(buf)); + nr_copied = sizeof(buf) - not_copied; + /* + * The copy_from_user above could have failed if user code is protected + * by a memory protection key. Give up on emulation in such a case. + * Should we issue a page fault? + */ + if (!nr_copied) + return false; + + insn_init(&insn, buf, nr_copied, user_64bit_mode(regs)); + + /* + * Override the default operand and address sizes to what is specified + * in the code segment descriptor. The instruction decoder only sets + * the address size it to either 4 or 8 address bytes and does nothing + * for the operand bytes. This OK for most of the cases, but we could + * have special cases where, for instance, a 16-bit code segment + * descriptor is used. + * If there are overrides, the instruction decoder correctly updates + * these values, even for 16-bit defaults. + */ + seg_defs = insn_get_code_seg_defaults(regs); + insn.addr_bytes = seg_defs.address_bytes; + insn.opnd_bytes = seg_defs.operand_bytes; + + if (!insn.addr_bytes || !insn.opnd_bytes) + return false; + + if (user_64bit_mode(regs)) + return false; + + insn_get_length(&insn); + if (nr_copied < insn.length) + return false; + + umip_inst = __identify_insn(&insn); + /* Check if we found an instruction protected by UMIP */ + if (umip_inst < 0) + return false; + + if (__emulate_umip_insn(&insn, umip_inst, dummy_data, &dummy_data_size)) + return false; + + /* If operand is a register, write directly to it */ + if (X86_MODRM_MOD(insn.modrm.value) == 3) { + reg_offset = insn_get_modrm_rm_off(&insn, regs); + reg_addr = (unsigned long *)((unsigned long)regs + reg_offset); + memcpy(reg_addr, dummy_data, dummy_data_size); + } else { + uaddr = insn_get_addr_ref(&insn, regs); + /* user address could not be determined, abort emulation */ + if ((unsigned long)uaddr == -1L) + return false; + nr_copied = copy_to_user(uaddr, dummy_data, dummy_data_size); + if (nr_copied > 0) + return false; + } + + /* increase IP to let the program keep going */ + regs->ip += insn.length; + return true; +}
On Fri, May 05, 2017 at 11:17:19AM -0700, Ricardo Neri wrote:
The feature User-Mode Instruction Prevention present in recent Intel processor prevents a group of instructions from being executed with CPL > 0. Otherwise, a general protection fault is issued.
This is one of the best opening paragraphs of a commit message I've read this year! This is how you open: short, succinct, to the point, no marketing bullshit. Good!
Rather than relaying this fault to the user space (in the form of a SIGSEGV signal), the instructions protected by UMIP can be emulated to provide dummy results. This allows to conserve the current kernel behavior and not reveal the system resources that UMIP intends to protect (the global descriptor and interrupt descriptor tables, the segment selectors of the local descriptor table and the task state and the machine status word).
This emulation is needed because certain applications (e.g., WineHQ and DOSEMU2) rely on this subset of instructions to function.
The instructions protected by UMIP can be split in two groups. Those who
s/who/which/
return a kernel memory address (sgdt and sidt) and those who return a
ditto.
value (sldt, str and smsw).
For the instructions that return a kernel memory address, applications such as WineHQ rely on the result being located in the kernel memory space. The result is emulated as a hard-coded value that, lies close to the top of the kernel memory. The limit for the GDT and the IDT are set to zero.
Nice.
Given that sldt and str are not used in common in programs supported by
You wanna say "in common programs" here? Or "not commonly used in programs" ?
WineHQ and DOSEMU2, they are not emulated.
The instruction smsw is emulated to return the value that the register CR0 has at boot time as set in the head_32.
Care is taken to appropriately emulate the results when segmentation is used. This is, rather than relying on USER_DS and USER_CS, the function
"That is,... "
insn_get_addr_ref() inspects the segment descriptor pointed by the registers in pt_regs. This ensures that we correctly obtain the segment base address and the address and operand sizes even if the user space application uses local descriptor table.
Btw, I could very well use all that nice explanation in umip.c too so that the high-level behavior is documented.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/umip.h | 15 +++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/umip.c | 245 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 261 insertions(+) create mode 100644 arch/x86/include/asm/umip.h create mode 100644 arch/x86/kernel/umip.c
diff --git a/arch/x86/include/asm/umip.h b/arch/x86/include/asm/umip.h new file mode 100644 index 0000000..077b236 --- /dev/null +++ b/arch/x86/include/asm/umip.h @@ -0,0 +1,15 @@ +#ifndef _ASM_X86_UMIP_H +#define _ASM_X86_UMIP_H
+#include <linux/types.h> +#include <asm/ptrace.h>
+#ifdef CONFIG_X86_INTEL_UMIP +bool fixup_umip_exception(struct pt_regs *regs); +#else +static inline bool fixup_umip_exception(struct pt_regs *regs) +{
- return false;
+}
Let's save some header lines:
static inline bool fixup_umip_exception(struct pt_regs *regs) { return false; }
those trunks take too much space as it is.
+#endif /* CONFIG_X86_INTEL_UMIP */ +#endif /* _ASM_X86_UMIP_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 4b99423..cc1b7cc 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -123,6 +123,7 @@ obj-$(CONFIG_EFI) += sysfb_efi.o obj-$(CONFIG_PERF_EVENTS) += perf_regs.o obj-$(CONFIG_TRACING) += tracepoint.o obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o +obj-$(CONFIG_X86_INTEL_UMIP) += umip.o
ifdef CONFIG_FRAME_POINTER obj-y += unwind_frame.o diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c new file mode 100644 index 0000000..c7c5795 --- /dev/null +++ b/arch/x86/kernel/umip.c @@ -0,0 +1,245 @@ +/*
- umip.c Emulation for instruction protected by the Intel User-Mode
- Instruction Prevention. The instructions are:
- sgdt
- sldt
- sidt
- str
- smsw
- Copyright (c) 2017, Intel Corporation.
- Ricardo Neri ricardo.neri@linux.intel.com
- */
+#include <linux/uaccess.h> +#include <asm/umip.h> +#include <asm/traps.h> +#include <asm/insn.h> +#include <asm/insn-eval.h> +#include <linux/ratelimit.h>
+/*
- == Base addresses of GDT and IDT
- Some applications to function rely finding the global descriptor table (GDT)
That formulation reads funny.
- and the interrupt descriptor table (IDT) in kernel memory.
- For x86_32, the selected values do not match any particular hole, but it
- suffices to provide a memory location within kernel memory.
- == CRO flags for SMSW
- Use the flags given when booting, as found in head_32.S
- */
+#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | X86_CR0_NE | \
X86_CR0_WP | X86_CR0_AM)
Why not pull those up in asm/processor-flags.h or so and share the definition instead of duplicating it?
+#define UMIP_DUMMY_GDT_BASE 0xfffe0000 +#define UMIP_DUMMY_IDT_BASE 0xffff0000
+enum umip_insn {
- UMIP_SGDT = 0, /* opcode 0f 01 ModR/M reg 0 */
- UMIP_SIDT, /* opcode 0f 01 ModR/M reg 1 */
- UMIP_SLDT, /* opcode 0f 00 ModR/M reg 0 */
- UMIP_SMSW, /* opcode 0f 01 ModR/M reg 4 */
- UMIP_STR, /* opcode 0f 00 ModR/M reg 1 */
Let's stick to a single spelling: ModRM.reg=0, etc.
Better yet, use the SDM format:
UMIP_SGDT = 0, /* 0F 01 /0 */ UMIP_SIDT, /* 0F 01 /1 */ ...
+};
+/**
- __identify_insn() - Identify a UMIP-protected instruction
- @insn: Instruction structure with opcode and ModRM byte.
- From the instruction opcode and the reg part of the ModRM byte, identify,
- if any, a UMIP-protected instruction.
- Return: an enumeration of a UMIP-protected instruction; -EINVAL on failure.
- */
+static int __identify_insn(struct insn *insn)
static enum umip_insn __identify_insn(...
But frankly, that enum looks pointless to me - it is used locally only and you can just as well use plain ints.
+{
- /* By getting modrm we also get the opcode. */
- insn_get_modrm(insn);
- /* All the instructions of interest start with 0x0f. */
- if (insn->opcode.bytes[0] != 0xf)
return -EINVAL;
- if (insn->opcode.bytes[1] == 0x1) {
switch (X86_MODRM_REG(insn->modrm.value)) {
case 0:
return UMIP_SGDT;
case 1:
return UMIP_SIDT;
case 4:
return UMIP_SMSW;
default:
return -EINVAL;
}
- }
- /* SLDT AND STR are not emulated */
- return -EINVAL;
+}
+/**
- __emulate_umip_insn() - Emulate UMIP instructions with dummy values
- @insn: Instruction structure with ModRM byte
- @umip_inst: Instruction to emulate
- @data: Buffer onto which the dummy values will be copied
- @data_size: Size of the emulated result
- Emulate an instruction protected by UMIP. The result of the emulation
- is saved in the provided buffer. The size of the results depends on both
- the instruction and type of operand (register vs memory address). Thus,
- the size of the result needs to be updated.
- Result: 0 if success, -EINVAL on failure to emulate
- */
+static int __emulate_umip_insn(struct insn *insn, enum umip_insn umip_inst,
unsigned char *data, int *data_size)
+{
- unsigned long dummy_base_addr;
- unsigned short dummy_limit = 0;
- unsigned int dummy_value = 0;
- switch (umip_inst) {
- /*
* These two instructions return the base address and limit of the
* global and interrupt descriptor table. The base address can be
* 24-bit, 32-bit or 64-bit. Limit is always 16-bit. If the operand
* size is 16-bit the returned value of the base address is supposed
* to be a zero-extended 24-byte number. However, it seems that a
* 32-byte number is always returned in legacy protected mode
* irrespective of the operand size.
*/
- case UMIP_SGDT:
/* fall through */
- case UMIP_SIDT:
if (umip_inst == UMIP_SGDT)
dummy_base_addr = UMIP_DUMMY_GDT_BASE;
else
dummy_base_addr = UMIP_DUMMY_IDT_BASE;
if (X86_MODRM_MOD(insn->modrm.value) == 3) {
/* SGDT and SIDT do not take register as argument. */
Comment above the if.
return -EINVAL;
}
So that check needs to go first, then the dummy_base_addr assignment.
memcpy(data + 2, &dummy_base_addr, sizeof(dummy_base_addr));
memcpy(data, &dummy_limit, sizeof(dummy_limit));
*data_size = sizeof(dummy_base_addr) + sizeof(dummy_limit);
Huh, that value will always be the same - why do you have a specific variable? It could be a define, once for 32-bit and once for 64-bit.
break;
- case UMIP_SMSW:
/*
* Even though CR0_STATE contain 4 bytes, the number
* of bytes to be copied in the result buffer is determined
* by whether the operand is a register or a memory location.
*/
dummy_value = CR0_STATE;
Something's wrong here: how does that local, write-only variable have any effect?
/*
* These two instructions return a 16-bit value. We return
* all zeros. This is equivalent to a null descriptor for
* str and sldt.
*/
/* SLDT and STR are not emulated */
/* fall through */
- case UMIP_SLDT:
/* fall through */
- case UMIP_STR:
/* fall through */
- default:
return -EINVAL;
That switch-case has a majority of fall-throughs. So make it an if-else instead.
- }
- return 0;
+}
+/**
- fixup_umip_exception() - Fixup #GP faults caused by UMIP
- @regs: Registers as saved when entering the #GP trap
- The instructions sgdt, sidt, str, smsw, sldt cause a general protection
- fault if with CPL > 0 (i.e., from user space). This function can be
- used to emulate the results of the aforementioned instructions with
- dummy values. Results are copied to user-space memory as indicated by
- the instruction pointed by EIP using the registers indicated in the
- instruction operands. This function also takes care of determining
- the address to which the results must be copied.
- */
+bool fixup_umip_exception(struct pt_regs *regs) +{
- struct insn insn;
- unsigned char buf[MAX_INSN_SIZE];
- /* 10 bytes is the maximum size of the result of UMIP instructions */
- unsigned char dummy_data[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
unsigned char dummy_data[10] = { 0 };
One 0 should be enough :)
- unsigned long seg_base;
- int not_copied, nr_copied, reg_offset, dummy_data_size;
- void __user *uaddr;
- unsigned long *reg_addr;
- enum umip_insn umip_inst;
- struct insn_code_seg_defaults seg_defs;
Please sort function local variables declaration in a reverse christmas tree order:
<type> longest_variable_name; <type> shorter_var_name; <type> even_shorter; <type> i;
- /*
* Use the segment base in case user space used a different code
* segment, either in protected (e.g., from an LDT) or virtual-8086
* modes. In most of the cases seg_base will be zero as in USER_CS.
*/
- seg_base = insn_get_seg_base(regs, &insn,
offsetof(struct pt_regs, ip));
Oh boy, where's the error handling?! That can return -1L.
- not_copied = copy_from_user(buf, (void __user *)(seg_base + regs->ip),
-1L + regs->ip is then your pwnage.
sizeof(buf));
Just let them stick out.
- nr_copied = sizeof(buf) - not_copied;
<---- newline here.
- /*
* The copy_from_user above could have failed if user code is protected
()
* by a memory protection key. Give up on emulation in such a case.
* Should we issue a page fault?
Why? AFAICT, you're in the #GP handler. Simply you return unhandled.
*/
- if (!nr_copied)
return false;
- insn_init(&insn, buf, nr_copied, user_64bit_mode(regs));
- /*
* Override the default operand and address sizes to what is specified
* in the code segment descriptor. The instruction decoder only sets
* the address size it to either 4 or 8 address bytes and does nothing
* for the operand bytes. This OK for most of the cases, but we could
* have special cases where, for instance, a 16-bit code segment
* descriptor is used.
* If there are overrides, the instruction decoder correctly updates
* these values, even for 16-bit defaults.
*/
- seg_defs = insn_get_code_seg_defaults(regs);
- insn.addr_bytes = seg_defs.address_bytes;
- insn.opnd_bytes = seg_defs.operand_bytes;
- if (!insn.addr_bytes || !insn.opnd_bytes)
return false;
- if (user_64bit_mode(regs))
return false;
- insn_get_length(&insn);
- if (nr_copied < insn.length)
return false;
- umip_inst = __identify_insn(&insn);
- /* Check if we found an instruction protected by UMIP */
Put comment above the function call.
- if (umip_inst < 0)
return false;
- if (__emulate_umip_insn(&insn, umip_inst, dummy_data, &dummy_data_size))
return false;
- /* If operand is a register, write directly to it */
- if (X86_MODRM_MOD(insn.modrm.value) == 3) {
reg_offset = insn_get_modrm_rm_off(&insn, regs);
Grr, error handling!! That reg_offset can be -E<something>.
reg_addr = (unsigned long *)((unsigned long)regs + reg_offset);
memcpy(reg_addr, dummy_data, dummy_data_size);
- } else {
uaddr = insn_get_addr_ref(&insn, regs);
/* user address could not be determined, abort emulation */
That comment is kinda obvious. But yes, this has error handling.
if ((unsigned long)uaddr == -1L)
return false;
nr_copied = copy_to_user(uaddr, dummy_data, dummy_data_size);
if (nr_copied > 0)
return false;
- }
- /* increase IP to let the program keep going */
- regs->ip += insn.length;
- return true;
+}
2.9.3
On Thu, 2017-06-08 at 20:38 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:19AM -0700, Ricardo Neri wrote:
The feature User-Mode Instruction Prevention present in recent Intel processor prevents a group of instructions from being executed with CPL > 0. Otherwise, a general protection fault is issued.
This is one of the best opening paragraphs of a commit message I've read this year! This is how you open: short, succinct, to the point, no marketing bullshit. Good!
Thanks you!
Rather than relaying this fault to the user space (in the form of a SIGSEGV signal), the instructions protected by UMIP can be emulated to provide dummy results. This allows to conserve the current kernel behavior and not reveal the system resources that UMIP intends to protect (the global descriptor and interrupt descriptor tables, the segment selectors of the local descriptor table and the task state and the machine status word).
This emulation is needed because certain applications (e.g., WineHQ and DOSEMU2) rely on this subset of instructions to function.
The instructions protected by UMIP can be split in two groups. Those who
s/who/which/
I will correct.
return a kernel memory address (sgdt and sidt) and those who return a
ditto.
I will correct here also.
value (sldt, str and smsw).
For the instructions that return a kernel memory address, applications such as WineHQ rely on the result being located in the kernel memory space. The result is emulated as a hard-coded value that, lies close to the top of the kernel memory. The limit for the GDT and the IDT are set to zero.
Nice.
Given that sldt and str are not used in common in programs supported by
You wanna say "in common programs" here? Or "not commonly used in programs" ?
I will rephrase this comment.
WineHQ and DOSEMU2, they are not emulated.
The instruction smsw is emulated to return the value that the register CR0 has at boot time as set in the head_32.
Care is taken to appropriately emulate the results when segmentation is used. This is, rather than relying on USER_DS and USER_CS, the function
"That is,... "
I will correct it.
insn_get_addr_ref() inspects the segment descriptor pointed by the registers in pt_regs. This ensures that we correctly obtain the segment base address and the address and operand sizes even if the user space application uses local descriptor table.
Btw, I could very well use all that nice explanation in umip.c too so that the high-level behavior is documented.
Sure, I will include a high-level description in the file itself.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/include/asm/umip.h | 15 +++ arch/x86/kernel/Makefile | 1 + arch/x86/kernel/umip.c | 245 ++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 261 insertions(+) create mode 100644 arch/x86/include/asm/umip.h create mode 100644 arch/x86/kernel/umip.c
diff --git a/arch/x86/include/asm/umip.h b/arch/x86/include/asm/umip.h new file mode 100644 index 0000000..077b236 --- /dev/null +++ b/arch/x86/include/asm/umip.h @@ -0,0 +1,15 @@ +#ifndef _ASM_X86_UMIP_H +#define _ASM_X86_UMIP_H
+#include <linux/types.h> +#include <asm/ptrace.h>
+#ifdef CONFIG_X86_INTEL_UMIP +bool fixup_umip_exception(struct pt_regs *regs); +#else +static inline bool fixup_umip_exception(struct pt_regs *regs) +{
- return false;
+}
Let's save some header lines:
static inline bool fixup_umip_exception(struct pt_regs *regs) { return false; }
those trunks take too much space as it is.
I will correct.
+#endif /* CONFIG_X86_INTEL_UMIP */ +#endif /* _ASM_X86_UMIP_H */ diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index 4b99423..cc1b7cc 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -123,6 +123,7 @@ obj-$(CONFIG_EFI) += sysfb_efi.o obj-$(CONFIG_PERF_EVENTS) += perf_regs.o obj-$(CONFIG_TRACING) += tracepoint.o obj-$(CONFIG_SCHED_MC_PRIO) += itmt.o +obj-$(CONFIG_X86_INTEL_UMIP) += umip.o
ifdef CONFIG_FRAME_POINTER obj-y += unwind_frame.o diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c new file mode 100644 index 0000000..c7c5795 --- /dev/null +++ b/arch/x86/kernel/umip.c @@ -0,0 +1,245 @@ +/*
- umip.c Emulation for instruction protected by the Intel User-Mode
- Instruction Prevention. The instructions are:
- sgdt
- sldt
- sidt
- str
- smsw
- Copyright (c) 2017, Intel Corporation.
- Ricardo Neri ricardo.neri@linux.intel.com
- */
+#include <linux/uaccess.h> +#include <asm/umip.h> +#include <asm/traps.h> +#include <asm/insn.h> +#include <asm/insn-eval.h> +#include <linux/ratelimit.h>
+/*
- == Base addresses of GDT and IDT
- Some applications to function rely finding the global descriptor table (GDT)
That formulation reads funny.
I will correct.
- and the interrupt descriptor table (IDT) in kernel memory.
- For x86_32, the selected values do not match any particular hole, but it
- suffices to provide a memory location within kernel memory.
- == CRO flags for SMSW
- Use the flags given when booting, as found in head_32.S
- */
+#define CR0_STATE (X86_CR0_PE | X86_CR0_MP | X86_CR0_ET | X86_CR0_NE | \
X86_CR0_WP | X86_CR0_AM)
Why not pull those up in asm/processor-flags.h or so and share the definition instead of duplicating it?
Sure, I will relocate this definition.
+#define UMIP_DUMMY_GDT_BASE 0xfffe0000 +#define UMIP_DUMMY_IDT_BASE 0xffff0000
+enum umip_insn {
- UMIP_SGDT = 0, /* opcode 0f 01 ModR/M reg 0 */
- UMIP_SIDT, /* opcode 0f 01 ModR/M reg 1 */
- UMIP_SLDT, /* opcode 0f 00 ModR/M reg 0 */
- UMIP_SMSW, /* opcode 0f 01 ModR/M reg 4 */
- UMIP_STR, /* opcode 0f 00 ModR/M reg 1 */
Let's stick to a single spelling: ModRM.reg=0, etc.
Better yet, use the SDM format:
UMIP_SGDT = 0, /* 0F 01 /0 */ UMIP_SIDT, /* 0F 01 /1 */ ...
I will update accordingly.
+};
+/**
- __identify_insn() - Identify a UMIP-protected instruction
- @insn: Instruction structure with opcode and ModRM byte.
- From the instruction opcode and the reg part of the ModRM byte, identify,
- if any, a UMIP-protected instruction.
- Return: an enumeration of a UMIP-protected instruction; -EINVAL on failure.
- */
+static int __identify_insn(struct insn *insn)
static enum umip_insn __identify_insn(...
But frankly, that enum looks pointless to me - it is used locally only and you can just as well use plain ints.
I will change to plain ints.
+{
- /* By getting modrm we also get the opcode. */
- insn_get_modrm(insn);
- /* All the instructions of interest start with 0x0f. */
- if (insn->opcode.bytes[0] != 0xf)
return -EINVAL;
- if (insn->opcode.bytes[1] == 0x1) {
switch (X86_MODRM_REG(insn->modrm.value)) {
case 0:
return UMIP_SGDT;
case 1:
return UMIP_SIDT;
case 4:
return UMIP_SMSW;
default:
return -EINVAL;
}
- }
- /* SLDT AND STR are not emulated */
- return -EINVAL;
+}
+/**
- __emulate_umip_insn() - Emulate UMIP instructions with dummy values
- @insn: Instruction structure with ModRM byte
- @umip_inst: Instruction to emulate
- @data: Buffer onto which the dummy values will be copied
- @data_size: Size of the emulated result
- Emulate an instruction protected by UMIP. The result of the emulation
- is saved in the provided buffer. The size of the results depends on both
- the instruction and type of operand (register vs memory address). Thus,
- the size of the result needs to be updated.
- Result: 0 if success, -EINVAL on failure to emulate
- */
+static int __emulate_umip_insn(struct insn *insn, enum umip_insn umip_inst,
unsigned char *data, int *data_size)
+{
- unsigned long dummy_base_addr;
- unsigned short dummy_limit = 0;
- unsigned int dummy_value = 0;
- switch (umip_inst) {
- /*
* These two instructions return the base address and limit of the
* global and interrupt descriptor table. The base address can be
* 24-bit, 32-bit or 64-bit. Limit is always 16-bit. If the operand
* size is 16-bit the returned value of the base address is supposed
* to be a zero-extended 24-byte number. However, it seems that a
* 32-byte number is always returned in legacy protected mode
* irrespective of the operand size.
*/
- case UMIP_SGDT:
/* fall through */
- case UMIP_SIDT:
if (umip_inst == UMIP_SGDT)
dummy_base_addr = UMIP_DUMMY_GDT_BASE;
else
dummy_base_addr = UMIP_DUMMY_IDT_BASE;
if (X86_MODRM_MOD(insn->modrm.value) == 3) {
/* SGDT and SIDT do not take register as argument. */
Comment above the if.
I will correct.
return -EINVAL;
}
So that check needs to go first, then the dummy_base_addr assignment.
I will rearrange.
memcpy(data + 2, &dummy_base_addr, sizeof(dummy_base_addr));
memcpy(data, &dummy_limit, sizeof(dummy_limit));
*data_size = sizeof(dummy_base_addr) + sizeof(dummy_limit);
Huh, that value will always be the same - why do you have a specific variable? It could be a define, once for 32-bit and once for 64-bit.
Sure. I will use #define's.
break;
- case UMIP_SMSW:
/*
* Even though CR0_STATE contain 4 bytes, the number
* of bytes to be copied in the result buffer is determined
* by whether the operand is a register or a memory location.
*/
dummy_value = CR0_STATE;
Something's wrong here: how does that local, write-only variable have any effect?
Ah yes, initially SMSW, SLDT and STR were handled equally. Since I removed support for the last two, I inadvertently removed the code that copies the result of SMSW. I will re-add it.
/*
* These two instructions return a 16-bit value. We return
* all zeros. This is equivalent to a null descriptor for
* str and sldt.
*/
/* SLDT and STR are not emulated */
/* fall through */
- case UMIP_SLDT:
/* fall through */
- case UMIP_STR:
/* fall through */
- default:
return -EINVAL;
That switch-case has a majority of fall-throughs. So make it an if-else instead.
Sure, I will update.
- }
- return 0;
+}
+/**
- fixup_umip_exception() - Fixup #GP faults caused by UMIP
- @regs: Registers as saved when entering the #GP trap
- The instructions sgdt, sidt, str, smsw, sldt cause a general protection
- fault if with CPL > 0 (i.e., from user space). This function can be
- used to emulate the results of the aforementioned instructions with
- dummy values. Results are copied to user-space memory as indicated by
- the instruction pointed by EIP using the registers indicated in the
- instruction operands. This function also takes care of determining
- the address to which the results must be copied.
- */
+bool fixup_umip_exception(struct pt_regs *regs) +{
- struct insn insn;
- unsigned char buf[MAX_INSN_SIZE];
- /* 10 bytes is the maximum size of the result of UMIP instructions */
- unsigned char dummy_data[10] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
unsigned char dummy_data[10] = { 0 };
One 0 should be enough :)
Right. I will update.
- unsigned long seg_base;
- int not_copied, nr_copied, reg_offset, dummy_data_size;
- void __user *uaddr;
- unsigned long *reg_addr;
- enum umip_insn umip_inst;
- struct insn_code_seg_defaults seg_defs;
Please sort function local variables declaration in a reverse christmas tree order:
<type> longest_variable_name; <type> shorter_var_name; <type> even_shorter; <type> i;
I will rearrange my variables.
- /*
* Use the segment base in case user space used a different code
* segment, either in protected (e.g., from an LDT) or virtual-8086
* modes. In most of the cases seg_base will be zero as in USER_CS.
*/
- seg_base = insn_get_seg_base(regs, &insn,
offsetof(struct pt_regs, ip));
Oh boy, where's the error handling?! That can return -1L.
- not_copied = copy_from_user(buf, (void __user *)(seg_base + regs->ip),
-1L + regs->ip is then your pwnage.
I will add the error handling code.
sizeof(buf));
Just let them stick out.
Sure.
- nr_copied = sizeof(buf) - not_copied;
<---- newline here.
I will add the new line.
- /*
* The copy_from_user above could have failed if user code is protected
()
* by a memory protection key. Give up on emulation in such a case.
* Should we issue a page fault?
Why? AFAICT, you're in the #GP handler. Simply you return unhandled.
If I returned unhandled, a SIGSEGV will be sent to the user space application but siginfo will look like a #GP. However, memory protection keys cause page faults and siginfo is filled differently.
*/
- if (!nr_copied)
return false;
- insn_init(&insn, buf, nr_copied, user_64bit_mode(regs));
- /*
* Override the default operand and address sizes to what is specified
* in the code segment descriptor. The instruction decoder only sets
* the address size it to either 4 or 8 address bytes and does nothing
* for the operand bytes. This OK for most of the cases, but we could
* have special cases where, for instance, a 16-bit code segment
* descriptor is used.
* If there are overrides, the instruction decoder correctly updates
* these values, even for 16-bit defaults.
*/
- seg_defs = insn_get_code_seg_defaults(regs);
- insn.addr_bytes = seg_defs.address_bytes;
- insn.opnd_bytes = seg_defs.operand_bytes;
- if (!insn.addr_bytes || !insn.opnd_bytes)
return false;
- if (user_64bit_mode(regs))
return false;
- insn_get_length(&insn);
- if (nr_copied < insn.length)
return false;
- umip_inst = __identify_insn(&insn);
- /* Check if we found an instruction protected by UMIP */
Put comment above the function call.
Will do.
- if (umip_inst < 0)
return false;
- if (__emulate_umip_insn(&insn, umip_inst, dummy_data, &dummy_data_size))
return false;
- /* If operand is a register, write directly to it */
- if (X86_MODRM_MOD(insn.modrm.value) == 3) {
reg_offset = insn_get_modrm_rm_off(&insn, regs);
Grr, error handling!! That reg_offset can be -E<something>.
I will add the error handling code.
reg_addr = (unsigned long *)((unsigned long)regs + reg_offset);
memcpy(reg_addr, dummy_data, dummy_data_size);
- } else {
uaddr = insn_get_addr_ref(&insn, regs);
/* user address could not be determined, abort emulation */
That comment is kinda obvious. But yes, this has error handling.
OK, I will remove this comment.
Many thanks for your detailed review!
BR, Ricardo
fixup_umip_exception() will be called from do_general_protection. If the former returns false, the latter will issue a SIGSEGV with SEND_SIG_PRIV. However, when emulation is successful but the emulated result cannot be copied to user space memory, it is more accurate to issue a SIGSEGV with SEGV_MAPERR with the offending address. A new function is inspired in force_sig_info_fault is introduced to model the page fault.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/kernel/umip.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index c7c5795..ff7366a 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -148,6 +148,41 @@ static int __emulate_umip_insn(struct insn *insn, enum umip_insn umip_inst, }
/** + * __force_sig_info_umip_fault() - Force a SIGSEGV with SEGV_MAPERR + * @address: Address that caused the signal + * @regs: Register set containing the instruction pointer + * + * Force a SIGSEGV signal with SEGV_MAPERR as the error code. This function is + * intended to be used to provide a segmentation fault when the result of the + * UMIP emulation could not be copied to the user space memory. + * + * Return: none + */ +static void __force_sig_info_umip_fault(void __user *address, + struct pt_regs *regs) +{ + siginfo_t info; + struct task_struct *tsk = current; + + if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV)) { + printk_ratelimited("%s[%d] umip emulation segfault ip:%lx sp:%lx error:%x in %lx\n", + tsk->comm, task_pid_nr(tsk), regs->ip, + regs->sp, X86_PF_USER | X86_PF_WRITE, + regs->ip); + } + + tsk->thread.cr2 = (unsigned long)address; + tsk->thread.error_code = X86_PF_USER | X86_PF_WRITE; + tsk->thread.trap_nr = X86_TRAP_PF; + + info.si_signo = SIGSEGV; + info.si_errno = 0; + info.si_code = SEGV_MAPERR; + info.si_addr = address; + force_sig_info(SIGSEGV, &info, tsk); +} + +/** * fixup_umip_exception() - Fixup #GP faults caused by UMIP * @regs: Registers as saved when entering the #GP trap * @@ -235,8 +270,14 @@ bool fixup_umip_exception(struct pt_regs *regs) if ((unsigned long)uaddr == -1L) return false; nr_copied = copy_to_user(uaddr, dummy_data, dummy_data_size); - if (nr_copied > 0) - return false; + if (nr_copied > 0) { + /* + * If copy fails, send a signal and tell caller that + * fault was fixed up + */ + __force_sig_info_umip_fault(uaddr, regs); + return true; + } }
/* increase IP to let the program keep going */
On Fri, May 05, 2017 at 11:17:20AM -0700, Ricardo Neri wrote:
fixup_umip_exception() will be called from do_general_protection. If the
^ | Please end function names with parentheses. ---+
former returns false, the latter will issue a SIGSEGV with SEND_SIG_PRIV. However, when emulation is successful but the emulated result cannot be copied to user space memory, it is more accurate to issue a SIGSEGV with SEGV_MAPERR with the offending address. A new function is inspired in
That reads funny.
force_sig_info_fault is introduced to model the page fault.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/kernel/umip.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index c7c5795..ff7366a 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -148,6 +148,41 @@ static int __emulate_umip_insn(struct insn *insn, enum umip_insn umip_inst, }
/**
- __force_sig_info_umip_fault() - Force a SIGSEGV with SEGV_MAPERR
- @address: Address that caused the signal
- @regs: Register set containing the instruction pointer
- Force a SIGSEGV signal with SEGV_MAPERR as the error code. This function is
- intended to be used to provide a segmentation fault when the result of the
- UMIP emulation could not be copied to the user space memory.
- Return: none
- */
+static void __force_sig_info_umip_fault(void __user *address,
struct pt_regs *regs)
+{
- siginfo_t info;
- struct task_struct *tsk = current;
- if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV)) {
Save an indentation level:
if (!(show_unhandled_signals && unhandled_signal(tsk, SIGSEGV))) return;
printk...
printk_ratelimited("%s[%d] umip emulation segfault ip:%lx sp:%lx error:%x in %lx\n",
tsk->comm, task_pid_nr(tsk), regs->ip,
regs->sp, X86_PF_USER | X86_PF_WRITE,
regs->ip);
- }
- tsk->thread.cr2 = (unsigned long)address;
- tsk->thread.error_code = X86_PF_USER | X86_PF_WRITE;
- tsk->thread.trap_nr = X86_TRAP_PF;
- info.si_signo = SIGSEGV;
- info.si_errno = 0;
- info.si_code = SEGV_MAPERR;
- info.si_addr = address;
- force_sig_info(SIGSEGV, &info, tsk);
+}
+/**
- fixup_umip_exception() - Fixup #GP faults caused by UMIP
- @regs: Registers as saved when entering the #GP trap
@@ -235,8 +270,14 @@ bool fixup_umip_exception(struct pt_regs *regs) if ((unsigned long)uaddr == -1L) return false; nr_copied = copy_to_user(uaddr, dummy_data, dummy_data_size);
if (nr_copied > 0)
return false;
if (nr_copied > 0) {
/*
* If copy fails, send a signal and tell caller that
* fault was fixed up
Pls end sentences in the comments with a fullstop.
*/
__force_sig_info_umip_fault(uaddr, regs);
return true;
}
}
/* increase IP to let the program keep going */
-- 2.9.3
On Fri, 2017-06-09 at 13:02 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:20AM -0700, Ricardo Neri wrote:
fixup_umip_exception() will be called from do_general_protection. If the
^ |
Please end function names with parentheses. ---+
former returns false, the latter will issue a SIGSEGV with SEND_SIG_PRIV. However, when emulation is successful but the emulated result cannot be copied to user space memory, it is more accurate to issue a SIGSEGV with SEGV_MAPERR with the offending address. A new function is inspired in
That reads funny.
I will correct this.
force_sig_info_fault is introduced to model the page fault.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/kernel/umip.c | 45 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c index c7c5795..ff7366a 100644 --- a/arch/x86/kernel/umip.c +++ b/arch/x86/kernel/umip.c @@ -148,6 +148,41 @@ static int __emulate_umip_insn(struct insn *insn, enum umip_insn umip_inst, }
/**
- __force_sig_info_umip_fault() - Force a SIGSEGV with SEGV_MAPERR
- @address: Address that caused the signal
- @regs: Register set containing the instruction pointer
- Force a SIGSEGV signal with SEGV_MAPERR as the error code. This function is
- intended to be used to provide a segmentation fault when the result of the
- UMIP emulation could not be copied to the user space memory.
- Return: none
- */
+static void __force_sig_info_umip_fault(void __user *address,
struct pt_regs *regs)
+{
- siginfo_t info;
- struct task_struct *tsk = current;
- if (show_unhandled_signals && unhandled_signal(tsk, SIGSEGV)) {
Save an indentation level:
if (!(show_unhandled_signals && unhandled_signal(tsk, SIGSEGV))) return;
printk...
I will implement like this.
printk_ratelimited("%s[%d] umip emulation segfault ip:%lx sp:%lx error:%x in %lx\n",
tsk->comm, task_pid_nr(tsk), regs->ip,
regs->sp, X86_PF_USER | X86_PF_WRITE,
regs->ip);
- }
- tsk->thread.cr2 = (unsigned long)address;
- tsk->thread.error_code = X86_PF_USER | X86_PF_WRITE;
- tsk->thread.trap_nr = X86_TRAP_PF;
- info.si_signo = SIGSEGV;
- info.si_errno = 0;
- info.si_code = SEGV_MAPERR;
- info.si_addr = address;
- force_sig_info(SIGSEGV, &info, tsk);
+}
+/**
- fixup_umip_exception() - Fixup #GP faults caused by UMIP
- @regs: Registers as saved when entering the #GP trap
@@ -235,8 +270,14 @@ bool fixup_umip_exception(struct pt_regs *regs) if ((unsigned long)uaddr == -1L) return false; nr_copied = copy_to_user(uaddr, dummy_data, dummy_data_size);
if (nr_copied > 0)
return false;
if (nr_copied > 0) {
/*
* If copy fails, send a signal and tell caller that
* fault was fixed up
Pls end sentences in the comments with a fullstop.
I will correct this.
Thanks and BR, Ricardo
If the User-Mode Instruction Prevention CPU feature is available and enabled, a general protection fault will be issued if the instructions sgdt, sldt, sidt, str or smsw are executed from user-mode context (CPL > 0). If the fault was caused by any of the instructions protected by UMIP, fixup_umip_exception will emulate dummy results for these instructions. If emulation is successful, the result is passed to the user space program and no SIGSEGV signal is emitted.
Please note that fixup_umip_exception also caters for the case when the fault originated while running in virtual-8086 mode.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Reviewed-by: Andy Lutomirski luto@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/kernel/traps.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 3995d3a..cec548d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -65,6 +65,7 @@ #include <asm/trace/mpx.h> #include <asm/mpx.h> #include <asm/vm86.h> +#include <asm/umip.h>
#ifdef CONFIG_X86_64 #include <asm/x86_init.h> @@ -526,6 +527,9 @@ do_general_protection(struct pt_regs *regs, long error_code) RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); cond_local_irq_enable(regs);
+ if (user_mode(regs) && fixup_umip_exception(regs)) + return; + if (v8086_mode(regs)) { local_irq_enable(); handle_vm86_fault((struct kernel_vm86_regs *) regs, error_code);
On Fri, May 05, 2017 at 11:17:21AM -0700, Ricardo Neri wrote:
If the User-Mode Instruction Prevention CPU feature is available and enabled, a general protection fault will be issued if the instructions sgdt, sldt, sidt, str or smsw are executed from user-mode context (CPL > 0). If the fault was caused by any of the instructions protected by UMIP, fixup_umip_exception will emulate dummy results for these
Please end function names with parentheses.
instructions. If emulation is successful, the result is passed to the user space program and no SIGSEGV signal is emitted.
Please note that fixup_umip_exception also caters for the case when the fault originated while running in virtual-8086 mode.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Reviewed-by: Andy Lutomirski luto@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/kernel/traps.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 3995d3a..cec548d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -65,6 +65,7 @@ #include <asm/trace/mpx.h> #include <asm/mpx.h> #include <asm/vm86.h> +#include <asm/umip.h>
#ifdef CONFIG_X86_64 #include <asm/x86_init.h> @@ -526,6 +527,9 @@ do_general_protection(struct pt_regs *regs, long error_code) RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); cond_local_irq_enable(regs);
Almost definitely:
if (static_cpu_has(X86_FEATURE_UMIP)) { if (...
- if (user_mode(regs) && fixup_umip_exception(regs))
return;
We don't want to punish !UMIP machines.
I am sorry Boris, I also missed this feedback.
On Fri, 2017-06-09 at 15:02 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:21AM -0700, Ricardo Neri wrote:
If the User-Mode Instruction Prevention CPU feature is available and enabled, a general protection fault will be issued if the instructions sgdt, sldt, sidt, str or smsw are executed from user-mode context (CPL > 0). If the fault was caused by any of the instructions protected by UMIP, fixup_umip_exception will emulate dummy results for these
Please end function names with parentheses.
I have audited my commit messages to remove all instances of this error.
instructions. If emulation is successful, the result is passed to the user space program and no SIGSEGV signal is emitted.
Please note that fixup_umip_exception also caters for the case when the fault originated while running in virtual-8086 mode.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Reviewed-by: Andy Lutomirski luto@kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/kernel/traps.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 3995d3a..cec548d 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -65,6 +65,7 @@ #include <asm/trace/mpx.h> #include <asm/mpx.h> #include <asm/vm86.h> +#include <asm/umip.h>
#ifdef CONFIG_X86_64 #include <asm/x86_init.h> @@ -526,6 +527,9 @@ do_general_protection(struct pt_regs *regs, long error_code) RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); cond_local_irq_enable(regs);
Almost definitely:
if (static_cpu_has(X86_FEATURE_UMIP)) { if (...
I will make this update.
- if (user_mode(regs) && fixup_umip_exception(regs))
return;
We don't want to punish !UMIP machines.
I will add this check.
Thanks and BR, Ricardo
User_mode Instruction Prevention (UMIP) is enabled by setting/clearing a bit in %cr4.
It makes sense to enable UMIP at some point while booting, before user spaces come up. Like SMAP and SMEP, is not critical to have it enabled very early during boot. This is because UMIP is relevant only when there is a userspace to be protected from. Given the similarities in relevance, it makes sense to enable UMIP along with SMAP and SMEP.
UMIP is enabled by default. It can be disabled by adding clearcpuid=514 to the kernel parameters.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- arch/x86/Kconfig | 10 ++++++++++ arch/x86/kernel/cpu/common.c | 16 +++++++++++++++- 2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 702002b..1b1bbeb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1745,6 +1745,16 @@ config X86_SMAP
If unsure, say Y.
+config X86_INTEL_UMIP + def_bool y + depends on CPU_SUP_INTEL + prompt "Intel User Mode Instruction Prevention" if EXPERT + ---help--- + The User Mode Instruction Prevention (UMIP) is a security + feature in newer Intel processors. If enabled, a general + protection fault is issued if the instructions SGDT, SLDT, + SIDT, SMSW and STR are executed in user mode. + config X86_INTEL_MPX prompt "Intel MPX (Memory Protection Extensions)" def_bool n diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 8ee3211..66ebded 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -311,6 +311,19 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) } }
+static __always_inline void setup_umip(struct cpuinfo_x86 *c) +{ + if (cpu_feature_enabled(X86_FEATURE_UMIP) && + cpu_has(c, X86_FEATURE_UMIP)) + cr4_set_bits(X86_CR4_UMIP); + else + /* + * Make sure UMIP is disabled in case it was enabled in a + * previous boot (e.g., via kexec). + */ + cr4_clear_bits(X86_CR4_UMIP); +} + /* * Protection Keys are not available in 32-bit mode. */ @@ -1121,9 +1134,10 @@ static void identify_cpu(struct cpuinfo_x86 *c) /* Disable the PN if appropriate */ squash_the_stupid_serial_number(c);
- /* Set up SMEP/SMAP */ + /* Set up SMEP/SMAP/UMIP */ setup_smep(c); setup_smap(c); + setup_umip(c);
/* * The vendor-specific functions might have changed features.
On Fri, May 05, 2017 at 11:17:22AM -0700, Ricardo Neri wrote:
User_mode Instruction Prevention (UMIP) is enabled by setting/clearing a bit in %cr4.
It makes sense to enable UMIP at some point while booting, before user spaces come up. Like SMAP and SMEP, is not critical to have it enabled very early during boot. This is because UMIP is relevant only when there is a userspace to be protected from. Given the similarities in relevance, it makes sense to enable UMIP along with SMAP and SMEP.
UMIP is enabled by default. It can be disabled by adding clearcpuid=514 to the kernel parameters.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/Kconfig | 10 ++++++++++ arch/x86/kernel/cpu/common.c | 16 +++++++++++++++- 2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 702002b..1b1bbeb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1745,6 +1745,16 @@ config X86_SMAP
If unsure, say Y.
+config X86_INTEL_UMIP
- def_bool y
That's a bit too much. It makes sense on distro kernels but how many machines out there actually have UMIP?
- depends on CPU_SUP_INTEL
- prompt "Intel User Mode Instruction Prevention" if EXPERT
- ---help---
The User Mode Instruction Prevention (UMIP) is a security
feature in newer Intel processors. If enabled, a general
protection fault is issued if the instructions SGDT, SLDT,
SIDT, SMSW and STR are executed in user mode.
config X86_INTEL_MPX prompt "Intel MPX (Memory Protection Extensions)" def_bool n diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 8ee3211..66ebded 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -311,6 +311,19 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) } }
+static __always_inline void setup_umip(struct cpuinfo_x86 *c) +{
- if (cpu_feature_enabled(X86_FEATURE_UMIP) &&
cpu_has(c, X86_FEATURE_UMIP))
Hmm, so if UMIP is not build-time disabled, the cpu_feature_enabled() will call static_cpu_has().
Looks like you want to call cpu_has() too because alternatives haven't run yet and static_cpu_has() will reply wrong. Please state that in a comment.
On Fri, 2017-06-09 at 18:10 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:22AM -0700, Ricardo Neri wrote:
User_mode Instruction Prevention (UMIP) is enabled by setting/clearing a bit in %cr4.
It makes sense to enable UMIP at some point while booting, before user spaces come up. Like SMAP and SMEP, is not critical to have it enabled very early during boot. This is because UMIP is relevant only when there is a userspace to be protected from. Given the similarities in relevance, it makes sense to enable UMIP along with SMAP and SMEP.
UMIP is enabled by default. It can be disabled by adding clearcpuid=514 to the kernel parameters.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: H. Peter Anvin hpa@zytor.com Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Cc: Tony Luck tony.luck@intel.com Cc: Paolo Bonzini pbonzini@redhat.com Cc: Liang Z. Li liang.z.li@intel.com Cc: Alexandre Julliard julliard@winehq.org Cc: Stas Sergeev stsp@list.ru Cc: x86@kernel.org Cc: linux-msdos@vger.kernel.org Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com
arch/x86/Kconfig | 10 ++++++++++ arch/x86/kernel/cpu/common.c | 16 +++++++++++++++- 2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 702002b..1b1bbeb 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1745,6 +1745,16 @@ config X86_SMAP
If unsure, say Y.
+config X86_INTEL_UMIP
- def_bool y
That's a bit too much. It makes sense on distro kernels but how many machines out there actually have UMIP?
So would this become a y when more machines have UMIP?
- depends on CPU_SUP_INTEL
- prompt "Intel User Mode Instruction Prevention" if EXPERT
- ---help---
The User Mode Instruction Prevention (UMIP) is a security
feature in newer Intel processors. If enabled, a general
protection fault is issued if the instructions SGDT, SLDT,
SIDT, SMSW and STR are executed in user mode.
config X86_INTEL_MPX prompt "Intel MPX (Memory Protection Extensions)" def_bool n diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 8ee3211..66ebded 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -311,6 +311,19 @@ static __always_inline void setup_smap(struct cpuinfo_x86 *c) } }
+static __always_inline void setup_umip(struct cpuinfo_x86 *c) +{
- if (cpu_feature_enabled(X86_FEATURE_UMIP) &&
cpu_has(c, X86_FEATURE_UMIP))
Hmm, so if UMIP is not build-time disabled, the cpu_feature_enabled() will call static_cpu_has().
Looks like you want to call cpu_has() too because alternatives haven't run yet and static_cpu_has() will reply wrong. Please state that in a comment.
Why would static_cpu_has() reply wrong if alternatives are not in place? Because it uses the boot CPU data? When it calls _static_cpu_has() it would do something equivalent to
testb test_bit, boot_cpu_data.x86_capability[bit].
I am calling cpu_has because cpu_feature_enabled(), via static_cpu_has(), will use the boot CPU data while cpu_has would use the local CPU data. Is this what you meant?
I can definitely add a comment with this explanation, if it makes sense.
Thanks and BR, Ricardo
On Tue, Jul 25, 2017 at 05:44:08PM -0700, Ricardo Neri wrote:
On Fri, 2017-06-09 at 18:10 +0200, Borislav Petkov wrote:
On Fri, May 05, 2017 at 11:17:22AM -0700, Ricardo Neri wrote:
User_mode Instruction Prevention (UMIP) is enabled by setting/clearing a bit in %cr4.
It makes sense to enable UMIP at some point while booting, before user spaces come up. Like SMAP and SMEP, is not critical to have it enabled very early during boot. This is because UMIP is relevant only when there is a userspace to be protected from. Given the similarities in relevance, it makes sense to enable UMIP along with SMAP and SMEP.
UMIP is enabled by default. It can be disabled by adding clearcpuid=514 to the kernel parameters.
...
So would this become a y when more machines have UMIP?
I guess. Stuff which proves reliable and widespread gets automatically enabled with time, in most cases. IMHO, of course.
Why would static_cpu_has() reply wrong if alternatives are not in place? Because it uses the boot CPU data? When it calls _static_cpu_has() it would do something equivalent to
Nevermind - I forgot that static_cpu_has() now drops to dynamic check before alternatives application.
Certain user space programs that run on virtual-8086 mode may utilize instructions protected by the User-Mode Instruction Prevention (UMIP) security feature present in new Intel processors: SGDT, SIDT and SMSW. In such a case, a general protection fault is issued if UMIP is enabled. When such a fault happens, the kernel traps it and emulates the results of these instructions with dummy values. The purpose of this new test is to verify whether the impacted instructions can be executed without causing such #GP. If no #GP exceptions occur, we expect to exit virtual-8086 mode from INT3.
The instructions protected by UMIP are executed in representative use cases: a) displacement-only memory addressing b) register-indirect memory addressing c) results stored directly in operands
Unfortunately, it is not possible to check the results against a set of expected values because no emulation will occur in systems that do not have the UMIP feature. Instead, results are printed for verification. A simple verification is done to ensure that results of all tests are identical.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- tools/testing/selftests/x86/entry_from_vm86.c | 73 ++++++++++++++++++++++++++- 1 file changed, 72 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/entry_from_vm86.c b/tools/testing/selftests/x86/entry_from_vm86.c index d075ea0..130e8ad 100644 --- a/tools/testing/selftests/x86/entry_from_vm86.c +++ b/tools/testing/selftests/x86/entry_from_vm86.c @@ -95,6 +95,22 @@ asm ( "int3\n\t" "vmcode_int80:\n\t" "int $0x80\n\t" + "vmcode_umip:\n\t" + /* addressing via displacements */ + "smsw (2052)\n\t" + "sidt (2054)\n\t" + "sgdt (2060)\n\t" + /* addressing via registers */ + "mov $2066, %bx\n\t" + "smsw (%bx)\n\t" + "mov $2068, %bx\n\t" + "sidt (%bx)\n\t" + "mov $2074, %bx\n\t" + "sgdt (%bx)\n\t" + /* register operands, only for smsw */ + "smsw %ax\n\t" + "mov %ax, (2080)\n\t" + "int3\n\t" ".size vmcode, . - vmcode\n\t" "end_vmcode:\n\t" ".code32\n\t" @@ -103,7 +119,7 @@ asm (
extern unsigned char vmcode[], end_vmcode[]; extern unsigned char vmcode_bound[], vmcode_sysenter[], vmcode_syscall[], - vmcode_sti[], vmcode_int3[], vmcode_int80[]; + vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[];
/* Returns false if the test was skipped. */ static bool do_test(struct vm86plus_struct *v86, unsigned long eip, @@ -160,6 +176,58 @@ static bool do_test(struct vm86plus_struct *v86, unsigned long eip, return true; }
+void do_umip_tests(struct vm86plus_struct *vm86, unsigned char *test_mem) +{ + struct table_desc { + unsigned short limit; + unsigned long base; + } __attribute__((packed)); + + /* Initialize variables with arbitrary values */ + struct table_desc gdt1 = { .base = 0x3c3c3c3c, .limit = 0x9999 }; + struct table_desc gdt2 = { .base = 0x1a1a1a1a, .limit = 0xaeae }; + struct table_desc idt1 = { .base = 0x7b7b7b7b, .limit = 0xf1f1 }; + struct table_desc idt2 = { .base = 0x89898989, .limit = 0x1313 }; + unsigned short msw1 = 0x1414, msw2 = 0x2525, msw3 = 3737; + + /* UMIP -- exit with INT3 unless kernel emulation did not trap #GP */ + do_test(vm86, vmcode_umip - vmcode, VM86_TRAP, 3, "UMIP tests"); + + /* Results from displacement-only addressing */ + msw1 = *(unsigned short *)(test_mem + 2052); + memcpy(&idt1, test_mem + 2054, sizeof(idt1)); + memcpy(&gdt1, test_mem + 2060, sizeof(gdt1)); + + /* Results from register-indirect addressing */ + msw2 = *(unsigned short *)(test_mem + 2066); + memcpy(&idt2, test_mem + 2068, sizeof(idt2)); + memcpy(&gdt2, test_mem + 2074, sizeof(gdt2)); + + /* Results when using register operands */ + msw3 = *(unsigned short *)(test_mem + 2080); + + printf("[INFO]\tResult from SMSW:[0x%04x]\n", msw1); + printf("[INFO]\tResult from SIDT: limit[0x%04x]base[0x%08lx]\n", + idt1.limit, idt1.base); + printf("[INFO]\tResult from SGDT: limit[0x%04x]base[0x%08lx]\n", + gdt1.limit, gdt1.base); + + if ((msw1 != msw2) || (msw1 != msw3)) + printf("[FAIL]\tAll the results of SMSW should be the same.\n"); + else + printf("[PASS]\tAll the results from SMSW are identical.\n"); + + if (memcmp(&gdt1, &gdt2, sizeof(gdt1))) + printf("[FAIL]\tAll the results of SGDT should be the same.\n"); + else + printf("[PASS]\tAll the results from SGDT are identical.\n"); + + if (memcmp(&idt1, &idt2, sizeof(idt1))) + printf("[FAIL]\tAll the results of SIDT should be the same.\n"); + else + printf("[PASS]\tAll the results from SIDT are identical.\n"); +} + int main(void) { struct vm86plus_struct v86; @@ -218,6 +286,9 @@ int main(void) v86.regs.eax = (unsigned int)-1; do_test(&v86, vmcode_int80 - vmcode, VM86_INTx, 0x80, "int80");
+ /* UMIP -- should exit with INTx 0x80 unless UMIP was not disabled */ + do_umip_tests(&v86, addr); + /* Execute a null pointer */ v86.regs.cs = 0; v86.regs.ss = 0;
The instructions str and sldt are not recognized when running on virtual- 8086 mode and generate an invalid operand exception. These two instructions are protected by the Intel User-Mode Instruction Prevention (UMIP) security feature. In protected mode, if UMIP is enabled, these instructions generate a general protection fault if called from CPL > 0. Linux traps the general protection fault and emulate the results with dummy values.
These tests are added to verify that the emulation code does not emulate these two instructions but issue the expected invalid operand exception.
Tests fallback to exit with int3 in case emulation does happen.
Cc: Andy Lutomirski luto@kernel.org Cc: Andrew Morton akpm@linux-foundation.org Cc: Borislav Petkov bp@suse.de Cc: Brian Gerst brgerst@gmail.com Cc: Chen Yucong slaoub@gmail.com Cc: Chris Metcalf cmetcalf@mellanox.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Fenghua Yu fenghua.yu@intel.com Cc: Huang Rui ray.huang@amd.com Cc: Jiri Slaby jslaby@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: Michael S. Tsirkin mst@redhat.com Cc: Paul Gortmaker paul.gortmaker@windriver.com Cc: Peter Zijlstra peterz@infradead.org Cc: Ravi V. Shankar ravi.v.shankar@intel.com Cc: Shuah Khan shuah@kernel.org Cc: Vlastimil Babka vbabka@suse.cz Signed-off-by: Ricardo Neri ricardo.neri-calderon@linux.intel.com --- tools/testing/selftests/x86/entry_from_vm86.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/x86/entry_from_vm86.c b/tools/testing/selftests/x86/entry_from_vm86.c index 130e8ad..b7a0c90 100644 --- a/tools/testing/selftests/x86/entry_from_vm86.c +++ b/tools/testing/selftests/x86/entry_from_vm86.c @@ -111,6 +111,11 @@ asm ( "smsw %ax\n\t" "mov %ax, (2080)\n\t" "int3\n\t" + "vmcode_umip_str:\n\t" + "str %eax\n\t" + "vmcode_umip_sldt:\n\t" + "sldt %eax\n\t" + "int3\n\t" ".size vmcode, . - vmcode\n\t" "end_vmcode:\n\t" ".code32\n\t" @@ -119,7 +124,8 @@ asm (
extern unsigned char vmcode[], end_vmcode[]; extern unsigned char vmcode_bound[], vmcode_sysenter[], vmcode_syscall[], - vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[]; + vmcode_sti[], vmcode_int3[], vmcode_int80[], vmcode_umip[], + vmcode_umip_str[], vmcode_umip_sldt[];
/* Returns false if the test was skipped. */ static bool do_test(struct vm86plus_struct *v86, unsigned long eip, @@ -226,6 +232,16 @@ void do_umip_tests(struct vm86plus_struct *vm86, unsigned char *test_mem) printf("[FAIL]\tAll the results of SIDT should be the same.\n"); else printf("[PASS]\tAll the results from SIDT are identical.\n"); + + sethandler(SIGILL, sighandler, 0); + do_test(vm86, vmcode_umip_str - vmcode, VM86_SIGNAL, 0, + "STR instruction"); + clearhandler(SIGILL); + + sethandler(SIGILL, sighandler, 0); + do_test(vm86, vmcode_umip_sldt - vmcode, VM86_SIGNAL, 0, + "SLDT instruction"); + clearhandler(SIGILL); }
int main(void)
Hi Ingo, Thomas,
On Fri, 2017-05-05 at 11:16 -0700, Ricardo Neri wrote:
This is v7 of this series. The six previous submissions can be found here [1], here [2], here[3], here[4], here[5] and here[6]. This version addresses the comments received in v6 plus improvements of the handling of exceptions unrelated to UMIP as well as corner cases in virtual-8086 mode. Please see details in the change log.
Since there have been no more comments in the version and if this series look good to you, could this be considered to be merged into the tip tree?
The only remaining item is a cleanup patch that Borislav Petkov suggested [1]. I could work on it incrementally on top of this series.
Thanks and BR, Ricardo
Hi again Ingo, Thomas, On Wed, 2017-05-17 at 11:42 -0700, Ricardo Neri wrote:
Hi Ingo, Thomas,
On Fri, 2017-05-05 at 11:16 -0700, Ricardo Neri wrote:
This is v7 of this series. The six previous submissions can be found here [1], here [2], here[3], here[4], here[5] and here[6]. This version addresses the comments received in v6 plus improvements of the handling of exceptions unrelated to UMIP as well as corner cases in virtual-8086 mode. Please see details in the change log.
Since there have been no more comments in the version and if this series look good to you, could this be considered to be merged into the tip tree?
The only remaining item is a cleanup patch that Borislav Petkov suggested [1]. I could work on it incrementally on top of this series.
More items have accumulated from the latest review from Borislav Petkov. These items are preparatory changes and are mostly minimal and would impact functionality. There have been no comments on other parts of the implementation. If I spin a v8 of the series, would it be considered sufficiently mature to be included in v4.13?
Thanks and BR, Ricardo
Thanks and BR, Ricardo