Issue
I tried to probe the event when the mode switch happens (user->kernel mode), as a result, I need to find which function will be triggered when the transition happens.
It seems that SBI is the placed doing transition for RISC-V. I'm wondering where is the code to handle this for x86?
Solution
It's not that simple. In x86, there are 4 different privilege levels: 0 (operating system kernel), 1, 2, and 3 (applications). Privilege levels 1 and 2 aren't used in Linux: the kernel runs at privilege level 0 while user space code runs at privilege level 3. The current privilege level (CPL) is stored in bits 0 and 1 of the CS (code segment) register.
There are multiple ways in which the transition from user to kernel can happen:
- Through hardware interrupts: page faults, general protection faults, devices, hardware timer, and so on.
- Through software interrupts: the
int
instruction raises a software interrupt. The most common in Linux isint 0x80
, which is configured to be used for system calls from user space to kernel space. - Through specialized instructions like
sysenter
andsyscall
.
In any case, there is no actual code that does the transition: it is done by the processor itself, which switches from one privilege level to the other, and sets up segment selectors, instruction pointer, stack pointer and more according to the information that was set up by the kernel right after booting.
In the case of interrupts, the entries of the Interrupt Descriptor Table (IDT) are used. See this useful documentation page about interrupts in Linux which explains more about the IDT. If you want to get into the details, check out Chapter 5 of the Intel 64 and IA-32 architectures software developer's manual, Volume 3.
In short, each IDT entry specifies a descriptor privilege level (DPL) and a new code segment and offset. In case of software interrupts, some privilege level checks are made by the processor (one of which is CPL <= DPL) to determine whether the code that issued the interrupt has the privilege to do so. Then, the interrupt handler is executed, which implicitly sets the new CS register with the privilege level bits set to 0. This is how the canonical int 0x80
syscall for x86 32bit is made.
In case of specialized instructions like sysenter
and syscall
, the details differ, but the concept is similar: the CPU checks privileges and then retrieves the information from dedicated Model Specific Registers (MSR) that were previously set up by the kernel after boot.
For system calls the result is always the same: user code switches to privilege level 0 and starts executing kernel code, ending up right at the beginning of one of the different syscall entry points defined by the kernel.
Possible syscall entry points are:
entry_INT80_32
for 32-bitint 0x80
entry_INT80_compat
for 32-bitint 0x80
on a 64-bit kernelentry_SYSENTER_32
for 32-bitsysenter
entry_SYSENTER_compat
for 32-bitsysenter
on a 64-bit kernelentry_SYSCALL_64
for 64-bitsyscall
entry_SYSCALL_compat
for 32-bitsyscall
on 64-bit kernel (special entry point which is not used by user code, in theorysyscall
is also a valid 32-bit instruction on AMD CPUs, but Linux only uses it for 64-bit because of its weird semantics)
There's also this nice LWN article covering this.
Answered By - Marco Bonelli Answer Checked By - Pedro (WPSolving Volunteer)