Linux syscalls
A system call is a controlled way for a program to request the kernel to perform an operation on its behalf. It solves a fundamental problem in OS architecture:
How do we safely allow unprivileged user programs to access privileged resources?
CPUs implement privilege separation through protection rings. Ring 0 (kernel mode) has full hardware access and Ring 3 (user mode) has restricted access. The initial Linux kernel (v0.01) had only 88 system calls. Today, the kernel (v6.14 as of early 2025) contains over 400 syscalls!
The lifecycle of a syscall begins in user space.
Take, for example, a basic operation like writing text to a file. A C program might use fprintf() from the standard library which internally calls write() and eventually triggers the write syscall.
On x86-64, the preferred approach is through the syscall instruction which replaced the older int 0x80 interrupt-based method used in i386. When the CPU executes a syscall instruction, it switches from user mode to kernel mode, saves the user-space state and jumps to a predefined kernel entry point.
const char msg[] = "hello world\n";
long ret;
asm volatile(
"mov $1, %%rax\n" /* syscall number for write (1) */
"mov $1, %%rdi\n" /* fd (1 = stdout) */
"mov %1, %%rsi\n" /* buffer address */
"mov %2, %%rdx\n" /* buffer length */
"syscall\n" /* invoke the syscall */
: "=a" (ret) /* output: return value in RAX */
: "r" (msg), "r" (sizeof(msg) - 1) /* inputs */
: "rdi", "rsi", "rdx", "rcx", "r11", "memory" /* clobbered registers */
);
asm volatile(
"mov $60, %%rax\n" /* syscall number for exit (60) */
"mov $0, %%rdi\n" /* exit status (0) */
"syscall\n"
:
:
: "rax", "rdi"
);
We directly use the x86-64 syscall instruction to invoke two syscalls, write (1) and exit (60). The syscall number is placed in the RAX register and arguments are passed in specific registers according to the x86-64 calling convention, RDI, RSI, RDX, R10, R8 and R9 for the first through sixth arguments, respectively.
This differs from the standard C function calling convention on x86-64 (System V AMD64 ABI) which uses RDI, RSI, RDX, RCX, R8 and R9. Note the difference in the fourth argument register (R10 for syscalls vs. RCX for C functions). The discrepancy exists because the syscall instruction itself clobbers RCX and R11.
Syscall numbers are architecture-specific. E.g. on x86-64, the write syscall is number 1, but on ARM64, it's number 64. They are defined in the kernel's header files, check <asm/unistd.h>. The complete list of syscall numbers for each architecture is maintained in the Linux kernel source tree.
SYSCALLinvokes an OS system-call handler at privilege level 0. It does so by loadingRIPfrom theIA32_LSTARMSR(after saving the address of the instruction followingSYSCALLintoRCX). (TheWRMSRinstruction ensures that theIA32_LSTARMSRalways contain a canonical address.)
SYSCALLalso savesRFLAGSintoR11and then masksRFLAGSusing theIA32_FMASKMSR(MSRaddressC0000084H); specifically, the processor clears inRFLAGSevery bit corresponding to a bit that is set in theIA32_FMASKMSR. — Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 2B 4-696
Let's look at the glibc source where the write syscall wrapper function is implemented.
/* Write NBYTES of BUF to FD. Return the number written, or -1. */
ssize_t
__libc_write (int fd, const void *buf, size_t nbytes)
{
if (nbytes == 0)
return 0;
if (fd < 0)
{
__set_errno (EBADF);
return -1;
}
if (buf == NULL)
{
__set_errno (EINVAL);
return -1;
}
__set_errno (ENOSYS);
return -1;
}
libc_hidden_def (__libc_write)
stub_warning (write)
weak_alias (__libc_write, __write)
libc_hidden_weak (__write)
weak_alias (__libc_write, write)
libc_hidden_weak (write)
In comparison, here's how musl libc implements the same syscall wrapper.
ssize_t write(int fd, const void *buf, size_t count)
{
return syscall_cp(SYS_write, fd, buf, count);
}
A generic __syscall function manages the calls.
hidden long __syscall_ret(unsigned long),
__syscall_cp(syscall_arg_t, syscall_arg_t, syscall_arg_t, syscall_arg_t,
syscall_arg_t, syscall_arg_t, syscall_arg_t);
#define __syscall1(n,a) __syscall1(n,__scc(a))
#define __syscall2(n,a,b) __syscall2(n,__scc(a),__scc(b))
#define __syscall3(n,a,b,c) __syscall3(n,__scc(a),__scc(b),__scc(c))
#define __syscall4(n,a,b,c,d) __syscall4(n,__scc(a),__scc(b),__scc(c),__scc(d))
#define __syscall5(n,a,b,c,d,e) __syscall5(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e))
#define __syscall6(n,a,b,c,d,e,f) __syscall6(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f))
#define __syscall7(n,a,b,c,d,e,f,g) __syscall7(n,__scc(a),__scc(b),__scc(c),__scc(d),__scc(e),__scc(f),__scc(g))
Each individual __syscall* is implemented in architecture-specific assembly, in root/arch/*/syscall_arch.h.
__syscall6 for x86-64 looks like this:
static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
unsigned long ret;
register long r10 __asm__("r10") = a4;
register long r8 __asm__("r8") = a5;
register long r9 __asm__("r9") = a6;
__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
"d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
return ret;
}
If a syscall succeeds, it returns a non-negative value (0 or a positive value representing a resource handle). If it fails, it returns a negative error code (between -1 and -4095).
The libc wrapper translates the negative value into a -1 return value and sets the global errno variable to the absolute value of the error code.
__syscall_ret checks if the return value is a small negative number (greater than -4096) and if so, sets errno and returns -1. C programs can check for errors using a simple comparison against -1 and then examine errno for details.
Each syscall corresponds to a function in the kernel. E.g. write is implemented by the sys_write function (check the kernel source for read write in fs).
ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
CLASS(fd_pos, f)(fd);
ssize_t ret = -EBADF;
if (!fd_empty(f)) {
loff_t pos, *ppos = file_ppos(fd_file(f));
if (ppos) {
pos = *ppos;
ppos = &pos;
}
ret = vfs_write(fd_file(f), buf, count, ppos);
if (ret >= 0 && ppos)
fd_file(f)->f_pos = pos;
}
return ret;
}
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
return ksys_write(fd, buf, count);
}
SYSCALL_DEFINE3 is a kernel convenience macro for defining a syscall with three arguments. ksys_write validates the file descriptor, gets the file structure, performs the write operation through the virtual file system layer and returns the result.
The kernel uses macros like __user to mark pointers that come from user space and functions like copy_from_user and copy_to_user to transfer data between user and kernel space.
socketcall multiplexing syscall is used in older 32-bit systems to group all socket-related ops into a single syscall number due to limited syscall table space.
_llseek syscall (later replaced by lseek64) was introduced to manage file offsets larger than 32 bits on 32-bit systems.
ioctl is a catch-all for device-specific operations. Instead of creating a new syscall for each unique device operation, it uses a command number and device-specific argument structure to provide an interface for device drivers.
int ioctl(int fd, unsigned long request, ...);
In terms of syscall table space it is advantageous but creates challenges for standardization since each device can define its own set of ioctl commands.
The syscall instruction itself has overhead due to the context switch. To mitigate this, there are a handful of optimizations.
-
vDSO (virtual dynamic shared object): A small shared library mapped into all user-space processes that contains kernel code for certain syscalls that don't actually require a mode switch, such as
gettimeofday. Applications would execute this code directly in user space. -
vsyscalls: An older mechanism similar to vDSO but implemented as a fixed-address memory region. It was deprecated in favor of vDSO due to security concerns.
-
Batched syscalls: Calls like
readv/writevand more recent additions likeio_submitfor multiple ops per syscall. -
io_uring: An interface for asynchronous I/O that reduces syscall overhead by batching operations through shared memory rings (introduced in Linux 5.1, 2019).
A syscall takes hundreds to thousands of CPU cycles, mostly due to the mode switch and associated context saving and restoring. A regular function call takes only a few cycles in comparison.
On mainstream hardware, benchmarks show that a vDSO call could be over 10x times faster than the direct syscall.
On the Intel Celeron D 341 from 2004 the a system call via the syscall instruction was about 25 times slower than a system call via the vDSO. On the Intel Core i7-4790K from 2014 it's only about 12 times slower. For me I'll use 10 times slower as rule of thumb for modern CPUs and 25 times for older CPUs. — Measurements of system call performance and overhead by Stephan Soller
ia32_compat is a syscall compatibility layer for translating 32-bit applications running on 64-bit kernels and patching differences in data type sizes and argument passing.
Now let's look at some common syscalls.
-
fork,execandwaitfamily of syscalls for process creation.forkcreates a new process by duplicating the calling process.pid_t fork(void) { return syscall(SYS_fork); }Internally, the kernel creates a new task structure, copies or shares resources according to flags and returns different values to the parent and child processes. Recently,
clonesyscall is commonly used with flags instead offork. -
mmapsyscall maps files or devices into memory.void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset) { return (void *)syscall(SYS_mmap, addr, length, prot, flags, fd, offset); }mallocusesmmapfor large allocations andbrk/sbrkfor smaller ones. Another use case ofmmapis to write machine code directly to the current process and execute it. -
Aside from
open,read,writeandclosefor device I/O,sendfileallows data to be transferred directly between file descriptors without passing through user space (which is great for web servers and transferring to network sockets).ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count) { return syscall(SYS_sendfile, out_fd, in_fd, offset, count); } -
The
futex(fast user-space mutex) syscall is a primitive for higher-level synchronization primitives.int futex(int *uaddr, int futex_op, int val, const struct timespec *timeout, int *uaddr2, int val3) { return syscall(SYS_futex, uaddr, futex_op, val, timeout, uaddr2, val3); }It combines user-space atomic ops with kernel invocation only when necessary, which is useful for IPC and implementing mutex locks, condition variables & semaphores.
-
uname,sysinfoand various/procfile ops provide information about the system.sysctlwas originally used to read and modify kernel parameters.int sysctl(int *name, int nlen, void *oldval, size_t *oldlenp, void *newval, size_t newlen) { struct __sysctl_args args = { .name = name, .nlen = nlen, .oldval = oldval, .oldlenp = oldlenp, .newval = newval, .newlen = newlen }; return syscall(SYS_sysctl, &args); }Although, recently the
/proc/sysinterface is preferred for these operations andsysctlis considered deprecated.
There are various strategies to maintain backward compatibility.
-
Adding new syscalls is the cleanest way to extend functionality without breaking existing applications. E.g.
epoll_createalongside the olderselectandpollsyscalls for I/O multiplexing. -
When a syscall needs new parameters, an extended version is created with an _x suffix, e.g.
truncateandtruncate64for handling larger file sizes. -
Syscalls with flag parameters, like
openwith itsO_*flags, can be extended by defining new flag values. -
Syscalls like
socketcallandipcgroup related functions under a single syscall number (i.e. multiplexing), using a subcommand parameter to specify the actual operation.
seccomp (secure computing) allows processes to restrict which syscalls they can make. It is used in container environments.
We use a socket filter which is an array of masks.
struct sock_filter filter[] = {
/* validate arch */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
/* load syscall number */
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))),
/* allow specific syscalls */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_read, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_write, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, SYS_exit, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
/* deny all other syscalls */
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
};
struct sock_fprog prog = {
.len = sizeof(filter) / sizeof(filter[0]),
.filter = filter,
};
/* enable seccomp filtering */
if (syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, 0, &prog) < 0) {
perror("seccomp");
return 1;
}
/* from this point, only read, write and exit syscalls are allowed */
This would create a seccomp filter that only permits the read, write and exit syscalls and immediately killing the process if it attempts any other syscall.
strace traces all syscalls made by a process.
$ strace -e trace=write echo "Hello, world!"
write(1, "Hello, world!\n", 14) = 14
This traces only write syscalls made by echo, it wrote 14 bytes to file descriptor 1 (stdout).
ltrace can show library calls as well as syscalls.
$ ltrace -S printf "Hello, world!"
SYS_brk(0) = 0x55b54de48000
SYS_access("/etc/ld.so.preload", 04) = -2
...
printf("Hello, world!") = 13
SYS_write(1, "Hello, world!", 13) = 13
- ← Previous
Euler's totient function - Next →
Probability theory reference