diff options
author | Will Drewry <wad@chromium.org> | 2012-01-10 14:42:59 -0600 |
---|---|---|
committer | Leann Ogasawara <leann.ogasawara@canonical.com> | 2012-05-21 06:46:03 -0700 |
commit | 5b594b174e9390b6133d30b52dab4c068241a474 (patch) | |
tree | c1dd3d929ed7782cbfd13d859156b3d9aec37edb /Documentation | |
parent | 4b38a59ff304d7d0b7a1ec8c7fe01a92a4b3ed10 (diff) |
UBUNTU: SAUCE: SECCOMP: Documentation: prctl/seccomp_filter
Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 and a semi-generic
example using a macro-based code generator.
v14: - rebase/nochanges
v13: - rebase on to 88ebdda6159ffc15699f204c33feb3e431bf9bdc
v12: - comment on the ptrace_event use
- update arch support comment
- note the behavior of SECCOMP_RET_DATA when there are multiple filters
(keescook@chromium.org)
- lots of samples/ clean up incl 64-bit bpf-direct support
(markus@chromium.org)
- rebase to linux-next
v11: - overhaul return value language, updates (keescook@chromium.org)
- comment on do_exit(SIGSYS)
v10: - update for SIGSYS
- update for new seccomp_data layout
- update for ptrace option use
v9: - updated bpf-direct.c for SIGILL
v8: - add PR_SET_NO_NEW_PRIVS to the samples.
v7: - updated for all the new stuff in v7: TRAP, TRACE
- only talk about PR_SET_SECCOMP now
- fixed bad JLE32 check (coreyb@linux.vnet.ibm.com)
- adds dropper.c: a simple system call disabler
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. (luto@mit.edu)
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value (eparis@redhat.com)
- language cleanup (rdunlap@xenotime.net)
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter (rdunlap@xenotime.net)
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples (corbet@lwn.net)
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <kees@ubuntu.com>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/prctl/seccomp_filter.txt | 156 |
1 files changed, 156 insertions, 0 deletions
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt new file mode 100644 index 00000000000..4aa3e78ff53 --- /dev/null +++ b/Documentation/prctl/seccomp_filter.txt @@ -0,0 +1,156 @@ + SECure COMPuting with filters + ============================= + +Introduction +------------ + +A large number of system calls are exposed to every userland process +with many of them going unused for the entire lifetime of the process. +As system calls change and mature, bugs are found and eradicated. A +certain subset of userland applications benefit by having a reduced set +of available system calls. The resulting set reduces the total kernel +surface exposed to the application. System call filtering is meant for +use with those applications. + +Seccomp filtering provides a means for a process to specify a filter for +incoming system calls. The filter is expressed as a Berkeley Packet +Filter (BPF) program, as with socket filters, except that the data +operated on is related to the system call being made: system call +number and the system call arguments. This allows for expressive +filtering of system calls using a filter program language with a long +history of being exposed to userland and a straightforward data set. + +Additionally, BPF makes it impossible for users of seccomp to fall prey +to time-of-check-time-of-use (TOCTOU) attacks that are common in system +call interposition frameworks. BPF programs may not dereference +pointers which constrains all filters to solely evaluating the system +call arguments directly. + +What it isn't +------------- + +System call filtering isn't a sandbox. It provides a clearly defined +mechanism for minimizing the exposed kernel surface. It is meant to be +a tool for sandbox developers to use. Beyond that, policy for logical +behavior and information flow should be managed with a combination of +other system hardening techniques and, potentially, an LSM of your +choosing. Expressive, dynamic filters provide further options down this +path (avoiding pathological sizes or selecting which of the multiplexed +system calls in socketcall() is allowed, for instance) which could be +construed, incorrectly, as a more complete sandboxing solution. + +Usage +----- + +An additional seccomp mode is added and is enabled using the same +prctl(2) call as the strict seccomp. If the architecture has +CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below: + +PR_SET_SECCOMP: + Now takes an additional argument which specifies a new filter + using a BPF program. + The BPF program will be executed over struct seccomp_data + reflecting the system call number, arguments, and other + metadata. The BPF program must then return one of the + acceptable values to inform the kernel which action should be + taken. + + Usage: + prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); + + The 'prog' argument is a pointer to a struct sock_fprog which + will contain the filter program. If the program is invalid, the + call will return -1 and set errno to EINVAL. + + Note, is_compat_task is also tracked for the @prog. This means + that once set the calling task will have all of its system calls + blocked if it switches its system call ABI. + + If fork/clone and execve are allowed by @prog, any child + processes will be constrained to the same filters and system + call ABI as the parent. + + Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or + run with CAP_SYS_ADMIN privileges in its namespace. If these are not + true, -EACCES will be returned. This requirement ensures that filter + programs cannot be applied to child processes with greater privileges + than the task that installed them. + + Additionally, if prctl(2) is allowed by the attached filter, + additional filters may be layered on which will increase evaluation + time, but allow for further decreasing the attack surface during + execution of a process. + +The above call returns 0 on success and non-zero on error. + +Return values +------------- +A seccomp filter may return any of the following values. If multiple +filters exist, the return value for the evaluation of a given system +call will always use the highest precedent value. (For example, +SECCOMP_RET_KILL will always take precedence.) + +In precedence order, they are: + +SECCOMP_RET_KILL: + Results in the task exiting immediately without executing the + system call. The exit status of the task (status & 0x7f) will + be SIGSYS, not SIGKILL. + +SECCOMP_RET_TRAP: + Results in the kernel sending a SIGSYS signal to the triggering + task without executing the system call. The kernel will + rollback the register state to just before the system call + entry such that a signal handler in the task will be able to + inspect the ucontext_t->uc_mcontext registers and emulate + system call success or failure upon return from the signal + handler. + + The SECCOMP_RET_DATA portion of the return value will be passed + as si_errno. + + SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP. + +SECCOMP_RET_ERRNO: + Results in the lower 16-bits of the return value being passed + to userland as the errno without executing the system call. + +SECCOMP_RET_TRACE: + When returned, this value will cause the kernel to attempt to + notify a ptrace()-based tracer prior to executing the system + call. If there is no tracer present, -ENOSYS is returned to + userland and the system call is not executed. + + A tracer will be notified if it requests PTRACE_O_TRACESECCOMP + using ptrace(PTRACE_SETOPTIONS). The tracer will be notified + of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of + the BPF program return value will be available to the tracer + via PTRACE_GETEVENTMSG. + +SECCOMP_RET_ALLOW: + Results in the system call being executed. + +If multiple filters exist, the return value for the evaluation of a +given system call will always use the highest precedent value. + +Precedence is only determined using the SECCOMP_RET_ACTION mask. When +multiple filters return values of the same precedence, only the +SECCOMP_RET_DATA from the most recently installed filter will be +returned. + + +Example +------- + +The samples/seccomp/ directory contains both an x86-specific example +and a more generic example of a higher level macro interface for BPF +program generation. + +Adding architecture support +----------------------- + +See arch/Kconfig for the authoritative requirements. In general, if an +architecture supports both ptrace_event and seccomp, it will be able to +support seccomp filter with minor fixup: SIGSYS support and seccomp return +value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER +to its arch-specific Kconfig. |