-
Notifications
You must be signed in to change notification settings - Fork 479
Description
Hi all! We would like to use Tetragon to implement per-workload runtime security policies across a Kubernetes cluster. The goal is to establish a "fingerprint" of allowed behavior for every Kubernetes workload (Deployment, StatefulSet, DaemonSet), starting with the strict enforcement of which processes each workload is permitted to spawn.
Let's say in our cluster we have two deployments, my-deployment-1 and my-deployment-2, and we want to enforce the following policies:
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "policy-1"
spec:
podSelector:
matchLabels:
app: "my-deployment-1"
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values:
- "/usr/bin/sleep"
- "/usr/bin/cat"
- "/usr/bin/my-server-1"
matchActions:
- action: Override
argError: -1
options:
- name: disable-kprobe-multi
value: "1"
---
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
name: "policy-2"
spec:
podSelector:
matchLabels:
app: "my-deployment-2"
kprobes:
- call: "security_bprm_creds_for_exec"
syscall: false
args:
- index: 0
type: "linux_binprm"
selectors:
- matchArgs:
- index: 0
operator: "NotEqual"
values:
- "/usr/bin/ls"
- "/usr/bin/my-server-2"
matchActions:
- action: Override
argError: -1
options:
- name: disable-kprobe-multi
value: "1"Let's see what Tetragon injects into the kernel today.
eBPF prog point of view
The above 2 policies will result in the following ebpf programs being attached to the security_bprm_creds_for_exec function:
generic_kprobe_event(from policy-1) -> other generic kprobe called in tail callgeneric_kprobe_event(from policy-2) -> other generic kprobe called in tail callgeneric_fmodret_override(from policy-1)generic_fmodret_override(from policy-2)
Of course, the number of progs will grow linearly with the number of policies (and so k8s workloads in our use case). When the number of policies grows, we hit the following limits:
- The first issue we face is the number of programs we can attach to the same hook. In particular, we have a limit of 38 progs if we use
BPF_MODIFY_RETURN. This type of program relies on eBPF trampoline and is subject to theBPF_MAX_TRAMP_LINKSlimit (38 on x86). https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138. - Let's say we overcome this issue using
kprobes + sigkill, now we hit a second limit of 128 policies. This limit is hardcoded in tetragon codeprobably to take care of memory usage. We can probably overcome this limit as well by making the limit configurable.tetragon/bpf/process/policy_filter.h
Line 23 in 47538a0
__uint(max_entries, POLICY_FILTER_MAX_POLICIES); - Now, the third issue that I think we cannot overcome today is performance overhead. The list of attached programs grows linearly with the number of policies we create. If we have 500 workloads in the cluster, we will have 500 programs attached to the same function. This could lead to a noticeable system slowdown when a new process is created. The slowdown could be even more relevant if we extend this behavior to some other kernel subsystems (e.g., file system/network operations).
eBPF maps point of view
For each of the above policies, I see more or less 50 eBPF maps loaded. Most of them have just 1 entry because they are probably not used, but others can take a great amount of memory. The reported memlock for each policy is around 8 MB. The most memory-intensive maps seem to be:
// inner map for each loaded policy with pod selectors
721: hash name policy_1_map flags 0x0
key 8B value 1B max_entries 32768 memlock 2624000B
pids tetragon(63603)
// Still need to check if this is really needed (?)
764: lru_hash name socktrack_map flags 0x0
key 8B value 16B max_entries 32000 memlock 2829696B
btf_id 947
pids tetragon(63603)
// map used for overriding the return value
766: hash name override_tasks flags 0x0
key 8B value 4B max_entries 32768 memlock 2624000B
btf_id 949
pids tetragon(63603)
As you may imagine also in this case, having 500 deployments in the cluster could lead to a significant memory usage on the node (8 MB* 500 = 4 GB)
Summary
With this issue, we just want to highlight the current limitation in scalability that we are facing. I would love your feedback on this. Do you see any mistakes in this analysis? I'm pretty new to Tetragon, so maybe I missed something, and there is a way to overcome some of the above limitations that I didn't consider.
If you confirm these are real limitations and you are interested in supporting this use case, we can maybe discuss possible ideas to address them.
Thank you for your time!