Scaling k8s workload aware tracing policies

[Tetragon CFP available here](https://github.com/cilium/design-cfps/pull/80)

Hi all! We would like to use Tetragon to implement per-workload runtime security policies across a Kubernetes cluster. The goal is to establish a "fingerprint" of allowed behavior for every Kubernetes workload (Deployment, StatefulSet, DaemonSet), starting with the strict enforcement of which processes each workload is permitted to spawn.

Let's say in our cluster we have two deployments, `my-deployment-1` and `my-deployment-2`, and we want to enforce the following policies:

```yaml
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "policy-1"
spec:
  podSelector:
    matchLabels:
      app: "my-deployment-1"
  kprobes:
  - call: "security_bprm_creds_for_exec"
    syscall: false
    args:
    - index: 0
      type: "linux_binprm"
    selectors:
    - matchArgs:
      - index: 0
        operator: "NotEqual"
        values:
        - "/usr/bin/sleep"
        - "/usr/bin/cat"
        - "/usr/bin/my-server-1"
      matchActions:
      - action: Override
        argError: -1
  options:
  - name: disable-kprobe-multi
    value: "1"
---
apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: "policy-2"
spec:
  podSelector:
    matchLabels:
      app: "my-deployment-2"
  kprobes:
  - call: "security_bprm_creds_for_exec"
    syscall: false
    args:
    - index: 0
      type: "linux_binprm"
    selectors:
    - matchArgs:
      - index: 0
        operator: "NotEqual"
        values:
        - "/usr/bin/ls"
        - "/usr/bin/my-server-2"
      matchActions:
      - action: Override
        argError: -1
  options:
  - name: disable-kprobe-multi
    value: "1"
```

Let's see what Tetragon injects into the kernel today.

### eBPF prog point of view

The above 2 policies will result in the following ebpf programs being attached to the `security_bprm_creds_for_exec` function:

- `generic_kprobe_event` (from policy-1) -> other generic kprobe called in tail call
- `generic_kprobe_event` (from policy-2) -> other generic kprobe called in tail call
- `generic_fmodret_override` (from policy-1)
- `generic_fmodret_override` (from policy-2)

Of course, the number of progs will grow linearly with the number of policies (and so k8s workloads in our use case). When the number of policies grows, we hit the following limits:

1. The first issue we face is the number of programs we can attach to the same hook. In particular, we have a limit of 38 progs if we use `BPF_MODIFY_RETURN`. This type of program relies on eBPF trampoline and is subject to the `BPF_MAX_TRAMP_LINKS` limit (38 on x86). <https://elixir.bootlin.com/linux/v6.14.11/source/include/linux/bpf.h#L1138>.
2. Let's say we overcome this issue using `kprobes + sigkill`, now we hit a second limit of 128 policies. This limit is hardcoded in tetragon code <https://github.com/cilium/tetragon/blob/47538a07a4e6c51a9cc569f78c42a2cf767c5405/bpf/process/policy_filter.h#L23> probably to take care of memory usage. We can probably overcome this limit as well by making the limit configurable.
3. Now, the third issue that I think we cannot overcome today is performance overhead. The list of attached programs grows linearly with the number of policies we create. If we have 500 workloads in the cluster, we will have 500 programs attached to the same function. This could lead to a noticeable system slowdown when a new process is created. The slowdown could be even more relevant if we extend this behavior to some other kernel subsystems (e.g., file system/network operations).

### eBPF maps point of view

For each of the above policies, I see more or less 50 eBPF maps loaded. Most of them have just 1 entry because they are probably not used, but others can take a great amount of memory. The reported memlock for each policy is around 8 MB. The most memory-intensive maps seem to be:

```text
// inner map for each loaded policy with pod selectors
721: hash  name policy_1_map  flags 0x0 
 key 8B  value 1B  max_entries 32768  memlock 2624000B
 pids tetragon(63603)

// Still need to check if this is really needed (?)
764: lru_hash  name socktrack_map  flags 0x0
 key 8B  value 16B  max_entries 32000  memlock 2829696B
 btf_id 947
 pids tetragon(63603)

// map used for overriding the return value
766: hash  name override_tasks  flags 0x0
 key 8B  value 4B  max_entries 32768  memlock 2624000B
 btf_id 949
 pids tetragon(63603)
```

As you may imagine also in this case, having 500 deployments in the cluster could lead to a significant memory usage on the node (8 MB* 500 =  4 GB)

### Summary

With this issue, we just want to highlight the current limitation in scalability that we are facing. I would love your feedback on this. Do you see any mistakes in this analysis? I'm pretty new to Tetragon, so maybe I missed something, and there is a way to overcome some of the above limitations that I didn't consider.
If you confirm these are real limitations and you are interested in supporting this use case, we can maybe discuss possible ideas to address them. 

Thank you for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Scaling k8s workload aware tracing policies #4191

eBPF prog point of view

eBPF maps point of view

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scaling k8s workload aware tracing policies #4191

Description

eBPF prog point of view

eBPF maps point of view

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions