From 0afcd1bc0ecc396171f59ecaaefaed1d9ad12bf3 Mon Sep 17 00:00:00 2001 From: Akihiro Suda Date: Sat, 21 Jun 2025 06:51:02 +0900 Subject: [PATCH 1/2] KEP-2033: KubeletInUserNamespace: update the template Only the template is updated in this commit. The actual content will be updated in follow-up commits. Signed-off-by: Akihiro Suda --- .../README.md | 506 +++++++++++++++--- 1 file changed, 421 insertions(+), 85 deletions(-) diff --git a/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md b/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md index 94c802a5400..5624bdb203f 100644 --- a/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md +++ b/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md @@ -1,4 +1,11 @@ + -### User Stories +### User Stories (Optional) +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + Tests are present in several subproject repos and third party repos: - https://github.com/kubernetes-sigs/kind/blob/v0.17.0/.github/workflows/cgroup2.yaml#L24 - https://github.com/kubernetes/minikube/blob/v1.29.0/.github/workflows/pr.yml#L293-L410 @@ -509,6 +522,81 @@ Tests will be added to `kubernetes/test-infra` as well when the [`k8s-infra-prow is upgraded to use cgroup v2. This will probably automatically happen when [GKE bumps up their "regular" channel to Kubernetes v1.26 or later](https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config). +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +##### e2e tests + + + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + ### Graduation Criteria @@ -602,9 +708,9 @@ components? What are the guarantees? Make sure this is in the test plan. Consider the following in developing a version skew strategy for this enhancement: -- Does this enhancement involve coordinating behavior in the control plane and - in the kubelet? How does an n-2 kubelet without this feature available behave - when this feature is used? +- Does this enhancement involve coordinating behavior in the control plane and nodes? +- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used? +- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used? - Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet. --> @@ -619,11 +725,10 @@ Production readiness reviews are intended to ensure that features merging into Kubernetes are observable, scalable and supportable; can be safely operated in production environments, and can be disabled or rolled back in the event they cause increased failures in production. See more in the PRR KEP at -https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md. +https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness. -The production readiness review questionnaire must be completed for features in -v1.19 or later, but is non-blocking at this time. That is, approval is not -required in order to be in the release. +The production readiness review questionnaire must be completed and approved +for the KEP to move to `implementable` status and be included in the release. In some cases, the questions below should also have answers in `kep.yaml`. This is to enable automation to verify the presence of the review, and to reduce review @@ -634,17 +739,35 @@ The KEP must have a approver from the team. Please reach out on the [#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if you need any help or guidance. - --> ### Feature Enablement and Rollback -_This section must be completed when targeting alpha to a release._ + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `KubeletInUserNamespace` + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? Enabling `KubeletInUsernamespace` feature gate does not automatically execute kubelet in a user namespace. The user namespace has to be created by RootlessKit before running kubelet. @@ -654,67 +777,200 @@ Note that this feature gate does not support separating kubelet's user namespace node components such as CRI. All the node components must run in the same user namespace. -* **Does enabling the feature change any default behavior?** +###### Does enabling the feature change any default behavior? + During Alpha, we will document what workloads will work and what will not work. -* **Can the feature be disabled once it has been enabled (i.e. can we roll back - the enablement)?** +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + N/A, as switching back rootless to rootful requires redeploying the kubelet, and vice versa. -* **What happens if we reenable the feature if it was previously rolled back?** +###### What happens if we reenable the feature if it was previously rolled back? N/A. -* **Are there any tests for feature enablement/disablement?** +###### Are there any tests for feature enablement/disablement? + + CI will run `kind` (Kubernetes in Docker) tests with Rootless Docker/Podman. Tests with a real cluster will be added later as well. ### Rollout, Upgrade and Rollback Planning -_This section must be completed when targeting beta graduation to a release._ + This section will be fulfilled when targeting beta graduation to a release. +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + ### Monitoring Requirements -_This section must be completed when targeting beta graduation to a release._ + + +###### How can an operator determine if the feature is in use by workloads? + + N/A -* **What are the SLIs (Service Level Indicators) an operator can use to determine -the health of the service?** - - [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: - - [X] Other (treat as last resort) - - Details: Use `systemctl --user is-system-running` to verify whether the processes (RootlessKit, kubelet, kube-proxy, and CRI) are running. +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: -* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + N/A -* **Are there any missing metrics that would be useful to have to improve observability -of this feature?** +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [X] Other (treat as last resort) + - Details: Use `systemctl --user is-system-running` to verify whether the processes (RootlessKit, kubelet, kube-proxy, and CRI) are running. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? -N/A, but it'd be useful to have the kubelet publish whether or not it is running rootless, as a boolean metric. + + +N/A ### Dependencies + + - Kernel: 5.2 or later is recommended. At least 4.15 or later is required. ([Reason](https://github.com/opencontainers/runc/blob/master/docs/cgroup-v2.md#host-requirements)) - Systemd: 244 or later is recommended. - CRI: containerd >= 1.4, or CRI-O >= 1.22 is required. - OCI: runc >= 1.0-rc91 is required. runc >= 1.0-rc93 is recommended. crun works, too. -_This section must be completed when targeting beta graduation to a release._ +###### Does this feature depend on any specific services running in the cluster? + + -* **Does this feature depend on any specific services running in the cluster?** - [RootlessKit] - Usage description: sets up namespaces, and forwards incoming TCP & UDP packets - Impact of its outage on the feature: kubelet, kube-proxy, CRI, and all container processes will crash, and will be restarted by systemd. @@ -732,58 +988,138 @@ Both Docker and Podman use RootlessKit and slirp4netns (or VPNkit, optionally) i ### Scalability -_For alpha, this section is encouraged: reviewers should consider these questions -and attempt to answer them._ + -* **Will enabling / using this feature result in any new API calls?** +###### Will enabling / using this feature result in any new API calls? -No. - -* **Will enabling / using this feature result in introducing new API types?** + No. -* **Will enabling / using this feature result in any new calls to the cloud -provider?** +###### Will enabling / using this feature result in introducing new API types? + + No. -* **Will enabling / using this feature result in increasing size or count of -the existing API objects?** +###### Will enabling / using this feature result in any new calls to the cloud provider? + + No. -* **Will enabling / using this feature result in increasing time taken by any -operations covered by [existing SLIs/SLOs]?** +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + No. -* **Will enabling / using this feature result in non-negligible increase of -resource usage (CPU, RAM, disk, IO, ...) in any components?** +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + RootlessKit and slirp4netns may face high CPU and memory consumption. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + ### Troubleshooting + -_This section must be completed when targeting beta graduation to a release._ - -* **How does this feature react if the API server and/or etcd is unavailable?** +###### How does this feature react if the API server and/or etcd is unavailable? Same as traditional rootful Kubernetes. -* **What are other known failure modes?** +###### What are other known failure modes? + + Same as traditional rootful Kubernetes. +###### What steps should be taken if SLOs are not being met to determine the problem? + ## Implementation History -[ ] I/we understand the owners of the involved components may require updates to +[X] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement. -Tests are present in several subproject repos and third party repos: -- https://github.com/kubernetes-sigs/kind/blob/v0.17.0/.github/workflows/cgroup2.yaml#L24 -- https://github.com/kubernetes/minikube/blob/v1.29.0/.github/workflows/pr.yml#L293-L410 -- https://github.com/k3s-io/k3s/blob/v1.26.1+k3s1/.github/workflows/cgroup.yaml#L92-L99 -- https://github.com/rootless-containers/usernetes/blob/v20221007.0/.cirrus.yml +See [e2e tests](#e2e-tests) below. -Tests will be added to `kubernetes/test-infra` as well when the [`k8s-infra-prow-build`](https://github.com/kubernetes/k8s.io/blob/a071c4ed0823f193ee29e2f14e191be42dc1a1f0/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L78) cluster -is upgraded to use cgroup v2. -This will probably automatically happen when [GKE bumps up their "regular" channel to Kubernetes v1.26 or later](https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config). +Additional tests are present in several subproject repos and third party repos: +- https://github.com/kubernetes-sigs/kind/blob/v0.29.0/.github/workflows/vm.yaml#L24 +- https://github.com/kubernetes/minikube/blob/v1.36.0/.github/workflows/pr.yml#L299-L415 +- https://github.com/k3s-io/k3s/blob/v1.33.1%2Bk3s1/.github/workflows/e2e.yaml#L56 +- https://github.com/rootless-containers/usernetes/blob/gen2-v20250501.0/.github/workflows/main.yaml + - Covers multi-node clusters with Flannel (VXLAN) + - Covers several host distributions (Ubuntu, CentOS Stream, and Fedora) ##### Prerequisite testing updates @@ -550,7 +554,7 @@ This can inform certain test coverage improvements that we want to do before extending the production code to implement this enhancement. --> -- ``: `` - `` +N/A, as unit tests do not make sense here. ##### Integration tests @@ -576,7 +580,7 @@ This can be done with: - a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html) --> -- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) +N/A, as integration tests do not make sense here. ##### e2e tests @@ -595,7 +599,31 @@ We expect no non-infra related flakes in the last month as a GA graduation crite If e2e tests are not necessary or useful, explain why. --> -- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) +`NodeConformance` tests are executed using [kubetest2-kindinv](https://github.com/rootless-containers/kubetest2-kindinv). + +"kindinv" stands for "Kubernetes in (Rootless) Docker in (GCE) VM". +GCE VM is used for enabling systemd that is required by Rootless Docker to set up cgroup v2. + +```bash +exec kubetest2 kindinv \ + --boskos-location=http://boskos.test-pods.svc.cluster.local \ + --gcp-zone=us-central1-b \ + --instance-image=ubuntu-os-cloud/ubuntu-2204-lts \ + --instance-type=n2-standard-4 \ + --kind-rootless \ + --user=rootless \ + --build \ + --up \ + --down \ + --test=ginkgo \ + -- \ + --focus-regex='\[NodeConformance\]' \ + --skip-regex='\[Environment:NotInUserNS\]|\[Slow\]' \ + --parallel=8 +``` + +- Prow manifest: https://github.com/kubernetes/test-infra/blob/4b7824ff1cfe00c36062035ab6aea3bb6c2e6ba2/config/jobs/kubernetes/sig-testing/kubernetes-kind.yaml#L615-L678 +- Logs: https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kind-rootless ### Graduation Criteria @@ -677,9 +705,7 @@ in back-to-back releases. - Beta: e2e tests coverage. Requires [the cgroup v2 KEP](../20191118-cgroups-v2.md ) to reach Beta or GA. - To move to beta, we need clarity if we intend to define two separate types of conformance suites: - - kubernetes clusters that can run privileged workloads - - kubernetes cluster that are restricted to run unprivileged workloads only + The tests are covered by `NodeConformance` tests (see above). - GA: Assuming no negative user feedback based on production experience, promote after >= 2 releases in beta. Requires [the cgroup v2 KEP](../20191118-cgroups-v2.md ) to reach GA. @@ -715,7 +741,8 @@ enhancement: CRI or CNI may require updating that component before the kubelet. --> -N/A +N/A. +This KEP only affects the internal of kubelet, and does not affect any API. ## Production Readiness Review Questionnaire @@ -761,7 +788,7 @@ well as the [existing list] of feature gates. - [X] Feature gate (also fill in values in `kep.yaml`) - Feature gate name: `KubeletInUserNamespace` - - Components depending on the feature gate: + - Components depending on the feature gate: kubelet - [ ] Other - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control @@ -784,7 +811,8 @@ Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here. --> -During Alpha, we will document what workloads will work and what will not work. +The limitation is same as Rootless Docker, Podman, etc. +See . ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? @@ -799,11 +827,11 @@ feature. NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`. --> -N/A, as switching back rootless to rootful requires redeploying the kubelet, and vice versa. +Yes, by turning off the feature gate. ###### What happens if we reenable the feature if it was previously rolled back? -N/A. +Nothing happens. ###### Are there any tests for feature enablement/disablement? @@ -820,8 +848,7 @@ You can take a look at one potential example of such test in: https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282 --> -CI will run `kind` (Kubernetes in Docker) tests with Rootless Docker/Podman. -Tests with a real cluster will be added later as well. +Yes. See [Test Plan](#test-plan). ### Rollout, Upgrade and Rollback Planning @@ -829,8 +856,6 @@ Tests with a real cluster will be added later as well. This section must be completed when targeting beta to a release. --> -This section will be fulfilled when targeting beta graduation to a release. - ###### How can a rollout or rollback fail? Can it impact already running workloads? +Rollout: Rolling out requires recreating a new node instance, in a UserNS. +Typical failures: +- [subuids are not allocated](https://rootlesscontaine.rs/getting-started/common/subuid/) +- [cgroup v2 delegation is not enabled](https://rootlesscontaine.rs/getting-started/common/cgroup2/) + +Rollback: this question is not applicable. Rolling back requires recreating a new node instance. + ###### What specific metrics should inform a rollback? +CrashLoopBackOffs + ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +This question is not applicable. Rolling out and rolling back requires recreating a new node instance. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No + ### Monitoring Requirements -N/A +They can determine if a Pod is running on a node that is running in UserNS. ###### How can someone using this feature know that it is working for their instance? @@ -894,8 +932,8 @@ and operation of this feature. Recall that end users cannot usually observe component logs or access metrics. --> -- [ ] Events - - Event Reason: +- [X] Events + - Event Reason: No CrashLoopBackOff - [ ] API .status - Condition name: - Other field: @@ -919,7 +957,7 @@ These goals will help you determine what you need to measure (SLIs) in the next question. --> -N/A +99.9% of /health requests per day finish with 200 code ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? @@ -941,7 +979,7 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co implementation difficulties, etc.). --> -N/A +No ### Dependencies @@ -1058,6 +1096,8 @@ Think about adding additional work or introducing new steps in between [existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos --> +No. + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No + ### Troubleshooting