Skip to content

Commit 0a9e39f

Browse files
authored
Add support for tainted nodes (#30)
Specify node taints to tolerate for the checks. Use the CHECK_TOLERATIONS environment variable in the format: "key=value:NoSchedule, key=value:NoSchedule"
1 parent 51c8cf1 commit 0a9e39f

File tree

6 files changed

+93
-13
lines changed

6 files changed

+93
-13
lines changed

CONTRIBUTORS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
- [Chris Hirsch](mailto:[email protected])
2+
- [Lee Smith](mailto:[email protected])

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ Once the contents of the `PVC` have been validated, the check `Job`, the init `J
1010

1111
Container resource requests are set to `15 millicores` of CPU and `20Mi` units of memory and use the Alpine image `alpine:3.11` for the `Job` and a default of `1Gi` for the `PVC`. If the environment variable `CHECK_STORAGE_PVC_SIZE` is set then the value of that will be used instead of the default.
1212

13-
By default, the nodes of the cluster will be discovered and only those nodes that are `untainted`, in a `Ready` state and not in the role of `master` will be used. If node(s) need to be `ignored` for whatever reason, then the environment variable `CHECK_STORAGE_IGNORED_CHECK_NODES` should be used a space or comma separated list of nodes should be supplied. If `auto-discovery` is not desired, the environment variable `CHECK_STORAGE_ALLOWED_CHECK_NODES` can be used and a space or comma separated list of nodes that should be checked needs to be supplied. If `CHECK_STORAGE_ALLOWED_CHECK_NODES` is supplied and a node in that list matches a node in the ignored (`CHECK_STORAGE_IGNORED_CHECK_NODES`) list then that node will be ignored.
13+
By default, the nodes of the cluster will be discovered and only those nodes that are `untainted` (or has taints that are all specified in `CHECK_TOLERATIONS`), in a `Ready` state and not in the role of `master` will be used. If node(s) need to be `ignored` for whatever reason, then the environment variable `CHECK_STORAGE_IGNORED_CHECK_NODES` should be used a space or comma separated list of nodes should be supplied. If `auto-discovery` is not desired, the environment variable `CHECK_STORAGE_ALLOWED_CHECK_NODES` can be used and a space or comma separated list of nodes that should be checked needs to be supplied. If `CHECK_STORAGE_ALLOWED_CHECK_NODES` is supplied and a node in that list matches a node in the ignored (`CHECK_STORAGE_IGNORED_CHECK_NODES`) list then that node will be ignored.
1414

1515
By default, the storage check `Job` and initialize storage check `Job` will use Alpine's `alpine:3.11` image. If a different image is desired, use the environment variable `CHECK_STORAGE_IMAGE` or `CHECK_STORAGE_INIT_IMAGE` depending on which image should be changed.
1616

@@ -33,7 +33,7 @@ This check follows the list of actions in order during the run of the check:
3333
1. Looks for old storage check job, storage init job, and PVC belonging to this check and cleans them up.
3434
2. Creates a PVC in the namespace and waits for the PVC to be ready.
3535
3. Creates a storage init configuration, applies it to the namespace, and waits for the storage init job to come up and initialize the PVC with known data.
36-
4. Determine which nodes in the cluster are going to run the storage check by auto-discovery or a list supplied nodes via the `CHECK_STORAGE_IGNORED_CHECK_NODES` and `CHECK_STORAGE_ALLOWED_CHECK_NODES` environment variables.
36+
4. Determine which nodes in the cluster are going to run the storage check by auto-discovery or a list supplied nodes via the `CHECK_STORAGE_IGNORED_CHECK_NODES` and `CHECK_STORAGE_ALLOWED_CHECK_NODES` environment variables. Nodes with taints will not be included unless the toleration is configured in `CHECK_TOLERATIONS`.
3737
5. For each node that needs a check, creates a storage check configuration, applies it to the namespace, and waits for the storage check job to start and validate the contents of storage on each desired node.
3838
6. Tear everything down once completed.
3939

@@ -55,6 +55,7 @@ This check follows the list of actions in order during the run of the check:
5555
- `CHECK_POD_CPU_LIMIT`: Check pod deployment CPU limit value. Calculated in decimal SI units `(75 = 75m cpu)`.
5656
- `CHECK_POD_MEM_REQUEST`: Check pod deployment memory request value. Calculated in binary SI units `(20 * 1024^2 = 20Mi memory)`.
5757
- `CHECK_POD_MEM_LIMIT`: Check pod deployment memory limit value. Calculated in binary SI units `(75 * 1024^2 = 75Mi memory)`.
58+
- `CHECK_TOLERATIONS`: Check pod tolerations of node taints. In the format "key=value:effect,key=value:effect". By default no taints are tolerated.
5859
- `ADDITIONAL_ENV_VARS`: Comma separated list of `key=value` variables passed into the pod's containers.
5960
- `SHUTDOWN_GRACE_PERIOD`: Amount of time in seconds the shutdown will allow itself to clean up after an interrupt signal (default=`30s`).
6061
- `DEBUG`: Verbose debug logging.

cmd/storage-check/input.go

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ import (
1919

2020
kh "github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient"
2121
log "github.com/sirupsen/logrus"
22+
corev1 "k8s.io/api/core/v1"
2223
)
2324

2425
// parseDebugSettings parses debug settings and fatals on errors.
@@ -219,4 +220,45 @@ func parseInputValues() {
219220
shutdownGracePeriod = duration
220221
log.Infoln("Parsed SHUTDOWN_GRACE_PERIOD:", shutdownGracePeriod)
221222
}
223+
224+
// Parse CHECK_TOLERATIONS in the format "key=value:effect,key=value:effect"
225+
if len(tolerationsEnv) > 0 {
226+
tolerationSpecs := strings.Split(tolerationsEnv, ",")
227+
for _, spec := range tolerationSpecs {
228+
parts := strings.Split(spec, ":")
229+
if len(parts) != 2 {
230+
log.Fatalf("Error: invalid toleration specification: %s", spec)
231+
}
232+
233+
keyValue := parts[0]
234+
effect := parts[1]
235+
236+
keyValueParts := strings.Split(strings.TrimSpace(keyValue), "=")
237+
if len(keyValueParts) != 2 {
238+
log.Fatalf("Error: invalid key-value specification: %s", keyValue)
239+
}
240+
241+
key := keyValueParts[0]
242+
value := keyValueParts[1]
243+
244+
var taintEffect corev1.TaintEffect
245+
switch strings.TrimSpace(effect) {
246+
case "NoSchedule":
247+
taintEffect = corev1.TaintEffectNoSchedule
248+
case "PreferNoSchedule":
249+
taintEffect = corev1.TaintEffectPreferNoSchedule
250+
default:
251+
log.Fatalf("Error: unknown effect value: %s", effect)
252+
}
253+
254+
toleration := corev1.Toleration{
255+
Key: key,
256+
Operator: "Equal",
257+
Value: value,
258+
Effect: taintEffect,
259+
}
260+
261+
tolerations = append(tolerations, toleration)
262+
}
263+
}
222264
}

cmd/storage-check/main.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ import (
2121
kh "github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient"
2222
"github.com/Comcast/kuberhealthy/v2/pkg/kubeClient"
2323
log "github.com/sirupsen/logrus"
24+
corev1 "k8s.io/api/core/v1"
2425
"k8s.io/client-go/kubernetes"
2526
)
2627

@@ -110,6 +111,9 @@ var (
110111
shutdownGracePeriodEnv = os.Getenv("SHUTDOWN_GRACE_PERIOD")
111112
shutdownGracePeriod time.Duration
112113

114+
tolerationsEnv = os.Getenv("CHECK_TOLERATIONS")
115+
tolerations []corev1.Toleration
116+
113117
// Time object used for the check.
114118
now time.Time
115119

cmd/storage-check/run_check.go

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ import (
1414
"context"
1515
"fmt"
1616
"os"
17+
"reflect"
1718
"strings"
1819
"time"
1920

@@ -28,7 +29,6 @@ type Node struct {
2829
schedulable bool
2930
override bool
3031
status v1.NodeStatus
31-
effect v1.TaintEffect
3232
}
3333

3434
// runStorageCheck sets up a storage PVC, a storage init and storage check and applies it to the cluster.
@@ -169,19 +169,17 @@ func runStorageCheck() {
169169
node.name = n.Name
170170
node.status = n.Status
171171

172-
// TODO Need to work through more logic to see if this should be configurable
173172
if len(n.Spec.Taints) > 0 {
174-
// By defalt, only schedule the storage checks on untained (nodes that are Ready and not masters) nodes
175-
for _, t := range n.Spec.Taints {
176-
log.Debugln("t.Effect=", t.Effect)
177-
log.Debugln("t.Key=", t.Key)
178-
log.Debugln("t.Value=", t.Value)
179-
log.Infoln("Adding node ", n.Name, " which is tainted as ", t.Effect, " NOT be schduled for check")
180-
node.effect = t.Effect
181-
node.schedulable = false
173+
// By default, only schedule the storage checks on untainted nodes
174+
node.schedulable = toleratesAllTaints(tolerations, n.Spec.Taints)
175+
176+
status := "be"
177+
if !node.schedulable {
178+
status = "NOT be"
182179
}
180+
log.Printf("Adding node %s with taints %s to %s scheduled for check", n.Name, formatTaints(n.Spec.Taints), status)
183181
} else {
184-
log.Infoln("Adding untainted node ", n.Name, " to be schduled for check")
182+
log.Infoln("Adding untainted node ", n.Name, " to be scheduled for check")
185183
node.schedulable = true
186184
}
187185
checkNodes[node.name] = node
@@ -370,3 +368,36 @@ func cleanUpOrphanedResources(ctx context.Context) chan error {
370368

371369
return cleanUpChan
372370
}
371+
372+
func toleratesAllTaints(tolerations []v1.Toleration, nodeTaints []v1.Taint) bool {
373+
for _, nodeTaint := range nodeTaints {
374+
tolerated := false
375+
for _, toleration := range tolerations {
376+
if reflect.DeepEqual(toleration, v1.Toleration{
377+
Key: nodeTaint.Key,
378+
Value: nodeTaint.Value,
379+
Operator: v1.TolerationOpEqual,
380+
Effect: nodeTaint.Effect,
381+
}) {
382+
tolerated = true
383+
break
384+
}
385+
}
386+
if !tolerated {
387+
return false
388+
}
389+
}
390+
return true
391+
}
392+
393+
func formatTaints(taints []v1.Taint) string {
394+
var taintStrings []string
395+
396+
for _, taint := range taints {
397+
// Format each taint as "key=value:effect"
398+
taintString := fmt.Sprintf("%s=%s:%s", taint.Key, taint.Value, taint.Effect)
399+
taintStrings = append(taintStrings, taintString)
400+
}
401+
402+
return strings.Join(taintStrings, ",")
403+
}

cmd/storage-check/storage.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,7 @@ func initializeStorageConfig(jobName string, pvcName string) *batchv1.Job {
141141
Name: "data",
142142
VolumeSource: corev1.VolumeSource{PersistentVolumeClaim: pvc},
143143
}},
144+
Tolerations: tolerations,
144145
},
145146
},
146147
}

0 commit comments

Comments
 (0)