-
Notifications
You must be signed in to change notification settings - Fork 20
feat: added config sdk and refactored FQ with bug fixes #143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added config sdk and refactored FQ with bug fixes #143
Conversation
- Added config SDK and refactored FQ with bug fixes - Added fake change stream client for MongoDB testing - Added e2e tests - Removed ini methods in configmanager - Added missing configmanager to makefile and lint-test - Merged upstream changes and resolved conflicts
6bae3e3 to
4e6c902
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR introduces a new configmanager SDK for centralized TOML and environment variable configuration management, and performs a major architectural refactoring of the fault-quarantine module. The configmanager provides generic utilities (LoadTOMLConfig, GetEnvVar, ReadEnvVars) that extract previously duplicated configuration logic into a reusable library. The fault-quarantine refactoring consolidates Kubernetes operations into a new FaultQuarantineClient, extracts initialization into a dedicated initializer package, moves MongoDB event watching into a separate EventWatcher abstraction, and simplifies the reconciler by eliminating the dual annotation cache. The rule evaluation state machine is simplified from 5 states to 2 (Success/Failed), and the circuit breaker switches from callback functions to interface-based dependency injection. The changes remove 2,877 lines while adding 3,157 new lines across test infrastructure (envtest adoption), improved error handling, and better separation of concerns.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
configmanager/go.mod |
0/5 | References non-existent Go versions (1.24.0 and 1.24.8) causing build failure |
fault-quarantine-module/Makefile |
2/5 | TEST_SETUP_COMMANDS variable contains broken Make syntax (eval with trailing &&) |
fault-quarantine-module/main.go |
2/5 | nodeInformer.Run() blocks forever preventing reconciler from starting on line 106 |
fault-quarantine-module/pkg/mongodb/event_watcher.go |
2/5 | Uses klog.Fatalf for transient errors causing process crashes; defer never executes |
fault-quarantine-module/pkg/informer/k8s_client.go |
3/5 | Missing nil checks for nodeInformer causing potential panics; hardcoded 55-char truncation |
fault-quarantine-module/pkg/reconciler/reconciler.go |
3/5 | Unchecked type assertions in updateTaintMaps and collectTaintsToApply risk panics |
fault-quarantine-module/pkg/common/common.go |
3/5 | Removes 3 evaluation result states; callers must be updated to handle simplified enum |
fault-quarantine-module/pkg/informer/node_informer.go |
3/5 | Race condition between totalNodes read and indexer query in GetNodeCounts |
configmanager/env.go |
3/5 | Variadic parameter parsing overwrites validators/defaults silently; incorrect required-var handling |
fault-quarantine-module/pkg/breaker/types.go |
4/5 | Type inconsistency between CircuitBreakerConfig.Percentage (int) and Config.TripPercentage (float64) |
fault-quarantine-module/pkg/breaker/breaker.go |
4/5 | Refactors to interface-based design improving testability; removes GPU-specific language |
configmanager/loader.go |
4/5 | Generic TOML loader lacks input validation (empty path, nil config pointer) |
configmanager/loader_test.go |
4/5 | Tests only happy path; missing error cases for malformed TOML and missing files |
fault-quarantine-module/pkg/informer/k8s_client_test.go |
4/5 | Comprehensive envtest coverage but contains typos in constants and hardcoded assumptions |
store-client-sdk/pkg/storewatcher/watchStoreMock.go |
4/5 | EventsChan never closed by mock; Start method doesn't track context parameter |
fault-quarantine-module/pkg/evaluator/rule_evaluator.go |
5/5 | Refactors high-complexity function into smaller helpers; renames error state for consistency |
fault-quarantine-module/pkg/config/config.go |
5/5 | Extracts LoadTomlConfig to configmanager; keeps domain-specific type definitions local |
Makefile |
5/5 | Adds configmanager to GO_MODULES and lint-test targets following existing patterns |
.github/workflows/lint-test.yml |
5/5 | Adds configmanager to CI lint-test matrix |
.github/dependabot.yml |
5/5 | Adds Dependabot configuration for configmanager Go module |
fault-quarantine-module/Dockerfile |
5/5 | Copies configmanager directory into build context alongside other dependencies |
configmanager/Makefile |
5/5 | Standard module Makefile with HAS_DOCKER=0 appropriate for library modules |
configmanager/go.sum |
5/5 | Auto-generated checksums for BurntSushi/toml v1.4.0 dependency |
fault-quarantine-module/go.sum |
5/5 | Adds controller-runtime dependencies for refactored architecture |
fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go |
5/5 | Removes unused kubernetes.Interface parameter simplifying API |
Confidence score: 1/5
- This PR has multiple critical issues that will cause immediate build and runtime failures if merged
- Score reflects blocking bugs: non-existent Go version preventing compilation, broken Makefile syntax, blocking goroutine preventing reconciler startup, process-crashing error handling in EventWatcher, and unchecked type assertions causing panics
- Pay close attention to configmanager/go.mod (Go version), fault-quarantine-module/main.go (line 99-106 blocking issue), fault-quarantine-module/pkg/mongodb/event_watcher.go (klog.Fatalf usage), and all unchecked type assertions in reconciler.go and node_informer.go
Additional Comments (2)
-
fault-quarantine-module/go.mod, line 79-80 (link)syntax: non-standard import paths
go.yaml.in/yaml/v2andgo.yaml.in/yaml/v3- should begopkg.in/yaml.v2andgopkg.in/yaml.v3 -
fault-quarantine-module/pkg/reconciler/reconciler.go, line 559-560 (link)logic: Type assertions without checking boolean return values. If the map contains non-string types, this will panic.
43 files reviewed, 45 comments
|
/ok to test 4e6c902 |
π‘οΈ CodeQL Analysisπ¨ Found 5 security alert(s) π View details |
Merging this branch will not change overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers the entire PR from start to finish.
This PR introduces a new shared configuration management SDK in the commons package and performs a comprehensive refactoring of the fault-quarantine module. The configmanager package provides type-safe environment variable parsing and TOML file loading with validation support. The fault-quarantine module's architecture has been restructured from a buffer-queue pattern to an event-driven callback model: the reconciler now receives events from a dedicated MongoDB event watcher via callbacks rather than managing its own buffer and change-stream. Core state management was simplified by eliminating redundant cachesβthe node informer now provides the single source of truth for quarantined node state via a custom indexer. The K8s client logic was extracted into a dedicated FaultQuarantineClient with an interface-based design, and the circuit breaker now uses dependency injection for K8s operations. New test infrastructure using envtest replaces fake clients for more realistic integration testing. The changes consolidate nearly 200 lines of initialization logic from main.go into a new initializer package, improve metrics organization by moving all Prometheus definitions to a dedicated metrics package, and simplify error handling by reducing rule evaluation result states from five to two (RuleEvaluationSuccess and RuleEvaluationFailed).
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
fault-quarantine-module/main.go |
1/5 | Critical blocking bug: Informer.Run() blocks on line 78, preventing reconciler startup on line 85 |
fault-quarantine-module/pkg/reconciler/reconciler.go |
3/5 | Major refactor to callback-based event processing with bidirectional dependency on event watcher; silently ignores JSON marshal errors |
fault-quarantine-module/pkg/breaker/breaker.go |
3/5 | Refactored to use K8s client interface; metric overwrite issue and unconditional os.Exit() on retry exhaustion |
fault-quarantine-module/pkg/informer/k8s_client.go |
3/5 | New K8s client wrapper with nil pointer risks if SetNodeInformer not called; non-deterministic taint ordering |
fault-quarantine-module/pkg/mongodb/event_watcher.go |
3/5 | New MongoDB change-stream watcher; nil callback panic risk and os.Exit() on errors prevent graceful recovery |
fault-quarantine-module/pkg/informer/node_informer.go |
4/5 | Refactored to use indexer-based quarantine tracking; removes GPU-specific filtering but introduces eventual consistency model |
fault-quarantine-module/pkg/initializer/init.go |
4/5 | Centralizes initialization logic; bidirectional dependency between reconciler and event watcher flagged |
fault-quarantine-module/go.mod |
2/5 | Adds controller-runtime and testify dependencies but introduces invalid yaml dependency paths on lines 76-77 |
fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go |
4/5 | Comprehensive e2e test suite with 40+ scenarios; potential goroutine leaks and duplicated test setup logic |
fault-quarantine-module/pkg/informer/k8s_client_test.go |
3/5 | New envtest-based integration tests; hardcoded label value check on line 293 doesn't match test setup |
commons/pkg/configmanager/loader.go |
5/5 | New generic TOML loader utility with clear documentation and proper error wrapping |
commons/pkg/configmanager/env.go |
4/5 | New environment variable SDK with type-safe parsing; optional vars with empty defaults added to results map may cause ambiguity |
fault-quarantine-module/pkg/metrics/metrics.go |
4/5 | Metrics moved from reconciler to dedicated package; Reset() in SetFaultQuarantineBreakerState could cause concurrency issues |
fault-quarantine-module/pkg/evaluator/rule_evaluator.go |
4/5 | Refactored to reduce complexity by extracting helper functions; simplified error handling to binary success/fail model |
fault-quarantine-module/pkg/breaker/types.go |
4/5 | Refactored from function-based to interface-based dependency injection; State type vs string inconsistency in interface |
commons/go.mod |
4/5 | Module renamed from logger-sdk to commons and adds TOML dependency; invalid Go version declarations (1.24.0) |
fault-quarantine-module/pkg/informer/k8s_client_interface.go |
4/5 | Interface moved from reconciler package; removed isUncordon parameter and added HandleManualUncordonCleanup method |
fault-quarantine-module/pkg/config/config.go |
5/5 | Clean separation: removed LoadTomlConfig function, kept only struct definitions for TOML deserialization |
fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go |
5/5 | Removed unused kubernetes.Interface parameter from initialization functionsβclean refactoring |
commons/pkg/configmanager/env_test.go |
5/5 | Comprehensive unit tests for environment variable handling across all supported types with edge cases |
commons/pkg/configmanager/loader_test.go |
4/5 | Basic happy-path test for TOML loader; missing error case coverage (malformed TOML, missing files) |
fault-quarantine-module/pkg/breaker/breaker_test.go |
4/5 | Refactored to use consolidated mock; added test for flapping node de-duplication bug fix |
fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go |
4/5 | Refactored to use envtest instead of fake client; polling-based sync may be fragile under load |
store-client-sdk/pkg/storewatcher/watchStoreMock.go |
4/5 | New thread-safe mock for change-stream watcher; EventsChan documentation unclear about buffer size rationale |
.github/dependabot.yml |
5/5 | Adds Dependabot monitoring for new commons directory following existing pattern |
fault-quarantine-module/Makefile |
4/5 | Adds TEST_SETUP_COMMANDS for envtest; eval syntax won't execute correctly in Make variable |
commons/Makefile |
5/5 | Documentation updated to reflect expanded scope (logging + config management) |
Confidence score: 2/5
- This PR has critical blocking issues that will prevent the fault-quarantine module from functioning in production
- Score reflects one showstopper bug (main.go line 78 blocks reconciler startup), multiple nil pointer panic risks, error handling issues with os.Exit() preventing recovery, and a broken go.mod dependency specification
- Pay close attention to
fault-quarantine-module/main.go(blocking bug),fault-quarantine-module/pkg/informer/k8s_client.go(nil pointer panics),fault-quarantine-module/pkg/mongodb/event_watcher.go(nil callback + process termination),fault-quarantine-module/go.mod(invalid yaml paths), andfault-quarantine-module/pkg/reconciler/reconciler.go(silently ignores marshal errors)
Additional Comments (2)
-
fault-quarantine-module/pkg/breaker/breaker.go, line 284-290 (link)logic: when GetTotalNodes returns an error (line 286), the function immediately returns via handleGetTotalNodesError without performing retries. This means API errors bypass the retry loop entirely, and only zero-node results are retried. Is it intentional that transient API errors are not retried? Should transient API errors from GetTotalNodes also be retried, or should they immediately fail the circuit breaker check?
-
fault-quarantine-module/pkg/breaker/breaker.go, line 418-430 (link)logic: GetTotalNodes is called again inside logRetriesExhausted (line 418), but if this call returns an error, the function logs it and exits with os.Exit(1) on line 429. However, if GetTotalNodes succeeds (err == nil), the code falls through to lines 432β439 and also calls os.Exit(1). Both branches exit the process, so the error check on line 420 is redundantβthe function always exits regardless of success or failure.
40 files reviewed, 32 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the most recent changes made since the last review, not the entire PR. The developer has addressed multiple previous review comments by: (1) correcting syntax errors in go.mod files (go version and toolchain), (2) adding comprehensive test coverage for the TOML loader and env var config manager, (3) improving test infrastructure by expanding envtest usage in node_informer_test.go and k8s_client_test.go, (4) fixing test naming typos ('COrdonned' β 'Cordoned', 'uncordonedReasonLabelkey' β 'uncordonedReasonLabelKey'), and (5) resolving the duplicate yaml.v3 dependency conflicts. These updates directly respond to feedback about build failures, missing tests, and code quality issues flagged in earlier iterations.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| fault-quarantine-module/go.mod | 5/5 | Corrected invalid Go version (1.24.0 β 1.23.0) and toolchain (go1.24.8β go1.23.8) per previous review feedback |
| commons/go.mod | 5/5 | Fixed toolchain version to go1.23.8 matching Go version 1.23.0, resolving build compatibility issue |
| fault-quarantine-module/go.sum | 5/5 | Cleaned up duplicate yaml.v3 entries, now consistently using v3.0.1 throughout dependencies |
| commons/pkg/configmanager/loader_test.go | 5/5 | Added three new test cases (happy path, non-existent file, invalid syntax) per previous review requesting test coverage |
| commons/pkg/configmanager/env_test.go | 5/5 | Added comprehensive test coverage for env var reading including edge cases and type validation |
| fault-quarantine-module/pkg/informer/k8s_client_test.go | 4.5/5 | Expanded envtest-based tests for taint/cordon operations, addressing request to replace mock clients with envtest |
| fault-quarantine-module/pkg/informer/node_informer.go | 4.5/5 | Refactored to remove GPU-specific filtering and add quarantine annotation indexer for efficient lookups |
| fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go | 4.5/5 | Fixed test naming typo ('COrdonned' β 'Cordoned') and expanded circuit breaker test coverage |
| fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go | 4.5/5 | Migrated from fake clientset to envtest for more accurate integration testing per review guidance |
Confidence score: 4.5/5
- This round of changes successfully addresses critical syntax errors and test gaps from prior reviews, significantly improving build stability and test coverage
- Score reflects resolved issues (Go version, duplicate deps, missing tests) with remaining minor concerns around test timing robustness and circular dependencies that were flagged but not yet resolved
- Pay close attention to fault-quarantine-module/main.go (blocking informer startup), fault-quarantine-module/pkg/mongodb/event_watcher.go (nil callback), and the circular reconcilerβwatcher dependency in pkg/initializer/init.go
Additional Comments (3)
-
commons/go.mod, line 42-43 (link)syntax: conflicting yaml.v2 imports - line 42uses invalid
go.yaml.in/yaml/v2path while line 43 uses correctgo.yaml.in/yaml/v3. The v2 path should begopkg.in/yaml.v2 -
commons/go.mod, line 53 (link)syntax: duplicate gopkg.in/yaml.v3 dependency (also on line 43 as go.yaml.in/yaml/v3). This creates ambiguity and may cause import resolution issues
-
fault-quarantine-module/pkg/reconciler/node_quarantine_test.go, line 1 (link)logic: entire test file (799 lines) was deleted, removing all test coverage for node quarantine operations (taint/untaint, cordon/uncordon, annotation handling). No replacement tests found in reconciler_e2e_test.go. This creates a critical gap in test coverage for core fault-quarantine functionality. Were these tests moved to another file, or is the coverage intentionally removed? The deleted tests covered critical edge cases (empty keys, non-existent nodes, partial removals, concurrent modifications) that are now untested
40 files reviewed, 41 comments
π‘οΈ CodeQL Analysisπ¨ Found 5 security alert(s) π View details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The developer has made significant progress on the fault-quarantine (FQ) module refactoring. Recent updates include: addressing race condition issues in the circuit breaker's metric writes and taint map updates, fixing nil pointer dereferences in the event watcher and informer, resolving Go module version issues (Go 1.24.x corrected to 1.23.x), and adding envtest-based integration tests for better Kubernetes API validation. The refactoring continues to extract initialization logic, consolidate metrics into a centralized package, and introduce a config SDK for environment variable and TOML configuration management. The architecture now uses a callback-driven event processing model with a MongoDB change stream watcher, replacing the previous polling-based approach. Key structural improvements include per-node mutex locking to prevent concurrent update conflicts, circuit breaker state management via ConfigMap, and a cleaner separation of concerns between the reconciler, informer, and Kubernetes client.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| fault-quarantine-module/main.go | 3/5 | Fixed blocking informer startup and unused server variable; POD_NAMESPACE still retrieved but not passed to initializer |
| fault-quarantine-module/pkg/breaker/breaker.go | 4/5 | Refactored to use K8sClient interface for better testability; metrics moved to centralized package |
| fault-quarantine-module/pkg/reconciler/reconciler.go | 4/5 | Major refactoring to remove in-memory caching and use callback-driven event processing; added per-node atomic updates |
| fault-quarantine-module/pkg/informer/k8s_client.go | 4/5 | New file consolidating Kubernetes operations with per-node mutex locking and retry logic for conflict resolution |
| fault-quarantine-module/pkg/mongodb/event_watcher.go | 4/5 | New MongoDB change stream watcher with callback pattern; requires nil check for processEventCallback before invocation |
| fault-quarantine-module/pkg/initializer/init.go | 4/5 | New centralized initialization module; creates bidirectional dependency between reconciler and event watcher |
| fault-quarantine-module/pkg/metrics/metrics.go | 4/5 | Moved from reconciler package and renamed metrics to PascalCase; added circuit breaker observability |
| commons/pkg/configmanager/env.go | 4/5 | New config SDK for type-safe environment variable reading with validation and default value support |
| commons/pkg/configmanager/loader.go | 4/5 | New TOML configuration loader using generics; missing nil pointer validation for config parameter |
| fault-quarantine-module/pkg/informer/node_informer.go | 4/5 | Refactored to use cache indexing instead of internal state tracking; removed work signal channel |
| fault-quarantine-module/pkg/evaluator/rule_evaluator.go | 4/5 | Extracted complex logic into smaller helper functions; unified error return values |
| fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go | 4/5 | Comprehensive E2E tests using envtest; covers quarantine flows, circuit breaker, and conflict scenarios |
| fault-quarantine-module/go.mod | 4/5 | Added controller-runtime and testify dependencies; moved BurntSushi/toml to indirect |
| commons/go.mod | 3/5 | Renamed module from logger-sdk to commons; has conflicting yaml.v3 versions that need resolution |
| fault-quarantine-module/Makefile | 3/5 | Added envtest setup commands with incorrect eval syntax that won't execute properly |
Confidence score: 3/5
- This PR contains significant architectural improvements but still has several implementation issues that need to be addressed before merging
- Score reflects unresolved blocking bugs (nil pointer dereference in event_watcher.go line 165, blocking informer startup in main.go line 91, conflicting YAML dependencies in commons/go.mod), incomplete refactoring (POD_NAMESPACE retrieved but unused, metrics server never started), and potential race conditions (taint map updates, metric writes)
- Pay close attention to fault-quarantine-module/main.go, fault-quarantine-module/pkg/mongodb/event_watcher.go, commons/go.mod, and fault-quarantine-module/Makefile as these files contain critical bugs that will cause runtime failures or build issues
Additional Comments (4)
-
fault-quarantine-module/pkg/breaker/metrics.go, line 1 (link)logic: entire metrics file deleted - all circuit breaker metrics removed. The
SetFaultQuarantineBreakerStateandSetFaultQuarantineBreakerUtilizationfunctions called in breaker.go will now fail at runtime with undefined reference errors -
fault-quarantine-module/pkg/reconciler/reconciler_test.go, line 1 (link)logic: Entire unit test file deleted. The reconciler now has zero unit test coverage - only e2e tests remain. Unit tests are crucial for fast, isolated testing of edge cases, error paths, and individual functions. Verify that all critical reconciler logic is adequately covered by e2e tests, or restore/recreate unit tests. Are all previous test scenarios now covered in reconciler_e2e_test.go? What is the test coverage strategy now that unit tests are removed - do the e2e tests cover all the scenarios previously tested in unit tests, including error handling and edge cases?
-
fault-quarantine-module/pkg/common/health_events_buffer.go, line 1 (link)logic: entire HealthEventBuffer implementation deleted. This file is referenced in the repository and removing it will break compilation. Check that all imports and usages have been migrated to a replacement implementation. What replaces HealthEventBuffer? Have all references to common.HealthEventBuffer, common.NewHealthEventBuffer, and common.HealthEventInfo been removed or updated throughout the codebase?
-
commons/go.mod, line 42-53 (link)syntax: conflicting yaml.v3 dependencies:
go.yaml.in/yaml/v2(line 42),go.yaml.in/yaml/v3(line 43), andgopkg.in/yaml.v3(line 53) with different versions (v3.0.4 and v3.0.1). Thego.yaml.inimport paths are non-standard - should begopkg.in/yaml.v2andgopkg.in/yaml.v3. This will cause module resolution failures.
40 files reviewed, 16 comments
Merging this branch changes the coverage (2 decrease, 3 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. Three targeted bug fixes were applied across test infrastructure and event-watcher shutdown handling in response to previous review feedback. The changes address (1) test-environment inconsistencies in commons/pkg/configmanager/loader_test.go by replacing a hardcoded absolute path with a temp-directory-based approach; (2) silent test-setup errors in commons/pkg/configmanager/env_test.go by adding explicit error checks during test-case validation; and (3) lost error information in fault-quarantine-module/pkg/mongodb/event_watcher.go when the MongoDB change-stream terminates unexpectedly. These fixes improve test reliability, portability, and runtime error diagnostics without altering any production logic or introducing new features. All changes are isolated to test code and error-handling paths, ensuring existing functionality remains intact.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| commons/pkg/configmanager/loader_test.go | 5/5 | Replaced hardcoded absolute path with temp-directory-based non-existent file path for improved test portability and consistency |
| commons/pkg/configmanager/env_test.go | 5/5 | Added error checks (t.Fatalf) for test-setup parse failures in bounds-checking tests to prevent false positives |
| fault-quarantine-module/pkg/mongodb/event_watcher.go | 4/5 | Introduced watchErr variable to capture and return unexpected watcher termination errors instead of silently returning nil |
Confidence score: 4/5
- This PR is safe to merge with minimal risk; all changes are defensive improvements to test infrastructure and error handling.
- Score reflects well-contained fixes addressing previous review feedback, but one point deducted due to potential impact of the event-watcher error propagation change on caller behavior (callers may need to handle non-nil errors on shutdown that were previously always nil).
- Pay close attention to
fault-quarantine-module/pkg/mongodb/event_watcher.go- verify that all callers ofEventWatcher.Start()correctly handle the new error return semantics when the watcher terminates unexpectedly (previously this always returned nil).
3 files reviewed, no comments
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. This iteration addresses numerous feedback points from previous reviews, including critical bug fixes and structural improvements.
Key Changes:
- Critical bug fix in
main.go: The blockingNodeInformer.Run()call (line 91) has been addressed in response to prior review feedback, though the implementation still prevents the reconciler and metrics server from starting. The informer must be launched in a goroutine before line 99. - Enhanced test infrastructure: Added comprehensive E2E tests in
reconciler_e2e_test.go(3698lines) covering quarantine flows, circuit breaker behavior, and dry-run modes. However, three circuit breaker tests contain race conditions where node visibility checks occur after event processing starts. - Refactored configuration management: The new
commons/pkg/configmanagerprovides generic TOML loading and environment variable handling, consolidating previously scattered configuration logic. - Improved Kubernetes client abstraction: The
FaultQuarantineClientink8s_client.gonow uses per-node mutexes and retry logic with backoff for conflict resolution, addressing concurrency concerns from earlier reviews. - Architectural simplification: Removed redundant caching layers (
NodeInfo,HealthEventBuffer) in favor of direct informer cache access and callback-based event processing.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| fault-quarantine-module/main.go | 1/5 | Critical blocking bug: NodeInformer.Run() prevents reconciler and metrics server goroutines from starting; unused srv variable |
| fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go | 3/5 | Comprehensive 3698-line E2E test suite with race condition in three circuit breaker tests where node visibility checks happen after event sends |
| fault-quarantine-module/pkg/reconciler/reconciler.go | 4/5 | Major refactoring removes caching layers, introduces ProcessEvent callbacks, adds mutex-protected taint updates, and fixes circuit breaker error handling |
| fault-quarantine-module/pkg/informer/k8s_client.go | 4/5 | New FaultQuarantineClient with per-node locking and retry logic; potential label read issue in handleUncordon after applyLabels modifies node |
| fault-quarantine-module/pkg/breaker/breaker.go | 4/5 | Refactored to use K8sClient interface with ErrRetryExhausted for pod restart signaling; state read errors silently ignored at startup |
| fault-quarantine-module/pkg/mongodb/event_watcher.go | 4/5 | New EventWatcher encapsulates MongoDB change streams with callback processing; processEventCallback can be nil causing panic |
| fault-quarantine-module/pkg/initializer/init.go | 4/5 | Centralized initialization logic with circular dependency between reconciler and event watcher via SetEventWatcher |
| commons/pkg/configmanager/loader.go | 5/5 | Generic TOML config loader with proper error wrapping and documentation |
| fault-quarantine-module/pkg/metrics/metrics.go | 3/5 | Metrics moved to shared package with race condition in SetFaultQuarantineBreakerState where Reset() clears all label combinations |
| fault-quarantine-module/pkg/informer/node_informer.go | 4/5 | Refactored from GPU-specific to all-node tracking with custom indexers; GetNodeCounts has TOCTOU race between total and quarantined reads |
| commons/pkg/configmanager/env.go | 4/5 | Generic environment variable manager with type-safe retrieval; optional vars with empty defaults omitted from batch read results |
| fault-quarantine-module/pkg/reconciler/reconciler_test.go | 0/5 | Entire test file emptied, representing critical loss of unit test coverage for reconciler package |
| fault-quarantine-module/pkg/reconciler/node_quarantine.go | 5/5 | File deleted as part of refactoring; functionality moved to k8s_client.go with improved architecture |
| fault-quarantine-module/pkg/evaluator/rule_evaluator.go | 4/5 | Refactored structToInterface to reduce complexity; changed error return from RuleEvaluationErroredOut to RuleEvaluationFailed; minor double-checking of primitives |
| fault-quarantine-module/pkg/informer/k8s_client_test.go | 4/5 | Comprehensive unit tests using envtest with 8-char node names that could collide in parallel execution |
Confidence score: 1/5
- This PR contains blocking bugs that will prevent the application from starting correctly and represents a critical loss of test coverage
- Score reflects critical issues in main.go (informer blocking startup sequence), race conditions in E2E tests (node visibility checks), empty reconciler_test.go file (complete loss of unit tests), nil callback panic risk in event_watcher.go, and unsafe metric Reset() in SetFaultQuarantineBreakerState that clears all label combinations
- Pay close attention to fault-quarantine-module/main.go (must launch informer in goroutine), fault-quarantine-module/pkg/reconciler/reconciler_test.go (restore deleted tests), fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go (fix circuit breaker race conditions in lines 2386-2402 and similar patterns), fault-quarantine-module/pkg/mongodb/event_watcher.go (add nil check for processEventCallback), and fault-quarantine-module/pkg/metrics/metrics.go (fix SetFaultQuarantineBreakerState race condition)
Additional Comments (2)
-
fault-quarantine-module/pkg/breaker/metrics.go, line 1 (link)logic: entire metrics package deleted but breaker.go likely still calls SetFaultQuarantineBreakerState and SetFaultQuarantineBreakerUtilization. Check that these calls were removed or moved to a different package to avoid undefined references. Were the metrics calls moved to a centralized metrics package, or should the breaker stop reporting metrics entirely?
-
commons/go.mod, line 42-43 (link)syntax: duplicate gopkg.in/yaml.v3 entries with conflicting versions (v3.0.4 on line 43, v3.0.1 on line 53). go.mod should have only one version per module - run
go mod tidyto resolve
39 files reviewed, 12 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The most recent updates include:
-
Go version fixes: Corrected
commons/go.modandfault-quarantine-module/go.modto use valid Go versions (go 1.23.0 and toolchain go1.23.8) instead of the non-existent 1.24.x versions that were previously causing build failures. -
YAML dependency cleanup: Fixed duplicate and malformed YAML package imports in
commons/go.mod- removed the incorrectgo.yaml.in/yaml/*paths and consolidated to the standardgopkg.in/yaml.v3import, eliminating the critical build-breaking conflicts identified in the last review. -
Mock state tracking enhancement: Updated
store-client-sdk/pkg/storewatcher/watchStoreMock.goto properly track circuit breaker state changes in the mock -WriteCircuitBreakerStatenow updates the internal state so subsequentReadCircuitBreakerStatecalls return the current state rather than always returning the initial value.
These changes directly address the three most critical blocking issues from the previous review: invalid Go version syntax errors, YAML dependency conflicts that would prevent module resolution, and incorrect mock behavior that could cause test failures.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| commons/go.mod | 5/5 | Fixed Go version to 1.23.0, removed malformed yaml dependencies, consolidated to standard gopkg.in/yaml.v3 |
| fault-quarantine-module/go.mod | 5/5 | Corrected Go version to 1.23.0 and toolchain to go1.23.8, fixed YAML dependency duplication |
| store-client-sdk/pkg/storewatcher/watchStoreMock.go | 5/5 | Added state tracking to WriteCircuitBreakerState so mock properly maintains circuit breaker state changes |
Confidence score: 5/5
- These changes directly fix the three critical build-breaking issues identified in the previous review
- The Go version corrections (1.23.0 instead of 1.24.0) now reference valid, stable Go releases
- YAML dependency consolidation removes the conflicting import paths that would cause module resolution failures
- Mock state tracking fix ensures test reliability by making the mock behavior match real implementations
Additional Comments (5)
-
commons/go.mod, line 42-43 (link)syntax: duplicate yaml.v2 import - line 42 uses
go.yaml.in/yaml/v2while line 43 usesgo.yaml.in/yaml/v3, which is non-standard. The correct import paths should begopkg.in/yaml.v2andgopkg.in/yaml.v3respectively. Line 53 already has the correctgopkg.in/yaml.v3import, creating a conflict -
commons/go.mod, line 53 (link)syntax: duplicate yaml.v3 dependency - already declared on line 43 with different import path (
go.yaml.in/yaml/v3vsgopkg.in/yaml.v3). Rungo mod tidyto resolve -
fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 89-103 (link)logic: function returns 'added' boolean but only checks
!exists. If an event already exists, it's not overwritten β so updates never happen despite 'AddOrUpdate' name. This means IsFatal, IsHealthy, ErrorCodes, Message fields in existing events never refresh even if the new event has updated values. Should existing events be updated, or should the function be renamed to AddEvent? Is it intentional that AddOrUpdateEvent never updates existing events, only adds new ones? Should the function update stored events when keys match, or should it be renamed to reflect add-only behavior? -
fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 170-196 (link)style: RemoveEvent counts matches in a separate loop (lines 183-189) before deleting. This iterates the keys slice twice. Consider counting during deletion or removing the added variable if the return value isn't critical for performance
-
fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 198-221 (link)style: removeAllEntitiesForCheck collects keysToRemove slice before deletion. For large maps this allocates extra memory. Consider deleting directly in the first loop if iteration-during-delete is safe in Go (it is)
40 files reviewed, 11 comments
Merging this branch changes the coverage (1 decrease, 3 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR.
The latest iteration continues the fault-quarantine refactoring with significant improvements: the Makefile TEST_SETUP_COMMANDS variable was corrected to remove the problematic trailing && operator; the go.mod files now properly reference Go 1.23.0 instead of non-existent 1.24.x versions; the metrics server initialization issue in main.go was addressed by starting the server in a goroutine before the blocking informer.Run() call; the circular dependency between reconciler and eventWatcher was resolved by passing the watcher during reconciler creation rather than using a setter; and the commons/go.mod malformed import paths (go.yaml.in/yaml) were fixed to use the correct gopkg.in/yaml paths. The configuration management SDK continues to mature with improved error handling patterns, and test coverage expanded with additional envtest-based integration tests for the k8s client. The changes align with the broader refactoring goal of introducing a config SDK while consolidating Kubernetes client operations into dedicated packages with proper separation of concerns.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| fault-quarantine-module/main.go | 3/5 | Fixed metrics server startup and informer blocking issues, but still has POD_NAMESPACE unused variable |
| commons/go.mod | 2/5 | Corrected Go version but still has duplicate gopkg.in/yaml.v3 entries with conflicting versions |
| fault-quarantine-module/go.mod | 2/5 | Corrected Go version but yaml.v3 duplicates and malformed go.yaml.in path persist |
| fault-quarantine-module/Makefile | 4/5 | TEST_SETUP_COMMANDS trailing && operator still causes syntax errors when expanded |
| fault-quarantine-module/pkg/initializer/init.go | 4/5 | Improved initialization consolidation but circular dependency between reconciler and watcher remains |
| fault-quarantine-module/pkg/reconciler/reconciler.go | 3/5 | Major refactoring improves structure but read-modify-write race conditions in annotation updates remain unaddressed |
| fault-quarantine-module/pkg/informer/k8s_client.go | 4/5 | Well-structured client abstraction with per-node locking, but uncordon reason reads from wrong source in handleUncordon |
| fault-quarantine-module/pkg/mongodb/event_watcher.go | 3/5 | Solid event watcher design but processEventCallback can be nil at Start() time causing panic risk |
| commons/pkg/configmanager/env.go | 3/5 | Useful config utilities but default values bypass validation and optional vars with empty defaults pollute results map |
| fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go | 4/5 | Comprehensive E2E coverage with proper blocking waits for circuit breaker tests |
Confidence score: 3/5
- This PR has made substantial progress on addressing previous review concerns but still contains several moderate-risk issues that should be resolved before merging
- Score reflects remaining issues: go.mod syntax errors (duplicate yaml.v3, malformed paths), annotation update race conditions, nil callback panic risk in event watcher, and validation bypass in config manager that could allow invalid defaults into the system
- Pay close attention to commons/go.mod and fault-quarantine-module/go.mod (resolve yaml.v3 duplicates and malformed import paths), fault-quarantine-module/pkg/mongodb/event_watcher.go (add nil check for processEventCallback), fault-quarantine-module/pkg/reconciler/reconciler.go (address read-modify-write race conditions in annotation operations), and commons/pkg/configmanager/env.go (ensure defaults go through validation)
Additional Comments (9)
-
fault-quarantine-module/pkg/reconciler/reconciler_test.go, line 1 (link)logic: The entire test file has been deleted. This removes all unit tests for the reconciler, leaving only the e2e tests. This significantly reduces test coverage and makes it harder to catch regressions during development. Were these tests moved elsewhere, or is the reconciler now only tested via e2e tests? Were the unit tests moved to another file, or is the plan to rely entirely on e2e tests for reconciler validation?
-
commons/go.mod, line 42-43 (link)logic: duplicate
gopkg.in/yaml.v3entries with conflicting versions. Line 43 has v3.0.4 (indirect), line 53 has v3.0.1 (indirect). This will cause build failures or unexpected resolution behavior -
commons/go.mod, line 42 (link)syntax: malformed import path
go.yaml.in/yaml/v2should begopkg.in/yaml.v2(missing 'gop' prefix) -
fault-quarantine-module/pkg/reconciler/node_quarantine_test.go, line 1 (link)logic: entire test file deleted - no tests remain for TaintAndCordonNodeAndSetAnnotations, UnTaintAndUnCordonNodeAndRemoveAnnotations, GetNodeAnnotations, GetNodesWithAnnotation, or error handling paths. These functions are still in production code (node_quarantine.go) but have zero test coverage now. Was this deletion intentional? The PR description mentions bug fixes and refactoring, but removing all tests for critical node operations (taint/cordon/uncordon) seems risky. Are these tests moved to a different file, or is test coverage being dropped?
-
fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 52-72 (link)logic: Entity and version fields are included in the key but not validated. If event or entity pointers are nil, this will panic. Add nil checks before accessing fields. Should createEventKeyForEntity handle nil event pointers, or can we assume it's always called with valid events?
-
fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 89-103 (link)logic: AddOrUpdateEvent only adds events that don't existβit never updates existing events even though the function name suggests it does. If an event already exists, the function returns false but the event remains unchanged. Should existing events be updated with the new version, or is the current add-only behavior intentional?
-
fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 252-266 (link)style: MarshalJSON deduplicates by comparing event pointers, but if the same event data exists at different memory addresses (e.g., after unmarshal or multiple adds), this won't detect the logical duplication. Is pointer-based deduplication sufficient, or should it compare event contents? Are events guaranteed to share the same pointer when logically identical (same entities/check), or could multiple event instances with identical data exist?
-
fault-quarantine-module/pkg/reconciler/node_quarantine.go, line 1 (link)logic: Entire FaultQuarantineClient implementation (507 lines) deleted without replacement in this file. The type is still referenced throughout the codebase (pkg/initializer/init.go line 192, pkg/breaker/breaker.go, pkg/reconciler/reconciler.go line 87, tests). This will cause compilation failures across the module. Was this code moved to a different file, or is this an accidental deletion? Was FaultQuarantineClient moved to a different file in this refactoring, or should this deletion be reverted? The type is still used extensively in other files.
-
fault-quarantine-module/pkg/breaker/breaker.go, line 414-426 (link)style: logRetriesExhausted calls GetTotalNodes again after all retries failed. If this call succeeds but previous calls returned 0, it proves cache sync issue. But if this call also fails, error message on line 418 says 'failed to get node count' even though the actual issue is zero_nodes. Should the error distinguish between 'API error on final attempt' vs 'consistently returning zero'? Should the final error message distinguish between API failures and consistently-zero node counts, since both end up in the same error path?
41 files reviewed, 21 comments
|
@coderabbitai full review |
β Actions performedFull review triggered. |
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis pull request refactors the fault-quarantine-module to use a centralized initialization pattern with Kubernetes-backed state persistence. It adds environment variable utilities and TOML config loading to commons, introduces a new MongoDB event watcher, consolidates Kubernetes client operations, and reorganizes package structure with new initializer and metrics modules. Changes
Sequence Diagram(s)sequenceDiagram
participant Main
participant Initializer
participant K8sClient as FaultQuarantineClient
participant NodeInformer
participant CircuitBreaker
participant MongoDB
participant Reconciler
participant EventWatcher
Main->>Initializer: InitializeAll(ctx, params)
Initializer->>Initializer: loadEnvConfig()
Initializer->>Initializer: LoadTOMLConfig(path)
Initializer->>K8sClient: NewFaultQuarantineClient(kubeconfig)
K8sClient->>NodeInformer: Create & sync
NodeInformer-->>K8sClient: Ready
Initializer->>CircuitBreaker: NewSlidingWindowBreaker(ctx, config)
CircuitBreaker->>K8sClient: EnsureCircuitBreakerConfigMap()
K8sClient-->>CircuitBreaker: State initialized
Initializer->>Reconciler: NewReconciler(config, k8sClient, breaker)
Initializer->>EventWatcher: NewEventWatcher(...)
Initializer->>Reconciler: SetEventWatcher(watcher)
Initializer->>EventWatcher: Start(ctx)
EventWatcher->>MongoDB: Watch change stream
MongoDB-->>EventWatcher: Events
EventWatcher->>Reconciler: ProcessEvent callback
Reconciler->>NodeInformer: Evaluate rules
NodeInformer-->>Reconciler: Matching nodes
Reconciler->>K8sClient: QuarantineNodeAndSetAnnotations()
K8sClient->>K8sClient: UpdateNode (with per-node lock)
Initializer-->>Main: Components {Reconciler, EventWatcher, K8sClient, CircuitBreaker}
Estimated code review effortπ― 5 (Critical) | β±οΈ ~120 minutes
Areas requiring extra attention:
Poem
Pre-merge checks and finishing touchesβ Failed checks (1 warning)
β Passed checks (2 passed)
β¨ Finishing touches
π§ͺ Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. π§ golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
β¦adkar/NVSentinel into kdabhadkar/FQ-refactorings
|
@CodeRabbit review |
β Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and canβt be posted inline due to platform limitations.
β οΈ Outside diff range comments (1)
fault-quarantine-module/pkg/breaker/breaker.go (1)
49-59: Validate window duration to prevent zero buckets.cfg.Window < 1s yields zero buckets and negative indices later. Guard early.
func NewSlidingWindowBreaker(ctx context.Context, cfg Config) (CircuitBreaker, error) { - numBuckets := int((cfg.Window + time.Second - 1) / time.Second) + if cfg.Window < time.Second { + return nil, fmt.Errorf("window must be >= 1s; got %v", cfg.Window) + } + numBuckets := int((cfg.Window + time.Second - 1) / time.Second)
β»οΈ Duplicate comments (12)
fault-quarantine-module/pkg/breaker/types.go (1)
24-30: Consider usingStatetype in interface methods for type safety.The interface methods
ReadCircuitBreakerStateandWriteCircuitBreakerStateusestringfor the state parameter/return value, while theStatetype (line 41) provides an explicit type with defined constants (StateClosed,StateTripped). UsingStateinstead ofstringwould provide compile-time type safety and prevent invalid state values.Apply this diff:
type K8sClientOperations interface { GetTotalNodes(ctx context.Context) (int, error) - EnsureCircuitBreakerConfigMap(ctx context.Context, name, namespace string, initialStatus State) error - ReadCircuitBreakerState(ctx context.Context, name, namespace string) (State, error) - WriteCircuitBreakerState(ctx context.Context, name, namespace string, status State) error + EnsureCircuitBreakerConfigMap(ctx context.Context, name, namespace string, initialStatus State) error + ReadCircuitBreakerState(ctx context.Context, name, namespace string) (State, error) + WriteCircuitBreakerState(ctx context.Context, name, namespace string, status State) error }Wait, looking at the code again, the interface already uses
Statetype (line 27, 28, 29). The past comment may be outdated.fault-quarantine-module/pkg/evaluator/rule_evaluator.go (2)
231-255: Primitive cases duplicated in handleComplexType.These cases are unreachable since structToInterface already returns primitives earlier. Safe to remove for clarity.
194-212: primitiveKinds package var immutability/style.Read-only after init; consider localizing or documenting immutability. Not blocking.
commons/pkg/configmanager/env_test.go (1)
116-121: Assert error mentions missing variable name.Strengthen the test by checking the error includes the var name.
Apply:
@@ -import ( +import ( "fmt" "math" "strconv" "testing" + "strings" ) @@ t.Run("missing required returns error", func(t *testing.T) { _, err := GetEnvVar[int]("TEST_MISSING_REQUIRED", nil, nil) if err == nil { t.Error("expected error for missing env var but got none") } + if err != nil && !strings.Contains(err.Error(), "TEST_MISSING_REQUIRED") { + t.Errorf("error should mention missing var name, got: %v", err) + } })Also applies to: 17-22
fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go (1)
174-178: Use collision-proof node names in tests.Truncating ObjectID to 8 chars can collide in parallel runs. Prefer full hex or suffix with t.Name().
- nodeName := "test-node-" + primitive.NewObjectID().Hex()[:8] + nodeName := "test-node-" + primitive.NewObjectID().Hex() + "-" + strings.ReplaceAll(t.Name(), "/", "_")fault-quarantine-module/pkg/initializer/init.go (2)
87-91: Magic number: 30m informer resync/timeout is hardcoded.Make it configurable via InitializationParams or env var to tune in different clusters.
- k8sClient, err := informer.NewFaultQuarantineClient(params.KubeconfigPath, params.DryRun, 30*time.Minute) + k8sClient, err := informer.NewFaultQuarantineClient(params.KubeconfigPath, params.DryRun, params.InformerResync)(Add InformerResync time.Duration to InitializationParams.)
141-144: Bidirectional wiring between Reconciler and EventWatcher.SetEventWatcher after passing reconciler into NewEventWatcher creates a circular dependency. Consider injecting the callback or an interface instead.
fault-quarantine-module/pkg/reconciler/reconciler.go (1)
156-164: Add nil-guard for eventWatcher before Start() dereference.Lines 156-162 dereference
r.eventWatcherwithout checking nil. While the normal production flow (viaInitializeAll) callsSetEventWatcherbeforeStart(), alternative pathsβsuch as direct reconciler instantiation in testsβcould bypass this, causing a panic. Defensive nil checks are warranted.+ if r.eventWatcher == nil { + return fmt.Errorf("event watcher not configured; call SetEventWatcher before Start") + } + r.eventWatcher.SetProcessEventCallback( func(ctx context.Context, event *model.HealthEventWithStatus) *model.Status { return r.ProcessEvent(ctx, event, ruleSetEvals, rulesetsConfig) }, ) if err := r.eventWatcher.Start(ctx); err != nil { return fmt.Errorf("event watcher failed: %w", err) }fault-quarantine-module/pkg/metrics/metrics.go (1)
183-186: Avoid GaugeVec.Reset in hot paths; prefer stable series and set 0/1 for known statesReset clears all label combinations and can be noisy under scrape; also previously raised. If single-goroutine is guaranteed, fine, but consider a safer pattern: set 1 for current state and 0 for the rest to keep time series stable. Also, fix state values to a small enum to avoid cardinality creep.
Apply this diff:
-func SetFaultQuarantineBreakerState(state string) { - FaultQuarantineBreakerState.Reset() - FaultQuarantineBreakerState.WithLabelValues(state).Set(1) -} +func SetFaultQuarantineBreakerState(state string) { + // Known states; keep series stable + for _, s := range []string{"closed", "open", "half_open"} { + val := 0.0 + if s == state { + val = 1.0 + } + FaultQuarantineBreakerState.WithLabelValues(s).Set(val) + } +}If you prefer to keep Reset, please confirm single-goroutine invocation and that youβre okay with transient 0s for all states during collection.
fault-quarantine-module/pkg/informer/k8s_client.go (2)
386-393: Why 55-char truncation? K8s label value limit is 63Unless thereβs a specific reason, align with 63 to avoid unnecessary loss of information.
- if len(uncordonReason) > 55 { - uncordonReason = uncordonReason[:55] + if len(uncordonReason) > 63 { + uncordonReason = uncordonReason[:63]
311-318: Do not skip uncordon/cleanup when taints-to-remove donβt matchEarly-returning here prevents uncordon, annotation, and label cleanup if the specified taints arenβt present. That contradicts expected idempotent unquarantine semantics.
Apply this diff:
- if len(taints) > 0 { - if shouldReturn := c.removeTaints(node, taints, nodename); shouldReturn { - return nil - } - } + if len(taints) > 0 { + c.removeTaints(node, taints, nodename) + }And change removeTaints to have no boolean return (see next comment).
fault-quarantine-module/pkg/mongodb/event_watcher.go (1)
163-171: Nil callback can panic; guard before invocationIf SetProcessEventCallback wasnβt called, this will panic.
Apply this diff:
- startTime := time.Now() - status := w.processEventCallback(ctx, &healthEventWithStatus) + startTime := time.Now() + var status *model.Status + if w.processEventCallback != nil { + status = w.processEventCallback(ctx, &healthEventWithStatus) + } else { + slog.Warn("processEventCallback not set; skipping event callback") + }
π§Ή Nitpick comments (17)
fault-quarantine-module/pkg/breaker/types.go (1)
32-38: ClarifyCircuitBreakerConfigusage in this file.The
CircuitBreakerConfigstruct is defined but its relationship to theConfigstruct and how it's used is not evident from this file alone. Consider adding a comment explaining its purpose or how it relates toConfig, especially since both contain related circuit breaker configuration.fault-quarantine-module/pkg/evaluator/rule_evaluator.go (2)
316-328: Map key handling can be lossy for non-string keys.key.String() is only correct for string keys. Use string keys or fmt.Sprint fallback.
Apply:
func handleMap(v reflect.Value) interface{} { if v.IsNil() { return nil } mapResult := make(map[string]interface{}) for _, key := range v.MapKeys() { - mapResult[key.String()] = structToInterface(v.MapIndex(key)) + var k string + if key.Kind() == reflect.String { + k = key.String() + } else { + k = fmt.Sprint(key.Interface()) + } + mapResult[k] = structToInterface(v.MapIndex(key)) } return mapResult }
339-359: Avoid double JSON round-trip when already a map.Minor perf: if structToInterface returns map[string]interface{}, return it directly.
Apply:
func RoundTrip(v interface{}) (map[string]interface{}, error) { val := reflect.ValueOf(v) - obj := structToInterface(val) + obj := structToInterface(val) + if m, ok := obj.(map[string]interface{}); ok { + return m, nil + } b, err := json.Marshal(obj) if err != nil { return nil, fmt.Errorf("failed to marshal intermediate object: %w", err) } var j interface{} if err := json.Unmarshal(b, &j); err != nil { return nil, fmt.Errorf("failed to unmarshal JSON back to map: %w", err) } m, ok := j.(map[string]interface{}) if !ok { return nil, fmt.Errorf("expected JSON object after roundtrip") } return m, nil }commons/pkg/configmanager/env.go (1)
155-167: Broaden boolean parsing (optional).Many envs use 1/0, yes/no, y/n. Consider supporting these alongside true/false.
Example:
func parseBool(valueStr string) (bool, error) { valueStr = strings.ToLower(strings.TrimSpace(valueStr)) switch valueStr { - case "true": + case "true", "1", "yes", "y": return true, nil - case "false": + case "false", "0", "no", "n": return false, nil default: return false, fmt.Errorf("invalid boolean value: %s (must be 'true' or 'false')", valueStr) } }Update error message accordingly if you adopt the broader set.
fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go (1)
65-100: Minor: preallocate evaluators and improve error context.
- Preallocate capacity to len(rules).
- Include rule.Expression in unknown kind error for easier triage.
-func createEvaluators(rules []config.Rule, nodeInformer *informer.NodeInformer) ([]RuleEvaluator, error) { - evaluators := []RuleEvaluator{} +func createEvaluators(rules []config.Rule, nodeInformer *informer.NodeInformer) ([]RuleEvaluator, error) { + evaluators := make([]RuleEvaluator, 0, len(rules)) ... - default: - err = fmt.Errorf("unknown evaluator kind: %s", rule.Kind) + default: + err = fmt.Errorf("unknown evaluator kind: %s (expr=%q)", rule.Kind, rule.Expression)fault-quarantine-module/pkg/reconciler/reconciler.go (3)
661-666: Record breaker events only after successful quarantine.Currently AddCordonEvent is called before K8s mutations; a subsequent failure overcounts. Move it after a successful QuarantineNodeAndSetAnnotations.
- r.recordCordonEventInCircuitBreaker(event) ... - err := r.k8sClient.QuarantineNodeAndSetAnnotations( + err := r.k8sClient.QuarantineNodeAndSetAnnotations( ctx, event.HealthEvent.NodeName, taintsToBeApplied, isCordoned.Load(), annotationsMap, labels, ) if err != nil { ... return nil } + r.recordCordonEventInCircuitBreaker(event)
701-718: Avoid double rule evaluation on quarantined nodes.eventMatchesAnyRule re-evaluates rules already computed in evaluateRulesets. Cache pass/fail per ruleset for the event and reuse to cut duplicate work.
496-503: Defensive type-assert on labelsMap value.Cordoned reason concatenation assumes string; guard to avoid panic if map gets polluted.
- if oldReasonVal, exist := labelsMap.Load(r.cordonedReasonLabelKey); exist { - oldCordonReason := oldReasonVal.(string) + if oldReasonVal, exist := labelsMap.Load(r.cordonedReasonLabelKey); exist { + oldCordonReason, _ := oldReasonVal.(string) newCordonReason = oldCordonReason + "-" + newCordonReason }fault-quarantine-module/pkg/breaker/breaker.go (2)
61-63: Replace Go 1.22-only integer ranges with classic loops for portability.Avoid
for range intto support older Go versions.- for i := range numBuckets { + for i := 0; i < numBuckets; i++ { b.indexToNodes[i] = make(map[string]bool) } ... - for range steps { + for i := 0; i < steps; i++ { ... - for i := range len(b.indexToNodes) - 1 { + for i := 0; i < len(b.indexToNodes)-1; i++ { b.indexToNodes[i] = b.indexToNodes[i+1] }Confirm Go toolchain (see script in breaker_test.go comment).
Also applies to: 108-134, 121-126
97-103: Use slice instead of map[int] for buckets-to-nodes mapping.indexToNodes is indexed densely; a slice []map[string]bool is simpler and faster than map[int]β¦ . Optional cleanup.
High-level:
- Change indexToNodes to
[]map[string]bool- Initialize with
make([]map[string]bool, numBuckets)- Shift with
copy(s[:], s[1:])Also applies to: 121-126
fault-quarantine-module/pkg/mongodb/event_watcher.go (3)
104-114: Treat graceful shutdown as info, not error; avoid noisy logs on ctx cancelWhen watchEvents returns nil due to Close(ctx) on context cancellation, this logs an error and wraps a synthetic error. Distinguish ctx-driven shutdown.
Apply this diff:
- go func() { - err := w.watchEvents(ctx, watcher) - if err != nil { - slog.Error("MongoDB event watcher goroutine failed", "error", err) - watchDoneCh <- err - } else { - slog.Error("MongoDB event watcher goroutine exited unexpectedly, event processing has stopped") - watchDoneCh <- fmt.Errorf("event watcher channel closed unexpectedly") - } - }() + go func() { + err := w.watchEvents(ctx, watcher) + if err != nil { + slog.Error("MongoDB event watcher goroutine failed", "error", err) + watchDoneCh <- err + return + } + if ctx.Err() != nil { + slog.Info("MongoDB event watcher stopped due to context cancellation") + } else { + slog.Error("MongoDB event watcher exited unexpectedly; event channel closed") + watchDoneCh <- fmt.Errorf("event watcher channel closed unexpectedly") + } + }()
179-185: Silent failure to capture ObjectID skews backlog; add metric/logIf _id extraction fails, EventBacklogSize can under-report. Emit a metric or debug log to observe this.
Apply this diff:
func (w *EventWatcher) storeEventObjectID(eventBson bson.M) { if fullDoc, ok := eventBson["fullDocument"].(bson.M); ok { if objID, ok := fullDoc["_id"].(primitive.ObjectID); ok { w.lastProcessedObjectID.StoreLastProcessedObjectID(objID) + return } } + slog.Debug("Could not extract ObjectID from event fullDocument; skipping lastProcessed update") + metrics.ProcessingErrors.WithLabelValues("object_id_missing").Inc() }
197-200: Avoid stale EventBacklogSize when no last processed IDCurrently we skip updating the gauge entirely; consider explicitly setting to 0 to avoid stale values.
- objID, ok := w.lastProcessedObjectID.LoadLastProcessedObjectID() - if !ok { - continue - } + objID, ok := w.lastProcessedObjectID.LoadLastProcessedObjectID() + if !ok { + metrics.EventBacklogSize.Set(0) + continue + }fault-quarantine-module/pkg/informer/k8s_client.go (2)
131-149: Use the same backoff policy everywhere for consistencyYou defined customBackoff; use it here too for node updates.
- return retry.OnError(retry.DefaultBackoff, errors.IsConflict, func() error { + return retry.OnError(customBackoff, errors.IsConflict, func() error {
251-254: Map rebuild causes non-deterministic taint orderingFunctional, but ordering becomes non-deterministic. If later logic or tests assume order, consider appending only missing taints instead of rebuilding from a map.
fault-quarantine-module/pkg/informer/k8s_client_test.go (1)
573-610: Add assertion that unquarantine still uncordons when taints donβt matchTo catch the early-return bug, start from a cordoned node and verify itβs uncordoned even if taintsToRemove donβt exist.
Apply this diff:
- createTestNode(ctx, t, nodeName, nil, nil, taints, false) + createTestNode(ctx, t, nodeName, nil, nil, taints, true) @@ - // Filter to test taint only (ignore automatic envtest taints) + // Node should be uncordoned even if specified taints didn't match + if updatedNode.Spec.Unschedulable { + t.Errorf("Expected node to be uncordoned") + } + // Filter to test taint only (ignore automatic envtest taints)fault-quarantine-module/pkg/informer/k8s_client_interface.go (1)
27-39: Update interface parameter names for consistency with implementationThe API rename verification confirms successful migration: old method names (
GetTotalGpuNodes,TaintAndCordonNodeAndSetAnnotations,UnTaintAndUnCordonNodeAndRemoveAnnotations) are completely replaced with new names.However, the interface definition uses
labelMapwhile the implementation useslabelsfor the same parameters. Update the interface at lines 28 and 31 to uselabelsinstead oflabelMapfor naming consistency:
QuarantineNodeAndSetAnnotations: parameter 6 (line 28)UnQuarantineNodeAndRemoveAnnotations: parameter 6 (line 31)
π Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
β Files ignored due to path filters (7)
commons/go.sumis excluded by!**/*.sumfault-quarantine-module/go.sumis excluded by!**/*.sumfault-remediation-module/go.sumis excluded by!**/*.sumhealth-events-analyzer/go.sumis excluded by!**/*.sumhealth-monitors/csp-health-monitor/go.sumis excluded by!**/*.sumjanitor/go.sumis excluded by!**/*.sumnode-drainer-module/go.sumis excluded by!**/*.sum
π Files selected for processing (42)
.github/dependabot.yml(1 hunks)commons/Makefile(2 hunks)commons/go.mod(1 hunks)commons/pkg/configmanager/env.go(1 hunks)commons/pkg/configmanager/env_test.go(1 hunks)commons/pkg/configmanager/loader.go(1 hunks)commons/pkg/configmanager/loader_test.go(1 hunks)fault-quarantine-module/Makefile(1 hunks)fault-quarantine-module/go.mod(5 hunks)fault-quarantine-module/main.go(3 hunks)fault-quarantine-module/pkg/breaker/breaker.go(10 hunks)fault-quarantine-module/pkg/breaker/breaker_test.go(1 hunks)fault-quarantine-module/pkg/breaker/metrics.go(0 hunks)fault-quarantine-module/pkg/breaker/types.go(3 hunks)fault-quarantine-module/pkg/common/common.go(0 hunks)fault-quarantine-module/pkg/common/health_events_buffer.go(0 hunks)fault-quarantine-module/pkg/config/config.go(0 hunks)fault-quarantine-module/pkg/evaluator/rule_evaluator.go(3 hunks)fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go(2 hunks)fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go(4 hunks)fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go(1 hunks)fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go(5 hunks)fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go(0 hunks)fault-quarantine-module/pkg/informer/k8s_client.go(1 hunks)fault-quarantine-module/pkg/informer/k8s_client_interface.go(1 hunks)fault-quarantine-module/pkg/informer/k8s_client_test.go(1 hunks)fault-quarantine-module/pkg/informer/node_informer.go(6 hunks)fault-quarantine-module/pkg/informer/node_informer_test.go(0 hunks)fault-quarantine-module/pkg/initializer/init.go(1 hunks)fault-quarantine-module/pkg/metrics/metrics.go(6 hunks)fault-quarantine-module/pkg/mongodb/event_watcher.go(1 hunks)fault-quarantine-module/pkg/nodeinfo/nodeinfo.go(0 hunks)fault-quarantine-module/pkg/nodeinfo/nodeinfo_test.go(0 hunks)fault-quarantine-module/pkg/reconciler/node_quarantine.go(0 hunks)fault-quarantine-module/pkg/reconciler/node_quarantine_test.go(0 hunks)fault-quarantine-module/pkg/reconciler/reconciler.go(11 hunks)fault-remediation-module/go.mod(3 hunks)health-events-analyzer/go.mod(1 hunks)health-monitors/csp-health-monitor/go.mod(1 hunks)janitor/go.mod(4 hunks)node-drainer-module/go.mod(4 hunks)store-client-sdk/pkg/storewatcher/watch_store_mock.go(1 hunks)
π€ Files with no reviewable changes (10)
- fault-quarantine-module/pkg/common/common.go
- fault-quarantine-module/pkg/nodeinfo/nodeinfo_test.go
- fault-quarantine-module/pkg/reconciler/node_quarantine_test.go
- fault-quarantine-module/pkg/informer/node_informer_test.go
- fault-quarantine-module/pkg/config/config.go
- fault-quarantine-module/pkg/common/health_events_buffer.go
- fault-quarantine-module/pkg/nodeinfo/nodeinfo.go
- fault-quarantine-module/pkg/breaker/metrics.go
- fault-quarantine-module/pkg/reconciler/node_quarantine.go
- fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go
π Additional comments (24)
node-drainer-module/go.mod (1)
50-50: All newly added indirect dependencies are legitimate and secureβno issues found.The verification confirms:
- Security: No vulnerabilities detected for any of the six indirect dependencies.
- Necessity: All are expected transitive dependencies from direct dependencies (k8s v0.34.1, controller-runtime v0.22.3, mongo-driver v1.17.4, prometheus v1.23.2).
- Legitimacy: Testing frameworks (ginkgo/v2, gomega), observability libraries (otel/metric, otel/trace), and development tooling (pprof, x/tools) align with the module's ecosystem and PR objectives.
The go.mod changes are correct and require no further action.
health-monitors/csp-health-monitor/go.mod (1)
33-33: LGTM: Routine patch version bump.Cloud.google.com/go patch update is appropriate for dependency hygiene.
.github/dependabot.yml (1)
62-68: LGTM: Good practice to track commons module dependencies.Adding commons to Dependabot monitoring is consistent with its expanded role as a shared library (now including config management utilities) and maintains dependency governance across the project.
commons/go.mod (1)
8-8: LGTM: Appropriate addition for config management.Adding BurntSushi/toml v1.5.0 aligns with the new TOML-based configuration loading utilities being introduced in commons (LoadTOMLConfig). This is a standard, well-maintained library.
commons/Makefile (1)
1-1: LGTM: Documentation updates reflect evolved module scope.Makefile updates accurately describe commons as providing both logging and configuration management utilities rather than solely logging initialization. These are documentation-only changes with no build logic impact.
Also applies to: 43-46
health-events-analyzer/go.mod (1)
40-41: LGTM: Expected OpenTelemetry transitive dependencies.Addition of OpenTelemetry metric and trace v1.38.0 as indirect dependencies reflects coordinated observability improvements across the project. These are pulled in by MongoDB and gRPC instrumentation.
fault-remediation-module/go.mod (1)
49-49: LGTM: Infrastructure and testing improvements reflected in indirect dependencies.New indirect dependencies support enhanced testing (ginkgo/gomega), observability (OpenTelemetry v1.38.0), profiling (pprof), and code tooling (golang.org/x/tools v0.38.0). These are appropriate for the fault remediation module's operational requirements.
Also applies to: 57-58, 69-70, 80-80
fault-quarantine-module/go.mod (3)
14-14: Verify controller-runtime and testify production usage.Two direct dependencies were added:
- Line 14:
github.com/stretchr/testify v1.11.1- Line 22:
sigs.k8s.io/controller-runtime v0.22.3A past review questioned whether controller-runtime should be test-only. Confirm both are used in production code and not test utilities.
Also applies to: 22-22
27-27: Investigate YAML version divergence in indirect dependencies.Two different versions of the YAML library appear in indirect dependencies:
- Line 79:
go.yaml.in/yaml/v3 v3.0.4- Line 93:
gopkg.in/yaml.v3 v3.0.1These are the same package (
gopkg.in/yaml.v3is an alias forgo.yaml.in/yaml/v3). While Go modules can handle this in some cases, confirm this is intentional and not a transitive dependency conflict. A past review flagged this and it may need resolution.Also applies to: 79-79, 93-93
27-27: LGTM: BurntSushi/toml architectural shift to commons.Moving BurntSushi/toml from direct to indirect dependency aligns with the refactoring that centralizes TOML configuration loading in the commons module. This reflects the intended architectural improvement.
janitor/go.mod (1)
8-8: LGTM: Coordinated, project-wide dependency modernization.The janitor module reflects a comprehensive update across multiple ecosystems (AWS SDK v2, Google Cloud, Kubernetes, OpenTelemetry, Prometheus, gRPC/protobuf). Version choices are consistent with updates across other modules in the PR:
- AWS SDK v2: v1.39.4+ (with credentials v1.18.19, ec2/imds v1.18.11, etc.)
- Kubernetes: v0.34.1
- OpenTelemetry: v1.38.0 (metric/trace)
- Google Cloud: v0.121.1
- gRPC/protobuf: v1.76.0/v1.36.10
This appears to be part of coordinated infrastructure modernization and improved observability across the project. The versions follow standard semantic versioning and no breaking changes are evident.
Also applies to: 12-12, 16-17, 29-48, 61-74, 84-120, 139-144
store-client-sdk/pkg/storewatcher/watch_store_mock.go (2)
54-76: LGTM!The constructor properly initializes all function hooks with safe default behaviors, and the buffered channel size (10) correctly matches the documentation at line 28.
147-165: LGTM!The
Resetmethod properly clears all state and drains theEventsChan, addressing the concern from previous reviews. The implementation is correct for test mock reset behavior.fault-quarantine-module/pkg/informer/node_informer.go (3)
55-58: LGTM!The new
GetInformeraccessor andWaitForSyncmethod with context support are well-implemented. The context-aware waiting at line 121 allows proper cancellation handling.Also applies to: 117-129
145-168: LGTM!The index-based approach for tracking quarantined nodes (lines 154-165) properly synchronizes node counting and quarantine status lookup, addressing previous concerns about race conditions between total node count and quarantined node tracking.
235-238: LGTM!The simplified
handleUpdateNodeimplementation focusing solely on manual uncordon detection aligns well with the refactored architecture that uses index-based node tracking instead of maintaining separate state.commons/pkg/configmanager/loader.go (1)
48-54: LGTM!The generic TOML loader implementation is clean and correct. The documentation (lines 41-47) appropriately demonstrates post-load validation pattern, which is the right place for domain-specific validation rather than in a generic loader.
fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go (1)
46-48: LGTM!The change from
RuleEvaluationErroredOuttoRuleEvaluationFailedaligns with the broader refactoring to simplify evaluation statuses. The error is still returned (line 47), allowing callers to distinguish between error cases and normal failures if needed.commons/pkg/configmanager/loader_test.go (1)
29-95: LGTM!The test coverage is comprehensive, including success, missing file, and invalid syntax cases. All past review concerns have been addressed: field assertions are present (lines 50-60), non-existent file test added, and
filepath.Joinis used consistently. Tests follow Go best practices witht.Parallel()andt.TempDir().fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go (2)
75-76: LGTM!The test expectation changes from
RuleEvaluationErroredOuttoRuleEvaluationFailedcorrectly align with the refactored status enum inrule_set_evaluator_any.go. The tests still validate that errors are returned (expectErr: true), maintaining proper error handling verification.Also applies to: 95-96
293-293: LGTM!Passing
nilfor the second parameter inInitializeRuleSetEvaluatorsandcreateEvaluatorsaligns with the refactoring to decouple evaluators from direct Kubernetes client dependencies. The test coverage remains comprehensive and validates the new signature correctly.Also applies to: 367-367
fault-quarantine-module/Makefile (1)
29-33: No issues found withTEST_SETUP_COMMANDSusage.Verification confirms that all recipes using
TEST_SETUP_COMMANDSproperly append additional commands after the variable expansion. Themake/go.mkrecipes (lint-testandtest) both follow the variable with trailing commands, and all other module Makefiles follow the same correct pattern. No standalone uses causing syntax errors exist in the codebase.fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go (1)
26-63: LGTM: cleaner initialization surface (no K8s client).Signatures and error aggregation look good.
fault-quarantine-module/pkg/breaker/breaker_test.go (1)
353-356: The review comment is incorrect. The codebase declares Go 1.25 infault-quarantine-module/go.mod, which fully supports the "range over int" syntax introduced in Go 1.22. The codefor range 10 { ... }is valid and requires no changes.Likely an incorrect or invalid review comment.
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
Summary
Type of Change
Component(s) Affected
Testing
Checklist
Tests performed on dev cluster manually:
Injected an inforom DCGM error via gpu health monitor on a node (using
dcgmi test --host nvidia-dcgm.gpu-operator:5555 --inject --gpuid 0 -f 84 -v 0) through the gpu monitor and verified that the node gets cordoned properly and that thequarantineHealthEventannotation gets applied on the node properly for the GPU 0 entity. Also verified that the node gets through the node drainer and fault remediation pipeline properly and that the node gets rebooted through janitor properly. Verified that the node is uncordoned and the annotation removed after the reboot successfully.Manually uncordoned the node after injecting the inforom GPU error. Checked that FQ removed the
quarantineHealthEventannotation from the node successfully.Injected a GPU inforom error for GPU 0 first then on GPU 1. Verified that the annotation gets updated properly with both the entities when it is cordoned. Then inserted a healthy event for GPU 1, and verified that the entity GPU 1 gets removed from the annotation, but that the node still remains cordoned. Then inserted a healthy event for GPU 0, and verified that the annotation gets removed from the node and that the node gets uncordoned.
Summary by CodeRabbit
New Features
Refactor
Tests