Skip to content

Conversation

@KaivalyaMDabhadkar
Copy link
Contributor

@KaivalyaMDabhadkar KaivalyaMDabhadkar commented Oct 24, 2025

Summary

Type of Change

  • πŸ› Bug fix
  • ✨ New feature
  • πŸ’₯ Breaking change
  • πŸ“š Documentation
  • πŸ”§ Refactoring
  • πŸ”¨ Build/CI

Component(s) Affected

  • Health Monitors
  • Core Services
  • Fault Management
  • Documentation/CI
  • Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Tests performed on dev cluster manually:

  1. Injected an inforom DCGM error via gpu health monitor on a node (using dcgmi test --host nvidia-dcgm.gpu-operator:5555 --inject --gpuid 0 -f 84 -v 0) through the gpu monitor and verified that the node gets cordoned properly and that the quarantineHealthEvent annotation gets applied on the node properly for the GPU 0 entity. Also verified that the node gets through the node drainer and fault remediation pipeline properly and that the node gets rebooted through janitor properly. Verified that the node is uncordoned and the annotation removed after the reboot successfully.

  2. Manually uncordoned the node after injecting the inforom GPU error. Checked that FQ removed the quarantineHealthEvent annotation from the node successfully.

  3. Injected a GPU inforom error for GPU 0 first then on GPU 1. Verified that the annotation gets updated properly with both the entities when it is cordoned. Then inserted a healthy event for GPU 1, and verified that the entity GPU 1 gets removed from the annotation, but that the node still remains cordoned. Then inserted a healthy event for GPU 0, and verified that the annotation gets removed from the node and that the node gets uncordoned.

Summary by CodeRabbit

  • New Features

    • Added environment variable configuration with type validation and default values
    • Implemented TOML configuration file loading
    • Added circuit breaker state persistence via Kubernetes ConfigMaps
    • Introduced event watcher for processing health events from MongoDB
    • Expanded metrics exposure for circuit breaker and reconciliation operations
  • Refactor

    • Centralized component initialization workflow
    • Migrated state persistence to Kubernetes-native storage
    • Reorganized metrics package with improved naming conventions
    • Streamlined node quarantine and uncordon operations
    • Enhanced Kubernetes client integration throughout
  • Tests

    • Added comprehensive Kubernetes integration tests using envtest

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Added config SDK and refactored FQ with bug fixes
- Added fake change stream client for MongoDB testing
- Added e2e tests
- Removed ini methods in configmanager
- Added missing configmanager to makefile and lint-test
- Merged upstream changes and resolved conflicts
@KaivalyaMDabhadkar KaivalyaMDabhadkar force-pushed the kdabhadkar/FQ-refactorings branch from 6bae3e3 to 4e6c902 Compare October 24, 2025 14:01
@KaivalyaMDabhadkar KaivalyaMDabhadkar marked this pull request as ready for review October 24, 2025 14:09
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR introduces a new configmanager SDK for centralized TOML and environment variable configuration management, and performs a major architectural refactoring of the fault-quarantine module. The configmanager provides generic utilities (LoadTOMLConfig, GetEnvVar, ReadEnvVars) that extract previously duplicated configuration logic into a reusable library. The fault-quarantine refactoring consolidates Kubernetes operations into a new FaultQuarantineClient, extracts initialization into a dedicated initializer package, moves MongoDB event watching into a separate EventWatcher abstraction, and simplifies the reconciler by eliminating the dual annotation cache. The rule evaluation state machine is simplified from 5 states to 2 (Success/Failed), and the circuit breaker switches from callback functions to interface-based dependency injection. The changes remove 2,877 lines while adding 3,157 new lines across test infrastructure (envtest adoption), improved error handling, and better separation of concerns.

Important Files Changed

Filename Score Overview
configmanager/go.mod 0/5 References non-existent Go versions (1.24.0 and 1.24.8) causing build failure
fault-quarantine-module/Makefile 2/5 TEST_SETUP_COMMANDS variable contains broken Make syntax (eval with trailing &&)
fault-quarantine-module/main.go 2/5 nodeInformer.Run() blocks forever preventing reconciler from starting on line 106
fault-quarantine-module/pkg/mongodb/event_watcher.go 2/5 Uses klog.Fatalf for transient errors causing process crashes; defer never executes
fault-quarantine-module/pkg/informer/k8s_client.go 3/5 Missing nil checks for nodeInformer causing potential panics; hardcoded 55-char truncation
fault-quarantine-module/pkg/reconciler/reconciler.go 3/5 Unchecked type assertions in updateTaintMaps and collectTaintsToApply risk panics
fault-quarantine-module/pkg/common/common.go 3/5 Removes 3 evaluation result states; callers must be updated to handle simplified enum
fault-quarantine-module/pkg/informer/node_informer.go 3/5 Race condition between totalNodes read and indexer query in GetNodeCounts
configmanager/env.go 3/5 Variadic parameter parsing overwrites validators/defaults silently; incorrect required-var handling
fault-quarantine-module/pkg/breaker/types.go 4/5 Type inconsistency between CircuitBreakerConfig.Percentage (int) and Config.TripPercentage (float64)
fault-quarantine-module/pkg/breaker/breaker.go 4/5 Refactors to interface-based design improving testability; removes GPU-specific language
configmanager/loader.go 4/5 Generic TOML loader lacks input validation (empty path, nil config pointer)
configmanager/loader_test.go 4/5 Tests only happy path; missing error cases for malformed TOML and missing files
fault-quarantine-module/pkg/informer/k8s_client_test.go 4/5 Comprehensive envtest coverage but contains typos in constants and hardcoded assumptions
store-client-sdk/pkg/storewatcher/watchStoreMock.go 4/5 EventsChan never closed by mock; Start method doesn't track context parameter
fault-quarantine-module/pkg/evaluator/rule_evaluator.go 5/5 Refactors high-complexity function into smaller helpers; renames error state for consistency
fault-quarantine-module/pkg/config/config.go 5/5 Extracts LoadTomlConfig to configmanager; keeps domain-specific type definitions local
Makefile 5/5 Adds configmanager to GO_MODULES and lint-test targets following existing patterns
.github/workflows/lint-test.yml 5/5 Adds configmanager to CI lint-test matrix
.github/dependabot.yml 5/5 Adds Dependabot configuration for configmanager Go module
fault-quarantine-module/Dockerfile 5/5 Copies configmanager directory into build context alongside other dependencies
configmanager/Makefile 5/5 Standard module Makefile with HAS_DOCKER=0 appropriate for library modules
configmanager/go.sum 5/5 Auto-generated checksums for BurntSushi/toml v1.4.0 dependency
fault-quarantine-module/go.sum 5/5 Adds controller-runtime dependencies for refactored architecture
fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go 5/5 Removes unused kubernetes.Interface parameter simplifying API

Confidence score: 1/5

  • This PR has multiple critical issues that will cause immediate build and runtime failures if merged
  • Score reflects blocking bugs: non-existent Go version preventing compilation, broken Makefile syntax, blocking goroutine preventing reconciler startup, process-crashing error handling in EventWatcher, and unchecked type assertions causing panics
  • Pay close attention to configmanager/go.mod (Go version), fault-quarantine-module/main.go (line 99-106 blocking issue), fault-quarantine-module/pkg/mongodb/event_watcher.go (klog.Fatalf usage), and all unchecked type assertions in reconciler.go and node_informer.go

Additional Comments (2)

  1. fault-quarantine-module/go.mod, line 79-80 (link)

    syntax: non-standard import paths go.yaml.in/yaml/v2 and go.yaml.in/yaml/v3 - should be gopkg.in/yaml.v2 and gopkg.in/yaml.v3

  2. fault-quarantine-module/pkg/reconciler/reconciler.go, line 559-560 (link)

    logic: Type assertions without checking boolean return values. If the map contains non-string types, this will panic.

43 files reviewed, 45 comments

Edit Code Review Agent Settings | Greptile

@lalitadithya
Copy link
Collaborator

/ok to test 4e6c902

@github-actions
Copy link

πŸ›‘οΈ CodeQL Analysis

🚨 Found 5 security alert(s)

πŸ”— View details

@github-actions
Copy link

Merging this branch will not change overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/NVIDIA/nvsentinel/configmanager 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/breaker 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/common 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/config 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/evaluator 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/informer 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/initializer 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/metrics 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/mongodb 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/nodeinfo 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/reconciler 0.00% (ΓΈ)
github.com/NVIDIA/nvsentinel/store-client-sdk/pkg/storewatcher 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/NVIDIA/nvsentinel/configmanager/env.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/configmanager/loader.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/main.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/breaker/breaker.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/breaker/metrics.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/breaker/types.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/common/common.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/common/healthEventsBuffer.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/config/config.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client_interface.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/informer/node_informer.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/initializer/init.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/metrics/metrics.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/mongodb/event_watcher.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler.go 0.00% (ΓΈ) 0 0 0
github.com/NVIDIA/nvsentinel/store-client-sdk/pkg/storewatcher/watchStoreMock.go 0.00% (ΓΈ) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/nvsentinel/configmanager/env_test.go
  • github.com/NVIDIA/nvsentinel/configmanager/loader_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/breaker/breaker_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/informer/node_informer_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/nodeinfo/nodeinfo_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/reconciler/node_quarantine_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go
  • github.com/NVIDIA/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_test.go

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers the entire PR from start to finish.

This PR introduces a new shared configuration management SDK in the commons package and performs a comprehensive refactoring of the fault-quarantine module. The configmanager package provides type-safe environment variable parsing and TOML file loading with validation support. The fault-quarantine module's architecture has been restructured from a buffer-queue pattern to an event-driven callback model: the reconciler now receives events from a dedicated MongoDB event watcher via callbacks rather than managing its own buffer and change-stream. Core state management was simplified by eliminating redundant cachesβ€”the node informer now provides the single source of truth for quarantined node state via a custom indexer. The K8s client logic was extracted into a dedicated FaultQuarantineClient with an interface-based design, and the circuit breaker now uses dependency injection for K8s operations. New test infrastructure using envtest replaces fake clients for more realistic integration testing. The changes consolidate nearly 200 lines of initialization logic from main.go into a new initializer package, improve metrics organization by moving all Prometheus definitions to a dedicated metrics package, and simplify error handling by reducing rule evaluation result states from five to two (RuleEvaluationSuccess and RuleEvaluationFailed).


Important Files Changed

Filename Score Overview
fault-quarantine-module/main.go 1/5 Critical blocking bug: Informer.Run() blocks on line 78, preventing reconciler startup on line 85
fault-quarantine-module/pkg/reconciler/reconciler.go 3/5 Major refactor to callback-based event processing with bidirectional dependency on event watcher; silently ignores JSON marshal errors
fault-quarantine-module/pkg/breaker/breaker.go 3/5 Refactored to use K8s client interface; metric overwrite issue and unconditional os.Exit() on retry exhaustion
fault-quarantine-module/pkg/informer/k8s_client.go 3/5 New K8s client wrapper with nil pointer risks if SetNodeInformer not called; non-deterministic taint ordering
fault-quarantine-module/pkg/mongodb/event_watcher.go 3/5 New MongoDB change-stream watcher; nil callback panic risk and os.Exit() on errors prevent graceful recovery
fault-quarantine-module/pkg/informer/node_informer.go 4/5 Refactored to use indexer-based quarantine tracking; removes GPU-specific filtering but introduces eventual consistency model
fault-quarantine-module/pkg/initializer/init.go 4/5 Centralizes initialization logic; bidirectional dependency between reconciler and event watcher flagged
fault-quarantine-module/go.mod 2/5 Adds controller-runtime and testify dependencies but introduces invalid yaml dependency paths on lines 76-77
fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go 4/5 Comprehensive e2e test suite with 40+ scenarios; potential goroutine leaks and duplicated test setup logic
fault-quarantine-module/pkg/informer/k8s_client_test.go 3/5 New envtest-based integration tests; hardcoded label value check on line 293 doesn't match test setup
commons/pkg/configmanager/loader.go 5/5 New generic TOML loader utility with clear documentation and proper error wrapping
commons/pkg/configmanager/env.go 4/5 New environment variable SDK with type-safe parsing; optional vars with empty defaults added to results map may cause ambiguity
fault-quarantine-module/pkg/metrics/metrics.go 4/5 Metrics moved from reconciler to dedicated package; Reset() in SetFaultQuarantineBreakerState could cause concurrency issues
fault-quarantine-module/pkg/evaluator/rule_evaluator.go 4/5 Refactored to reduce complexity by extracting helper functions; simplified error handling to binary success/fail model
fault-quarantine-module/pkg/breaker/types.go 4/5 Refactored from function-based to interface-based dependency injection; State type vs string inconsistency in interface
commons/go.mod 4/5 Module renamed from logger-sdk to commons and adds TOML dependency; invalid Go version declarations (1.24.0)
fault-quarantine-module/pkg/informer/k8s_client_interface.go 4/5 Interface moved from reconciler package; removed isUncordon parameter and added HandleManualUncordonCleanup method
fault-quarantine-module/pkg/config/config.go 5/5 Clean separation: removed LoadTomlConfig function, kept only struct definitions for TOML deserialization
fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go 5/5 Removed unused kubernetes.Interface parameter from initialization functionsβ€”clean refactoring
commons/pkg/configmanager/env_test.go 5/5 Comprehensive unit tests for environment variable handling across all supported types with edge cases
commons/pkg/configmanager/loader_test.go 4/5 Basic happy-path test for TOML loader; missing error case coverage (malformed TOML, missing files)
fault-quarantine-module/pkg/breaker/breaker_test.go 4/5 Refactored to use consolidated mock; added test for flapping node de-duplication bug fix
fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go 4/5 Refactored to use envtest instead of fake client; polling-based sync may be fragile under load
store-client-sdk/pkg/storewatcher/watchStoreMock.go 4/5 New thread-safe mock for change-stream watcher; EventsChan documentation unclear about buffer size rationale
.github/dependabot.yml 5/5 Adds Dependabot monitoring for new commons directory following existing pattern
fault-quarantine-module/Makefile 4/5 Adds TEST_SETUP_COMMANDS for envtest; eval syntax won't execute correctly in Make variable
commons/Makefile 5/5 Documentation updated to reflect expanded scope (logging + config management)

Confidence score: 2/5

  • This PR has critical blocking issues that will prevent the fault-quarantine module from functioning in production
  • Score reflects one showstopper bug (main.go line 78 blocks reconciler startup), multiple nil pointer panic risks, error handling issues with os.Exit() preventing recovery, and a broken go.mod dependency specification
  • Pay close attention to fault-quarantine-module/main.go (blocking bug), fault-quarantine-module/pkg/informer/k8s_client.go (nil pointer panics), fault-quarantine-module/pkg/mongodb/event_watcher.go (nil callback + process termination), fault-quarantine-module/go.mod (invalid yaml paths), and fault-quarantine-module/pkg/reconciler/reconciler.go (silently ignores marshal errors)

Additional Comments (2)

  1. fault-quarantine-module/pkg/breaker/breaker.go, line 284-290 (link)

    logic: when GetTotalNodes returns an error (line 286), the function immediately returns via handleGetTotalNodesError without performing retries. This means API errors bypass the retry loop entirely, and only zero-node results are retried. Is it intentional that transient API errors are not retried? Should transient API errors from GetTotalNodes also be retried, or should they immediately fail the circuit breaker check?

  2. fault-quarantine-module/pkg/breaker/breaker.go, line 418-430 (link)

    logic: GetTotalNodes is called again inside logRetriesExhausted (line 418), but if this call returns an error, the function logs it and exits with os.Exit(1) on line 429. However, if GetTotalNodes succeeds (err == nil), the code falls through to lines 432–439 and also calls os.Exit(1). Both branches exit the process, so the error check on line 420 is redundantβ€”the function always exits regardless of success or failure.

40 files reviewed, 32 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the most recent changes made since the last review, not the entire PR. The developer has addressed multiple previous review comments by: (1) correcting syntax errors in go.mod files (go version and toolchain), (2) adding comprehensive test coverage for the TOML loader and env var config manager, (3) improving test infrastructure by expanding envtest usage in node_informer_test.go and k8s_client_test.go, (4) fixing test naming typos ('COrdonned' β†’ 'Cordoned', 'uncordonedReasonLabelkey' β†’ 'uncordonedReasonLabelKey'), and (5) resolving the duplicate yaml.v3 dependency conflicts. These updates directly respond to feedback about build failures, missing tests, and code quality issues flagged in earlier iterations.

Important Files Changed

Filename Score Overview
fault-quarantine-module/go.mod 5/5 Corrected invalid Go version (1.24.0 β†’ 1.23.0) and toolchain (go1.24.8β†’ go1.23.8) per previous review feedback
commons/go.mod 5/5 Fixed toolchain version to go1.23.8 matching Go version 1.23.0, resolving build compatibility issue
fault-quarantine-module/go.sum 5/5 Cleaned up duplicate yaml.v3 entries, now consistently using v3.0.1 throughout dependencies
commons/pkg/configmanager/loader_test.go 5/5 Added three new test cases (happy path, non-existent file, invalid syntax) per previous review requesting test coverage
commons/pkg/configmanager/env_test.go 5/5 Added comprehensive test coverage for env var reading including edge cases and type validation
fault-quarantine-module/pkg/informer/k8s_client_test.go 4.5/5 Expanded envtest-based tests for taint/cordon operations, addressing request to replace mock clients with envtest
fault-quarantine-module/pkg/informer/node_informer.go 4.5/5 Refactored to remove GPU-specific filtering and add quarantine annotation indexer for efficient lookups
fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go 4.5/5 Fixed test naming typo ('COrdonned' β†’ 'Cordoned') and expanded circuit breaker test coverage
fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go 4.5/5 Migrated from fake clientset to envtest for more accurate integration testing per review guidance

Confidence score: 4.5/5

  • This round of changes successfully addresses critical syntax errors and test gaps from prior reviews, significantly improving build stability and test coverage
  • Score reflects resolved issues (Go version, duplicate deps, missing tests) with remaining minor concerns around test timing robustness and circular dependencies that were flagged but not yet resolved
  • Pay close attention to fault-quarantine-module/main.go (blocking informer startup), fault-quarantine-module/pkg/mongodb/event_watcher.go (nil callback), and the circular reconciler↔watcher dependency in pkg/initializer/init.go

Additional Comments (3)

  1. commons/go.mod, line 42-43 (link)

    syntax: conflicting yaml.v2 imports - line 42uses invalid go.yaml.in/yaml/v2 path while line 43 uses correct go.yaml.in/yaml/v3. The v2 path should be gopkg.in/yaml.v2

  2. commons/go.mod, line 53 (link)

    syntax: duplicate gopkg.in/yaml.v3 dependency (also on line 43 as go.yaml.in/yaml/v3). This creates ambiguity and may cause import resolution issues

  3. fault-quarantine-module/pkg/reconciler/node_quarantine_test.go, line 1 (link)

    logic: entire test file (799 lines) was deleted, removing all test coverage for node quarantine operations (taint/untaint, cordon/uncordon, annotation handling). No replacement tests found in reconciler_e2e_test.go. This creates a critical gap in test coverage for core fault-quarantine functionality. Were these tests moved to another file, or is the coverage intentionally removed? The deleted tests covered critical edge cases (empty keys, non-existent nodes, partial removals, concurrent modifications) that are now untested

40 files reviewed, 41 comments

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link

πŸ›‘οΈ CodeQL Analysis

🚨 Found 5 security alert(s)

πŸ”— View details

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The developer has made significant progress on the fault-quarantine (FQ) module refactoring. Recent updates include: addressing race condition issues in the circuit breaker's metric writes and taint map updates, fixing nil pointer dereferences in the event watcher and informer, resolving Go module version issues (Go 1.24.x corrected to 1.23.x), and adding envtest-based integration tests for better Kubernetes API validation. The refactoring continues to extract initialization logic, consolidate metrics into a centralized package, and introduce a config SDK for environment variable and TOML configuration management. The architecture now uses a callback-driven event processing model with a MongoDB change stream watcher, replacing the previous polling-based approach. Key structural improvements include per-node mutex locking to prevent concurrent update conflicts, circuit breaker state management via ConfigMap, and a cleaner separation of concerns between the reconciler, informer, and Kubernetes client.

Important Files Changed

Filename Score Overview
fault-quarantine-module/main.go 3/5 Fixed blocking informer startup and unused server variable; POD_NAMESPACE still retrieved but not passed to initializer
fault-quarantine-module/pkg/breaker/breaker.go 4/5 Refactored to use K8sClient interface for better testability; metrics moved to centralized package
fault-quarantine-module/pkg/reconciler/reconciler.go 4/5 Major refactoring to remove in-memory caching and use callback-driven event processing; added per-node atomic updates
fault-quarantine-module/pkg/informer/k8s_client.go 4/5 New file consolidating Kubernetes operations with per-node mutex locking and retry logic for conflict resolution
fault-quarantine-module/pkg/mongodb/event_watcher.go 4/5 New MongoDB change stream watcher with callback pattern; requires nil check for processEventCallback before invocation
fault-quarantine-module/pkg/initializer/init.go 4/5 New centralized initialization module; creates bidirectional dependency between reconciler and event watcher
fault-quarantine-module/pkg/metrics/metrics.go 4/5 Moved from reconciler package and renamed metrics to PascalCase; added circuit breaker observability
commons/pkg/configmanager/env.go 4/5 New config SDK for type-safe environment variable reading with validation and default value support
commons/pkg/configmanager/loader.go 4/5 New TOML configuration loader using generics; missing nil pointer validation for config parameter
fault-quarantine-module/pkg/informer/node_informer.go 4/5 Refactored to use cache indexing instead of internal state tracking; removed work signal channel
fault-quarantine-module/pkg/evaluator/rule_evaluator.go 4/5 Extracted complex logic into smaller helper functions; unified error return values
fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go 4/5 Comprehensive E2E tests using envtest; covers quarantine flows, circuit breaker, and conflict scenarios
fault-quarantine-module/go.mod 4/5 Added controller-runtime and testify dependencies; moved BurntSushi/toml to indirect
commons/go.mod 3/5 Renamed module from logger-sdk to commons; has conflicting yaml.v3 versions that need resolution
fault-quarantine-module/Makefile 3/5 Added envtest setup commands with incorrect eval syntax that won't execute properly

Confidence score: 3/5

  • This PR contains significant architectural improvements but still has several implementation issues that need to be addressed before merging
  • Score reflects unresolved blocking bugs (nil pointer dereference in event_watcher.go line 165, blocking informer startup in main.go line 91, conflicting YAML dependencies in commons/go.mod), incomplete refactoring (POD_NAMESPACE retrieved but unused, metrics server never started), and potential race conditions (taint map updates, metric writes)
  • Pay close attention to fault-quarantine-module/main.go, fault-quarantine-module/pkg/mongodb/event_watcher.go, commons/go.mod, and fault-quarantine-module/Makefile as these files contain critical bugs that will cause runtime failures or build issues

Additional Comments (4)

  1. fault-quarantine-module/pkg/breaker/metrics.go, line 1 (link)

    logic: entire metrics file deleted - all circuit breaker metrics removed. The SetFaultQuarantineBreakerState and SetFaultQuarantineBreakerUtilization functions called in breaker.go will now fail at runtime with undefined reference errors

  2. fault-quarantine-module/pkg/reconciler/reconciler_test.go, line 1 (link)

    logic: Entire unit test file deleted. The reconciler now has zero unit test coverage - only e2e tests remain. Unit tests are crucial for fast, isolated testing of edge cases, error paths, and individual functions. Verify that all critical reconciler logic is adequately covered by e2e tests, or restore/recreate unit tests. Are all previous test scenarios now covered in reconciler_e2e_test.go? What is the test coverage strategy now that unit tests are removed - do the e2e tests cover all the scenarios previously tested in unit tests, including error handling and edge cases?

  3. fault-quarantine-module/pkg/common/health_events_buffer.go, line 1 (link)

    logic: entire HealthEventBuffer implementation deleted. This file is referenced in the repository and removing it will break compilation. Check that all imports and usages have been migrated to a replacement implementation. What replaces HealthEventBuffer? Have all references to common.HealthEventBuffer, common.NewHealthEventBuffer, and common.HealthEventInfo been removed or updated throughout the codebase?

  4. commons/go.mod, line 42-53 (link)

    syntax: conflicting yaml.v3 dependencies: go.yaml.in/yaml/v2 (line 42), go.yaml.in/yaml/v3 (line 43), and gopkg.in/yaml.v3 (line 53) with different versions (v3.0.4 and v3.0.1). The go.yaml.in import paths are non-standard - should be gopkg.in/yaml.v2 and gopkg.in/yaml.v3. This will cause module resolution failures.

40 files reviewed, 16 comments

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 3 increase)

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/configmanager 40.23% (+40.23%) 🌟
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker 30.20% (+4.30%) πŸ‘
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer 38.44% (-0.54%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler 27.23% (+9.02%) πŸ‘
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher 64.15% (-11.86%) πŸ’€

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/configmanager/env.go 40.16% (+40.16%) 254 (+254) 102 (+102) 152 (+152) 🌟
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/breaker/breaker.go 30.20% (+4.61%) 831 (-173) 251 (-6) 580 (-167) πŸ‘
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer.go 30.84% (-8.14%) 415 (-493) 128 (-226) 287 (-267) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler.go 27.23% (+9.14%) 2126 (-1245) 579 (-31) 1547 (-1214) πŸ‘
github.com/nvidia/nvsentinel/store-client-sdk/pkg/storewatcher/watchStoreMock.go 0.00% (ΓΈ) 47 (+47) 0 47 (+47)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/configmanager/env_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/k8s_client_test.go
  • github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. Three targeted bug fixes were applied across test infrastructure and event-watcher shutdown handling in response to previous review feedback. The changes address (1) test-environment inconsistencies in commons/pkg/configmanager/loader_test.go by replacing a hardcoded absolute path with a temp-directory-based approach; (2) silent test-setup errors in commons/pkg/configmanager/env_test.go by adding explicit error checks during test-case validation; and (3) lost error information in fault-quarantine-module/pkg/mongodb/event_watcher.go when the MongoDB change-stream terminates unexpectedly. These fixes improve test reliability, portability, and runtime error diagnostics without altering any production logic or introducing new features. All changes are isolated to test code and error-handling paths, ensuring existing functionality remains intact.

Important Files Changed

Filename Score Overview
commons/pkg/configmanager/loader_test.go 5/5 Replaced hardcoded absolute path with temp-directory-based non-existent file path for improved test portability and consistency
commons/pkg/configmanager/env_test.go 5/5 Added error checks (t.Fatalf) for test-setup parse failures in bounds-checking tests to prevent false positives
fault-quarantine-module/pkg/mongodb/event_watcher.go 4/5 Introduced watchErr variable to capture and return unexpected watcher termination errors instead of silently returning nil

Confidence score: 4/5

  • This PR is safe to merge with minimal risk; all changes are defensive improvements to test infrastructure and error handling.
  • Score reflects well-contained fixes addressing previous review feedback, but one point deducted due to potential impact of the event-watcher error propagation change on caller behavior (callers may need to handle non-nil errors on shutdown that were previously always nil).
  • Pay close attention to fault-quarantine-module/pkg/mongodb/event_watcher.go - verify that all callers of EventWatcher.Start() correctly handle the new error return semantics when the watcher terminates unexpectedly (previously this always returned nil).

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/configmanager 40.23% (+40.23%) 🌟
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/mongodb 0.00% (ΓΈ)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/mongodb/event_watcher.go 0.00% (ΓΈ) 440 (+440) 0 440 (+440)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/commons/pkg/configmanager/env_test.go
  • github.com/nvidia/nvsentinel/commons/pkg/configmanager/loader_test.go

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. This iteration addresses numerous feedback points from previous reviews, including critical bug fixes and structural improvements.

Key Changes:

  • Critical bug fix in main.go: The blocking NodeInformer.Run() call (line 91) has been addressed in response to prior review feedback, though the implementation still prevents the reconciler and metrics server from starting. The informer must be launched in a goroutine before line 99.
  • Enhanced test infrastructure: Added comprehensive E2E tests in reconciler_e2e_test.go (3698lines) covering quarantine flows, circuit breaker behavior, and dry-run modes. However, three circuit breaker tests contain race conditions where node visibility checks occur after event processing starts.
  • Refactored configuration management: The new commons/pkg/configmanager provides generic TOML loading and environment variable handling, consolidating previously scattered configuration logic.
  • Improved Kubernetes client abstraction: The FaultQuarantineClient in k8s_client.go now uses per-node mutexes and retry logic with backoff for conflict resolution, addressing concurrency concerns from earlier reviews.
  • Architectural simplification: Removed redundant caching layers (NodeInfo, HealthEventBuffer) in favor of direct informer cache access and callback-based event processing.

Important Files Changed

Filename Score Overview
fault-quarantine-module/main.go 1/5 Critical blocking bug: NodeInformer.Run() prevents reconciler and metrics server goroutines from starting; unused srv variable
fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go 3/5 Comprehensive 3698-line E2E test suite with race condition in three circuit breaker tests where node visibility checks happen after event sends
fault-quarantine-module/pkg/reconciler/reconciler.go 4/5 Major refactoring removes caching layers, introduces ProcessEvent callbacks, adds mutex-protected taint updates, and fixes circuit breaker error handling
fault-quarantine-module/pkg/informer/k8s_client.go 4/5 New FaultQuarantineClient with per-node locking and retry logic; potential label read issue in handleUncordon after applyLabels modifies node
fault-quarantine-module/pkg/breaker/breaker.go 4/5 Refactored to use K8sClient interface with ErrRetryExhausted for pod restart signaling; state read errors silently ignored at startup
fault-quarantine-module/pkg/mongodb/event_watcher.go 4/5 New EventWatcher encapsulates MongoDB change streams with callback processing; processEventCallback can be nil causing panic
fault-quarantine-module/pkg/initializer/init.go 4/5 Centralized initialization logic with circular dependency between reconciler and event watcher via SetEventWatcher
commons/pkg/configmanager/loader.go 5/5 Generic TOML config loader with proper error wrapping and documentation
fault-quarantine-module/pkg/metrics/metrics.go 3/5 Metrics moved to shared package with race condition in SetFaultQuarantineBreakerState where Reset() clears all label combinations
fault-quarantine-module/pkg/informer/node_informer.go 4/5 Refactored from GPU-specific to all-node tracking with custom indexers; GetNodeCounts has TOCTOU race between total and quarantined reads
commons/pkg/configmanager/env.go 4/5 Generic environment variable manager with type-safe retrieval; optional vars with empty defaults omitted from batch read results
fault-quarantine-module/pkg/reconciler/reconciler_test.go 0/5 Entire test file emptied, representing critical loss of unit test coverage for reconciler package
fault-quarantine-module/pkg/reconciler/node_quarantine.go 5/5 File deleted as part of refactoring; functionality moved to k8s_client.go with improved architecture
fault-quarantine-module/pkg/evaluator/rule_evaluator.go 4/5 Refactored structToInterface to reduce complexity; changed error return from RuleEvaluationErroredOut to RuleEvaluationFailed; minor double-checking of primitives
fault-quarantine-module/pkg/informer/k8s_client_test.go 4/5 Comprehensive unit tests using envtest with 8-char node names that could collide in parallel execution

Confidence score: 1/5

  • This PR contains blocking bugs that will prevent the application from starting correctly and represents a critical loss of test coverage
  • Score reflects critical issues in main.go (informer blocking startup sequence), race conditions in E2E tests (node visibility checks), empty reconciler_test.go file (complete loss of unit tests), nil callback panic risk in event_watcher.go, and unsafe metric Reset() in SetFaultQuarantineBreakerState that clears all label combinations
  • Pay close attention to fault-quarantine-module/main.go (must launch informer in goroutine), fault-quarantine-module/pkg/reconciler/reconciler_test.go (restore deleted tests), fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go (fix circuit breaker race conditions in lines 2386-2402 and similar patterns), fault-quarantine-module/pkg/mongodb/event_watcher.go (add nil check for processEventCallback), and fault-quarantine-module/pkg/metrics/metrics.go (fix SetFaultQuarantineBreakerState race condition)

Additional Comments (2)

  1. fault-quarantine-module/pkg/breaker/metrics.go, line 1 (link)

    logic: entire metrics package deleted but breaker.go likely still calls SetFaultQuarantineBreakerState and SetFaultQuarantineBreakerUtilization. Check that these calls were removed or moved to a different package to avoid undefined references. Were the metrics calls moved to a centralized metrics package, or should the breaker stop reporting metrics entirely?

  2. commons/go.mod, line 42-43 (link)

    syntax: duplicate gopkg.in/yaml.v3 entries with conflicting versions (v3.0.4 on line 43, v3.0.1 on line 53). go.mod should have only one version per module - run go mod tidy to resolve

39 files reviewed, 12 comments

Edit Code Review Agent Settings | Greptile

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The most recent updates include:

  1. Go version fixes: Corrected commons/go.mod and fault-quarantine-module/go.mod to use valid Go versions (go 1.23.0 and toolchain go1.23.8) instead of the non-existent 1.24.x versions that were previously causing build failures.

  2. YAML dependency cleanup: Fixed duplicate and malformed YAML package imports in commons/go.mod - removed the incorrect go.yaml.in/yaml/* paths and consolidated to the standard gopkg.in/yaml.v3 import, eliminating the critical build-breaking conflicts identified in the last review.

  3. Mock state tracking enhancement: Updated store-client-sdk/pkg/storewatcher/watchStoreMock.go to properly track circuit breaker state changes in the mock - WriteCircuitBreakerState now updates the internal state so subsequent ReadCircuitBreakerState calls return the current state rather than always returning the initial value.

These changes directly address the three most critical blocking issues from the previous review: invalid Go version syntax errors, YAML dependency conflicts that would prevent module resolution, and incorrect mock behavior that could cause test failures.

Important Files Changed

Filename Score Overview
commons/go.mod 5/5 Fixed Go version to 1.23.0, removed malformed yaml dependencies, consolidated to standard gopkg.in/yaml.v3
fault-quarantine-module/go.mod 5/5 Corrected Go version to 1.23.0 and toolchain to go1.23.8, fixed YAML dependency duplication
store-client-sdk/pkg/storewatcher/watchStoreMock.go 5/5 Added state tracking to WriteCircuitBreakerState so mock properly maintains circuit breaker state changes

Confidence score: 5/5

  • These changes directly fix the three critical build-breaking issues identified in the previous review
  • The Go version corrections (1.23.0 instead of 1.24.0) now reference valid, stable Go releases
  • YAML dependency consolidation removes the conflicting import paths that would cause module resolution failures
  • Mock state tracking fix ensures test reliability by making the mock behavior match real implementations

Additional Comments (5)

  1. commons/go.mod, line 42-43 (link)

    syntax: duplicate yaml.v2 import - line 42 uses go.yaml.in/yaml/v2 while line 43 uses go.yaml.in/yaml/v3, which is non-standard. The correct import paths should be gopkg.in/yaml.v2 and gopkg.in/yaml.v3 respectively. Line 53 already has the correct gopkg.in/yaml.v3 import, creating a conflict

  2. commons/go.mod, line 53 (link)

    syntax: duplicate yaml.v3 dependency - already declared on line 43 with different import path (go.yaml.in/yaml/v3 vs gopkg.in/yaml.v3). Run go mod tidy to resolve

  3. fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 89-103 (link)

    logic: function returns 'added' boolean but only checks !exists. If an event already exists, it's not overwritten – so updates never happen despite 'AddOrUpdate' name. This means IsFatal, IsHealthy, ErrorCodes, Message fields in existing events never refresh even if the new event has updated values. Should existing events be updated, or should the function be renamed to AddEvent? Is it intentional that AddOrUpdateEvent never updates existing events, only adds new ones? Should the function update stored events when keys match, or should it be renamed to reflect add-only behavior?

  4. fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 170-196 (link)

    style: RemoveEvent counts matches in a separate loop (lines 183-189) before deleting. This iterates the keys slice twice. Consider counting during deletion or removing the added variable if the return value isn't critical for performance

  5. fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 198-221 (link)

    style: removeAllEntitiesForCheck collects keysToRemove slice before deletion. For large maps this allocates extra memory. Consider deleting directly in the first loop if iteration-during-delete is safe in Go (it is)

40 files reviewed, 11 comments

Edit Code Review Agent Settings | Greptile

@github-actions
Copy link

Merging this branch changes the coverage (1 decrease, 3 increase)

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/configmanager 40.23% (+40.23%) 🌟
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation 44.89% (+7.22%) πŸ‘
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer 38.44% (-0.54%) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler 26.81% (+8.59%) πŸ‘

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/commons/pkg/configmanager/env.go 40.16% (+40.16%) 254 (+254) 102 (+102) 152 (+152) 🌟
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go 44.89% (+7.22%) 421 (-86) 189 (-2) 232 (-84) πŸ‘
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/informer/node_informer.go 30.84% (-8.14%) 415 (-493) 128 (-226) 287 (-267) πŸ‘Ž
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/reconciler/reconciler.go 26.81% (+8.71%) 2197 (-1174) 589 (-21) 1608 (-1153) πŸ‘

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@mchmarny mchmarny added the enhancement New feature or request label Oct 28, 2025
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR.

The latest iteration continues the fault-quarantine refactoring with significant improvements: the Makefile TEST_SETUP_COMMANDS variable was corrected to remove the problematic trailing && operator; the go.mod files now properly reference Go 1.23.0 instead of non-existent 1.24.x versions; the metrics server initialization issue in main.go was addressed by starting the server in a goroutine before the blocking informer.Run() call; the circular dependency between reconciler and eventWatcher was resolved by passing the watcher during reconciler creation rather than using a setter; and the commons/go.mod malformed import paths (go.yaml.in/yaml) were fixed to use the correct gopkg.in/yaml paths. The configuration management SDK continues to mature with improved error handling patterns, and test coverage expanded with additional envtest-based integration tests for the k8s client. The changes align with the broader refactoring goal of introducing a config SDK while consolidating Kubernetes client operations into dedicated packages with proper separation of concerns.

Important Files Changed

Filename Score Overview
fault-quarantine-module/main.go 3/5 Fixed metrics server startup and informer blocking issues, but still has POD_NAMESPACE unused variable
commons/go.mod 2/5 Corrected Go version but still has duplicate gopkg.in/yaml.v3 entries with conflicting versions
fault-quarantine-module/go.mod 2/5 Corrected Go version but yaml.v3 duplicates and malformed go.yaml.in path persist
fault-quarantine-module/Makefile 4/5 TEST_SETUP_COMMANDS trailing && operator still causes syntax errors when expanded
fault-quarantine-module/pkg/initializer/init.go 4/5 Improved initialization consolidation but circular dependency between reconciler and watcher remains
fault-quarantine-module/pkg/reconciler/reconciler.go 3/5 Major refactoring improves structure but read-modify-write race conditions in annotation updates remain unaddressed
fault-quarantine-module/pkg/informer/k8s_client.go 4/5 Well-structured client abstraction with per-node locking, but uncordon reason reads from wrong source in handleUncordon
fault-quarantine-module/pkg/mongodb/event_watcher.go 3/5 Solid event watcher design but processEventCallback can be nil at Start() time causing panic risk
commons/pkg/configmanager/env.go 3/5 Useful config utilities but default values bypass validation and optional vars with empty defaults pollute results map
fault-quarantine-module/pkg/reconciler/reconciler_e2e_test.go 4/5 Comprehensive E2E coverage with proper blocking waits for circuit breaker tests

Confidence score: 3/5

  • This PR has made substantial progress on addressing previous review concerns but still contains several moderate-risk issues that should be resolved before merging
  • Score reflects remaining issues: go.mod syntax errors (duplicate yaml.v3, malformed paths), annotation update race conditions, nil callback panic risk in event watcher, and validation bypass in config manager that could allow invalid defaults into the system
  • Pay close attention to commons/go.mod and fault-quarantine-module/go.mod (resolve yaml.v3 duplicates and malformed import paths), fault-quarantine-module/pkg/mongodb/event_watcher.go (add nil check for processEventCallback), fault-quarantine-module/pkg/reconciler/reconciler.go (address read-modify-write race conditions in annotation operations), and commons/pkg/configmanager/env.go (ensure defaults go through validation)

Additional Comments (9)

  1. fault-quarantine-module/pkg/reconciler/reconciler_test.go, line 1 (link)

    logic: The entire test file has been deleted. This removes all unit tests for the reconciler, leaving only the e2e tests. This significantly reduces test coverage and makes it harder to catch regressions during development. Were these tests moved elsewhere, or is the reconciler now only tested via e2e tests? Were the unit tests moved to another file, or is the plan to rely entirely on e2e tests for reconciler validation?

  2. commons/go.mod, line 42-43 (link)

    logic: duplicate gopkg.in/yaml.v3 entries with conflicting versions. Line 43 has v3.0.4 (indirect), line 53 has v3.0.1 (indirect). This will cause build failures or unexpected resolution behavior

  3. commons/go.mod, line 42 (link)

    syntax: malformed import path go.yaml.in/yaml/v2 should be gopkg.in/yaml.v2 (missing 'gop' prefix)

  4. fault-quarantine-module/pkg/reconciler/node_quarantine_test.go, line 1 (link)

    logic: entire test file deleted - no tests remain for TaintAndCordonNodeAndSetAnnotations, UnTaintAndUnCordonNodeAndRemoveAnnotations, GetNodeAnnotations, GetNodesWithAnnotation, or error handling paths. These functions are still in production code (node_quarantine.go) but have zero test coverage now. Was this deletion intentional? The PR description mentions bug fixes and refactoring, but removing all tests for critical node operations (taint/cordon/uncordon) seems risky. Are these tests moved to a different file, or is test coverage being dropped?

  5. fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 52-72 (link)

    logic: Entity and version fields are included in the key but not validated. If event or entity pointers are nil, this will panic. Add nil checks before accessing fields. Should createEventKeyForEntity handle nil event pointers, or can we assume it's always called with valid events?

  6. fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 89-103 (link)

    logic: AddOrUpdateEvent only adds events that don't existβ€”it never updates existing events even though the function name suggests it does. If an event already exists, the function returns false but the event remains unchanged. Should existing events be updated with the new version, or is the current add-only behavior intentional?

  7. fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go, line 252-266 (link)

    style: MarshalJSON deduplicates by comparing event pointers, but if the same event data exists at different memory addresses (e.g., after unmarshal or multiple adds), this won't detect the logical duplication. Is pointer-based deduplication sufficient, or should it compare event contents? Are events guaranteed to share the same pointer when logically identical (same entities/check), or could multiple event instances with identical data exist?

  8. fault-quarantine-module/pkg/reconciler/node_quarantine.go, line 1 (link)

    logic: Entire FaultQuarantineClient implementation (507 lines) deleted without replacement in this file. The type is still referenced throughout the codebase (pkg/initializer/init.go line 192, pkg/breaker/breaker.go, pkg/reconciler/reconciler.go line 87, tests). This will cause compilation failures across the module. Was this code moved to a different file, or is this an accidental deletion? Was FaultQuarantineClient moved to a different file in this refactoring, or should this deletion be reverted? The type is still used extensively in other files.

  9. fault-quarantine-module/pkg/breaker/breaker.go, line 414-426 (link)

    style: logRetriesExhausted calls GetTotalNodes again after all retries failed. If this call succeeds but previous calls returned 0, it proves cache sync issue. But if this call also fails, error message on line 418 says 'failed to get node count' even though the actual issue is zero_nodes. Should the error distinguish between 'API error on final attempt' vs 'consistently returning zero'? Should the final error message distinguish between API failures and consistently-zero node counts, since both end up in the same error path?

41 files reviewed, 21 comments

Edit Code Review Agent Settings | Greptile

@lalitadithya
Copy link
Collaborator

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

βœ… Actions performed

Full review triggered.

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This pull request refactors the fault-quarantine-module to use a centralized initialization pattern with Kubernetes-backed state persistence. It adds environment variable utilities and TOML config loading to commons, introduces a new MongoDB event watcher, consolidates Kubernetes client operations, and reorganizes package structure with new initializer and metrics modules.

Changes

Cohort / File(s) Summary
Commons Configuration Utilities
commons/pkg/configmanager/env.go, commons/pkg/configmanager/env_test.go
Added generic environment variable utilities: GetEnvVar[T] for type-safe env var reading with validation, EnvVarSpec for batch configuration, and ReadEnvVars for reading multiple env vars with default handling. Includes comprehensive test coverage for type parsing, bounds checking, and error scenarios.
Commons TOML Support
commons/pkg/configmanager/loader.go, commons/pkg/configmanager/loader_test.go
Introduced LoadTOMLConfig[T] function to decode TOML files using BurntSushi/toml, with generic config support and unit tests for valid/invalid/missing file cases.
Commons Build Configuration
commons/Makefile, commons/go.mod, .github/dependabot.yml
Updated Makefile title and module notes; added BurntSushi/toml v1.5.0 dependency; added dependabot entry for commons gomod updates.
Fault Quarantine Build Setup
fault-quarantine-module/Makefile
Added TEST_SETUP_COMMANDS variable to encapsulate setup-envtest installation and configuration for test environment initialization.
Fault Quarantine Dependencies
fault-quarantine-module/go.mod
Reorganized dependencies: added controller-runtime v0.22.3 and testify v1.11.1 as direct; removed local replacements for commons and store-client-sdk; added numerous indirect dependencies for Kubernetes and telemetry support.
Fault Quarantine Bootstrap
fault-quarantine-module/main.go
Refactored main.go to use new initializer.InitializeAll pattern, replacing inline config loading and component wiring with centralized initialization; updated flag parsing to include TOML config path.
Initializer Package
fault-quarantine-module/pkg/initializer/init.go
New package providing orchestrated initialization: InitializationParams, Components, EnvConfig types and InitializeAll function that bootstrap MongoDB, Kubernetes client, circuit breaker, and reconciler with unified error handling.
Metrics Consolidation
fault-quarantine-module/pkg/metrics/metrics.go
Renamed from reconciler to metrics package; renamed metrics variables from unexported to exported (e.g., totalEventsReceived β†’ TotalEventsReceived); added circuit breaker metrics (FaultQuarantineBreakerState, FaultQuarantineBreakerUtilization, etc.); introduced setter functions for breaker state/utilization.
MongoDB Event Watcher
fault-quarantine-module/pkg/mongodb/event_watcher.go
New module providing EventWatcher with change stream processing, event pipeline support, metrics integration, and callback-based event processing. Implements LastProcessedObjectIDStore interface for state persistence.
Circuit Breaker Kubernetes Integration
fault-quarantine-module/pkg/breaker/breaker.go, fault-quarantine-module/pkg/breaker/types.go, fault-quarantine-module/pkg/breaker/breaker_test.go
Refactored breaker to use Kubernetes-backed state persistence via new K8sClientOperations interface; added ErrRetryExhausted error; updated constructor to accept context; replaced in-process state management with ConfigMap-based persistence; introduced envtest-based testing.
Breaker Metrics Removal
fault-quarantine-module/pkg/breaker/metrics.go (deleted)
Removed local metrics implementation; metrics now centralized in metrics package.
Kubernetes Client Operations
fault-quarantine-module/pkg/informer/k8s_client.go, fault-quarantine-module/pkg/informer/k8s_client_interface.go, fault-quarantine-module/pkg/informer/k8s_client_test.go
New FaultQuarantineClient implementation providing circuit breaker ConfigMap management, node taint/label/annotation operations, cordon/uncordon logic, and per-node locking for thread-safe updates; includes comprehensive test suite with envtest integration.
Node Informer Refactoring
fault-quarantine-module/pkg/informer/node_informer.go, fault-quarantine-module/pkg/informer/node_informer_test.go (deleted)
Simplified NewNodeInformer constructor, removed NodeInfoProvider interface, added indexing by quarantine annotation, introduced GetNodeCounts, GetNode, ListNodes, WaitForSync accessors; replaced callback-based architecture with index-based queries; removed test file.
Reconciler Refactoring
fault-quarantine-module/pkg/reconciler/reconciler.go, fault-quarantine-module/pkg/reconciler/node_quarantine.go (deleted), fault-quarantine-module/pkg/reconciler/node_quarantine_test.go (deleted)
Updated NewReconciler constructor to accept k8sClient and circuitBreaker; added SetLabelKeys, SetEventWatcher, state persistence methods; deleted standalone FaultQuarantineClient implementation (moved to informer package).
Rule Evaluator Updates
fault-quarantine-module/pkg/evaluator/rule_evaluator.go, fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go, fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go, fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go, fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go
Replaced RuleEvaluationErroredOut with RuleEvaluationFailed; removed Kubernetes client dependency from rule set initialization; refactored node evaluation with structured conversion helpers; updated to envtest-based testing; removed unused error constants from common.go.
Config and Buffer Removals
fault-quarantine-module/pkg/config/config.go, fault-quarantine-module/pkg/common/health_events_buffer.go, fault-quarantine-module/pkg/nodeinfo/nodeinfo.go (deleted)
Removed LoadTomlConfig from config package (now in commons); deleted HealthEventBuffer (no longer needed); deleted NodeInfo (replaced by informer-based index queries).
Health Events Annotation
fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go
Removed map-clearing behavior from UnmarshalJSON to preserve pre-existing entries during unmarshaling.
Dependency Updates
fault-remediation-module/go.mod, health-events-analyzer/go.mod, health-monitors/csp-health-monitor/go.mod, janitor/go.mod, node-drainer-module/go.mod
Added indirect dependencies for testing frameworks (ginkgo, gomega), profiling (pprof), telemetry (opentelemetry), and tools; updated existing versions across AWS SDK, Kubernetes, and Google Cloud packages.
Test Utilities
store-client-sdk/pkg/storewatcher/watch_store_mock.go
Introduced FakeChangeStreamWatcher mock for testing with configurable function hooks and call counting.

Sequence Diagram(s)

sequenceDiagram
    participant Main
    participant Initializer
    participant K8sClient as FaultQuarantineClient
    participant NodeInformer
    participant CircuitBreaker
    participant MongoDB
    participant Reconciler
    participant EventWatcher

    Main->>Initializer: InitializeAll(ctx, params)
    Initializer->>Initializer: loadEnvConfig()
    Initializer->>Initializer: LoadTOMLConfig(path)
    Initializer->>K8sClient: NewFaultQuarantineClient(kubeconfig)
    K8sClient->>NodeInformer: Create & sync
    NodeInformer-->>K8sClient: Ready
    Initializer->>CircuitBreaker: NewSlidingWindowBreaker(ctx, config)
    CircuitBreaker->>K8sClient: EnsureCircuitBreakerConfigMap()
    K8sClient-->>CircuitBreaker: State initialized
    Initializer->>Reconciler: NewReconciler(config, k8sClient, breaker)
    Initializer->>EventWatcher: NewEventWatcher(...)
    Initializer->>Reconciler: SetEventWatcher(watcher)
    Initializer->>EventWatcher: Start(ctx)
    EventWatcher->>MongoDB: Watch change stream
    MongoDB-->>EventWatcher: Events
    EventWatcher->>Reconciler: ProcessEvent callback
    Reconciler->>NodeInformer: Evaluate rules
    NodeInformer-->>Reconciler: Matching nodes
    Reconciler->>K8sClient: QuarantineNodeAndSetAnnotations()
    K8sClient->>K8sClient: UpdateNode (with per-node lock)
    Initializer-->>Main: Components {Reconciler, EventWatcher, K8sClient, CircuitBreaker}
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

  • Reconciler refactoring: Complete restructuring of initialization, state management, and component wiring with multiple interacting changes across main.go, initializer, and reconciler packages.
  • Kubernetes client migration: Moved from reconciler package to informer package with significant API surface changes (string-based to breaker.State types, new ConfigMap persistence model).
  • Circuit breaker state persistence: Changed from in-memory to Kubernetes ConfigMap-backed, affecting all state reads/writes and retry logic.
  • Node informer simplification: Removed NodeInfoProvider interface, refactored caching strategy from explicit state management to informer-based indexing with new accessor patterns.
  • Event processing pipeline: New EventWatcher component with MongoDB integration requires careful verification of event flow, state persistence, and callback semantics.
  • Test infrastructure changes: Extensive migration from fake clients to envtest-based integration testing across multiple packages; requires validation of test setup/teardown and synchronization logic.
  • Metrics consolidation: Package rename and variable re-exports require careful checking of import paths and metric registration across all consumers.
  • Dependency reorganization: Removed local module replacements and reorganized go.mod files; requires verification that all indirect dependencies are correctly resolved.

Areas requiring extra attention:

  • Concurrent node updates in FaultQuarantineClient via per-node mutex map; verify lock scope and conflict retry semantics
  • Circuit breaker state initialization and synchronization between ConfigMap and in-memory state
  • EventWatcher callback integration with Reconciler; verify event ordering and error propagation
  • NodeInformer cache sync timing in tests; ensure WaitForSync barriers are correctly placed
  • TOML config loading path integration in main.go and initializer
  • Metric name exports and Prometheus scraping endpoint changes due to package rename

Poem

🐰 A rabbit hops through refactored code,
With Kubernetes backing each node's load,
From informers synced to ConfigMaps stored,
The quarantine logic newly poured,
State persists now in clusters, not memory's door,
Spring-cleaning done, the system restored! 🌱

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
βœ… Passed checks (2 passed)
Check name Status Explanation
Description Check βœ… Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check βœ… Passed The title "feat: added config sdk and refactored FQ with bug fixes" accurately describes the main changes in the pull request. The config SDK addition to the commons module with environment variable and TOML configuration utilities is clearly referenced, and the refactoring of the Fault Quarantine (FQ) module with Kubernetes-backed state management, the new initializer pattern, and various bug fixes in node quarantine handling are appropriately captured. While the FQ refactoring could be described with more specificity, the title provides sufficient clarity for a developer scanning commit history to understand that this PR introduces new configuration tooling and substantially refactors the fault quarantine component.
✨ Finishing touches
  • πŸ“ Generate docstrings
πŸ§ͺ Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

πŸ”§ golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lalitadithya lalitadithya added this to the v0.3.0 milestone Oct 29, 2025
lalitadithya
lalitadithya previously approved these changes Oct 29, 2025
@lalitadithya
Copy link
Collaborator

@CodeRabbit review

@coderabbitai
Copy link

coderabbitai bot commented Oct 29, 2025

βœ… Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
fault-quarantine-module/pkg/breaker/breaker.go (1)

49-59: Validate window duration to prevent zero buckets.

cfg.Window < 1s yields zero buckets and negative indices later. Guard early.

 func NewSlidingWindowBreaker(ctx context.Context, cfg Config) (CircuitBreaker, error) {
- numBuckets := int((cfg.Window + time.Second - 1) / time.Second)
+ if cfg.Window < time.Second {
+   return nil, fmt.Errorf("window must be >= 1s; got %v", cfg.Window)
+ }
+ numBuckets := int((cfg.Window + time.Second - 1) / time.Second)
♻️ Duplicate comments (12)
fault-quarantine-module/pkg/breaker/types.go (1)

24-30: Consider using State type in interface methods for type safety.

The interface methods ReadCircuitBreakerState and WriteCircuitBreakerState use string for the state parameter/return value, while the State type (line 41) provides an explicit type with defined constants (StateClosed, StateTripped). Using State instead of string would provide compile-time type safety and prevent invalid state values.

Apply this diff:

 type K8sClientOperations interface {
 	GetTotalNodes(ctx context.Context) (int, error)
-	EnsureCircuitBreakerConfigMap(ctx context.Context, name, namespace string, initialStatus State) error
-	ReadCircuitBreakerState(ctx context.Context, name, namespace string) (State, error)
-	WriteCircuitBreakerState(ctx context.Context, name, namespace string, status State) error
+	EnsureCircuitBreakerConfigMap(ctx context.Context, name, namespace string, initialStatus State) error
+	ReadCircuitBreakerState(ctx context.Context, name, namespace string) (State, error)
+	WriteCircuitBreakerState(ctx context.Context, name, namespace string, status State) error
 }

Wait, looking at the code again, the interface already uses State type (line 27, 28, 29). The past comment may be outdated.

fault-quarantine-module/pkg/evaluator/rule_evaluator.go (2)

231-255: Primitive cases duplicated in handleComplexType.

These cases are unreachable since structToInterface already returns primitives earlier. Safe to remove for clarity.


194-212: primitiveKinds package var immutability/style.

Read-only after init; consider localizing or documenting immutability. Not blocking.

commons/pkg/configmanager/env_test.go (1)

116-121: Assert error mentions missing variable name.

Strengthen the test by checking the error includes the var name.

Apply:

@@
-import (
+import (
 	"fmt"
 	"math"
 	"strconv"
 	"testing"
+	"strings"
 )
@@
 	t.Run("missing required returns error", func(t *testing.T) {
 		_, err := GetEnvVar[int]("TEST_MISSING_REQUIRED", nil, nil)
 		if err == nil {
 			t.Error("expected error for missing env var but got none")
 		}
+		if err != nil && !strings.Contains(err.Error(), "TEST_MISSING_REQUIRED") {
+			t.Errorf("error should mention missing var name, got: %v", err)
+		}
 	})

Also applies to: 17-22

fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go (1)

174-178: Use collision-proof node names in tests.

Truncating ObjectID to 8 chars can collide in parallel runs. Prefer full hex or suffix with t.Name().

- nodeName := "test-node-" + primitive.NewObjectID().Hex()[:8]
+ nodeName := "test-node-" + primitive.NewObjectID().Hex() + "-" + strings.ReplaceAll(t.Name(), "/", "_")
fault-quarantine-module/pkg/initializer/init.go (2)

87-91: Magic number: 30m informer resync/timeout is hardcoded.

Make it configurable via InitializationParams or env var to tune in different clusters.

- k8sClient, err := informer.NewFaultQuarantineClient(params.KubeconfigPath, params.DryRun, 30*time.Minute)
+ k8sClient, err := informer.NewFaultQuarantineClient(params.KubeconfigPath, params.DryRun, params.InformerResync)

(Add InformerResync time.Duration to InitializationParams.)


141-144: Bidirectional wiring between Reconciler and EventWatcher.

SetEventWatcher after passing reconciler into NewEventWatcher creates a circular dependency. Consider injecting the callback or an interface instead.

fault-quarantine-module/pkg/reconciler/reconciler.go (1)

156-164: Add nil-guard for eventWatcher before Start() dereference.

Lines 156-162 dereference r.eventWatcher without checking nil. While the normal production flow (via InitializeAll) calls SetEventWatcher before Start(), alternative pathsβ€”such as direct reconciler instantiation in testsβ€”could bypass this, causing a panic. Defensive nil checks are warranted.

+	if r.eventWatcher == nil {
+		return fmt.Errorf("event watcher not configured; call SetEventWatcher before Start")
+	}
+
	r.eventWatcher.SetProcessEventCallback(
		func(ctx context.Context, event *model.HealthEventWithStatus) *model.Status {
			return r.ProcessEvent(ctx, event, ruleSetEvals, rulesetsConfig)
		},
	)

	if err := r.eventWatcher.Start(ctx); err != nil {
		return fmt.Errorf("event watcher failed: %w", err)
	}
fault-quarantine-module/pkg/metrics/metrics.go (1)

183-186: Avoid GaugeVec.Reset in hot paths; prefer stable series and set 0/1 for known states

Reset clears all label combinations and can be noisy under scrape; also previously raised. If single-goroutine is guaranteed, fine, but consider a safer pattern: set 1 for current state and 0 for the rest to keep time series stable. Also, fix state values to a small enum to avoid cardinality creep.

Apply this diff:

-func SetFaultQuarantineBreakerState(state string) {
-	FaultQuarantineBreakerState.Reset()
-	FaultQuarantineBreakerState.WithLabelValues(state).Set(1)
-}
+func SetFaultQuarantineBreakerState(state string) {
+	// Known states; keep series stable
+	for _, s := range []string{"closed", "open", "half_open"} {
+		val := 0.0
+		if s == state {
+			val = 1.0
+		}
+		FaultQuarantineBreakerState.WithLabelValues(s).Set(val)
+	}
+}

If you prefer to keep Reset, please confirm single-goroutine invocation and that you’re okay with transient 0s for all states during collection.

fault-quarantine-module/pkg/informer/k8s_client.go (2)

386-393: Why 55-char truncation? K8s label value limit is 63

Unless there’s a specific reason, align with 63 to avoid unnecessary loss of information.

-			if len(uncordonReason) > 55 {
-				uncordonReason = uncordonReason[:55]
+			if len(uncordonReason) > 63 {
+				uncordonReason = uncordonReason[:63]

311-318: Do not skip uncordon/cleanup when taints-to-remove don’t match

Early-returning here prevents uncordon, annotation, and label cleanup if the specified taints aren’t present. That contradicts expected idempotent unquarantine semantics.

Apply this diff:

-		if len(taints) > 0 {
-			if shouldReturn := c.removeTaints(node, taints, nodename); shouldReturn {
-				return nil
-			}
-		}
+		if len(taints) > 0 {
+			c.removeTaints(node, taints, nodename)
+		}

And change removeTaints to have no boolean return (see next comment).

fault-quarantine-module/pkg/mongodb/event_watcher.go (1)

163-171: Nil callback can panic; guard before invocation

If SetProcessEventCallback wasn’t called, this will panic.

Apply this diff:

-	startTime := time.Now()
-	status := w.processEventCallback(ctx, &healthEventWithStatus)
+	startTime := time.Now()
+	var status *model.Status
+	if w.processEventCallback != nil {
+		status = w.processEventCallback(ctx, &healthEventWithStatus)
+	} else {
+		slog.Warn("processEventCallback not set; skipping event callback")
+	}
🧹 Nitpick comments (17)
fault-quarantine-module/pkg/breaker/types.go (1)

32-38: Clarify CircuitBreakerConfig usage in this file.

The CircuitBreakerConfig struct is defined but its relationship to the Config struct and how it's used is not evident from this file alone. Consider adding a comment explaining its purpose or how it relates to Config, especially since both contain related circuit breaker configuration.

fault-quarantine-module/pkg/evaluator/rule_evaluator.go (2)

316-328: Map key handling can be lossy for non-string keys.

key.String() is only correct for string keys. Use string keys or fmt.Sprint fallback.

Apply:

 func handleMap(v reflect.Value) interface{} {
   if v.IsNil() {
     return nil
   }
   mapResult := make(map[string]interface{})
   for _, key := range v.MapKeys() {
-    mapResult[key.String()] = structToInterface(v.MapIndex(key))
+    var k string
+    if key.Kind() == reflect.String {
+      k = key.String()
+    } else {
+      k = fmt.Sprint(key.Interface())
+    }
+    mapResult[k] = structToInterface(v.MapIndex(key))
   }
   return mapResult
 }

339-359: Avoid double JSON round-trip when already a map.

Minor perf: if structToInterface returns map[string]interface{}, return it directly.

Apply:

 func RoundTrip(v interface{}) (map[string]interface{}, error) {
   val := reflect.ValueOf(v)
-  obj := structToInterface(val)
+  obj := structToInterface(val)
+  if m, ok := obj.(map[string]interface{}); ok {
+    return m, nil
+  }
   b, err := json.Marshal(obj)
   if err != nil {
     return nil, fmt.Errorf("failed to marshal intermediate object: %w", err)
   }
   var j interface{}
   if err := json.Unmarshal(b, &j); err != nil {
     return nil, fmt.Errorf("failed to unmarshal JSON back to map: %w", err)
   }
   m, ok := j.(map[string]interface{})
   if !ok {
     return nil, fmt.Errorf("expected JSON object after roundtrip")
   }
   return m, nil
 }
commons/pkg/configmanager/env.go (1)

155-167: Broaden boolean parsing (optional).

Many envs use 1/0, yes/no, y/n. Consider supporting these alongside true/false.

Example:

 func parseBool(valueStr string) (bool, error) {
   valueStr = strings.ToLower(strings.TrimSpace(valueStr))
   switch valueStr {
-  case "true":
+  case "true", "1", "yes", "y":
     return true, nil
-  case "false":
+  case "false", "0", "no", "n":
     return false, nil
   default:
     return false, fmt.Errorf("invalid boolean value: %s (must be 'true' or 'false')", valueStr)
   }
 }

Update error message accordingly if you adopt the broader set.

fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go (1)

65-100: Minor: preallocate evaluators and improve error context.

  • Preallocate capacity to len(rules).
  • Include rule.Expression in unknown kind error for easier triage.
-func createEvaluators(rules []config.Rule, nodeInformer *informer.NodeInformer) ([]RuleEvaluator, error) {
-	evaluators := []RuleEvaluator{}
+func createEvaluators(rules []config.Rule, nodeInformer *informer.NodeInformer) ([]RuleEvaluator, error) {
+	evaluators := make([]RuleEvaluator, 0, len(rules))
 ...
-	default:
-		err = fmt.Errorf("unknown evaluator kind: %s", rule.Kind)
+	default:
+		err = fmt.Errorf("unknown evaluator kind: %s (expr=%q)", rule.Kind, rule.Expression)
fault-quarantine-module/pkg/reconciler/reconciler.go (3)

661-666: Record breaker events only after successful quarantine.

Currently AddCordonEvent is called before K8s mutations; a subsequent failure overcounts. Move it after a successful QuarantineNodeAndSetAnnotations.

- r.recordCordonEventInCircuitBreaker(event)
...
- err := r.k8sClient.QuarantineNodeAndSetAnnotations(
+ err := r.k8sClient.QuarantineNodeAndSetAnnotations(
    ctx, event.HealthEvent.NodeName, taintsToBeApplied, isCordoned.Load(), annotationsMap, labels,
  )
  if err != nil {
    ...
    return nil
  }
+ r.recordCordonEventInCircuitBreaker(event)

701-718: Avoid double rule evaluation on quarantined nodes.

eventMatchesAnyRule re-evaluates rules already computed in evaluateRulesets. Cache pass/fail per ruleset for the event and reuse to cut duplicate work.


496-503: Defensive type-assert on labelsMap value.

Cordoned reason concatenation assumes string; guard to avoid panic if map gets polluted.

- if oldReasonVal, exist := labelsMap.Load(r.cordonedReasonLabelKey); exist {
-   oldCordonReason := oldReasonVal.(string)
+ if oldReasonVal, exist := labelsMap.Load(r.cordonedReasonLabelKey); exist {
+   oldCordonReason, _ := oldReasonVal.(string)
    newCordonReason = oldCordonReason + "-" + newCordonReason
 }
fault-quarantine-module/pkg/breaker/breaker.go (2)

61-63: Replace Go 1.22-only integer ranges with classic loops for portability.

Avoid for range int to support older Go versions.

- for i := range numBuckets {
+ for i := 0; i < numBuckets; i++ {
    b.indexToNodes[i] = make(map[string]bool)
  }
...
- for range steps {
+ for i := 0; i < steps; i++ {
   ...
-  for i := range len(b.indexToNodes) - 1 {
+  for i := 0; i < len(b.indexToNodes)-1; i++ {
     b.indexToNodes[i] = b.indexToNodes[i+1]
   }

Confirm Go toolchain (see script in breaker_test.go comment).

Also applies to: 108-134, 121-126


97-103: Use slice instead of map[int] for buckets-to-nodes mapping.

indexToNodes is indexed densely; a slice []map[string]bool is simpler and faster than map[int]… . Optional cleanup.

High-level:

  • Change indexToNodes to []map[string]bool
  • Initialize with make([]map[string]bool, numBuckets)
  • Shift with copy(s[:], s[1:])

Also applies to: 121-126

fault-quarantine-module/pkg/mongodb/event_watcher.go (3)

104-114: Treat graceful shutdown as info, not error; avoid noisy logs on ctx cancel

When watchEvents returns nil due to Close(ctx) on context cancellation, this logs an error and wraps a synthetic error. Distinguish ctx-driven shutdown.

Apply this diff:

-	go func() {
-		err := w.watchEvents(ctx, watcher)
-		if err != nil {
-			slog.Error("MongoDB event watcher goroutine failed", "error", err)
-			watchDoneCh <- err
-		} else {
-			slog.Error("MongoDB event watcher goroutine exited unexpectedly, event processing has stopped")
-			watchDoneCh <- fmt.Errorf("event watcher channel closed unexpectedly")
-		}
-	}()
+	go func() {
+		err := w.watchEvents(ctx, watcher)
+		if err != nil {
+			slog.Error("MongoDB event watcher goroutine failed", "error", err)
+			watchDoneCh <- err
+			return
+		}
+		if ctx.Err() != nil {
+			slog.Info("MongoDB event watcher stopped due to context cancellation")
+		} else {
+			slog.Error("MongoDB event watcher exited unexpectedly; event channel closed")
+			watchDoneCh <- fmt.Errorf("event watcher channel closed unexpectedly")
+		}
+	}()

179-185: Silent failure to capture ObjectID skews backlog; add metric/log

If _id extraction fails, EventBacklogSize can under-report. Emit a metric or debug log to observe this.

Apply this diff:

 func (w *EventWatcher) storeEventObjectID(eventBson bson.M) {
 	if fullDoc, ok := eventBson["fullDocument"].(bson.M); ok {
 		if objID, ok := fullDoc["_id"].(primitive.ObjectID); ok {
 			w.lastProcessedObjectID.StoreLastProcessedObjectID(objID)
+			return
 		}
 	}
+	slog.Debug("Could not extract ObjectID from event fullDocument; skipping lastProcessed update")
+	metrics.ProcessingErrors.WithLabelValues("object_id_missing").Inc()
 }

197-200: Avoid stale EventBacklogSize when no last processed ID

Currently we skip updating the gauge entirely; consider explicitly setting to 0 to avoid stale values.

-			objID, ok := w.lastProcessedObjectID.LoadLastProcessedObjectID()
-			if !ok {
-				continue
-			}
+			objID, ok := w.lastProcessedObjectID.LoadLastProcessedObjectID()
+			if !ok {
+				metrics.EventBacklogSize.Set(0)
+				continue
+			}
fault-quarantine-module/pkg/informer/k8s_client.go (2)

131-149: Use the same backoff policy everywhere for consistency

You defined customBackoff; use it here too for node updates.

-	return retry.OnError(retry.DefaultBackoff, errors.IsConflict, func() error {
+	return retry.OnError(customBackoff, errors.IsConflict, func() error {

251-254: Map rebuild causes non-deterministic taint ordering

Functional, but ordering becomes non-deterministic. If later logic or tests assume order, consider appending only missing taints instead of rebuilding from a map.

fault-quarantine-module/pkg/informer/k8s_client_test.go (1)

573-610: Add assertion that unquarantine still uncordons when taints don’t match

To catch the early-return bug, start from a cordoned node and verify it’s uncordoned even if taintsToRemove don’t exist.

Apply this diff:

-	createTestNode(ctx, t, nodeName, nil, nil, taints, false)
+	createTestNode(ctx, t, nodeName, nil, nil, taints, true)
@@
-	// Filter to test taint only (ignore automatic envtest taints)
+	// Node should be uncordoned even if specified taints didn't match
+	if updatedNode.Spec.Unschedulable {
+		t.Errorf("Expected node to be uncordoned")
+	}
+	// Filter to test taint only (ignore automatic envtest taints)
fault-quarantine-module/pkg/informer/k8s_client_interface.go (1)

27-39: Update interface parameter names for consistency with implementation

The API rename verification confirms successful migration: old method names (GetTotalGpuNodes, TaintAndCordonNodeAndSetAnnotations, UnTaintAndUnCordonNodeAndRemoveAnnotations) are completely replaced with new names.

However, the interface definition uses labelMap while the implementation uses labels for the same parameters. Update the interface at lines 28 and 31 to use labels instead of labelMap for naming consistency:

  • QuarantineNodeAndSetAnnotations: parameter 6 (line 28)
  • UnQuarantineNodeAndRemoveAnnotations: parameter 6 (line 31)
πŸ“œ Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between c7c2a46 and abe7105.

β›” Files ignored due to path filters (7)
  • commons/go.sum is excluded by !**/*.sum
  • fault-quarantine-module/go.sum is excluded by !**/*.sum
  • fault-remediation-module/go.sum is excluded by !**/*.sum
  • health-events-analyzer/go.sum is excluded by !**/*.sum
  • health-monitors/csp-health-monitor/go.sum is excluded by !**/*.sum
  • janitor/go.sum is excluded by !**/*.sum
  • node-drainer-module/go.sum is excluded by !**/*.sum
πŸ“’ Files selected for processing (42)
  • .github/dependabot.yml (1 hunks)
  • commons/Makefile (2 hunks)
  • commons/go.mod (1 hunks)
  • commons/pkg/configmanager/env.go (1 hunks)
  • commons/pkg/configmanager/env_test.go (1 hunks)
  • commons/pkg/configmanager/loader.go (1 hunks)
  • commons/pkg/configmanager/loader_test.go (1 hunks)
  • fault-quarantine-module/Makefile (1 hunks)
  • fault-quarantine-module/go.mod (5 hunks)
  • fault-quarantine-module/main.go (3 hunks)
  • fault-quarantine-module/pkg/breaker/breaker.go (10 hunks)
  • fault-quarantine-module/pkg/breaker/breaker_test.go (1 hunks)
  • fault-quarantine-module/pkg/breaker/metrics.go (0 hunks)
  • fault-quarantine-module/pkg/breaker/types.go (3 hunks)
  • fault-quarantine-module/pkg/common/common.go (0 hunks)
  • fault-quarantine-module/pkg/common/health_events_buffer.go (0 hunks)
  • fault-quarantine-module/pkg/config/config.go (0 hunks)
  • fault-quarantine-module/pkg/evaluator/rule_evaluator.go (3 hunks)
  • fault-quarantine-module/pkg/evaluator/rule_evaluator_test.go (2 hunks)
  • fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go (4 hunks)
  • fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go (1 hunks)
  • fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go (5 hunks)
  • fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go (0 hunks)
  • fault-quarantine-module/pkg/informer/k8s_client.go (1 hunks)
  • fault-quarantine-module/pkg/informer/k8s_client_interface.go (1 hunks)
  • fault-quarantine-module/pkg/informer/k8s_client_test.go (1 hunks)
  • fault-quarantine-module/pkg/informer/node_informer.go (6 hunks)
  • fault-quarantine-module/pkg/informer/node_informer_test.go (0 hunks)
  • fault-quarantine-module/pkg/initializer/init.go (1 hunks)
  • fault-quarantine-module/pkg/metrics/metrics.go (6 hunks)
  • fault-quarantine-module/pkg/mongodb/event_watcher.go (1 hunks)
  • fault-quarantine-module/pkg/nodeinfo/nodeinfo.go (0 hunks)
  • fault-quarantine-module/pkg/nodeinfo/nodeinfo_test.go (0 hunks)
  • fault-quarantine-module/pkg/reconciler/node_quarantine.go (0 hunks)
  • fault-quarantine-module/pkg/reconciler/node_quarantine_test.go (0 hunks)
  • fault-quarantine-module/pkg/reconciler/reconciler.go (11 hunks)
  • fault-remediation-module/go.mod (3 hunks)
  • health-events-analyzer/go.mod (1 hunks)
  • health-monitors/csp-health-monitor/go.mod (1 hunks)
  • janitor/go.mod (4 hunks)
  • node-drainer-module/go.mod (4 hunks)
  • store-client-sdk/pkg/storewatcher/watch_store_mock.go (1 hunks)
πŸ’€ Files with no reviewable changes (10)
  • fault-quarantine-module/pkg/common/common.go
  • fault-quarantine-module/pkg/nodeinfo/nodeinfo_test.go
  • fault-quarantine-module/pkg/reconciler/node_quarantine_test.go
  • fault-quarantine-module/pkg/informer/node_informer_test.go
  • fault-quarantine-module/pkg/config/config.go
  • fault-quarantine-module/pkg/common/health_events_buffer.go
  • fault-quarantine-module/pkg/nodeinfo/nodeinfo.go
  • fault-quarantine-module/pkg/breaker/metrics.go
  • fault-quarantine-module/pkg/reconciler/node_quarantine.go
  • fault-quarantine-module/pkg/healthEventsAnnotation/health_events_annotation_map.go
πŸ”‡ Additional comments (24)
node-drainer-module/go.mod (1)

50-50: All newly added indirect dependencies are legitimate and secureβ€”no issues found.

The verification confirms:

  • Security: No vulnerabilities detected for any of the six indirect dependencies.
  • Necessity: All are expected transitive dependencies from direct dependencies (k8s v0.34.1, controller-runtime v0.22.3, mongo-driver v1.17.4, prometheus v1.23.2).
  • Legitimacy: Testing frameworks (ginkgo/v2, gomega), observability libraries (otel/metric, otel/trace), and development tooling (pprof, x/tools) align with the module's ecosystem and PR objectives.

The go.mod changes are correct and require no further action.

health-monitors/csp-health-monitor/go.mod (1)

33-33: LGTM: Routine patch version bump.

Cloud.google.com/go patch update is appropriate for dependency hygiene.

.github/dependabot.yml (1)

62-68: LGTM: Good practice to track commons module dependencies.

Adding commons to Dependabot monitoring is consistent with its expanded role as a shared library (now including config management utilities) and maintains dependency governance across the project.

commons/go.mod (1)

8-8: LGTM: Appropriate addition for config management.

Adding BurntSushi/toml v1.5.0 aligns with the new TOML-based configuration loading utilities being introduced in commons (LoadTOMLConfig). This is a standard, well-maintained library.

commons/Makefile (1)

1-1: LGTM: Documentation updates reflect evolved module scope.

Makefile updates accurately describe commons as providing both logging and configuration management utilities rather than solely logging initialization. These are documentation-only changes with no build logic impact.

Also applies to: 43-46

health-events-analyzer/go.mod (1)

40-41: LGTM: Expected OpenTelemetry transitive dependencies.

Addition of OpenTelemetry metric and trace v1.38.0 as indirect dependencies reflects coordinated observability improvements across the project. These are pulled in by MongoDB and gRPC instrumentation.

fault-remediation-module/go.mod (1)

49-49: LGTM: Infrastructure and testing improvements reflected in indirect dependencies.

New indirect dependencies support enhanced testing (ginkgo/gomega), observability (OpenTelemetry v1.38.0), profiling (pprof), and code tooling (golang.org/x/tools v0.38.0). These are appropriate for the fault remediation module's operational requirements.

Also applies to: 57-58, 69-70, 80-80

fault-quarantine-module/go.mod (3)

14-14: Verify controller-runtime and testify production usage.

Two direct dependencies were added:

  • Line 14: github.com/stretchr/testify v1.11.1
  • Line 22: sigs.k8s.io/controller-runtime v0.22.3

A past review questioned whether controller-runtime should be test-only. Confirm both are used in production code and not test utilities.

Also applies to: 22-22


27-27: Investigate YAML version divergence in indirect dependencies.

Two different versions of the YAML library appear in indirect dependencies:

  • Line 79: go.yaml.in/yaml/v3 v3.0.4
  • Line 93: gopkg.in/yaml.v3 v3.0.1

These are the same package (gopkg.in/yaml.v3 is an alias for go.yaml.in/yaml/v3). While Go modules can handle this in some cases, confirm this is intentional and not a transitive dependency conflict. A past review flagged this and it may need resolution.

Also applies to: 79-79, 93-93


27-27: LGTM: BurntSushi/toml architectural shift to commons.

Moving BurntSushi/toml from direct to indirect dependency aligns with the refactoring that centralizes TOML configuration loading in the commons module. This reflects the intended architectural improvement.

janitor/go.mod (1)

8-8: LGTM: Coordinated, project-wide dependency modernization.

The janitor module reflects a comprehensive update across multiple ecosystems (AWS SDK v2, Google Cloud, Kubernetes, OpenTelemetry, Prometheus, gRPC/protobuf). Version choices are consistent with updates across other modules in the PR:

  • AWS SDK v2: v1.39.4+ (with credentials v1.18.19, ec2/imds v1.18.11, etc.)
  • Kubernetes: v0.34.1
  • OpenTelemetry: v1.38.0 (metric/trace)
  • Google Cloud: v0.121.1
  • gRPC/protobuf: v1.76.0/v1.36.10

This appears to be part of coordinated infrastructure modernization and improved observability across the project. The versions follow standard semantic versioning and no breaking changes are evident.

Also applies to: 12-12, 16-17, 29-48, 61-74, 84-120, 139-144

store-client-sdk/pkg/storewatcher/watch_store_mock.go (2)

54-76: LGTM!

The constructor properly initializes all function hooks with safe default behaviors, and the buffered channel size (10) correctly matches the documentation at line 28.


147-165: LGTM!

The Reset method properly clears all state and drains the EventsChan, addressing the concern from previous reviews. The implementation is correct for test mock reset behavior.

fault-quarantine-module/pkg/informer/node_informer.go (3)

55-58: LGTM!

The new GetInformer accessor and WaitForSync method with context support are well-implemented. The context-aware waiting at line 121 allows proper cancellation handling.

Also applies to: 117-129


145-168: LGTM!

The index-based approach for tracking quarantined nodes (lines 154-165) properly synchronizes node counting and quarantine status lookup, addressing previous concerns about race conditions between total node count and quarantined node tracking.


235-238: LGTM!

The simplified handleUpdateNode implementation focusing solely on manual uncordon detection aligns well with the refactored architecture that uses index-based node tracking instead of maintaining separate state.

commons/pkg/configmanager/loader.go (1)

48-54: LGTM!

The generic TOML loader implementation is clean and correct. The documentation (lines 41-47) appropriately demonstrates post-load validation pattern, which is the right place for domain-specific validation rather than in a generic loader.

fault-quarantine-module/pkg/evaluator/rule_set_evaluator_any.go (1)

46-48: LGTM!

The change from RuleEvaluationErroredOut to RuleEvaluationFailed aligns with the broader refactoring to simplify evaluation statuses. The error is still returned (line 47), allowing callers to distinguish between error cases and normal failures if needed.

commons/pkg/configmanager/loader_test.go (1)

29-95: LGTM!

The test coverage is comprehensive, including success, missing file, and invalid syntax cases. All past review concerns have been addressed: field assertions are present (lines 50-60), non-existent file test added, and filepath.Join is used consistently. Tests follow Go best practices with t.Parallel() and t.TempDir().

fault-quarantine-module/pkg/evaluator/rule_set_evaluator_test.go (2)

75-76: LGTM!

The test expectation changes from RuleEvaluationErroredOut to RuleEvaluationFailed correctly align with the refactored status enum in rule_set_evaluator_any.go. The tests still validate that errors are returned (expectErr: true), maintaining proper error handling verification.

Also applies to: 95-96


293-293: LGTM!

Passing nil for the second parameter in InitializeRuleSetEvaluators and createEvaluators aligns with the refactoring to decouple evaluators from direct Kubernetes client dependencies. The test coverage remains comprehensive and validates the new signature correctly.

Also applies to: 367-367

fault-quarantine-module/Makefile (1)

29-33: No issues found with TEST_SETUP_COMMANDS usage.

Verification confirms that all recipes using TEST_SETUP_COMMANDS properly append additional commands after the variable expansion. The make/go.mk recipes (lint-test and test) both follow the variable with trailing commands, and all other module Makefiles follow the same correct pattern. No standalone uses causing syntax errors exist in the codebase.

fault-quarantine-module/pkg/evaluator/rule_set_evaluator.go (1)

26-63: LGTM: cleaner initialization surface (no K8s client).

Signatures and error aggregation look good.

fault-quarantine-module/pkg/breaker/breaker_test.go (1)

353-356: The review comment is incorrect. The codebase declares Go 1.25 in fault-quarantine-module/go.mod, which fully supports the "range over int" syntax introduced in Go 1.22. The code for range 10 { ... } is valid and requires no changes.

Likely an incorrect or invalid review comment.

@github-actions
Copy link

Merging this branch will increase overall coverage

Impacted Packages Coverage Ξ” πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator 41.95% (+16.40%) πŸŽ‰

Coverage by file

Changed files (no unit tests)

Changed File Coverage Ξ” Total Covered Missed πŸ€–
github.com/nvidia/nvsentinel/fault-quarantine-module/pkg/evaluator/rule_evaluator.go 40.23% (+16.20%) 614 (-56) 247 (+86) 367 (-142) πŸŽ‰

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@KaivalyaMDabhadkar KaivalyaMDabhadkar merged commit 0b05a55 into NVIDIA:main Oct 29, 2025
41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants