Skip to content

[K8s] Mounting changes, labels, shutdown#679

Merged
PawelPlesniak merged 7 commits intodevelopfrom
mrigan/k8s_mounting_changes
Nov 13, 2025
Merged

[K8s] Mounting changes, labels, shutdown#679
PawelPlesniak merged 7 commits intodevelopfrom
mrigan/k8s_mounting_changes

Conversation

@MRiganSUSX
Copy link
Contributor

Description

Fixes several items in #618

Requires: DUNE-DAQ/druncschema#76

Shortly:

  • adds mounting of logs and data volumes
  • proper mounting config in json
  • fixes shutdown order
  • fixes shutdown for LCS
  • better error handling for config problem
  • proper pod labelling

Details:

-> Staged, Role-Based Shutdown:

--> _kill_impl has been rewritten to perform a sequential, blocking shutdown based on process roles.
It now terminates all pods in a specific order (applications -> segment-controllers -> root-controller -> local-connection-server) to ensure a clean stop.

--> Robust Signal Handling & PID 1 Fixes:
Python Controllers: Graceful shutdown is fixed. Controllers now correctly use exec to become PID 1, allowing them to receive and handle SIGTERM signals and exit with code 0.

--> LCS: the LCS would always exit with code 143 (SIGTERM). A shell trap is now added to the pod command, which catches the SIGTERM, sends a SIGKILL to the child gunicorn process, and then exits cleanly with code 0.

-> Dynamic & Static Volume Mounting:

--> Implements a new _get_pod_volumes_and_mounts helper to manage all pod volumes.

--> Static Mounts: Reads volumes (e.g., /nfs, /cvmfs) from the k8s.json configuration and mounts them.

--> Log Mount: Automatically creates a read-write "log-mount" for the directory specified in process_logs_path, enabling persistent logs.

--> Data Mount: The oks_parser and process_manager_driver now extract the data_path for DFApplications. This is passed to the K8sProcessManager and mounted as a read-write "data-mount".
Includes de-duplication logic to prevent conflicts if the log and data mounts are the same.

-> Improved Startup Error Handling:

--> Server-Side: The K8sProcessManager now wraps config.load_kube_config() in a try...except block. If the kube-config is missing or invalid, it logs a clear CRITICAL error and exits, rather than failing silently.

--> Client-Side: The drunc-unified-shell startup loop now checks if the server process (pm_process) is alive. If the server crashes on launch (e.g., due to the bad config), the client exits immediately with error message.

-> Pod Role Labeling:

--> A new helper, _get_tree_labels, automatically adds role.drunc.daq and tree-id.drunc.daq labels to all pods at creation time.
This labeling is what enables the new staged shutdown logic.


Tested:

  • custom k8s runs ✅
  • custom ssh runs ✅
  • pytest ✅
  • integtest_bundle ✅
  • tested different data file locations ✅

To test:

  • run k8s session:
    • observe no problems for normal running
    • check pod mounts (k8s dashboard, directly in pods)
    • check pod labels (k8s dashboard, directly in pods)
    • observe data files getting created
    • observe log files getting created
    • observe proper shutdown procedure with graceful shutdowns and error codes 0

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. python -m pytest)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future.
    (Indicate issue here: # (issue))

@PawelPlesniak
Copy link
Collaborator

I have run the folllowing tests:

  • daqsystemtest_integtest_bundle on daq.fnal.gov
  • Manual tests:
    • drunc-unified-shell k8s config/daqsystemtest/example-configs.data.xml local-1x1-config pawel boot wait 2 start-run --run-number 1 wait 20 shutdown wait 2
    • drunc-unified-shell k8s config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config pawel boot wait 2 start-run --run-number 1 wait 20 shutdown wait 2
    • drunc-unified-shell k8s-CERN config/daqsystemtest/example-configs.data.xml local-1x1-config pawel boot wait 2 start-run --run-number 1 wait 20 shutdown wait 2
    • drunc-unified-shell k8s-CERN config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config pawel boot wait 2 start-run --run-number 1 wait 20 shutdown wait 2

All these tests passed as expected. Thank you @MRiganSUSX

@PawelPlesniak PawelPlesniak merged commit b787107 into develop Nov 13, 2025
1 of 4 checks passed
@PawelPlesniak PawelPlesniak deleted the mrigan/k8s_mounting_changes branch November 13, 2025 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants