Skip to content

Conversation

@jamesturner246
Copy link
Contributor

@jamesturner246 jamesturner246 commented Nov 7, 2025

Description

Fixes #667.

Few things going on here:

  • new include/exclude RPCs using dedicated message types
  • Fix bug from last PR: RecomputeStatusResponse -> StatusResponse
  • simplify child node creation (no more 'dummy' nodes) during init
  • ChildNode is now a pure abstract class -- all subclasses MUST reimplement all methods explicitly. This was done to help reduce cognitive load -- better a bit extra verbosity to make the code both safer and easier to read.

Sister druncschema PR DUNE-DAQ/druncschema#75

Unfortunately some extra noise in this PR from things moving around. As usual, happy to chat and walk through.

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. python -m pytest)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future.
    (Indicate issue here: # (issue))

@jamesturner246 jamesturner246 force-pushed the jamesturner246/msg_inc_exc branch from f2fc19c to 2f28819 Compare November 10, 2025 16:05
@jamesturner246 jamesturner246 force-pushed the jamesturner246/msg_inc_exc branch from 0660f68 to 7486ab3 Compare November 11, 2025 15:14
@jamesturner246 jamesturner246 marked this pull request as ready for review November 11, 2025 16:50
@PawelPlesniak
Copy link
Collaborator

I have run the integration tests twice on this codebase. The first time failed with

[2025/11/12 07:18:48] ERROR      process_manager_driver.py:806  controller.ProcessManagerDriver:
                                 # Could not find 'root-controller' on the connectivity service.

                                 # Two possibilities:

                                 # 1. The most likely, the controller died. You can check that by looking for error like:
                                 # Process 'root-controller' (session: 'fakedata', user: 'plesniak') process exited with exit code 1).
                                 # Try running ps to see if the root-controller is still running.
                                 # You may also want to check the logs of the controller, try typing:
                                 # logs --name root-controller --how-far 1000
                                 # If that's not helping, you can restart this shell with --log-level debug, and look out for 'STDOUT' and 'STDERR'.

                                 # 2. The controller did not die, but is still setting up and has not advertised itself on the connection service.
                                 # You may be able to connect to the root-controller in a bit. Check the logs of the controller:
                                 # logs --name root-controller --grep grpc
                                 # And look for messages like:
                                 # Registering root-controller to the connectivity service at grpc://xxx.xxx.xxx.xxx:xxxxx
                                 # To find the controller address, you can look up 'root-controller_control' on http://daq.fnal.gov:30512 (you may
                                 need a SOCKS proxy from outside CERN), or use the address from the logs as above. Then just connect this shell to the
                                 controller with:
                                 # connect {controller_address}:{controller_port}>

[2025/11/12 07:18:48] WARNING    process_manager_driver.py:433  controller.ProcessManagerDriver:              Falling back to static OKS configuration
                                 for address resolution.
⠋ Trying to talk to the root controller... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  0:00:01 0:01:00
[2025/11/12 07:19:48] INFO       shell.py:389                   unified_shell:                                Shutting down the unified_shell
[2025/11/12 07:19:48] ERROR      shell.py:428                   unified_shell:                                Could not retrieve the controller
                                 status, reason: failed to connect to all addresses; last error: UNKNOWN: ipv4:131.225.193.20:30847: Failed to connect
                                 to remote host: connect: Connection refused (111)

This addressed is from a completely unrelated issue, which I suspect is from multiple testers on the same physical host with a static connectivity service address.

Giving a few mins break and rerunning the tests, I get the following on daq.fnal.gov with NFD_DEV_251112_A9 having only this and the sister druncschema branch checked out

+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ SUMMARY ++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

Wed Nov 12 07:56:09 AM CST 2025
Log file is: /tmp/pytest-of-plesniak/daqsystemtest_integtest_bundle_20251112073225.log

⮕ Running minimal_system_quick_test.py ⬅
============================== 3 passed in 45.09s ==============================
⮕ Running readout_type_scan_test.py ⬅
======================== 27 passed in 418.14s (0:06:58) ========================
⮕ Running 3ru_3df_multirun_test.py ⬅
======================== 6 passed in 231.10s (0:03:51) =========================
⮕ Running small_footprint_quick_test.py ⬅
============================== 3 passed in 46.64s ==============================
⮕ Running fake_data_producer_test.py ⬅
======================== 6 passed in 227.90s (0:03:47) =========================
⮕ Running long_window_readout_test.py ⬅
============================= 4 skipped in 15.52s ==============================
⮕ Running 3ru_1df_multirun_test.py ⬅
======================== 6 passed in 228.07s (0:03:48) =========================
⮕ Running tpstream_writing_test.py ⬅
======================== 4 passed in 100.33s (0:01:40) =========================
⮕ Running example_system_test.py ⬅
======================== 6 passed in 102.51s (0:01:42) =========================

I will announce the merge shortly on # daq-release-preparation.

@PawelPlesniak
Copy link
Collaborator

This code is also much cleaner now, and highlights just how many methods still require implementation, e.g. a formal terminate process in RESTAPIChildNode. For now this does not affect the terminate RPC due to the use of cleanup methods called through signal handling methods in the controllers, but when doing a code review this will be useful

Copy link
Collaborator

@PawelPlesniak PawelPlesniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@PawelPlesniak
Copy link
Collaborator

It is also nice to see when using include and exclude on the root-controller

drunc-unified-shell > status
                                                pawel status                                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State      ┃ Substate   ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:30006 │
│   df-controller        │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:38247 │
│     df-01              │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:53563 │
│     dfo-01             │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:42305 │
│     tp-stream-writer   │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:56527 │
│   hsi-fake-controller  │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:38985 │
│     hsi-fake-01        │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:44465 │
│     hsi-fake-to-tc-app │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:52889 │
│   ru-controller        │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:45331 │
│     ru-01              │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:53945 │
│   trg-controller       │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:43129 │
│     mlt                │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:36109 │
│     tc-maker-1         │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:57127 │
└────────────────────────┴──────┴────────────┴────────────┴──────────┴──────────┴───────────────────────────┘
[2025/11/12 15:11:27] INFO       shell_utils.py:165             utils.ShellContext:                           Current FSM status is configured. Available transitions are start, scrap. Available sequence commands are shutdown, start-run.                                                                                                                                                                                                                                                                                                                                            
drunc-unified-shell > exclude --help
Usage: PROCESS_MANAGER CONFIGURATION_FILE CONFIGURATION_ID SESSION_NAME exclude 
           [OPTIONS]

Options:
  --target TEXT  The target to address
  --help         Show this message and exit.
drunc-unified-shell > exclude --target root-controller
[2025/11/12 15:16:22] INFO       commands.py:331                controller.interface:                         'root-controller' excluded                                                                                                                                                                                                                                                                                                                                                                                                                                                
drunc-unified-shell > status
                                             pawel status                                             
┏━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name            ┃ Info ┃ State      ┃ Substate   ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller │      │ configured │ configured │ No       │ No       │ grpc://10.73.136.38:30006 │
└─────────────────┴──────┴────────────┴────────────┴──────────┴──────────┴───────────────────────────┘
[2025/11/12 15:16:26] INFO       shell_utils.py:165             utils.ShellContext:                           Current FSM status is configured. Available transitions are start, scrap. Available sequence commands are shutdown, start-run.                                                                                                                                                                                                                                                                                                                                            
drunc-unified-shell > include --target root-controller
[2025/11/12 15:16:31] INFO       commands.py:313                controller.interface:                         'root-controller' included                                                                                                                                                                                                                                                                                                                                                                                                                                                
drunc-unified-shell > status
                                                pawel status                                                 
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State      ┃ Substate   ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:30006 │
│   df-controller        │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:38247 │
│     df-01              │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:53563 │
│     dfo-01             │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:42305 │
│     tp-stream-writer   │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:56527 │
│   hsi-fake-controller  │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:38985 │
│     hsi-fake-01        │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:44465 │
│     hsi-fake-to-tc-app │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:52889 │
│   ru-controller        │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:45331 │
│     ru-01              │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:53945 │
│   trg-controller       │      │ configured │ configured │ No       │ Yes      │ grpc://10.73.136.38:43129 │
│     mlt                │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:36109 │
│     tc-maker-1         │      │ configured │ idle       │ No       │ Yes      │ rest://10.73.136.38:57127 │
└────────────────────────┴──────┴────────────┴────────────┴──────────┴──────────┴───────────────────────────┘
[2025/11/12 15:16:36] INFO       shell_utils.py:165             utils.ShellContext:                           Current FSM status is configured. Available transitions are start, scrap. Available sequence commands are shutdown, start-run. 

@PawelPlesniak
Copy link
Collaborator

Hi James, there is immediately an issue when running status. We will want to see the full status table

drunc-unified-shell > status
                                                  pawel status
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                  ┃ Info ┃ State        ┃ Substate     ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller       │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.38:30006 │
│   df-controller       │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.38:37977 │
│   hsi-fake-controller │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.38:37471 │
│   ru-controller       │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.38:36083 │
│   trg-controller      │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.38:36005 │
└───────────────────────┴──────┴──────────────┴──────────────┴──────────┴──────────┴───────────────────────────┘

@PawelPlesniak
Copy link
Collaborator

Looks much better

[2025/11/13 19:53:21] INFO       commands.py:80                 unified_shell.boot:                           Booted successfully
drunc-unified-shell > status
                                              pawel status
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:30006 │
│   df-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:46591 │
│     df-01              │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:40135 │
│     dfo-01             │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:39235 │
│     tp-stream-writer   │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:48519 │
│   hsi-fake-controller  │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:33297 │
│     hsi-fake-01        │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:45133 │
│     hsi-fake-to-tc-app │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:55643 │
│   ru-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:36071 │
│     ru-01              │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:41567 │
│   trg-controller       │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:43611 │
│     mlt                │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:43249 │
│     tc-maker-1         │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:47693 │
└────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴───────────────────────────┘
[2025/11/13 19:53:31] INFO       shell_utils.py:165             utils.ShellContext:                           Current FSM status is initial. Available transitions are conf. Available sequence commands are start-run.
drunc-unified-shell > exclude --target root-controller/df-controller
[2025/11/13 19:53:40] INFO       commands.py:331                controller.interface:
drunc-unified-shell > status
                                              pawel status
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:30006 │
│   df-controller        │      │ initial │ initial  │ No       │ No       │ grpc://10.73.136.38:46591 │
│     df-01              │      │ initial │ idle     │ No       │ No       │ rest://10.73.136.38:40135 │
│     dfo-01             │      │ initial │ idle     │ No       │ No       │ rest://10.73.136.38:39235 │
│     tp-stream-writer   │      │ initial │ idle     │ No       │ No       │ rest://10.73.136.38:48519 │
│   hsi-fake-controller  │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:33297 │
│     hsi-fake-01        │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:45133 │
│     hsi-fake-to-tc-app │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:55643 │
│   ru-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:36071 │
│     ru-01              │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:41567 │
│   trg-controller       │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.38:43611 │
│     mlt                │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:43249 │
│     tc-maker-1         │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.38:47693 │
└────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴───────────────────────────┘
[2025/11/13 19:53:41] INFO       shell_utils.py:165             utils.ShellContext:                           Current FSM status is initial. Available transitions are conf. Available sequence commands are start-run.

There is still a stray log, I removed it in the commit

@PawelPlesniak PawelPlesniak merged commit 3ec17a0 into develop Nov 13, 2025
3 checks passed
@PawelPlesniak PawelPlesniak deleted the jamesturner246/msg_inc_exc branch November 13, 2025 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: message types for include/exclude

2 participants