Skip to content

Conversation

@PawelPlesniak
Copy link
Collaborator

@PawelPlesniak PawelPlesniak commented Aug 19, 2025

Description

Requires DUNE-DAQ/druncschema#65

Addresses #249
Addresses #297
Batch mode is managed "under the hood" by click context, but not directly by drunc. Therefore, when sending a batched set of commands e.g. boot conf ps, when an error state is raised the batch is not stopped. This adds a catch to ensure that when an error state is reached, we do not continue with the test.

Fixes #249

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. python -m pytest)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future.
    (Indicate issue here: # (issue))

@PawelPlesniak PawelPlesniak requested a review from eflumerf August 29, 2025 13:58
@PawelPlesniak PawelPlesniak marked this pull request as ready for review August 29, 2025 13:59
@PawelPlesniak
Copy link
Collaborator Author

Note - the to-error command is defined for testing that this works as intended. It may also be useful for integration tests, happy to remove it if deemed unnecessary

@eflumerf
Copy link
Member

I tried starting minimal_system_quick_test and then killall daq_application, and the system still proceeded with the command set.

[2025/08/29 11:02:54] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'start' on controller 'root-controller', targeting: 'root-controller'
                                                minimal status
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Info ┃ State      ┃ Substate        ┃ In error ┃ Included ┃ Endpoint                    ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller   │      │ configured │ preparing-start │ No       │ Yes      │ grpc://172.19.176.157:44079 │
│   df-controller   │      │ configured │ configured      │ No       │ Yes      │ grpc://172.19.176.157:38141 │
│     df-01         │      │ configured │ idle            │ No       │ Yes      │ rest://172.19.176.157:53473 │
│     dfo-01        │      │ configured │ idle            │ No       │ Yes      │ rest://172.19.176.157:46253 │
│   ru-controller   │      │ configured │ configured      │ No       │ Yes      │ grpc://172.19.176.157:43983 │
│     ru-det-conn-0 │      │ configured │ idle            │ No       │ Yes      │ rest://172.19.176.157:46767 │
│   trg-controller  │      │ configured │ configured      │ No       │ Yes      │ grpc://172.19.176.157:33795 │
│     mlt           │      │ configured │ idle            │ No       │ Yes      │ rest://172.19.176.157:51479 │
└───────────────────┴──────┴────────────┴─────────────────┴──────────┴──────────┴─────────────────────────────┘
                                                 minimal status
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Info ┃ State      ┃ Substate          ┃ In error ┃ Included ┃ Endpoint                    ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller   │      │ configured │ propagating-start │ No       │ Yes      │ grpc://172.19.176.157:44079 │
│   df-controller   │      │ configured │ propagating-start │ No       │ Yes      │ grpc://172.19.176.157:38141 │
│     df-01         │      │ ready      │ idle              │ Yes      │ Yes      │ rest://172.19.176.157:53473 │
│     dfo-01        │      │ configured │ executing_cmd     │ Yes      │ Yes      │ rest://172.19.176.157:46253 │
│   ru-controller   │      │ ready      │ ready             │ No       │ Yes      │ grpc://172.19.176.157:43983 │
│     ru-det-conn-0 │      │ ready      │ idle              │ Yes      │ Yes      │ rest://172.19.176.157:46767 │
│   trg-controller  │      │ ready      │ ready             │ No       │ Yes      │ grpc://172.19.176.157:33795 │
│     mlt           │      │ ready      │ idle              │ Yes      │ Yes      │ rest://172.19.176.157:51479 │
└───────────────────┴──────┴────────────┴───────────────────┴──────────┴──────────┴─────────────────────────────┘
Waiting for start to complete... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 0:00:01
[2025/08/29 11:03:55] ERROR      shell_utils.py:553             controller.shell_utils:                       Deadline Exceeded
[2025/08/29 11:03:55] ERROR      shell_utils.py:554             controller.shell_utils:                       The command timed out, unfortunately this means the server is in undefined state, and your best option at this stage is to terminate and boot.
[2025/08/29 11:03:55] ERROR      shell_utils.py:557             controller.shell_utils:                       Alternatively, if you are patient, you can try to wait a bit longer and send 'status' to check if the command ends up being executed (you may want to check the
                                 logs of the controller and application with the 'logs' command).
[2025/08/29 11:03:55] INFO       commands.py:53                 controller.interface:                         Command wait running for 1 seconds.
[2025/08/29 11:03:56] INFO       commands.py:55                 controller.interface:                         Command wait ran for 1 seconds.
[2025/08/29 11:03:56] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'enable_triggers' on controller 'root-controller', targeting: 'root-controller'
[2025/08/29 11:03:56] ERROR      shell_utils.py:498             controller.shell_utils:                       Command "enable_triggers" does not exist, or is not accessible right now
[2025/08/29 11:03:56] INFO       commands.py:53                 controller.interface:                         Command wait running for 20 seconds.
[2025/08/29 11:04:16] INFO       commands.py:55                 controller.interface:                         Command wait ran for 20 seconds.
[2025/08/29 11:04:16] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'disable_triggers' on controller 'root-controller', targeting: 'root-controller'
[2025/08/29 11:04:16] ERROR      shell_utils.py:498             controller.shell_utils:                       Command "disable_triggers" does not exist, or is not accessible right now
[2025/08/29 11:04:16] INFO       commands.py:53                 controller.interface:                         Command wait running for 2 seconds.
[2025/08/29 11:04:18] INFO       commands.py:55                 controller.interface:                         Command wait ran for 2 seconds.
[2025/08/29 11:04:18] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'drain_dataflow' on controller 'root-controller', targeting: 'root-controller'
[2025/08/29 11:04:18] ERROR      shell_utils.py:498             controller.shell_utils:                       Command "drain_dataflow" does not exist, or is not accessible right now
[2025/08/29 11:04:18] INFO       commands.py:53                 controller.interface:                         Command wait running for 2 seconds.
[2025/08/29 11:04:20] INFO       commands.py:55                 controller.interface:                         Command wait ran for 2 seconds.
[2025/08/29 11:04:20] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'stop_trigger_sources' on controller 'root-controller', targeting: 'root-controller'
[2025/08/29 11:04:20] ERROR      shell_utils.py:498             controller.shell_utils:                       Command "stop_trigger_sources" does not exist, or is not accessible right now
[2025/08/29 11:04:20] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'stop' on controller 'root-controller', targeting: 'root-controller'
[2025/08/29 11:04:20] ERROR      shell_utils.py:498             controller.shell_utils:                       Command "stop" does not exist, or is not accessible right now
[2025/08/29 11:04:20] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'scrap' on controller 'root-controller', targeting: 'root-controller'

@PawelPlesniak
Copy link
Collaborator Author

Thanks for this, the filter used for defining the error state currently only changes the root controller, and is something that is easily fixable. Will address this on Monday, but thank you for checking

@PawelPlesniak
Copy link
Collaborator Author

This is interesting. I have run drunc in batch mode, and it got caught as

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config PawelIssueTesting boot conf wait 10 start --run-number 1 shutdown
...
[2025/09/10 14:54:31] INFO       commands.py:53                 controller.interface:                         Command wait running for 10 seconds.                                                                                                                                           
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'dfo-01' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                            
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'hsi-fake-01' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                       
[2025/09/10 14:54:32] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'tp-stream-writer' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                  
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'hsi-fake-to-tc-app' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'tc-maker-1' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                        
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'mlt' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                               
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'df-01' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                             
[2025/09/10 14:54:32] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'ru-01' (session: 'PawelIssueTesting', user: 'pplesnia') process exited with exit code 143                                                                             
[2025/09/10 14:54:41] INFO       commands.py:55                 controller.interface:                         Command wait ran for 10 seconds.                                                                                                                                               
[2025/09/10 14:54:41] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'start' on controller 'root-controller', targeting: 'root-controller'                                                                                       
[2025/09/10 14:54:41] CRITICAL   shell_utils.py:466             controller.shell_utils:                       False                                                                                                                                                                          
                                             PawelIssueTesting status                                             
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State      ┃ Substate        ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ configured │ preparing-start │ No       │ Yes      │ grpc://10.73.136.36:38269 │
│   df-controller        │      │ configured │ configured      │ No       │ Yes      │ grpc://10.73.136.36:43615 │
│     df-01              │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:43879 │
│     dfo-01             │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:42839 │
│     tp-stream-writer   │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:35985 │
│   hsi-fake-controller  │      │ configured │ configured      │ No       │ Yes      │ grpc://10.73.136.36:43231 │
│     hsi-fake-01        │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:35639 │
│     hsi-fake-to-tc-app │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:33367 │
│   ru-controller        │      │ configured │ configured      │ No       │ Yes      │ grpc://10.73.136.36:40569 │
│     ru-01              │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:59257 │
│   trg-controller       │      │ configured │ configured      │ No       │ Yes      │ grpc://10.73.136.36:36377 │
│     mlt                │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:49567 │
│     tc-maker-1         │      │ configured │ idle            │ Yes      │ Yes      │ rest://10.73.136.36:40613 │
                                            PawelIssueTesting status                                            
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State      ┃ Substate      ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:38269 │
│   df-controller        │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:43615 │
│     df-01              │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:43879 │
│     dfo-01             │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:42839 │
│     tp-stream-writer   │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:35985 │
│   hsi-fake-controller  │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:43231 │
│     hsi-fake-01        │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:35639 │
│     hsi-fake-to-tc-app │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:33367 │
│   ru-controller        │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:40569 │
│     ru-01              │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:59257 │
│   trg-controller       │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:36377 │
│     mlt                │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:49567 │
│     tc-maker-1         │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:40613 │
└────────────────────────┴──────┴────────────┴───────────────┴──────────┴──────────┴───────────────────────────┘
                                      Run Info                                       
┌───────────────────────┬───────────────────────────────────────────────────────────┐
│ Run number            │ 1                                                         │
│ Run type              │ TEST                                                      │
│ Start time            │ 2025-09-10 14:54:41                                       │
│ Duration              │ 0:00:00                                                   │
│ Trigger rate          │ 0.0000 Hz                                                 │
│ Data storage disabled │ False                                                     │
│ Config file           │ oksconflibs:config/daqsystemtest/example-configs.data.xml │
│ Config ID             │ local-1x1-config                                          │
└───────────────────────┴───────────────────────────────────────────────────────────┘
Waiting for start to complete... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
                       start execution report                       
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Command execution      ┃ FSM transition ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ root-controller        │ Executed Successfully  │ Fsm Failed     │
│   ru-controller        │ Executed Successfully  │ Fsm Failed     │
│     ru-01              │ Drunc Exception Thrown │ NA             │
│   hsi-fake-controller  │ Executed Successfully  │ Fsm Failed     │
│     hsi-fake-01        │ Drunc Exception Thrown │ NA             │
│     hsi-fake-to-tc-app │ Drunc Exception Thrown │ NA             │
│   trg-controller       │ Executed Successfully  │ Fsm Failed     │
│     tc-maker-1         │ Drunc Exception Thrown │ NA             │
│     mlt                │ Drunc Exception Thrown │ NA             │
│   df-controller        │ Executed Successfully  │ Fsm Failed     │
│     dfo-01             │ Drunc Exception Thrown │ NA             │
│     tp-stream-writer   │ Drunc Exception Thrown │ NA             │
│     df-01              │ Drunc Exception Thrown │ NA             │
└────────────────────────┴────────────────────────┴────────────────┘
                                            PawelIssueTesting status                                            
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State      ┃ Substate      ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:38269 │
│   df-controller        │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:43615 │
│     df-01              │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:43879 │
│     dfo-01             │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:42839 │
│     tp-stream-writer   │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:35985 │
│   hsi-fake-controller  │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:43231 │
│     hsi-fake-01        │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:35639 │
│     hsi-fake-to-tc-app │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:33367 │
│   ru-controller        │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:40569 │
│     ru-01              │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:59257 │
│   trg-controller       │      │ ready      │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:36377 │
│     mlt                │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:49567 │
│     tc-maker-1         │      │ configured │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:40613 │
└────────────────────────┴──────┴────────────┴───────────────┴──────────┴──────────┴───────────────────────────┘
                                      Run Info                                       
┌───────────────────────┬───────────────────────────────────────────────────────────┐
│ Run number            │ 1                                                         │
│ Run type              │ TEST                                                      │
│ Start time            │ 2025-09-10 14:54:41                                       │
│ Duration              │ 0:00:00                                                   │
│ Trigger rate          │ 0.0000 Hz                                                 │
│ Data storage disabled │ False                                                     │
│ Config file           │ oksconflibs:config/daqsystemtest/example-configs.data.xml │
│ Config ID             │ local-1x1-config                                          │
└───────────────────────┴───────────────────────────────────────────────────────────┘
[2025/09/10 14:54:41] ERROR      shell_utils.py:279             utils.ShellContext:                            FSM is in error (state: "ready"                                                                                                                                               
                                 sub_state: "ready"                                                                                                                                                                                                                                          
                                 in_error: true                                                                                                                                                                                                                                              
                                 included: true                                                                                                                                                                                                                                              
                                 run_info {                                                                                                                                                                                                                                                  
                                   run_type: "TEST"                                                                                                                                                                                                                                          
                                   run_number: 1                                                                                                                                                                                                                                             
                                   run_time_at_start: 1757508881                                                                                                                                                                                                                             
                                   run_config_file: "oksconflibs:config/daqsystemtest/example-configs.data.xml"                                                                                                                                                                              
                                   run_config_name: "local-1x1-config"                                                                                                                                                                                                                       
                                 }                                                                                                                                                                                                                                                           
                                 ), not currently accepting new commands.                                                                                                                                                                                                                    
[2025/09/10 14:54:41] INFO       shell_utils.py:13              unified_shell.shell_utils:                    Running sequence: ['disable-triggers', 'drain-dataflow', 'stop-trigger-sources', 'stop', 'scrap', 'terminate']                                                                 
[2025/09/10 14:54:41] INFO       shell_utils.py:33              unified_shell.shell_utils:                    Skipping command 'disable-triggers'                                                                                                                                            
[2025/09/10 14:54:42] INFO       shell_utils.py:36              unified_shell.shell_utils:                    Running command: 'drain-dataflow'                                                                                                                                              
[2025/09/10 14:54:42] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'drain_dataflow' on controller 'root-controller', targeting: 'root-controller'                                                                              
[2025/09/10 14:54:42] CRITICAL   shell_utils.py:466             controller.shell_utils:                       True                                                                                                                                                                           
[2025/09/10 14:54:42] ERROR      shell_utils.py:469             controller.shell_utils:                       Running in batch mode, and because error state is detected, exiting.                                                                                                           
[2025/09/10 14:54:42] INFO       shell.py:252                   unified_shell:                                Retracting the session PawelIssueTesting from the connectivity service                                                                                                         
[2025/09/10 14:54:42] INFO       shell.py:375                   unified_shell:                                Exiting unified_shell                                                                                                                                                          
[2025/09/10 14:54:42] INFO       shell.py:377                   unified_shell:                                unified_shell exited successfully.   

Investigating why MSQT did not follow now

@PawelPlesniak
Copy link
Collaborator Author

Running the minimal_system_quick_test, I got

[2025/09/10 15:06:27] INFO       commands.py:53                 controller.interface:                         Command wait running for 20 seconds.                                                                                                                                           
[2025/09/10 15:06:28] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 15:06:28] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 15:06:28] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 15:06:28] ERROR      sh.py:1717                     process_manager.SSH_process_manager:          Connection to localhost closed.                                                                                                                                                
                                                                                                                                                                                                                                                                                             
[2025/09/10 15:06:28] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'ru-det-conn-0' (session: 'minimal', user: 'pplesnia') process exited with exit code 143                                                                               
[2025/09/10 15:06:28] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'df-01' (session: 'minimal', user: 'pplesnia') process exited with exit code 143                                                                                       
[2025/09/10 15:06:28] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'dfo-01' (session: 'minimal', user: 'pplesnia') process exited with exit code 143                                                                                      
[2025/09/10 15:06:28] INFO       ssh_process_manager.py:228     process_manager.SSH_process_manager:          Process 'mlt' (session: 'minimal', user: 'pplesnia') process exited with exit code 143                                                                                         
[2025/09/10 15:06:47] INFO       commands.py:55                 controller.interface:                         Command wait ran for 20 seconds.                                                                                                                                               
[2025/09/10 15:06:47] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'disable_triggers' on controller 'root-controller', targeting: 'root-controller'                                                                            
[2025/09/10 15:06:47] CRITICAL   shell_utils.py:466             controller.shell_utils:                       False                                                                                                                                                                          
                                                 minimal status                                                  
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Info ┃ State   ┃ Substate               ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller   │      │ running │ disable_triggers-ready │ No       │ Yes      │ grpc://10.73.136.36:38323 │
│   df-controller   │      │ running │ running                │ No       │ Yes      │ grpc://10.73.136.36:40523 │
│     df-01         │      │ running │ idle                   │ Yes      │ Yes      │ rest://10.73.136.36:34171 │
│     dfo-01        │      │ running │ idle                   │ Yes      │ Yes      │ rest://10.73.136.36:52553 │
│   ru-controller   │      │ running │ running                │ No       │ Yes      │ grpc://10.73.136.36:37867 │
                                             minimal status                                             
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Info ┃ State   ┃ Substate      ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller   │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:38323 │
│   df-controller   │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:40523 │
│     df-01         │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:34171 │
│     dfo-01        │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:52553 │
│   ru-controller   │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:37867 │
│     ru-det-conn-0 │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:51589 │
│   trg-controller  │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:34379 │
│     mlt           │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:34525 │
└───────────────────┴──────┴─────────┴───────────────┴──────────┴──────────┴───────────────────────────┘
                                                      Run Info                                                      
┌───────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ Run number            │ 101                                                                                      │
│ Run type              │ TEST                                                                                     │
│ Start time            │ 2025-09-10 15:06:25                                                                      │
│ Duration              │ 0:00:22                                                                                  │
│ Trigger rate          │ 0.0000 Hz                                                                                │
│ Data storage disabled │ False                                                                                    │
│ Config file           │ oksconflibs:/tmp/pytest-of-pplesnia/pytest-7/config0/integtest-session-resolved.data.xml │
│ Config ID             │ minimal                                                                                  │
└───────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────┘
Waiting for disable_triggers to complete... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
               disable_triggers execution report               
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Command execution      ┃ FSM transition ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ root-controller   │ Executed Successfully  │ Fsm Failed     │
│   trg-controller  │ Executed Successfully  │ Fsm Failed     │
│     mlt           │ Drunc Exception Thrown │ NA             │
│   ru-controller   │ Executed Successfully  │ Fsm Failed     │
│     ru-det-conn-0 │ Drunc Exception Thrown │ NA             │
│   df-controller   │ Executed Successfully  │ Fsm Failed     │
│     dfo-01        │ Drunc Exception Thrown │ NA             │
│     df-01         │ Drunc Exception Thrown │ NA             │
└───────────────────┴────────────────────────┴────────────────┘
                                             minimal status                                             
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Info ┃ State   ┃ Substate      ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller   │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:38323 │
│   df-controller   │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:40523 │
│     df-01         │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:34171 │
│     dfo-01        │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:52553 │
│   ru-controller   │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:37867 │
│     ru-det-conn-0 │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:51589 │
│   trg-controller  │      │ ready   │ ready         │ Yes      │ Yes      │ grpc://10.73.136.36:34379 │
│     mlt           │      │ running │ executing_cmd │ Yes      │ Yes      │ rest://10.73.136.36:34525 │
└───────────────────┴──────┴─────────┴───────────────┴──────────┴──────────┴───────────────────────────┘
                                                      Run Info                                                      
┌───────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ Run number            │ 101                                                                                      │
│ Run type              │ TEST                                                                                     │
│ Start time            │ 2025-09-10 15:06:25                                                                      │
│ Duration              │ 0:00:22                                                                                  │
│ Trigger rate          │ 0.0000 Hz                                                                                │
│ Data storage disabled │ False                                                                                    │
│ Config file           │ oksconflibs:/tmp/pytest-of-pplesnia/pytest-7/config0/integtest-session-resolved.data.xml │
│ Config ID             │ minimal                                                                                  │
└───────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────┘
[2025/09/10 15:06:47] ERROR      shell_utils.py:279             utils.ShellContext:                            FSM is in error (state: "ready"                                                                                                                                               
                                 sub_state: "ready"                                                                                                                                                                                                                                          
                                 in_error: true                                                                                                                                                                                                                                              
                                 included: true                                                                                                                                                                                                                                              
                                 run_info {                                                                                                                                                                                                                                                  
                                   run_type: "TEST"                                                                                                                                                                                                                                          
                                   run_number: 101                                                                                                                                                                                                                                           
                                   run_time_at_start: 1757509585                                                                                                                                                                                                                             
                                   run_time_since_start: 22                                                                                                                                                                                                                                  
                                   run_config_file: "oksconflibs:/tmp/pytest-of-pplesnia/pytest-7/config0/integtest-session-resolved.data.xml"                                                                                                                                               
                                   run_config_name: "minimal"                                                                                                                                                                                                                                
                                 }                                                                                                                                                                                                                                                           
                                 ), not currently accepting new commands.                                                                                                                                                                                                                    
[2025/09/10 15:06:47] INFO       commands.py:53                 controller.interface:                         Command wait running for 2 seconds.                                                                                                                                            
[2025/09/10 15:06:49] INFO       commands.py:55                 controller.interface:                         Command wait ran for 2 seconds.                                                                                                                                                
[2025/09/10 15:06:49] INFO       shell_utils.py:462             controller.shell_utils:                       Running transition 'drain_dataflow' on controller 'root-controller', targeting: 'root-controller'                                                                              
[2025/09/10 15:06:49] CRITICAL   shell_utils.py:466             controller.shell_utils:                       True                                                                                                                                                                           
[2025/09/10 15:06:49] ERROR      shell_utils.py:469             controller.shell_utils:                       Running in batch mode, and because error state is detected, exiting.                                                                                                           
[2025/09/10 15:06:49] INFO       shell.py:252                   unified_shell:                                Retracting the session minimal from the connectivity service                                                                                                                   
[2025/09/10 15:06:49] WARNING    client.py:111                  utils.ConnectivityServiceClient:              Session minimal not found on the connectivity service                                                                                                                          
[2025/09/10 15:06:49] INFO       shell.py:375                   unified_shell:                                Exiting unified_shell                                                                                                                                                          
[2025/09/10 15:06:49] INFO       shell.py:377                   unified_shell:                                unified_shell exited successfully. 

with killall daq_application. Will try and recreate the hang on start

@PawelPlesniak
Copy link
Collaborator Author

Killing everything on boot I got

  Looking for root-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 0:00:01
⠋ Trying to talk to the root controller... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:00
                                               minimal status                                               
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Info ┃ State        ┃ Substate     ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller   │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.36:43819 │
│   df-controller   │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.36:41259 │
│     df-01         │      │ unknown      │ unknown      │ No       │ Yes      │                           │
│     dfo-01        │      │ unknown      │ unknown      │ No       │ Yes      │                           │
│   ru-controller   │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.36:46769 │
│     ru-det-conn-0 │      │ unknown      │ unknown      │ No       │ Yes      │                           │
│   trg-controller  │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.36:36329 │
│     mlt           │      │ unknown      │ unknown      │ No       │ Yes      │                           │
└───────────────────┴──────┴──────────────┴──────────────┴──────────┴──────────┴───────────────────────────┘
                                               minimal status                                               
┏━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name              ┃ Info ┃ State        ┃ Substate     ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller   │      │ initial      │ initial      │ Yes      │ Yes      │ grpc://10.73.136.36:43819 │
│   df-controller   │      │ initialising │ initialising │ No       │ Yes      │ grpc://10.73.136.36:41259 │
│     df-01         │      │ unknown      │ unknown      │ No       │ Yes      │                           │
│     dfo-01        │      │ initial      │ idle         │ Yes      │ Yes      │ rest://10.73.136.36:33293 │
│   ru-controller   │      │ initial      │ initial      │ Yes      │ Yes      │ grpc://10.73.136.36:46769 │
│     ru-det-conn-0 │      │ initial      │ idle         │ Yes      │ Yes      │ rest://10.73.136.36:58801 │
│   trg-controller  │      │ initial      │ initial      │ Yes      │ Yes      │ grpc://10.73.136.36:36329 │
│     mlt           │      │ initial      │ idle         │ Yes      │ Yes      │ rest://10.73.136.36:40831 │
└───────────────────┴──────┴──────────────┴──────────────┴──────────┴──────────┴───────────────────────────┘
Waiting on tree initialisation... ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   4% 0:01:07
[2025/09/10 15:08:36] ERROR      commands.py:74                 unified_shell.boot:                           Booted, but the top controller is in error                                                                                                                                     
[2025/09/10 15:08:36] ERROR      commands.py:76                 unified_shell.boot:                           Unified shell: Running in batch mode, and because error state is detected, exiting.                                                                                            
[2025/09/10 15:08:36] INFO       shell.py:252                   unified_shell:                                Retracting the session minimal from the connectivity service                                                                                                                   
[2025/09/10 15:08:36] WARNING    client.py:111                  utils.ConnectivityServiceClient:              Session minimal not found on the connectivity service                                                                                                                          
[2025/09/10 15:08:36] INFO       shell.py:375                   unified_shell:                                Exiting unified_shell                                                                                                                                                          
[2025/09/10 15:08:36] INFO       shell.py:377                   unified_shell:                                unified_shell exited successfully.   

Going through start was very quick, @eflumerf could you try recreating the problem again?
Note - the single log entry of True or False was a debugging tool I included which I removed. No changes have been made since you have posted the error.

Copy link
Member

@eflumerf eflumerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for now, more detailed error handling and recovery to be discussed

@jcfreeman2
Copy link
Contributor

The PawelPlesniak/SimpleFixes branches of drunc and druncschema both represent an improvement over the status quo; two times I tried killing a process at random after starting the minimum_system_quick_test.py and both times the system no longer tried to go through all transitions. I especially liked your best option at this stage is to terminate and boot. I will say that in one of my attempts the system spend another - 30? - seconds trying to successfully complete the start transition rather than throwing up its hands immediately, which might have been nice, but that may go beyond the scope of this PR.

@PawelPlesniak PawelPlesniak merged commit 92f75b6 into develop Sep 15, 2025
1 check passed
@PawelPlesniak PawelPlesniak deleted the PawelPlesniak/SimpleFixes branch September 15, 2025 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Abort command sequence

3 participants