Skip to content

Sample integtest that uses multiple computers at EHN1#291

Open
bieryAtFnal wants to merge 15 commits intodevelopfrom
kbiery/demo_multihost_integtest
Open

Sample integtest that uses multiple computers at EHN1#291
bieryAtFnal wants to merge 15 commits intodevelopfrom
kbiery/demo_multihost_integtest

Conversation

@bieryAtFnal
Copy link
Contributor

Description

In recent discussions, we talked about the possibility of having an integtest (integration/regression test) that demonstrates the use of multiple computers at EHN1. The changes covered by this PR provide such a test. It is called sample_ehn1_multihost_test.py. There are still some aspects of this test that are not yet ideal, but I feel that it is worth gathering feedback on what is available so far, and merging what exists so far to the develop branch if no show-stoppers are identified.

This new integtest can be run on any computer, inside or outside, the NP04 DAQ cluster, but it will only fully work from inside the NP04 DAQ cluster. On other computers, the tests are skipped.

One of the non-ideal aspects of the current test is that we need to follow several steps in the current Linux environment before the test will run successfully, even within the NP04 DAQ cluster. These steps are listed at the top of the script and are listed below. These steps are needed to direct the pytest output to a directory on a shared (e.g. NFS) disk, etc.

In the test system that is instantiated, DAQ processes are run on four computers in the NP04 cluster (np04-srv-021/22/28/29) plus the computer on which the test is run, if that computer is not one of srv-021/22/28/29.

Some notes on the implementation:

  • entries for the four named computers that are used in the test were added to the daqsystemtest/config/daqsystemtest/hosts.data.xml file. Beyond that, tweaks to the local-1x1-config configuration are done dynamically using config substitutions in the integtest file.
  • unneeded configurations entries for setting a CONNECTION_SERVER environmental variable were removed from ccm.data.xml.
  • there are some work-arounds that are included in the test. For example, it seems that ConnectivityServer instances are not being successfully shut down in certain situations ([Bug]: ConnSvc gunicorn is not being cleaned up in certain types of integtests when run at EHN1 drunc#754), and special code was added to this test to clean those up.

Here are the comments from the top of the sample_ehn1_multihost_test.py file which include the special instructions that need to be followed:

# 29-Jan-2026, KAB: Steps to run this test:
# - Log into any np04-srv-XYZ computer and set up a software area with the
#     appropriate branch of daqsystemtest.
# - 'cd $DBT_AREA_ROOT/sourcecode/daqsystemtest/integtest'
# - 'mkdir -p $HOME/dunedaq/scratch'  # only need to do this once per user account
# - 'export PYTEST_DEBUG_TEMPROOT=$HOME/dunedaq/scratch'  # once per login/shell
# - 'pytest -s ./sample_ehn1_multihost_test.py'
#
# This test currently puts the various DAQ processes on the following computers:
# - np04-srv-021:  ru-01, ru-controller
# - np04-srv-022:  mlt, tc-maker-1, trg-controller
# - np04-srv-028:  local-connection-server, tp-stream-writer
# - np04-srv-029:  dfo-01, df-01, df-controller
# - localhost:  hsi-fake-01, hsi-fake-to-tc-app, root-controller
#
# The choice of running the tp-stream-writer on a different computer than the other
# Dataflow apps was just to show that it works.  And, running the ConnectivityServer
# on a computer other than "localhost" was also to show that it can be done.
#
# To enable the capturing of TRACE messages on all of the computers...
# - 'export TRACE_FILE=/tmp/pytest-of-${USER}/log/${USER}_dunedaq.trace'  # once per login/shell
# - 'mkdir -p /tmp/pytest-of-${USER}/log'  # only need to do this once per computer
# - 'ssh np04-srv-021 "mkdir -p /tmp/pytest-of-${USER}/log"'  # only once per user
# - 'ssh np04-srv-022 "mkdir -p /tmp/pytest-of-${USER}/log"'  # only once per user
# - 'ssh np04-srv-028 "mkdir -p /tmp/pytest-of-${USER}/log"'  # only once per user
# - 'ssh np04-srv-029 "mkdir -p /tmp/pytest-of-${USER}/log"'  # only once per user
# - 'pytest -s ./sample_ehn1_multihost_test.py'

Type of change

  • New feature or enhancement (non-breaking change which adds functionality)

Testing checklist

  • Minimal system quicktest passes (pytest -s minimal_system_quick_test.py)
  • Full set of integration tests pass (daqsystemtest_integtest_bundle.sh)

Further checks

  • Code is commented where needed, particularly in hard-to-understand areas

@bieryAtFnal
Copy link
Contributor Author

Here a system configuration diagram with hostnames added by hand:

MultiHostConfigEHN1_02Feb2026.pdf

… the reasons are still shown even when console output is reduced.
Copy link
Contributor

@mroda88 mroda88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it withe the nightly 260202. The first attempt didn't work. I got this error:

platform linux -- Python 3.10.10, pytest-8.3.3, pluggy-1.5.0
rootdir: /nfs/home/maroda/DAQ/NFD_DEV_260203_A9/sourcecode/daqsystemtest
configfile: pytest.ini
plugins: anyio-4.6.2.post1, integrationtest-3.6.0
collected 4 items                                                                                                                                                                                                   

sample_ehn1_multihost_test.py 
Integtest preconfigured config file: /nfs/home/maroda/DAQ/NFD_DEV_260203_A9/sourcecode/daqsystemtest/integtest/../config/daqsystemtest/example-configs.data.xml
Found free Kubernetes NodePort: 30566
Updated Connectivity Service 'local-connectivity-service' to use port 30566
Updated runtime environment variable 'local-env-connectivity-port' to '30566'
Successfully configured connectivity service port for session 'local-1x1-config'.
Found free Kubernetes NodePort: 31136
Updated RC Controller Service 'root-rccontroller_control' to use port 31136
Successfully configured RC controller port for session 'local-1x1-config'.
++++++++++ DRUNC Run BEGIN ++++++++++
[2026/02/03 16:47:08 UTC] INFO       shell.py:180                             drunc.unified_shell                                Setting up to use the process manager with configuration ssh-standalone and 
configuration id "local-1x1-config" from oksconflibs:/nfs/home/maroda/dunedaq/scratch/pytest-of-maroda/pytest-0/config0/integtest-session-resolved.data.xml
[2026/02/03 16:47:08 UTC] INFO       shell.py:202                             drunc.unified_shell                                Starting process manager
[2026/02/03 16:47:08 UTC] INFO       process_manager.py:108                   drunc.process_manager                              process_manager communicating through address 10.73.136.38:35365
[2026/02/03 16:47:08 UTC] INFO       shell_utils.py:685                       drunc.controller.iface.shell_utils                 Using environment variable 'DRUNC_RUN_TYPE_DEFAULT' as default value for argument 
'--run-type' of start ('TEST')
[2026/02/03 16:47:08 UTC] INFO       shell.py:538                             drunc.unified_shell                                unified_shell ready with process_manager and controller commands
[2026/02/03 16:47:08 UTC] INFO       process_manager_driver.py:96             drunc.process_manager_driver                       Booting session local-1x1-config
[2026/02/03 16:47:08 UTC] INFO       ssh_process_manager.py:340               drunc.process_manager.SSH_SHELL_process_manager    Booted 'local-connection-server' from session 'local-1x1-config' with UUID 
1d45d8ef-4301-492f-8034-cd3b92dc18b3
[2026/02/03 16:47:19 UTC] INFO       shell.py:411                             drunc.unified_shell                                Shutting down the unified_shell
[2026/02/03 16:47:19 UTC] INFO       ssh_process_manager.py:174               drunc.process_manager.SSH_SHELL_process_manager    Terminating
[2026/02/03 16:47:19 UTC] INFO       ssh_process_manager.py:177               drunc.process_manager.SSH_SHELL_process_manager    Killing all the known processes before exiting
[2026/02/03 16:47:20 UTC] INFO       ssh_process_manager.py:277               drunc.process_manager.SSH_SHELL_process_manager    Process 'local-connection-server' (session: 'local-1x1-config', user: 'maroda') 
process exited with exit code 255
[2026/02/03 16:47:20 UTC] INFO       ssh_process_lifetime_manager_shell.py:38 drunc.process_manager.SSH_SHELL_process_manager    Process 1d45d8ef-4301-492f-8034-cd3b92dc18b3 terminated
[2026/02/03 16:47:20 UTC] INFO       ssh_process_manager.py:119               drunc.process_manager.SSH_SHELL_process_manager    Killed 'local-connection-server' with UUID 1d45d8ef-4301-492f-8034-cd3b92dc18b3
[2026/02/03 16:47:20 UTC] INFO       shell_utils.py:135                       drunc.utils.ShellContext                           You will not be able to issue commands to the process_manager anymore.
[2026/02/03 16:47:20 UTC] INFO       shell_utils.py:137                       drunc.utils.ShellContext                           Process_manager driver has been deleted.
[2026/02/03 16:47:20 UTC] INFO       process_manager.py:128                   drunc.process_manager                              Shutting down the process manager server
[2026/02/03 16:47:20 UTC] INFO       shell.py:513                             drunc.unified_shell                                unified_shell exited successfully
---------- DRUNC Run END ----------

*** PLEASE NOTE: this script is cleaning up stale _gunicorn_ processes on np04-srv-028...

========================================
EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM
========================================
.FFF

===================================================================================================== FAILURES ======================================================================================================
_______________________________________________________________________ test_log_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] ________________________________________________________________________
sample_ehn1_multihost_test.py:327: in test_log_files
    assert any(
E   assert False
E    +  where False = any(<generator object test_log_files.<locals>.<genexpr> at 0x7f83bc00bed0>)
_______________________________________________________________________ test_data_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] _______________________________________________________________________
sample_ehn1_multihost_test.py:368: in test_data_files
    assert len(run_nanorc.data_files) == expected_file_count, f"Unexpected file count: Actual: {len(run_nanorc.data_files)}, Expected: {expected_file_count}"
E   AssertionError: Unexpected file count: Actual: 0, Expected: 1
E   assert 0 == 1
E    +  where 0 = len([])
E    +    where [] = <integrationtest.integrationtest_drunc.run_nanorc.<locals>.RunResult object at 0x7f83bcda62c0>.data_files
_____________________________________________________________________ test_tpstream_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] _____________________________________________________________________
sample_ehn1_multihost_test.py:429: in test_tpstream_files
    assert len(tpstream_files) == 1  # one for each run
E   assert 0 == 1
E    +  where 0 = len([])
============================================================================================== short test summary info ==============================================================================================
FAILED sample_ehn1_multihost_test.py::test_log_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] - assert False
FAILED sample_ehn1_multihost_test.py::test_data_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] - AssertionError: Unexpected file count: Actual: 0, Expected: 1
FAILED sample_ehn1_multihost_test.py::test_tpstream_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] - assert 0 == 1
=========================================================================================== 3 failed, 1 passed in 21.98s ============================================================================================

I just re-run the pytest command and it all worked ok.

@bieryAtFnal
Copy link
Contributor Author

Yes, I have also seen that "first-time" failure occasionally. Unfortunately, I haven't yet figured out what characteristic of the local computer or user account or whatever is causing it. I just tried running on several different EHN1 computers with drunc logging set to debug, and I was unable to catch the problem...

…ters in sample_ehn1_multihost_test.py so that ConnSvc startup (and other app startup) doesn't take a long time the first time that the test is run.
@bieryAtFnal
Copy link
Contributor Author

@mroda88 , I believe that the "first-time-startup" problem is caused by the time needed to load a new base release onto the remote computers. This mostly affects the ConnectivityServer. In any case, I have added code to the sample_ehn1_multihost_test.py to pre-load the software environment on the remote computers, and this should help avoid that problem when an user initially tries to use the new test.

@mroda88
Copy link
Contributor

mroda88 commented Feb 4, 2026

Hi @bieryAtFnal , yes I suspected the same. Thanks for adding the change!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants