Sample integtest that uses multiple computers at EHN1#291
Sample integtest that uses multiple computers at EHN1#291bieryAtFnal wants to merge 15 commits intodevelopfrom
Conversation
…env var is no longer used.
…tihost_test.py plus other minor enhancements.
…mpdir looks reasonable).
|
Here a system configuration diagram with hostnames added by hand: |
… the reasons are still shown even when console output is reduced.
mroda88
left a comment
There was a problem hiding this comment.
I tested it withe the nightly 260202. The first attempt didn't work. I got this error:
platform linux -- Python 3.10.10, pytest-8.3.3, pluggy-1.5.0
rootdir: /nfs/home/maroda/DAQ/NFD_DEV_260203_A9/sourcecode/daqsystemtest
configfile: pytest.ini
plugins: anyio-4.6.2.post1, integrationtest-3.6.0
collected 4 items
sample_ehn1_multihost_test.py
Integtest preconfigured config file: /nfs/home/maroda/DAQ/NFD_DEV_260203_A9/sourcecode/daqsystemtest/integtest/../config/daqsystemtest/example-configs.data.xml
Found free Kubernetes NodePort: 30566
Updated Connectivity Service 'local-connectivity-service' to use port 30566
Updated runtime environment variable 'local-env-connectivity-port' to '30566'
Successfully configured connectivity service port for session 'local-1x1-config'.
Found free Kubernetes NodePort: 31136
Updated RC Controller Service 'root-rccontroller_control' to use port 31136
Successfully configured RC controller port for session 'local-1x1-config'.
++++++++++ DRUNC Run BEGIN ++++++++++
[2026/02/03 16:47:08 UTC] INFO shell.py:180 drunc.unified_shell Setting up to use the process manager with configuration ssh-standalone and
configuration id "local-1x1-config" from oksconflibs:/nfs/home/maroda/dunedaq/scratch/pytest-of-maroda/pytest-0/config0/integtest-session-resolved.data.xml
[2026/02/03 16:47:08 UTC] INFO shell.py:202 drunc.unified_shell Starting process manager
[2026/02/03 16:47:08 UTC] INFO process_manager.py:108 drunc.process_manager process_manager communicating through address 10.73.136.38:35365
[2026/02/03 16:47:08 UTC] INFO shell_utils.py:685 drunc.controller.iface.shell_utils Using environment variable 'DRUNC_RUN_TYPE_DEFAULT' as default value for argument
'--run-type' of start ('TEST')
[2026/02/03 16:47:08 UTC] INFO shell.py:538 drunc.unified_shell unified_shell ready with process_manager and controller commands
[2026/02/03 16:47:08 UTC] INFO process_manager_driver.py:96 drunc.process_manager_driver Booting session local-1x1-config
[2026/02/03 16:47:08 UTC] INFO ssh_process_manager.py:340 drunc.process_manager.SSH_SHELL_process_manager Booted 'local-connection-server' from session 'local-1x1-config' with UUID
1d45d8ef-4301-492f-8034-cd3b92dc18b3
[2026/02/03 16:47:19 UTC] INFO shell.py:411 drunc.unified_shell Shutting down the unified_shell
[2026/02/03 16:47:19 UTC] INFO ssh_process_manager.py:174 drunc.process_manager.SSH_SHELL_process_manager Terminating
[2026/02/03 16:47:19 UTC] INFO ssh_process_manager.py:177 drunc.process_manager.SSH_SHELL_process_manager Killing all the known processes before exiting
[2026/02/03 16:47:20 UTC] INFO ssh_process_manager.py:277 drunc.process_manager.SSH_SHELL_process_manager Process 'local-connection-server' (session: 'local-1x1-config', user: 'maroda')
process exited with exit code 255
[2026/02/03 16:47:20 UTC] INFO ssh_process_lifetime_manager_shell.py:38 drunc.process_manager.SSH_SHELL_process_manager Process 1d45d8ef-4301-492f-8034-cd3b92dc18b3 terminated
[2026/02/03 16:47:20 UTC] INFO ssh_process_manager.py:119 drunc.process_manager.SSH_SHELL_process_manager Killed 'local-connection-server' with UUID 1d45d8ef-4301-492f-8034-cd3b92dc18b3
[2026/02/03 16:47:20 UTC] INFO shell_utils.py:135 drunc.utils.ShellContext You will not be able to issue commands to the process_manager anymore.
[2026/02/03 16:47:20 UTC] INFO shell_utils.py:137 drunc.utils.ShellContext Process_manager driver has been deleted.
[2026/02/03 16:47:20 UTC] INFO process_manager.py:128 drunc.process_manager Shutting down the process manager server
[2026/02/03 16:47:20 UTC] INFO shell.py:513 drunc.unified_shell unified_shell exited successfully
---------- DRUNC Run END ----------
*** PLEASE NOTE: this script is cleaning up stale _gunicorn_ processes on np04-srv-028...
========================================
EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM
========================================
.FFF
===================================================================================================== FAILURES ======================================================================================================
_______________________________________________________________________ test_log_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] ________________________________________________________________________
sample_ehn1_multihost_test.py:327: in test_log_files
assert any(
E assert False
E + where False = any(<generator object test_log_files.<locals>.<genexpr> at 0x7f83bc00bed0>)
_______________________________________________________________________ test_data_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] _______________________________________________________________________
sample_ehn1_multihost_test.py:368: in test_data_files
assert len(run_nanorc.data_files) == expected_file_count, f"Unexpected file count: Actual: {len(run_nanorc.data_files)}, Expected: {expected_file_count}"
E AssertionError: Unexpected file count: Actual: 0, Expected: 1
E assert 0 == 1
E + where 0 = len([])
E + where [] = <integrationtest.integrationtest_drunc.run_nanorc.<locals>.RunResult object at 0x7f83bcda62c0>.data_files
_____________________________________________________________________ test_tpstream_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] _____________________________________________________________________
sample_ehn1_multihost_test.py:429: in test_tpstream_files
assert len(tpstream_files) == 1 # one for each run
E assert 0 == 1
E + where 0 = len([])
============================================================================================== short test summary info ==============================================================================================
FAILED sample_ehn1_multihost_test.py::test_log_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] - assert False
FAILED sample_ehn1_multihost_test.py::test_data_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] - AssertionError: Unexpected file count: Actual: 0, Expected: 1
FAILED sample_ehn1_multihost_test.py::test_tpstream_files[EHN1 MultiHost 1x1 Conf-StandAloneSSH_PM-run_nanorc0] - assert 0 == 1
=========================================================================================== 3 failed, 1 passed in 21.98s ============================================================================================
I just re-run the pytest command and it all worked ok.
|
Yes, I have also seen that "first-time" failure occasionally. Unfortunately, I haven't yet figured out what characteristic of the local computer or user account or whatever is causing it. I just tried running on several different EHN1 computers with |
…ters in sample_ehn1_multihost_test.py so that ConnSvc startup (and other app startup) doesn't take a long time the first time that the test is run.
|
@mroda88 , I believe that the "first-time-startup" problem is caused by the time needed to load a new base release onto the remote computers. This mostly affects the ConnectivityServer. In any case, I have added code to the |
|
Hi @bieryAtFnal , yes I suspected the same. Thanks for adding the change! |
Description
In recent discussions, we talked about the possibility of having an integtest (integration/regression test) that demonstrates the use of multiple computers at EHN1. The changes covered by this PR provide such a test. It is called
sample_ehn1_multihost_test.py. There are still some aspects of this test that are not yet ideal, but I feel that it is worth gathering feedback on what is available so far, and merging what exists so far to thedevelopbranch if no show-stoppers are identified.This new integtest can be run on any computer, inside or outside, the NP04 DAQ cluster, but it will only fully work from inside the NP04 DAQ cluster. On other computers, the tests are skipped.
One of the non-ideal aspects of the current test is that we need to follow several steps in the current Linux environment before the test will run successfully, even within the NP04 DAQ cluster. These steps are listed at the top of the script and are listed below. These steps are needed to direct the pytest output to a directory on a shared (e.g. NFS) disk, etc.
In the test system that is instantiated, DAQ processes are run on four computers in the NP04 cluster (np04-srv-021/22/28/29) plus the computer on which the test is run, if that computer is not one of srv-021/22/28/29.
Some notes on the implementation:
daqsystemtest/config/daqsystemtest/hosts.data.xmlfile. Beyond that, tweaks to thelocal-1x1-configconfiguration are done dynamically using config substitutions in the integtest file.CONNECTION_SERVERenvironmental variable were removed fromccm.data.xml.ConnectivityServerinstances are not being successfully shut down in certain situations ([Bug]: ConnSvc gunicorn is not being cleaned up in certain types of integtests when run at EHN1 drunc#754), and special code was added to this test to clean those up.Here are the comments from the top of the
sample_ehn1_multihost_test.pyfile which include the special instructions that need to be followed:Type of change
Testing checklist
pytest -s minimal_system_quick_test.py)daqsystemtest_integtest_bundle.sh)Further checks