[CI] migrate CI test steps to containers #310

NirWolfer · 2025-02-19T12:20:18Z

Description

Today we use benni09 static agent to run test/gtest/valgrind steps which is unscalable since it can only run one pipeline at a time, causing delays in builds that can be stuck waiting for hours

The idea is to move these steps to containers, allowing running them in parallel as well as running multiple pipelines at the same time (depending on the capacity of the k8s cluster)

What

Change test/gtest/valgrind steps to run on containers on swx-k8s-spray cluster instead of benni09

Why ?

HPCINFRA-3249

Change type

What kind of change does this PR introduce?

Check list

Code follows the style de facto guidelines of this project
Comments have been inserted in hard to understand places
Documentation has been updated (if necessary)
Test has been added (if possible)

.ci/dockerfiles/Dockerfile.ubuntu22.04

contrib/jenkins_tests/globals.sh

contrib/jenkins_tests/vg.sh

dpressle · 2025-03-05T08:34:54Z

bot:retest

dpressle

If gtest works, we can now enable concurrent builds

NirWolfer · 2025-03-06T11:44:44Z

bot:retest

dpressle · 2025-03-06T12:11:46Z

contrib/jenkins_tests/globals.sh

+
+function do_hugepages()
+{
+    if [[ -f /.dockerenv && ! $(grep -q hugetlbfs /proc/mounts) ]]; then


You should probably do a more extensive check for containterized environment like you do in gtest script
[[ -f /.dockerenv || -f /run/.containerenv || -n "${KUBERNETES_SERVICE_HOST}" ]]

dpressle · 2025-03-06T13:42:04Z

contrib/jenkins_tests/vg.sh

-#fi
-if [ ! -z "$(do_get_ip 'eth')" ]; then
-	test_ip_list="${test_ip_list} eth_ip4:$(do_get_ip 'eth')"
+if [[ -f /.dockerenv ]] || [[ -f /run/.containerenv ]] || [[ -n "${KUBERNETES_SERVICE_HOST}" ]]; then


can you improve it a little bit and reduce duplication, so instead of build the ip list string in 2 places just have variables for each ip and generate the list in one place
if
ipv4=get_ipv4...
ipv6=get_ipv6...
else
ipv4=
ipv6=
fi

test_ip_list="eth_ip4:${ipv4} eth_ip6:${ipv6}"

dpressle · 2025-03-06T13:43:39Z

contrib/jenkins_tests/gtest.sh

+else
+	gtest_opt="--addr=$(do_get_addrs 'eth' ${opt2})"
+	gtest_opt_ipv6="--addr=$(do_get_addrs 'inet6' ${opt2}) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff" # Remote - Dummy Address
+fi


can you add check if ips are empty (like we have in all other tests scripts)?

dpressle · 2025-04-17T14:12:47Z

/review

pr-review-bot-app · 2025-04-17T14:13:33Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns Sensitive information exposure: The Dockerfile in `.ci/dockerfiles/Dockerfile.ubuntu22.04` configures passwordless SSH login for the root user. This could be exploited if the container is compromised. It is recommended to use more secure authentication mechanisms or restrict SSH access.
⚡ Recommended focus areas for review Possible Issue The replacement of `t_lock_dummy_ring` with a dynamically allocated `lock_dummy` object in the `get_new_lock` function may introduce memory management issues. Ensure proper cleanup of dynamically allocated memory to avoid memory leaks. // static thread_local lock_dummy t_lock_dummy_ring; static lock_base get_new_lock(const char name, bool real_lock) { return (real_lock ? static_cast<lock_base >(multilock::create_new_lock(MULTILOCK_RECURSIVE, name)) : static_cast<lock_base >(new lock_dummy())); Test Skipping Several tests are being skipped temporarily with references to Issue #4331178. Ensure this issue is tracked and resolved to avoid long-term test coverage gaps. eval "${sudo_cmd} $timeout_exe env XLIO_SOCKETXTREME=1 GTEST_TAP=2 LD_PRELOAD=$gtest_lib $gtest_app $gtest_opt --gtest_filter=socketxtreme_poll.:socketxtreme_ring.:sock_socket.:tcp_bind.:tcp_connect.:tcp_sendto.:tcp_set_get_sockopt:udp_bind.:udp_connect.:udp_sendto.:udp_socket.* --gtest_output=xml:${WORKSPACE}/${prefix}/test-socketxtreme.xml" rc=$(($rc+$?)) # Verify XLIO EXTRA API tests IPv6 eval "${sudo_cmd} $timeout_exe env GTEST_TAP=2 LD_PRELOAD=$gtest_lib $gtest_app $gtest_opt_ipv6 --gtest_filter=xlio_:-socketxtreme_poll.:socketxtreme_ring.:xlio_send_zc. --gtest_output=xml:${WORKSPACE}/${prefix}/test-extra-ipv6.xml" rc=$(($rc+$?)) # Verify XLIO EXTRA API socketxtreme mode tests IPv6 eval "${sudo_cmd} $timeout_exe env XLIO_SOCKETXTREME=1 GTEST_TAP=2 LD_PRELOAD=$gtest_lib $gtest_app $gtest_opt_ipv6 --gtest_filter=socketxtreme_poll.:socketxtreme_ring.:sock_socket.:tcp_bind.:tcp_connect.:tcp_sendto.:tcp_set_get_sockopt:udp_bind.:udp_connect.:udp_sendto.:udp_socket.* --gtest_output=xml:${WORKSPACE}/${prefix}/test-socketxtreme-ipv6.xml" rc=$(($rc+$?)) #Skipping this test temporarily;Please see Issue #4331178. # Verify socketxtreme mode and Delegated TCP Timers tests Security Concern The Dockerfile sets up passwordless SSH login for the root user, which could pose a security risk. Consider restricting access or using a more secure authentication method. RUN apt-get update && \ apt-get install -y \ openssh-server psmisc \ && apt-get clean && rm -rf /var/lib/apt/lists/* # setup ssh server and passwordless login for root for tests flows (verifyer.pl) RUN mkdir -p /var/run/sshd ~/.ssh && \ rm -rf ~/.ssh/id_rsa* && ssh-keygen -t rsa -N '' -f ~/.ssh/id_rsa && \ cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys && \ sed -i 's\|#PermitRootLogin.\|PermitRootLogin without-password\|g' /etc/ssh/sshd_config && \ sed -i 's\|#PasswordAuthentication.\|PasswordAuthentication no\|g' /etc/ssh/sshd_config && \ echo "Host *" >> ~/.ssh/config && \ echo " StrictHostKeyChecking no" >> ~/.ssh/config && \ echo " UserKnownHostsFile /dev/null" >> ~/.ssh/config && \ echo " LogLevel ERROR" >> ~/.ssh/config

dpressle · 2025-04-17T14:14:26Z

/improve

pr-review-bot-app · 2025-04-17T14:14:44Z

PR Code Suggestions ✨

Category	Suggestion	Impact
Possible issue	Prevent memory leaks in dynamic allocation Ensure that the dynamically allocated `lock_dummy` object is properly deleted to avoid memory leaks. Consider using a smart pointer or managing the object's lifecycle explicitly. src/core/dev/ring_slave.cpp [54-59] static lock_base get_new_lock(const char name, bool real_lock) { return (real_lock ? static_cast<lock_base >(multilock::create_new_lock(MULTILOCK_RECURSIVE, name)) - : static_cast<lock_base >(new lock_dummy())); + : static_cast<lock_base *>(std::make_unique<lock_dummy>().release())); } Suggestion importance[1-10]: 8 __ Why: The suggestion addresses a potential memory leak by replacing a raw pointer with a smart pointer, which ensures proper memory management. This is a critical improvement for preventing resource leaks in the code.	Medium
Possible issue	Fix memory leak in dynamic allocation Address potential memory leaks by ensuring that the dynamically allocated `lock_dummy` object is properly managed or deleted. Use smart pointers or other mechanisms to handle the object's lifecycle. src/core/sock/sockinfo_tcp.cpp [163-169] static lock_base get_new_tcp_lock() { return ( safe_mce_sys().tcp_ctl_thread != option_tcp_ctl_thread::CTL_THREAD_DELEGATE_TCP_TIMERS ? static_cast<lock_base >(multilock::create_new_lock(MULTILOCK_RECURSIVE, "tcp_con")) - : static_cast<lock_base >(new lock_dummy())); + : static_cast<lock_base >(std::make_unique<lock_dummy>().release())); } Suggestion importance[1-10]: 8 __ Why: The suggestion resolves a potential memory leak by using a smart pointer for dynamically allocated objects. This is a significant enhancement to ensure proper memory management and prevent resource leaks.	Medium
General	Add error handling for IP retrieval Add error handling to verify that the `ip` command outputs valid addresses. If no valid address is found, log an error and exit to prevent invalid configurations. contrib/jenkins_tests/gtest.sh [58-63] if [[ -f /.dockerenv ]] \|\| [[ -f /run/.containerenv ]] \|\| [[ -n "${KUBERNETES_SERVICE_HOST}" ]]; then - gtest_opt="--addr=$(ip -f inet addr show net1 \| awk '/inet / {print $2}' \| cut -d/ -f1),$(ip -f inet addr show net2 \| awk '/inet / {print $2}' \| cut -d/ -f1)" - gtest_opt_ipv6="--addr=$(ip -f inet6 addr show net1 \| grep global \| awk '/inet6 / {print $2}' \| cut -d/ -f1),$(ip -f inet6 addr show net2 \| grep global \| awk '/inet6 / {print $2}' \| cut -d/ -f1) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff" + gtest_opt="$(ip -f inet addr show net1 \| awk '/inet / {print $2}' \| cut -d/ -f1),$(ip -f inet addr show net2 \| awk '/inet / {print $2}' \| cut -d/ -f1)" + gtest_opt_ipv6="$(ip -f inet6 addr show net1 \| grep global \| awk '/inet6 / {print $2}' \| cut -d/ -f1),$(ip -f inet6 addr show net2 \| grep global \| awk '/inet6 / {print $2}' \| cut -d/ -f1) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff" + if [[ -z "$gtest_opt" \|\| -z "$gtest_opt_ipv6" ]]; then + echo "ERROR: Failed to retrieve valid IP addresses for gtest." >&2 + exit 1 + fi else gtest_opt="--addr=$(do_get_addrs 'eth' ${opt2})" gtest_opt_ipv6="--addr=$(do_get_addrs 'inet6' ${opt2}) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff" fi Suggestion importance[1-10]: 7 __ Why: Adding error handling for IP retrieval improves the robustness of the script by ensuring invalid configurations are caught early. This is a valuable enhancement for preventing runtime issues in the test environment.	Medium
General	Improve error message clarity Enhance the error message to include troubleshooting steps or additional context, such as verifying the network interface configuration or checking container environment variables. contrib/jenkins_tests/test.sh [61-64] if [ "$test_ip_list" == "eth_ip4: eth_ip6:" ] \|\| [ -z "${test_ip_list}" ]; then echo "ERROR: Cannot get IPv4/6 address of net1 interface!" + echo "Please ensure that the network interfaces are correctly configured and accessible." + echo "Check if the container environment variables or network settings are properly set." exit 1 fi Suggestion importance[1-10]: 6 __ Why: Enhancing the error message with troubleshooting steps provides better context for debugging, improving the developer experience. While not critical, it is a helpful improvement for clarity and usability.	Low

Today we use benni09 static agent to run test/gtest/valgrind steps which is unscaleable since it can only run one pipeline at a time, causing delays in builds that can be stuck waiting for hours The idea is to move these steps to containers, allowing running them in parallel as well as running multiple pipelines at the same time (depending on the capacity of the k8s cluster) Issue: HPCINFRA-3249 Signed-off-by: NirWolfer <[email protected]>

The thread-local dummy locker in ring_slave could cause use-after-free issues during XLIO shutdown when one thread attempts to access a socket's locker that was created by a terminated thread. This occurs because the thread-local object is freed when its creator thread terminates. Replace the thread-local dummy locker with a global one to prevent this issue. To maintain data path performance, optimize the dummy lock for a different cache-line to prevent false sharing by aligning the lock on a 64-byte boundary. Signed-off-by: Tomer Cabouly <[email protected]>

This commit fixes a critical race condition in timer management for TCP sockets that was introduced in commit c73d96a. The heap corruption was caused by a race condition between the timer thread and socket destruction. Sockets could be deleted by the event handler thread while still being referenced by the timer thread in the timer collections, resulting in heap corruption when the timer thread attempted to access the deleted memory. In the original implementation, sockets were removed from timer collections and deleted asynchronously without proper synchronization with the timer processing thread. Fix: - Remove sockets from timer collections while still holding the socket lock, guaranteeing the timer thread cannot access sockets marked for deletion - Create a simplified deletion path that doesn't attempt to access timer collections again after socket cleanup Additionally, as an unrelated improvement, this patch fixes a lock leak in the early return path of sockinfo_tcp::clean_socket_obj() where a lock was acquired but not released when a socket was already marked as cleaned. The heap corruption stemmed from a fundamental architectural change that separated socket objects from their timer management without providing proper synchronization for the distributed socket lifecycle. Signed-off-by: Tomer Cabouly <[email protected]>

dpressle

How is it possible that i see vg changes didnt we already merged vg in container to vNext

NirWolfer · 2025-05-27T08:07:20Z

no idea, but since we are adding these changes one PR at a time, this PR is no longer relevant.
Im closing it

NirWolfer force-pushed the HPCINFRA-3249 branch 15 times, most recently from 6f35827 to 696827b Compare February 25, 2025 15:38

dpressle requested changes Feb 26, 2025

View reviewed changes

.ci/dockerfiles/Dockerfile.ubuntu22.04 Outdated Show resolved Hide resolved

contrib/jenkins_tests/globals.sh Outdated Show resolved Hide resolved

contrib/jenkins_tests/vg.sh Outdated Show resolved Hide resolved

NirWolfer force-pushed the HPCINFRA-3249 branch 2 times, most recently from 9e69799 to 1e16cae Compare February 26, 2025 13:39

NirWolfer marked this pull request as ready for review February 26, 2025 13:40

NirWolfer force-pushed the HPCINFRA-3249 branch from e8a7c0b to ee0fa1b Compare March 4, 2025 15:40

NirWolfer force-pushed the HPCINFRA-3249 branch from ee0fa1b to b28e820 Compare March 5, 2025 09:16

dpressle requested changes Mar 5, 2025

View reviewed changes

NirWolfer force-pushed the HPCINFRA-3249 branch 7 times, most recently from 9e0436b to 18cb753 Compare March 6, 2025 08:36

NirWolfer force-pushed the HPCINFRA-3249 branch 3 times, most recently from e52a28e to f660eef Compare March 6, 2025 10:57

NirWolfer requested a review from dpressle March 6, 2025 12:28

dpressle requested changes Mar 6, 2025

View reviewed changes

pr-review-bot-app bot added the Review effort 4/5 label Apr 17, 2025

NirWolfer force-pushed the HPCINFRA-3249 branch from 55659e4 to 980e0ef Compare April 23, 2025 09:04

NirWolfer and others added 3 commits April 23, 2025 14:00

NirWolfer force-pushed the HPCINFRA-3249 branch from 980e0ef to 78cebc7 Compare April 23, 2025 11:03

dpressle requested changes May 22, 2025

View reviewed changes

NirWolfer closed this May 27, 2025

[CI] migrate CI test steps to containers #310

[CI] migrate CI test steps to containers #310

Uh oh!

Conversation

NirWolfer commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What

Why ?

Change type

Check list

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dpressle commented Mar 5, 2025

Uh oh!

dpressle left a comment

Choose a reason for hiding this comment

Uh oh!

NirWolfer commented Mar 6, 2025

Uh oh!

dpressle Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

dpressle Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

dpressle Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

dpressle commented Apr 17, 2025

Uh oh!

pr-review-bot-app bot commented Apr 17, 2025

PR Reviewer Guide 🔍

Uh oh!

dpressle commented Apr 17, 2025

Uh oh!

pr-review-bot-app bot commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Uh oh!

dpressle left a comment

Choose a reason for hiding this comment

Uh oh!

NirWolfer commented May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NirWolfer commented Feb 19, 2025 •

edited

Loading

pr-review-bot-app bot commented Apr 17, 2025 •

edited

Loading