Skip to content

Conversation

@NirWolfer
Copy link
Contributor

@NirWolfer NirWolfer commented Feb 19, 2025

Description

Today we use benni09 static agent to run test/gtest/valgrind steps which is unscalable since it can only run one pipeline at a time, causing delays in builds that can be stuck waiting for hours

The idea is to move these steps to containers, allowing running them in parallel as well as running multiple pipelines at the same time (depending on the capacity of the k8s cluster)

What

Change test/gtest/valgrind steps to run on containers on swx-k8s-spray cluster instead of benni09

Why ?

HPCINFRA-3249

Change type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Tests
  • Other

Check list

  • Code follows the style de facto guidelines of this project
  • Comments have been inserted in hard to understand places
  • Documentation has been updated (if necessary)
  • Test has been added (if possible)

@NirWolfer NirWolfer force-pushed the HPCINFRA-3249 branch 15 times, most recently from 6f35827 to 696827b Compare February 25, 2025 15:38
@NirWolfer NirWolfer force-pushed the HPCINFRA-3249 branch 2 times, most recently from 9e69799 to 1e16cae Compare February 26, 2025 13:39
@NirWolfer NirWolfer marked this pull request as ready for review February 26, 2025 13:40
@dpressle
Copy link
Collaborator

dpressle commented Mar 5, 2025

bot:retest

Copy link
Collaborator

@dpressle dpressle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If gtest works, we can now enable concurrent builds

@NirWolfer NirWolfer force-pushed the HPCINFRA-3249 branch 7 times, most recently from 9e0436b to 18cb753 Compare March 6, 2025 08:36
@NirWolfer NirWolfer force-pushed the HPCINFRA-3249 branch 3 times, most recently from e52a28e to f660eef Compare March 6, 2025 10:57
@NirWolfer
Copy link
Contributor Author

bot:retest

@NirWolfer NirWolfer requested a review from dpressle March 6, 2025 12:28

function do_hugepages()
{
if [[ -f /.dockerenv && ! $(grep -q hugetlbfs /proc/mounts) ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably do a more extensive check for containterized environment like you do in gtest script
[[ -f /.dockerenv || -f /run/.containerenv || -n "${KUBERNETES_SERVICE_HOST}" ]]

#fi
if [ ! -z "$(do_get_ip 'eth')" ]; then
test_ip_list="${test_ip_list} eth_ip4:$(do_get_ip 'eth')"
if [[ -f /.dockerenv ]] || [[ -f /run/.containerenv ]] || [[ -n "${KUBERNETES_SERVICE_HOST}" ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you improve it a little bit and reduce duplication, so instead of build the ip list string in 2 places just have variables for each ip and generate the list in one place
if
ipv4=get_ipv4...
ipv6=get_ipv6...
else
ipv4=
ipv6=
fi

test_ip_list="eth_ip4:${ipv4} eth_ip6:${ipv6}"

else
gtest_opt="--addr=$(do_get_addrs 'eth' ${opt2})"
gtest_opt_ipv6="--addr=$(do_get_addrs 'inet6' ${opt2}) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff" # Remote - Dummy Address
fi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add check if ips are empty (like we have in all other tests scripts)?

@dpressle
Copy link
Collaborator

/review

@pr-review-bot-app
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 Security concerns

Sensitive information exposure:
The Dockerfile in .ci/dockerfiles/Dockerfile.ubuntu22.04 configures passwordless SSH login for the root user. This could be exploited if the container is compromised. It is recommended to use more secure authentication mechanisms or restrict SSH access.

⚡ Recommended focus areas for review

Possible Issue

The replacement of t_lock_dummy_ring with a dynamically allocated lock_dummy object in the get_new_lock function may introduce memory management issues. Ensure proper cleanup of dynamically allocated memory to avoid memory leaks.

// static thread_local lock_dummy t_lock_dummy_ring;

static lock_base *get_new_lock(const char *name, bool real_lock)
{
    return (real_lock
                ? static_cast<lock_base *>(multilock::create_new_lock(MULTILOCK_RECURSIVE, name))
                : static_cast<lock_base *>(new lock_dummy()));
Test Skipping

Several tests are being skipped temporarily with references to Issue #4331178. Ensure this issue is tracked and resolved to avoid long-term test coverage gaps.

eval "${sudo_cmd} $timeout_exe env XLIO_SOCKETXTREME=1 GTEST_TAP=2 LD_PRELOAD=$gtest_lib $gtest_app $gtest_opt --gtest_filter=socketxtreme_poll.*:socketxtreme_ring.*:sock_socket.*:tcp_bind.*:tcp_connect.*:tcp_sendto.*:tcp_set_get_sockopt*:udp_bind.*:udp_connect.*:udp_sendto.*:udp_socket.* --gtest_output=xml:${WORKSPACE}/${prefix}/test-socketxtreme.xml"
rc=$(($rc+$?))

# Verify XLIO EXTRA API tests IPv6
eval "${sudo_cmd} $timeout_exe env GTEST_TAP=2 LD_PRELOAD=$gtest_lib $gtest_app $gtest_opt_ipv6 --gtest_filter=xlio_*:-socketxtreme_poll.*:socketxtreme_ring.*:xlio_send_zc.* --gtest_output=xml:${WORKSPACE}/${prefix}/test-extra-ipv6.xml"
rc=$(($rc+$?))

# Verify XLIO EXTRA API socketxtreme mode tests IPv6
eval "${sudo_cmd} $timeout_exe env XLIO_SOCKETXTREME=1 GTEST_TAP=2 LD_PRELOAD=$gtest_lib $gtest_app $gtest_opt_ipv6 --gtest_filter=socketxtreme_poll.*:socketxtreme_ring.*:sock_socket.*:tcp_bind.*:tcp_connect.*:tcp_sendto.*:tcp_set_get_sockopt*:udp_bind.*:udp_connect.*:udp_sendto.*:udp_socket.* --gtest_output=xml:${WORKSPACE}/${prefix}/test-socketxtreme-ipv6.xml"
rc=$(($rc+$?))

#Skipping this test temporarily;Please see Issue #4331178.
# Verify socketxtreme mode and Delegated TCP Timers tests
Security Concern

The Dockerfile sets up passwordless SSH login for the root user, which could pose a security risk. Consider restricting access or using a more secure authentication method.

RUN apt-get update && \
    apt-get install -y \
    openssh-server psmisc \
    && apt-get clean && rm -rf /var/lib/apt/lists/*
# setup ssh server and passwordless login for root for tests flows (verifyer.pl)
RUN mkdir -p /var/run/sshd ~/.ssh && \
    rm -rf ~/.ssh/id_rsa* && ssh-keygen -t rsa -N '' -f ~/.ssh/id_rsa && \
    cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys && \
    sed -i 's|#PermitRootLogin.*|PermitRootLogin without-password|g' /etc/ssh/sshd_config && \
    sed -i 's|#PasswordAuthentication.*|PasswordAuthentication no|g' /etc/ssh/sshd_config && \
    echo "Host *" >> ~/.ssh/config && \
    echo "   StrictHostKeyChecking no" >> ~/.ssh/config && \
    echo "   UserKnownHostsFile /dev/null" >> ~/.ssh/config && \
    echo "   LogLevel ERROR" >> ~/.ssh/config

@dpressle
Copy link
Collaborator

/improve

@pr-review-bot-app
Copy link

pr-review-bot-app bot commented Apr 17, 2025

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent memory leaks in dynamic allocation

Ensure that the dynamically allocated lock_dummy object is properly deleted to avoid
memory leaks. Consider using a smart pointer or managing the object's lifecycle
explicitly.

src/core/dev/ring_slave.cpp [54-59]

 static lock_base *get_new_lock(const char *name, bool real_lock)
 {
     return (real_lock
                 ? static_cast<lock_base *>(multilock::create_new_lock(MULTILOCK_RECURSIVE, name))
-                : static_cast<lock_base *>(new lock_dummy()));
+                : static_cast<lock_base *>(std::make_unique<lock_dummy>().release()));
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion addresses a potential memory leak by replacing a raw pointer with a smart pointer, which ensures proper memory management. This is a critical improvement for preventing resource leaks in the code.

Medium
Fix memory leak in dynamic allocation

Address potential memory leaks by ensuring that the dynamically allocated lock_dummy
object is properly managed or deleted. Use smart pointers or other mechanisms to
handle the object's lifecycle.

src/core/sock/sockinfo_tcp.cpp [163-169]

 static lock_base *get_new_tcp_lock()
 {
     return (
         safe_mce_sys().tcp_ctl_thread != option_tcp_ctl_thread::CTL_THREAD_DELEGATE_TCP_TIMERS
             ? static_cast<lock_base *>(multilock::create_new_lock(MULTILOCK_RECURSIVE, "tcp_con"))
-            : static_cast<lock_base *>(new lock_dummy()));
+            : static_cast<lock_base *>(std::make_unique<lock_dummy>().release()));
 }
Suggestion importance[1-10]: 8

__

Why: The suggestion resolves a potential memory leak by using a smart pointer for dynamically allocated objects. This is a significant enhancement to ensure proper memory management and prevent resource leaks.

Medium
General
Add error handling for IP retrieval

Add error handling to verify that the ip command outputs valid addresses. If no
valid address is found, log an error and exit to prevent invalid configurations.

contrib/jenkins_tests/gtest.sh [58-63]

 if [[ -f /.dockerenv ]] || [[ -f /run/.containerenv ]] || [[ -n "${KUBERNETES_SERVICE_HOST}" ]]; then
-    gtest_opt="--addr=$(ip -f inet addr show net1 | awk '/inet / {print $2}' | cut -d/ -f1),$(ip -f inet addr show net2 | awk '/inet / {print $2}' | cut -d/ -f1)"
-    gtest_opt_ipv6="--addr=$(ip -f inet6 addr show net1 | grep global | awk '/inet6 / {print $2}' | cut -d/ -f1),$(ip -f inet6 addr show net2 | grep global | awk '/inet6 / {print $2}' | cut -d/ -f1) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"
+    gtest_opt="$(ip -f inet addr show net1 | awk '/inet / {print $2}' | cut -d/ -f1),$(ip -f inet addr show net2 | awk '/inet / {print $2}' | cut -d/ -f1)"
+    gtest_opt_ipv6="$(ip -f inet6 addr show net1 | grep global | awk '/inet6 / {print $2}' | cut -d/ -f1),$(ip -f inet6 addr show net2 | grep global | awk '/inet6 / {print $2}' | cut -d/ -f1) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"
+    if [[ -z "$gtest_opt" || -z "$gtest_opt_ipv6" ]]; then
+        echo "ERROR: Failed to retrieve valid IP addresses for gtest." >&2
+        exit 1
+    fi
 else
     gtest_opt="--addr=$(do_get_addrs 'eth' ${opt2})"
     gtest_opt_ipv6="--addr=$(do_get_addrs 'inet6' ${opt2}) -r fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff"
 fi
Suggestion importance[1-10]: 7

__

Why: Adding error handling for IP retrieval improves the robustness of the script by ensuring invalid configurations are caught early. This is a valuable enhancement for preventing runtime issues in the test environment.

Medium
Improve error message clarity

Enhance the error message to include troubleshooting steps or additional context,
such as verifying the network interface configuration or checking container
environment variables.

contrib/jenkins_tests/test.sh [61-64]

 if [ "$test_ip_list" == "eth_ip4: eth_ip6:" ] || [ -z "${test_ip_list}" ]; then
     echo "ERROR: Cannot get IPv4/6 address of net1 interface!"
+    echo "Please ensure that the network interfaces are correctly configured and accessible."
+    echo "Check if the container environment variables or network settings are properly set."
     exit 1
 fi
Suggestion importance[1-10]: 6

__

Why: Enhancing the error message with troubleshooting steps provides better context for debugging, improving the developer experience. While not critical, it is a helpful improvement for clarity and usability.

Low

NirWolfer and others added 3 commits April 23, 2025 14:00
Today we use benni09 static agent to run test/gtest/valgrind steps which
is unscaleable since it can only run one pipeline at a time, causing
delays in builds that can be stuck waiting for hours

The idea is to move these steps to containers, allowing running them in
parallel as well as running multiple pipelines at the same time
(depending on the capacity of the k8s cluster)

Issue: HPCINFRA-3249
Signed-off-by: NirWolfer <[email protected]>
The thread-local dummy locker in ring_slave could cause use-after-free
issues during XLIO shutdown when one thread attempts to access
a socket's locker that was created by a terminated thread.
This occurs because the thread-local object is freed
when its creator thread terminates.

Replace the thread-local dummy locker with a global one to prevent this
issue. To maintain data path performance, optimize the dummy lock for
a different cache-line to prevent false sharing by aligning the lock on
a 64-byte boundary.

Signed-off-by: Tomer Cabouly <[email protected]>
This commit fixes a critical race condition in timer management for TCP
sockets that was introduced in commit c73d96a.

The heap corruption was caused by a race condition between the timer
thread and socket destruction. Sockets could be deleted by the event
handler thread while still being referenced by the timer thread in the
timer collections, resulting in heap corruption when the timer thread
attempted to access the deleted memory.

In the original implementation, sockets were removed from timer
collections and deleted asynchronously without proper synchronization
with the timer processing thread.

Fix:
- Remove sockets from timer collections while still holding the socket
  lock, guaranteeing the timer thread cannot access sockets marked for
  deletion
- Create a simplified deletion path that doesn't attempt to access timer
  collections again after socket cleanup

Additionally, as an unrelated improvement, this patch fixes a lock leak
in the early return path of sockinfo_tcp::clean_socket_obj() where a
lock was acquired but not released when a socket was already marked as
cleaned.

The heap corruption stemmed from a fundamental architectural change that
separated socket objects from their timer management without providing
proper synchronization for the distributed socket lifecycle.

Signed-off-by: Tomer Cabouly <[email protected]>
Copy link
Collaborator

@dpressle dpressle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible that i see vg changes didnt we already merged vg in container to vNext

@NirWolfer
Copy link
Contributor Author

no idea, but since we are adding these changes one PR at a time, this PR is no longer relevant.
Im closing it

@NirWolfer NirWolfer closed this May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants