-
Notifications
You must be signed in to change notification settings - Fork 29
[CI] issue: HPCINFRA-3937 Terminate Valgrind gracefully #472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3211dee to
1cf3db2
Compare
|
bot:retest |
|
Depends on #468 |
1cf3db2 to
54199bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR modifies the Valgrind test cleanup script in the CI infrastructure to replace SIGKILL (-9) with SIGINT as the initial termination signal for sockperf processes. The change enables Valgrind to execute its cleanup routines and generate complete memory leak summaries, which are essential for debugging memory issues in the XLIO library during CI validation. The script retains a three-tier escalation strategy (SIGINT with 10s wait → SIGTERM with 3s wait → SIGKILL) to ensure that unresponsive processes are eventually terminated. This modification aligns with standard best practices for graceful process termination while maintaining CI pipeline reliability.
PR Description Notes:
- Minor typo: "debuging" should be "debugging"
- Typo: "af the first try" should be "at the first try"
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| contrib/jenkins_tests/vg.sh | 5/5 | Changed initial process termination from SIGKILL to SIGINT to allow Valgrind graceful shutdown and summary generation |
Confidence score: 5/5
- This PR is safe to merge with minimal risk
- Score reflects a simple, well-understood signal handling change in a CI script with proper fallback mechanisms already in place to prevent hanging processes
- No files require special attention
Sequence Diagram
sequenceDiagram
participant User
participant CI as CI System
participant Script as vg.sh
participant Valgrind
participant Sockperf as Sockperf Server
participant SockperfClient as Sockperf Client
User->>CI: Trigger CI Pipeline
CI->>Script: Execute vg.sh
Script->>Script: Configure hugepages
Script->>Script: Build XLIO with Valgrind support
Script->>Script: Detect network interface (net1/eth)
Script->>Script: Get IPv4/IPv6 addresses
loop For each IP (IPv4/IPv6)
loop For each test (TCP/UDP)
Script->>Valgrind: Start Valgrind with memory checks
Valgrind->>Sockperf: Launch sockperf server (sr mode)
Sockperf-->>Script: Log warmup stage message
Script->>Script: Wait for warmup (max 120s)
Script->>Valgrind: Start Valgrind for client
Valgrind->>SockperfClient: Launch sockperf client (pp mode, 10s)
SockperfClient->>Sockperf: Send test traffic
Sockperf-->>SockperfClient: Process requests
SockperfClient-->>Script: Complete test run
Script->>Sockperf: Send SIGINT signal
Script->>Script: Wait 10 seconds
alt Process still running
Script->>Sockperf: Send SIGTERM signal
Script->>Script: Wait 3 seconds
end
alt Process still running
Script->>Sockperf: Send SIGKILL signal
end
Valgrind-->>Script: Generate memory check logs
Script->>Script: Archive logs
Script->>Script: Parse ERROR SUMMARY
alt Errors found > 0
Script->>Script: Record test failure in TAP
else No errors and latency found
Script->>Script: Record test success in TAP
end
end
end
Script->>CI: Return exit code
CI-->>User: Report test results
1 file reviewed, no comments
Valgrind today is being terminated with -9 which prevents it from printing all the summaries at the end which are needed for debuging. Change -9 signal to -SIGINT af the first try of shutting down Valgrind Signed-off-by: NirWolfer <[email protected]>
54199bf to
b03556f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Changed the initial process termination signal from SIGKILL (-9) to SIGINT in the Valgrind test cleanup routine. This allows Valgrind to gracefully shutdown and print its full error summaries and leak reports, which are essential for debugging.
Key changes:
- Line 117: Changed
pkill -9topkill -SIGINTfor the first termination attempt - Removed obsolete comment about SIGINT not working (it was previously referring to the wrong signal)
- Maintains existing escalation path:
SIGINT→SIGTERM→SIGKILLwith appropriate wait times
The change follows standard Unix signal handling best practices where graceful signals are attempted before forceful termination.
Confidence Score: 5/5
- This PR is safe to merge with no risk
- The change is minimal, well-understood, and follows Unix best practices. The signal escalation path ensures that even if graceful shutdown fails, the process will still be terminated. The existing 10-second wait and fallback mechanisms prevent any potential hangs.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| contrib/jenkins_tests/vg.sh | 5/5 | Changed first termination signal from SIGKILL to SIGINT, allowing Valgrind to print summaries gracefully |
Sequence Diagram
sequenceDiagram
participant Script as vg.sh
participant Process as sockperf (Valgrind)
Note over Script,Process: After tests complete
Script->>Process: Check if process still running
alt Process is running
Script->>Process: pkill -SIGINT (graceful shutdown)
Note over Process: Valgrind catches SIGINT<br/>and prints summaries
Script->>Script: Wait 10 seconds
alt Process still running
Script->>Process: pkill -SIGTERM (terminate)
Script->>Script: Wait 3 seconds
alt Process still running
Script->>Process: pkill -SIGKILL (force kill)
Note over Process: Process terminated immediately
end
end
end
Script->>Script: Archive logs and check results
1 file reviewed, no comments
|
@galnoam please merge |
Description
Valgrind today is being terminated with -9 which prevents it from printing all the summaries at the end which are needed for debuging.
What
Change -9 signal to -SIGINT af the first try of shutting down Valgrind
Why ?
HPCINFRA-3937
Change type
What kind of change does this PR introduce?
Check list