-
-
Notifications
You must be signed in to change notification settings - Fork 245
PCP: PMDA: Introduce new PMDA for RDS #2230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
PCP: PMDA: Introduce new PMDA for RDS #2230
Conversation
This commit adds a new PMDA (Performance Metrics Domain Agent) for Reliable Datagram Sockets (RDS). It exports key metrics including connection information, socket and connection statistics, and details of send, receive, and retransmit queues for performance analysis using Performance Co-Pilot (PCP). This PMDA is intended to aid in diagnosing network-related issues on systems using RDS over Infiniband or TCP. Signed-off-by: Mohith Kumar Thummaluru <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distinct lack of QA tests for the PMDA but also for the helper scripts - please add. And
if we can use a standard module or command rather than using libc directly and adding the python ping implementation(s), that'd be better. Other comments inline.
""" | ||
# Centralized metric mapping | ||
METRICS = { | ||
# CLUSTER_CONN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General notes here apply to the metric specifications below. All of the units seem to be set to zero - seems unlikely? All of the metrics seem to be hard-coded as PM_SEM_INSTANT - are there really no counters? That'd be very unusual and based on the names, alot of them look like counters. There's no help text for any of the metrics, so I can't tell what they are. The vast majority seem to be specified as 32 bit values too - this is also very unusual these days, most things would be 64bit. The metrics specified with "8k" and "1m" are duplicated - these should be specified once and have instances of 8k/1m instead of a null instance domain.
print() | ||
|
||
def main(argv): | ||
parser = argparse.ArgumentParser(description='python version of rds-ping utility.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this code come from? All authored for this PR (and GPL2+) or is there another author/license involved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is authored for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but there a C implementation of it available - https://github.com/oracle/rds-tools/blob/rdma-vos/rds-tools-2/rds-ping.c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its OK to stick with python over C, particularly since the rest of the PMDA is python code. How do we know this code works though? (there's no tests, not even indirect tests using the agent to run the code, nor even pylint static checks yet - let's get these in-place asap). Its a lot of code to add to the project for everyone to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let me work on the test scripts.
There were multiple reasons behind choosing the libc-based approach instead of relying on standard tools. We wanted to eliminate dependencies on native utilities like ping and rds-ping, especially since our use case required functionality beyond what these tools offer. Our implementation provides an enhanced version of rds-ping, which supports sending pings to multiple connections using a single socket. This socket can be bound to a specific source address and ToS (Type of Service) value, significantly reducing overhead and simplifying connection management. Additionally, the Python implementation of getsockopt allocates a fixed 1024-byte buffer when retrieving connection information from the kernel. While this is typically sufficient for generic socket information, it may be inadequate for extracting RDS protocol-specific data. Although CPython internally invokes the C getsockopt system call, it ignores the return value, which is critical for parsing the data bytes returned by the kernel. This omission can lead to incorrect parsing or missed information. In contrast, the direct C implementation using libc respects the return value and imposes no such buffer limitation, allowing for more accurate and flexible data extraction. Considering all these factors, we implemented the required tools directly using libc to gain better control, accuracy, and efficiency for our specific needs. |
Signed-off-by: Mohith Kumar Thummaluru <[email protected]>
3a918e2
to
890d850
Compare
This commit adds a new PMDA (Performance Metrics Domain Agent) for Reliable Datagram Sockets (RDS). It exports key metrics including connection information, socket and connection statistics, and details of send, receive, and retransmit queues for performance analysis using Performance Co-Pilot (PCP).
This PMDA is intended to aid in diagnosing network-related issues on systems using RDS over Infiniband or TCP.