-
Notifications
You must be signed in to change notification settings - Fork 192
DEVICE/API: Add tests #814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
5dd69d7 to
a06b812
Compare
a06b812 to
6cd8c5e
Compare
6cd8c5e to
5e07f27
Compare
5d203f0 to
9124a60
Compare
9124a60 to
c9d565b
Compare
c9d565b to
cc0dc83
Compare
cc0dc83 to
cccf367
Compare
cccf367 to
e162030
Compare
Signed-off-by: Michal Shalev <[email protected]>
e162030 to
55df3ae
Compare
| } | ||
|
|
||
| void | ||
| copyToHost(T *host_data, size_t count) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use common copy function and pass cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost to it
| if constexpr (level == nixl_gpu_level_t::THREAD) { | ||
| return 1; | ||
| } else if constexpr (level == nixl_gpu_level_t::WARP) { | ||
| return 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use macro/const?
| req_ptr); | ||
| break; | ||
|
|
||
| case NixlDeviceOperation::FULL_WRITE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just WRITE?
| req_ptr); | ||
| break; | ||
|
|
||
| case NixlDeviceOperation::SIGNAL_READ: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe SIGNAL_POLL/SIGANL_WAIT?
| } while (value != params.signalRead.expectedValue); | ||
|
|
||
| if (params.signalRead.resultPtr != nullptr) { | ||
| *params.signalRead.resultPtr = value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need resultPtr if we know this function exists only if value == params.signalRead.expectedValue?
| } | ||
| nixlGpuWriteSignal<level>(params.signalWrite.signalAddr, | ||
| params.signalWrite.value); | ||
| return NIXL_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just set status and break, use common code in end of func to return status.
Do the same for previous cases that returns status.
|
|
||
| namespace { | ||
|
|
||
| constexpr size_t maxThreadsPerBlock = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
256 or 1024?
| return NIXL_ERR_BACKEND; | ||
| } | ||
|
|
||
| return NIXL_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return NIXL_SUCCESS; | |
| const cudaError_t launch_error = cudaGetLastError(); | |
| if (launch_error != cudaSuccess) { | |
| std::cerr << "CUDA kernel launch error: " << cudaGetErrorString(launch_error) << "\n"; | |
| return NIXL_ERR_BACKEND; | |
| } | |
| const cudaError_t sync_error = cudaDeviceSynchronize(); | |
| if (sync_error != cudaSuccess) { | |
| std::cerr << "CUDA synchronization error: " << cudaGetErrorString(sync_error) << "\n"; | |
| return NIXL_ERR_BACKEND; | |
| } |
|
|
||
| createXferRequest(data.srcBuffers, data.dstBuffers, mem_type, | ||
| data.xferReq, data.gpuReqHandle); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like common code... other tests uses same code for creating gpuReqHandle...
What?
Adds a comprehensive suite of tests for the Device API, covering Single Write, Partial Write, Full Write, and Signal operations.
Why?
To validate the functionality and correctness of the Device API implementation.
How?
test/gtest/device_api/common/for shared kernels and utilities (device_kernels,mem_buffer, etc.).test/gtest/device_api/tests/for each operation type.