Skip to content

Conversation

@michal-shalev
Copy link
Contributor

@michal-shalev michal-shalev commented Sep 22, 2025

What?

Adds a comprehensive suite of tests for the Device API, covering Single Write, Partial Write, Full Write, and Signal operations.

Why?

To validate the functionality and correctness of the Device API implementation.

How?

  • Introduced test/gtest/device_api/common/ for shared kernels and utilities (device_kernels, mem_buffer, etc.).
  • Added specific test files in test/gtest/device_api/tests/ for each operation type.
  • Cleaned up and replaced older tests.

@github-actions
Copy link

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@michal-shalev michal-shalev marked this pull request as draft October 21, 2025 00:05
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Michal Shalev <[email protected]>
@michal-shalev michal-shalev marked this pull request as ready for review November 28, 2025 23:46
}

void
copyToHost(T *host_data, size_t count) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use common copy function and pass cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost to it

if constexpr (level == nixl_gpu_level_t::THREAD) {
return 1;
} else if constexpr (level == nixl_gpu_level_t::WARP) {
return 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use macro/const?

req_ptr);
break;

case NixlDeviceOperation::FULL_WRITE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just WRITE?

req_ptr);
break;

case NixlDeviceOperation::SIGNAL_READ: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe SIGNAL_POLL/SIGANL_WAIT?

} while (value != params.signalRead.expectedValue);

if (params.signalRead.resultPtr != nullptr) {
*params.signalRead.resultPtr = value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need resultPtr if we know this function exists only if value == params.signalRead.expectedValue?

}
nixlGpuWriteSignal<level>(params.signalWrite.signalAddr,
params.signalWrite.value);
return NIXL_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just set status and break, use common code in end of func to return status.
Do the same for previous cases that returns status.


namespace {

constexpr size_t maxThreadsPerBlock = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

256 or 1024?

return NIXL_ERR_BACKEND;
}

return NIXL_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return NIXL_SUCCESS;
const cudaError_t launch_error = cudaGetLastError();
if (launch_error != cudaSuccess) {
std::cerr << "CUDA kernel launch error: " << cudaGetErrorString(launch_error) << "\n";
return NIXL_ERR_BACKEND;
}
const cudaError_t sync_error = cudaDeviceSynchronize();
if (sync_error != cudaSuccess) {
std::cerr << "CUDA synchronization error: " << cudaGetErrorString(sync_error) << "\n";
return NIXL_ERR_BACKEND;
}


createXferRequest(data.srcBuffers, data.dstBuffers, mem_type,
data.xferReq, data.gpuReqHandle);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like common code... other tests uses same code for creating gpuReqHandle...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants