DEVICE/API: Add tests #814

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

michal-shalev wants to merge 1 commit into ai-dynamo:main from michal-shalev:add-device-api-tests

+1,783 −1,079

Contributor

michal-shalev commented Sep 22, 2025 •

edited

Loading

What?

Adds a comprehensive suite of tests for the Device API, covering Single Write, Partial Write, Full Write, and Signal operations.

Why?

To validate the functionality and correctness of the Device API implementation.

How?

Introduced test/gtest/device_api/common/ for shared kernels and utilities (device_kernels, mem_buffer, etc.).
Added specific test files in test/gtest/device_api/tests/ for each operation type.
Cleaned up and replaced older tests.

michal-shalev self-assigned this

michal-shalev requested a review from a team as a code owner

September 22, 2025 06:49

pull-request-size bot added the size/XXL label

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:49

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:49

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:49

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

September 22, 2025 06:49

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:49

Inactive

github-actions bot commented Sep 22, 2025

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

github-actions bot added the external-contribution label

copy-pr-bot bot temporarily deployed to GITLAB

September 22, 2025 06:54

Inactive

michal-shalev force-pushed the add-device-api-tests branch from 5dd69d7 to a06b812 Compare

September 22, 2025 06:56

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:56

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:56

Inactive

copy-pr-bot bot had a problem deploying to SWX_AWS

September 22, 2025 06:56

Failure

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:56

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

September 22, 2025 06:56

Inactive

michal-shalev force-pushed the add-device-api-tests branch from a06b812 to 6cd8c5e Compare

September 22, 2025 06:57

copy-pr-bot bot temporarily deployed to GITLAB

September 22, 2025 06:57

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:57

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:57

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:57

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 06:57

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

September 22, 2025 07:00

Inactive

michal-shalev force-pushed the add-device-api-tests branch from 6cd8c5e to 5e07f27 Compare

September 22, 2025 07:07

copy-pr-bot bot temporarily deployed to GITLAB

September 22, 2025 07:07

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 07:07

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 07:07

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 07:07

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 22, 2025 07:07

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

September 29, 2025 12:59

Inactive

michal-shalev force-pushed the add-device-api-tests branch from 5d203f0 to 9124a60 Compare

September 29, 2025 14:02

copy-pr-bot bot temporarily deployed to GITLAB

September 29, 2025 14:02

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 14:02

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 14:02

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 14:02

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 14:02

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

September 29, 2025 14:02

Inactive

michal-shalev force-pushed the add-device-api-tests branch from 9124a60 to c9d565b Compare

September 29, 2025 19:27

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 19:28

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 19:28

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 19:28

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

September 29, 2025 19:28

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

September 29, 2025 19:28

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

September 29, 2025 19:30

Inactive

michal-shalev force-pushed the add-device-api-tests branch from c9d565b to cc0dc83 Compare

October 1, 2025 00:00

copy-pr-bot bot temporarily deployed to GITLAB

October 1, 2025 00:00

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

October 1, 2025 00:00

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

October 1, 2025 00:00

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

October 1, 2025 00:00

Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS

October 1, 2025 00:00

Inactive

copy-pr-bot bot temporarily deployed to GITLAB

October 1, 2025 00:05

Inactive

michal-shalev marked this pull request as draft

October 21, 2025 00:05

michal-shalev force-pushed the add-device-api-tests branch from cc0dc83 to cccf367 Compare

November 28, 2025 23:27

copy-pr-bot bot commented Nov 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

michal-shalev force-pushed the add-device-api-tests branch from cccf367 to e162030 Compare

November 28, 2025 23:34


          DEVICE/API: Add tests

55df3ae

Signed-off-by: Michal Shalev <[email protected]>

michal-shalev force-pushed the add-device-api-tests branch from e162030 to 55df3ae Compare

November 28, 2025 23:37

michal-shalev marked this pull request as ready for review

November 28, 2025 23:46

ofirfarjun7 reviewed

View reviewed changes

test/gtest/device_api/common/cuda_array.h

+                  }
+                  void
+                  copyToHost(T *host_data, size_t count) const {

Contributor

ofirfarjun7 Dec 1, 2025

Maybe use common copy function and pass cudaMemcpyHostToDevice/cudaMemcpyDeviceToHost to it

test/gtest/device_api/common/device_kernels.cu

+                  if constexpr (level == nixl_gpu_level_t::THREAD) {
+                      return 1;
+                  } else if constexpr (level == nixl_gpu_level_t::WARP) {
+                      return 32;

Contributor

ofirfarjun7 Dec 1, 2025

Maybe use macro/const?

test/gtest/device_api/common/device_kernels.cu

+                          req_ptr);
+                      break;
+                  case NixlDeviceOperation::FULL_WRITE:

Contributor

ofirfarjun7 Dec 1, 2025

Maybe just WRITE?

test/gtest/device_api/common/device_kernels.cu

+                          req_ptr);
+                      break;
+                  case NixlDeviceOperation::SIGNAL_READ: {

Contributor

ofirfarjun7 Dec 1, 2025

maybe SIGNAL_POLL/SIGANL_WAIT?

test/gtest/device_api/common/device_kernels.cu

+                      } while (value != params.signalRead.expectedValue);
+                      if (params.signalRead.resultPtr != nullptr) {
+                          *params.signalRead.resultPtr = value;

Contributor

ofirfarjun7 Dec 1, 2025

Why we need resultPtr if we know this function exists only if value == params.signalRead.expectedValue?

test/gtest/device_api/common/device_kernels.cu

+                      }
+                      nixlGpuWriteSignal<level>(params.signalWrite.signalAddr,
+                                                params.signalWrite.value);
+                      return NIXL_SUCCESS;

Contributor

ofirfarjun7 Dec 1, 2025

Maybe just set status and break, use common code in end of func to return status.
Do the same for previous cases that returns status.

test/gtest/device_api/common/device_kernels.cu


		namespace {

		constexpr size_t maxThreadsPerBlock = 256;

Contributor

ofirfarjun7 Dec 1, 2025

256 or 1024?

test/gtest/device_api/common/device_utils.cuh

+                      return NIXL_ERR_BACKEND;
+                  }
+                  return NIXL_SUCCESS;

Contributor

ofirfarjun7 Dec 1, 2025

Suggested change

      
                return NIXL_SUCCESS;
          
                const cudaError_t launch_error = cudaGetLastError();
          
                if (launch_error != cudaSuccess) {
          
                        std::cerr << "CUDA kernel launch error: " << cudaGetErrorString(launch_error) << "\n";
          
                        return NIXL_ERR_BACKEND;
          
                 }
          
                 const cudaError_t sync_error = cudaDeviceSynchronize();
          
                 if (sync_error != cudaSuccess) {
          
                        std::cerr << "CUDA synchronization error: " << cudaGetErrorString(sync_error) << "\n";
          
                        return NIXL_ERR_BACKEND;
          
                  }

test/gtest/device_api/tests/partial_write_test.cu

+                      createXferRequest(data.srcBuffers, data.dstBuffers, mem_type,
+                                       data.xferReq, data.gpuReqHandle);
+                  }

Contributor

ofirfarjun7 Dec 1, 2025

Looks like common code... other tests uses same code for creating gpuReqHandle...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

rakhmets Awaiting requested review from rakhmets

yosefe Awaiting requested review from yosefe

brminich Awaiting requested review from brminich

1 more reviewer

ofirfarjun7 ofirfarjun7 left review comments

At least 1 approving review is required to merge this pull request.

Labels

external-contribution size/XXL