Skip to content

Conversation

@jakirkham
Copy link
Member

Make CUDA 12.2 the new minimum. Drops logic related to CUDA 12.0 & 12.1. Thus simplifying the CUDA build and dependency logic.


Part of issue:

@jakirkham jakirkham requested a review from a team as a code owner October 28, 2025 21:41
@jakirkham jakirkham added the breaking Introduces a breaking change label Oct 28, 2025
@jakirkham jakirkham requested review from a team as code owners October 28, 2025 21:41
@jakirkham jakirkham added the improvement Improves an existing functionality label Oct 28, 2025
@jakirkham jakirkham requested a review from gforsyth October 28, 2025 21:41
# Each case has different cuda-version constraints as expressed below
should_use_cufile: ${{ x86_64 or (aarch64 and cuda_version >= "12.2") }}
# When reverting, instances of cuda_key_string can be replaced with cuda_major
cuda_key_string: ${{ cuda_version | replace(".", "_") }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want this to go back to cuda_major as the comment suggests. Delete cuda_key_string, uncomment cuda_version / cuda_major, and replace cuda_key_string with cuda_major.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am ambivalent here. On the one hand can see that point. However on the other it could be useful to call out CUDA 12.2 is a minimum here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this helps. We already set a lower bound on cuda-version. Let's stop using special-case behavior here and revert as the comment recommends.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same argument could be made for the explicit inclusion of the CUDA version in our build string currently

It is also already handled via cuda-version and the build string hash

Think it is worth discussing this point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same argument could be made for the explicit inclusion of the CUDA version in our build string currently

How? conda doesn't allow you to upload two builds that have identical versions and build strings but different dependencies. You have to include some bit of information that makes the build unique. I guess we could use PKG_HASH to distinguish the builds in the build string, but if you have to include something it might as well be more meaningful than that.

That being said, I don't look at the build string as a useful way of conveying the minimum supported CUDA version given that the cuda-version dependency is part of the package metadata as well, so my vote would be to go back to the previous status. 12 vs 13 tells me enough about the purpose of the two packages. I'm not looking for anything more than that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that argument goes like this

If cuda-version is in the variants (like conda_build_config.yaml or variants.yaml), then it is included in the string hash. So there is no need to manually add it to the string

The challenge with including it manually (as we have to date) is it means writing the build string from scratch each time. As a result we end up missing details that would have otherwise been captured for us if it was added as a variant and included in the string hash


Right so there are essentially two viewpoints we encounter. One view is the as you have articulated that the build string is not a reliable or consistent way to find this information so it shouldn't be relied on and package dependencies (like cuda-version) should be used to do that. The other view is that it can be useful to have the relevant info in one place on the relevant package to simplify the end-user workflow

Given today we already recommend users specify cuda-version, there is no real simplification offered by adding this info to the build string. So would tend to agree with the first view for our case (similar to you). And would go further, that it would be better to minimize string customization and rely on variants and the string hash

Copy link
Contributor

@vyasr vyasr Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we have never included the CUDA version in the variants directly is that we need the flexibility of building different variants in different CI jobs on different runners, whereas conda-build and rattler-build will build all the variants at once (the slowdown in end-to-end CI time that would result from serializing the builds has too big a negative impact on developers). With the switch to rattler-build we have a bit of additional flexibility because rattler-build supports providing the variant config on the CLI, but ultimately that means we'd be constructing the variants on the fly in CI jobs which leads to more divergence between local dev and CI where we have been aiming to close the gap over time. If you have a solution that allows us to specify all the information in the variants file and avoid environment and recipe context variables, I'm all for it. Otherwise I think we're stuck with some degree of manual string customization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both conda-build and rattler-build can specify variants by CLI

Think we are conflating two things:

  1. How the matrix is specified
  2. How CUDA version is specified

We need not change 1 to change 2. Am focused here on 2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have an idea of how to implement a proposed improved, feel free to push a commit with the change and I'm happy to review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do. Though will approach this in a follow up PR so we can discuss. We can then decide if it merits broader rollout

For now have simply reverted the relevant bits here

@jakirkham jakirkham requested a review from a team as a code owner October 28, 2025 21:50
@jakirkham jakirkham force-pushed the req_cuda_12.2+ branch 2 times, most recently from 9eebbfa to 82937d0 Compare October 28, 2025 22:02
// Note: `version` is v1.7 (in CUDA 12.2) for cuFile versions prior to v1.8 (CUDA 12.3)
// because `cuFileGetVersion` did not exist until CUDA 12.3 and CUDA 12.2 is the oldest
// version we support. This trade-off is made for improved robustness.
if (version >= 1060) {
Copy link
Member Author

@jakirkham jakirkham Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA 12.2.0 shipped with cuFile 1.7.0.49

In other words it had the batch and async features all supported. Given this all of the symbols checked for below are defined and the checks could be dropped. For now have just bumped the version value to represent 1.7.0

Refs:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean-up! This will help to simplify the logic in C++ compatibility mode manager as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the encouragement Tianyu! 🙏

Went ahead and dropped all of this logic

As of CUDA 12.2.0 all of these functions are available. So drop the
version checks around them.

Though still get the version (if possible) as it is defined here and
used throughout the codebase. If it is not possible to get the version,
then this is CUDA 12.2 (as CUDA 12.3 defines the function). So set the
version to cuFile 1.7.0, which was included in CUDA 12.2.
As of CUDA 12.2+, these functions are always true as the relevant APIs
are always included. So go ahead drop them. Also equate their calls to
`true` in relevant logic.
@jakirkham
Copy link
Member Author

jakirkham commented Nov 6, 2025

Have been seeing some unrelated flaky CI behavior in a few jobs. Here is one example

Download action repository 'aws-actions/configure-aws-credentials@00943011d9042930efac3dcd3a170e4273319bc8' (SHA:00943011d9042930efac3dcd3a170e4273319bc8)
Warning: Failed to download action 'https://api.github.com/repos/aws-actions/configure-aws-credentials/tarball/00943011d9042930efac3dcd3a170e4273319bc8'. Error: Resource temporarily unavailable (codeload.github.com:443) 
Warning: Back off 25.333 seconds before retry.
Warning: Failed to download action 'https://api.github.com/repos/aws-actions/configure-aws-credentials/tarball/00943011d9042930efac3dcd3a170e4273319bc8'. Error: Resource temporarily unavailable (codeload.github.com:443) 
Warning: Back off 18.664 seconds before retry.
Error: Resource temporarily unavailable (codeload.github.com:443)

Have restarted CI a few times to get this to clear up. Will also follow up offline

Edit: Looks like this was do to a networking issue that has since been resolved

@jakirkham
Copy link
Member Author

That being said, think this is otherwise ready for another review. So would appreciate if folks could take another look


# Check API support
try_compile(
cuFile_BATCH_API_FOUND SOURCE_FROM_CONTENT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking about these. I think we may still want to check, but instead error if it doesn't work

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the batch/stream APIs are guaranteed to exist in cuFile under CTK 12.2+, I think we can remove the check here for simplicity.

)
message(STATUS "Found cuFile Batch API: ${cuFile_BATCH_API_FOUND}")
try_compile(
cuFile_STREAM_API_FOUND SOURCE_FROM_CONTENT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and this one as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss in the above thread: #860 (comment)

We can apply any changes needed to both after

Comment on lines 81 to 91
void CompatModeManager::validate_compat_mode_for_async() const
{
KVIKIO_NVTX_FUNC_RANGE();
if (_is_compat_mode_preferred_for_async && _compat_mode_requested == CompatMode::OFF) {
std::string err_msg;
if (!is_stream_api_available()) { err_msg += "Missing the cuFile stream api."; }

// When checking for availability, we also check if cuFile's config file exists. This is
// When checking for availability, we check if cuFile's config file exists. This is
// because even when the stream API is available, it doesn't work if no config file exists.
if (config_path().empty()) { err_msg += " Missing cuFile configuration file."; }
if (config_path().empty()) { err_msg += "Missing cuFile configuration file."; }

KVIKIO_FAIL(err_msg, std::runtime_error);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CompatModeManager::validate_compat_mode_for_async() const
{
KVIKIO_NVTX_FUNC_RANGE();
if (_is_compat_mode_preferred_for_async && _compat_mode_requested == CompatMode::OFF) {
std::string err_msg;
if (!is_stream_api_available()) { err_msg += "Missing the cuFile stream api."; }
// When checking for availability, we also check if cuFile's config file exists. This is
// When checking for availability, we check if cuFile's config file exists. This is
// because even when the stream API is available, it doesn't work if no config file exists.
if (config_path().empty()) { err_msg += " Missing cuFile configuration file."; }
if (config_path().empty()) { err_msg += "Missing cuFile configuration file."; }
KVIKIO_FAIL(err_msg, std::runtime_error);
}
void CompatModeManager::validate_compat_mode_for_async() const
{
KVIKIO_NVTX_FUNC_RANGE();
// cuFile stream API will not work if the config file does not exist.
if (_is_compat_mode_preferred_for_async && _compat_mode_requested == CompatMode::OFF &&
config_path().empty()) {
KVIKIO_FAIL("Missing cuFile configuration file.", std::runtime_error);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a side note, I didn't know the presence of config file is required for the stream API to function properly. This is quite interesting. I'll run some tests and see if this behavior has changed in recent cuFile API.

Comment on lines 33 to 44
#ifdef KVIKIO_CUFILE_VERSION_API_FOUND
try {
get_symbol(GetVersion, lib, KVIKIO_STRINGIFY(cuFileGetVersion));
int ver;
CUfileError_t const error = GetVersion(&ver);
if (error.err == CU_FILE_SUCCESS) { version = ver; }
} catch (std::runtime_error const&) {
version = 1070;
}
#else
version = 1070;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with previous behavior, we don't need to set the version number here in the else branch. It will take the default value of 0 (specified in cuFileAPI's constructor). I think we could add back some of the comments related to the version checking API. So I'd suggest this change:

Suggested change
#endif
// Note: `version` is 0 for cuFile versions prior to v1.8 because `cuFileGetVersion`
// did not exist.
#ifdef KVIKIO_CUFILE_VERSION_API_FOUND
try {
get_symbol(GetVersion, lib, KVIKIO_STRINGIFY(cuFileGetVersion));
int ver;
CUfileError_t const error = GetVersion(&ver);
if (error.err == CU_FILE_SUCCESS) { version = ver; }
} catch (std::runtime_error const&) {
}
#endif

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving packaging/CI changes. I'll let @kingcrimsontianyu finalize his feedback and approve the C++ side, but it looks good to me too

Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested a couple of changes. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change improvement Improves an existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants