-
Notifications
You must be signed in to change notification settings - Fork 82
Require CUDA 12.2+ #860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Require CUDA 12.2+ #860
Conversation
conda/recipes/libkvikio/recipe.yaml
Outdated
| # Each case has different cuda-version constraints as expressed below | ||
| should_use_cufile: ${{ x86_64 or (aarch64 and cuda_version >= "12.2") }} | ||
| # When reverting, instances of cuda_key_string can be replaced with cuda_major | ||
| cuda_key_string: ${{ cuda_version | replace(".", "_") }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want this to go back to cuda_major as the comment suggests. Delete cuda_key_string, uncomment cuda_version / cuda_major, and replace cuda_key_string with cuda_major.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am ambivalent here. On the one hand can see that point. However on the other it could be useful to call out CUDA 12.2 is a minimum here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this helps. We already set a lower bound on cuda-version. Let's stop using special-case behavior here and revert as the comment recommends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same argument could be made for the explicit inclusion of the CUDA version in our build string currently
It is also already handled via cuda-version and the build string hash
Think it is worth discussing this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same argument could be made for the explicit inclusion of the CUDA version in our build string currently
How? conda doesn't allow you to upload two builds that have identical versions and build strings but different dependencies. You have to include some bit of information that makes the build unique. I guess we could use PKG_HASH to distinguish the builds in the build string, but if you have to include something it might as well be more meaningful than that.
That being said, I don't look at the build string as a useful way of conveying the minimum supported CUDA version given that the cuda-version dependency is part of the package metadata as well, so my vote would be to go back to the previous status. 12 vs 13 tells me enough about the purpose of the two packages. I'm not looking for anything more than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that argument goes like this
If cuda-version is in the variants (like conda_build_config.yaml or variants.yaml), then it is included in the string hash. So there is no need to manually add it to the string
The challenge with including it manually (as we have to date) is it means writing the build string from scratch each time. As a result we end up missing details that would have otherwise been captured for us if it was added as a variant and included in the string hash
Right so there are essentially two viewpoints we encounter. One view is the as you have articulated that the build string is not a reliable or consistent way to find this information so it shouldn't be relied on and package dependencies (like cuda-version) should be used to do that. The other view is that it can be useful to have the relevant info in one place on the relevant package to simplify the end-user workflow
Given today we already recommend users specify cuda-version, there is no real simplification offered by adding this info to the build string. So would tend to agree with the first view for our case (similar to you). And would go further, that it would be better to minimize string customization and rely on variants and the string hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we have never included the CUDA version in the variants directly is that we need the flexibility of building different variants in different CI jobs on different runners, whereas conda-build and rattler-build will build all the variants at once (the slowdown in end-to-end CI time that would result from serializing the builds has too big a negative impact on developers). With the switch to rattler-build we have a bit of additional flexibility because rattler-build supports providing the variant config on the CLI, but ultimately that means we'd be constructing the variants on the fly in CI jobs which leads to more divergence between local dev and CI where we have been aiming to close the gap over time. If you have a solution that allows us to specify all the information in the variants file and avoid environment and recipe context variables, I'm all for it. Otherwise I think we're stuck with some degree of manual string customization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both conda-build and rattler-build can specify variants by CLI
Think we are conflating two things:
- How the matrix is specified
- How CUDA version is specified
We need not change 1 to change 2. Am focused here on 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have an idea of how to implement a proposed improved, feel free to push a commit with the change and I'm happy to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do. Though will approach this in a follow up PR so we can discuss. We can then decide if it merits broader rollout
For now have simply reverted the relevant bits here
c23914f to
cff534c
Compare
9eebbfa to
82937d0
Compare
cpp/src/shim/cufile.cpp
Outdated
| // Note: `version` is v1.7 (in CUDA 12.2) for cuFile versions prior to v1.8 (CUDA 12.3) | ||
| // because `cuFileGetVersion` did not exist until CUDA 12.3 and CUDA 12.2 is the oldest | ||
| // version we support. This trade-off is made for improved robustness. | ||
| if (version >= 1060) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA 12.2.0 shipped with cuFile 1.7.0.49
In other words it had the batch and async features all supported. Given this all of the symbols checked for below are defined and the checks could be dropped. For now have just bumped the version value to represent 1.7.0
Refs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clean-up! This will help to simplify the logic in C++ compatibility mode manager as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the encouragement Tianyu! 🙏
Went ahead and dropped all of this logic
82937d0 to
05f1d88
Compare
6ab3851 to
d1bfd7e
Compare
d1bfd7e to
c1f0d37
Compare
As of CUDA 12.2.0 all of these functions are available. So drop the version checks around them. Though still get the version (if possible) as it is defined here and used throughout the codebase. If it is not possible to get the version, then this is CUDA 12.2 (as CUDA 12.3 defines the function). So set the version to cuFile 1.7.0, which was included in CUDA 12.2.
As of CUDA 12.2+, these functions are always true as the relevant APIs are always included. So go ahead drop them. Also equate their calls to `true` in relevant logic.
c1f0d37 to
5405ae4
Compare
|
Have been seeing some unrelated flaky CI behavior in a few jobs. Here is one example Have restarted CI a few times to get this to clear up. Will also follow up offline Edit: Looks like this was do to a networking issue that has since been resolved |
|
That being said, think this is otherwise ready for another review. So would appreciate if folks could take another look |
|
|
||
| # Check API support | ||
| try_compile( | ||
| cuFile_BATCH_API_FOUND SOURCE_FROM_CONTENT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking about these. I think we may still want to check, but instead error if it doesn't work
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the batch/stream APIs are guaranteed to exist in cuFile under CTK 12.2+, I think we can remove the check here for simplicity.
| ) | ||
| message(STATUS "Found cuFile Batch API: ${cuFile_BATCH_API_FOUND}") | ||
| try_compile( | ||
| cuFile_STREAM_API_FOUND SOURCE_FROM_CONTENT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and this one as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss in the above thread: #860 (comment)
We can apply any changes needed to both after
| void CompatModeManager::validate_compat_mode_for_async() const | ||
| { | ||
| KVIKIO_NVTX_FUNC_RANGE(); | ||
| if (_is_compat_mode_preferred_for_async && _compat_mode_requested == CompatMode::OFF) { | ||
| std::string err_msg; | ||
| if (!is_stream_api_available()) { err_msg += "Missing the cuFile stream api."; } | ||
|
|
||
| // When checking for availability, we also check if cuFile's config file exists. This is | ||
| // When checking for availability, we check if cuFile's config file exists. This is | ||
| // because even when the stream API is available, it doesn't work if no config file exists. | ||
| if (config_path().empty()) { err_msg += " Missing cuFile configuration file."; } | ||
| if (config_path().empty()) { err_msg += "Missing cuFile configuration file."; } | ||
|
|
||
| KVIKIO_FAIL(err_msg, std::runtime_error); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| void CompatModeManager::validate_compat_mode_for_async() const | |
| { | |
| KVIKIO_NVTX_FUNC_RANGE(); | |
| if (_is_compat_mode_preferred_for_async && _compat_mode_requested == CompatMode::OFF) { | |
| std::string err_msg; | |
| if (!is_stream_api_available()) { err_msg += "Missing the cuFile stream api."; } | |
| // When checking for availability, we also check if cuFile's config file exists. This is | |
| // When checking for availability, we check if cuFile's config file exists. This is | |
| // because even when the stream API is available, it doesn't work if no config file exists. | |
| if (config_path().empty()) { err_msg += " Missing cuFile configuration file."; } | |
| if (config_path().empty()) { err_msg += "Missing cuFile configuration file."; } | |
| KVIKIO_FAIL(err_msg, std::runtime_error); | |
| } | |
| void CompatModeManager::validate_compat_mode_for_async() const | |
| { | |
| KVIKIO_NVTX_FUNC_RANGE(); | |
| // cuFile stream API will not work if the config file does not exist. | |
| if (_is_compat_mode_preferred_for_async && _compat_mode_requested == CompatMode::OFF && | |
| config_path().empty()) { | |
| KVIKIO_FAIL("Missing cuFile configuration file.", std::runtime_error); | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a side note, I didn't know the presence of config file is required for the stream API to function properly. This is quite interesting. I'll run some tests and see if this behavior has changed in recent cuFile API.
| #ifdef KVIKIO_CUFILE_VERSION_API_FOUND | ||
| try { | ||
| get_symbol(GetVersion, lib, KVIKIO_STRINGIFY(cuFileGetVersion)); | ||
| int ver; | ||
| CUfileError_t const error = GetVersion(&ver); | ||
| if (error.err == CU_FILE_SUCCESS) { version = ver; } | ||
| } catch (std::runtime_error const&) { | ||
| version = 1070; | ||
| } | ||
| #else | ||
| version = 1070; | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with previous behavior, we don't need to set the version number here in the else branch. It will take the default value of 0 (specified in cuFileAPI's constructor). I think we could add back some of the comments related to the version checking API. So I'd suggest this change:
| #endif | |
| // Note: `version` is 0 for cuFile versions prior to v1.8 because `cuFileGetVersion` | |
| // did not exist. | |
| #ifdef KVIKIO_CUFILE_VERSION_API_FOUND | |
| try { | |
| get_symbol(GetVersion, lib, KVIKIO_STRINGIFY(cuFileGetVersion)); | |
| int ver; | |
| CUfileError_t const error = GetVersion(&ver); | |
| if (error.err == CU_FILE_SUCCESS) { version = ver; } | |
| } catch (std::runtime_error const&) { | |
| } | |
| #endif |
bdice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving packaging/CI changes. I'll let @kingcrimsontianyu finalize his feedback and approve the C++ side, but it looks good to me too
kingcrimsontianyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested a couple of changes. Thanks!
Make CUDA 12.2 the new minimum. Drops logic related to CUDA 12.0 & 12.1. Thus simplifying the CUDA build and dependency logic.
Part of issue: