-
Notifications
You must be signed in to change notification settings - Fork 131
Run unit tests in test_pytorch_wheels.yml on Windows #2265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
HereThereBeDragons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks already quite good to me. here a couple of comments for improvement:
i wonder if it will be needed to extend our test skipping depending on the platform and extending it with
skip_tests/generic.py
skip_tests/generic_linux.py
skip_tests/generic_win.py
skip_tests/pytorch_2.9.py
skip_tests/pytorch_2.9_linux.py
skip_tests/pytorch_2.9_win.py
i think it is also worth considering if we can add a comment how torch_version looks from the format or maybe rename it to torch_rocm_version to clarify it is the 2.9.0+rocm7.10a... in the build_..pytorch_wheels.yml. as you already add some comments in test_pytorch_wheels.yml about it
| """Forces termination to work around https://github.com/ROCm/TheRock/issues/999.""" | ||
| import signal | ||
|
|
||
| retcode_file = Path("exit_code.txt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comment in r497 if we need the printing of the error code here in the first place.
if yes:
maybe rename to pytorch_pytest_exit_code.txt? and do we need to upload it somewhere to the artifacts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to run_pytorch_tests_exit_code.txt, matching the file name. If we add other test scripts they can (ughhh) use the same pattern.
and do we need to upload it somewhere to the artifacts?
I don't think we need to upload this exit code file. We should generate test reports and upload those. The test reports will then be authoritative for result status.
- https://docs.pytest.org/en/stable/how-to/output.html#creating-junitxml-format-files
- https://github.com/pytorch/pytorch/blob/33d4cf4fcb7f0cba6191b242dae53b48057e05b9/test/run_test.py#L1266-L1270
- https://github.com/pytorch/pytorch/blob/33d4cf4fcb7f0cba6191b242dae53b48057e05b9/test/run_test.py#L598-L605
Maybe if we do that we can add continue-on-error: true to Linux too and have a common step that runs after tests that checks the results in the reports and we can ignore the exit code altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well.. then we can also change the script of not returning the pytest return code? and just use the printing of the error code we already use?
| ] | ||
|
|
||
| retcode = pytest.main(pytorch_args) | ||
| print(f"Pytest finished with return code: {retcode}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am already printing there the return code. maybe we do not need the extra file for windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to capture stdout somehow (e.g. pipe the output to a file) to use this print() for return code handling.
I'm considering putting this code in a python script, but I really don't want this hack to live for long:
- name: (Windows) Read and propagate exit code
if: ${{ runner.os == 'Windows' }}
run: |
if [ -f run_pytorch_tests_exit_code.txt ]; then
EXIT_CODE=$(cat run_pytorch_tests_exit_code.txt)
echo "Exit code from file: $EXIT_CODE"
exit $EXIT_CODE
else
echo "No run_pytorch_tests_exit_code.txt found"
exit 1
fiThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are already capturing it in the Run PyTorch tests
= 38 failed, 15608 passed, 24557 skipped, 75 deselected, 45 xfailed in 1354.49s (0:22:34) =
Pytest finished with return code: 1 <<<<< this line here
Writing retcode 1 to 'exit_code.txt'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub Actions logs stdout, but we can't (I don't think?) access it unless we capture it ourselves somehow too?
Ah... there's an idea. We could write to GITHUB_OUTPUT instead of track our own custom file. I think that would be a bit too roundabout though 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean? it is captured as part of the Run PyTorch test step. this was an extract of the ci runner.
E.g. https://github.com/ROCm/TheRock/actions/runs/19635001346/job/56229351773#step:11:40789
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Captured" meaning we can do something with it (e.g. have the value in an environment variable, a file, a bash variable, etc.). We can't just parse through all of stdout from a prior job step to determine if a step should pass or fail, unless I'm missing some way that steps can read stdout from prior steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[TBD] More complete release workflow runs for Windows and Linux?
@HereThereBeDragons @araravik-psd would you like me to trigger a full ROCm dev release for this PR to test across all pytorch versions and supported gfx families, or is spot checking with jobs like https://github.com/ROCm/TheRock/actions/runs/19586629648/job/56096766330 sufficient and then we'll see results from the next nightly release?
As this is now, the "release gating" will stop promoting packages from v2-staging to v2 for Windows once this PR is merged, until we get all test failures addressed (#2156). That is already the case for Linux nightly releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to you. i would just wait for the nightlies.
just considering the runtime you dont get signals before tomorrow anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how unstable the tests appeared when I was testing, I think I will split this into two PRs:
- The
external-builds/pytorch/*changes allowing for running tests on Windows locally - The
.github/workflows/test_pytorch_wheels.ymlchanges that include those tests on our Windows runners
That way we can more easily revert just the workflow changes as needed while keeping support for testing locally.
HereThereBeDragons
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see discussion comments
jayhawk-commits
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good as a starter to get windows results
…orch-windows-tests-2
| # Skip tests that hang. Perhaps related to processes not terminating | ||
| # on their own: https://github.com/ROCm/TheRock/issues/999. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing more tests hang on 'nightly' than just on 'release/2.9':
https://github.com/ROCm/TheRock/actions/runs/19651083796/job/56278015582
Mon, 24 Nov 2025 22:27:09 GMT external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_nvtx PASSED [0.0007s] [ 8%]
Mon, 24 Nov 2025 22:27:09 GMT external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_out_of_memory PASSED [0.0014s] [ 8%]
Mon, 24 Nov 2025 22:27:09 GMT external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_out_of_memory_retry FAILED [0.7945s] [ 8%]
Mon, 24 Nov 2025 22:27:09 GMT external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_pinned_memory_empty_cache PASSED [0.0043s] [ 8%]
Mon, 24 Nov 2025 23:02:02 GMT external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_pinned_memory_use_background_threads [TORCH_VITAL] CUDA.used true
Mon, 24 Nov 2025 23:02:02 GMT [TORCH_VITAL] Dataloader.basic_unit_test TEST_VALUE_STRING
Mon, 24 Nov 2025 23:02:02 GMT [TORCH_VITAL] Dataloader.enabled True
Mon, 24 Nov 2025 23:02:03 GMT Error: The operation was canceled.
I'll pin that down through local testing and push another test skip before merging this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (hopefully) by skipping two more tests - one timeout and one crash. Testing again at https://github.com/ROCm/TheRock/actions/runs/19653570950/job/56285463242 before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing a bunch of crashes on CI runners that I can't reproduce locally. I'll have to debug more tomorrow, can't merge this yet.
Latest is https://github.com/ROCm/TheRock/actions/runs/19654631051/job/56288648443#step:12:5886
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_hip_device_count PASSED [6.0132s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_huge_index SKIPPED [0.0007s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_index_out_of_bounds_exception_cuda SKIPPED [0.0005s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_invalid_status_for_legacy_api FAILED [0.0005s] [ 8%]
Traceback (most recent call last):
File "<string>", line 35, in <module>
File "<string>", line 22, in fork_and_check_is_pinned
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\multiprocessing\context.py", line 337, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
reduction.dump(process_obj, to_child)
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't get local object 'fork_and_check_is_pinned.<locals>.worker'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_is_pinned_no_context FAILED [1.4856s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_lazy_init PASSED [1.4903s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_manual_seed PASSED [0.0038s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_matmul_device_mismatch PASSED [0.0012s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_matmul_memory_use PASSED [0.0143s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_max_large_axis SKIPPED [0.0005s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_mean_fp16 PASSED [0.0008s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_memory_allocation PASSED [0.2754s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_memory_stats PASSED [0.5461s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_memory_stats_of_multiple_generators_and_graphs PASSED [0.8383s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_min_max_inits PASSED [0.0020s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_multi_device_context_manager SKIPPED [0.0001s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_multi_device_stream_context_manager SKIPPED [0.0001s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_multinomial_ext PASSED [0.0034s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_multinomial_invalid_probs_cuda SKIPPED [0.0001s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_noncontiguous_pinned_memory PASSED [0.0006s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_norm_type_conversion PASSED [0.0017s] [ 8%]
[W1125 01:12:59.000000000 nvtx.cpp:75] Warning: Warning: roctracer isn't available on Windows (function operator())
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_nvtx PASSED [0.0005s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_out_of_memory PASSED [0.0012s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_out_of_memory_retry FAILED [0.7698s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_pinned_memory_empty_cache PASSED [0.0044s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_pinned_memory_with_cudaregister PASSED [0.0195s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_pinned_memory_with_cudaregister_multithread PASSED [0.0176s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_preferred_blas_library_settings PASSED [3.0720s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_prod_large PASSED [0.0018s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_randint_generation_for_large_numel PASSED [1.3064s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_randint_randomness_for_large_range PASSED [0.1446s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_random_no_reused_random_states_float32 PASSED [0.5986s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_random_no_reused_random_states_float64 PASSED [0.4658s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_record_stream PASSED [0.0513s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_record_stream_on_shifted_view PASSED [12.3244s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_reduction_gpu_memory_accessing PASSED [0.0009s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_repeat_graph_capture_cublas_workspace_memory PASSED [0.9795s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_rocm_backward_pass_guard PASSED [0.0012s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.0882s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_specify_improper_device_name PASSED [0.0129s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_stream_compatibility PASSED [0.0008s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_stream_context_manager PASSED [0.0007s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_stream_event_repr PASSED [0.0006s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_streaming_backwards_callback FAILED [0.0545s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_streaming_backwards_multiple_streams PASSED [0.0103s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_streaming_backwards_sync PASSED [0.0017s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_streaming_backwards_sync_graph_root FAILED [0.0517s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_streams FAILED [0.0008s] [ 8%]
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_sum_fp16 FAILED [0.0007s] [ 8%]
Windows fatal exception: access violation
Thread 0x00001290 (most recent call first):
<no Python frame>
Thread 0x00001e44 (most recent call first):
File "B:\runner\_work\TheRock\TheRock\external-builds\pytorch\pytorch\test\test_cuda.py", line 1577 in test_tiny_half_norm_
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\unittest\case.py", line 589 in _callTestMethod
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\unittest\case.py", line 634 in run
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\torch\testing\_internal\common_utils.py", line 3484 in _run_custom
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\torch\testing\_internal\common_utils.py", line 3514 in run
File "B:\runner\_work\_tool\Python\3.12.10\x64\Lib\unittest\case.py", line 690 in __call__
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\unittest.py", line 351 in runtest
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\runner.py", line 174 in pytest_runtest_call
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_callers.py", line 121 in _multicall
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_hooks.py", line 512 in __call__
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\runner.py", line 242 in <lambda>
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\runner.py", line 341 in from_call
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\runner.py", line 241 in call_and_report
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\runner.py", line 132 in runtestprotocol
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\runner.py", line 113 in pytest_runtest_protocol
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_callers.py", line 121 in _multicall
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_hooks.py", line 512 in __call__
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\main.py", line 362 in pytest_runtestloop
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_callers.py", line 121 in _multicall
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_hooks.py", line 512 in __call__
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\main.py", line 337 in _main
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\main.py", line 283 in wrap_session
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\main.py", line 330 in pytest_cmdline_main
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_callers.py", line 121 in _multicall
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_manager.py", line 120 in _hookexec
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\pluggy\_hooks.py", line 512 in __call__
File "B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_pytest\config\__init__.py", line 175 in main
File "B:\runner\_work\TheRock\TheRock\external-builds\pytorch\run_pytorch_tests.py", line 499 in main
File "B:\runner\_work\TheRock\TheRock\external-builds\pytorch\run_pytorch_tests.py", line 531 in <module>
Exception Code: 0xC0000005
0x00007FF97A670983, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x920983 byte(s), hipHccModuleLaunchKernel() + 0x59B5F3 byte(s)
0x00007FF97A1A4315, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x454315 byte(s), hipHccModuleLaunchKernel() + 0xCEF85 byte(s)
0x00007FF97A1DEF47, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x48EF47 byte(s), hipHccModuleLaunchKernel() + 0x109BB7 byte(s)
0x00007FF97A1DDEC6, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x48DEC6 byte(s), hipHccModuleLaunchKernel() + 0x108B36 byte(s)
0x00007FF97A1DE1B4, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x48E1B4 byte(s), hipHccModuleLaunchKernel() + 0x108E24 byte(s)
0x00007FF97A1CB105, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x47B105 byte(s), hipHccModuleLaunchKernel() + 0xF5D75 byte(s)
0x00007FF97A14010F, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x3F010F byte(s), hipHccModuleLaunchKernel() + 0x6AD7F byte(s)
0x00007FF97A140231, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x3F0231 byte(s), hipHccModuleLaunchKernel() + 0x6AEA1 byte(s)
0x00007FF97A163A86, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x413A86 byte(s), hipHccModuleLaunchKernel() + 0x8E6F6 byte(s)
0x00007FF97A0FB0FF, B:\runner\_work\TheRock\TheRock\.venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll(0x00007FF979D50000) + 0x3AB0FF byte(s), hipHccModuleLaunchKernel() + 0x25D6F byte(s)
0x00007FF9B1A4E8D7, C:\Windows\System32\KERNEL32.DLL(0x00007FF9B1A20000) + 0x2E8D7 byte(s), BaseThreadInitThunk() + 0x17 byte(s)
0x00007FF9B232C53C, C:\Windows\SYSTEM32\ntdll.dll(0x00007FF9B22A0000) + 0x8C53C byte(s), RtlUserThreadStart() + 0x2C byte(s)
B:\runner\_work\_temp\211244a6-284a-4d32-9dec-bf7bac56d6e0.sh: line 1: 507 Segmentation fault python ./external-builds/pytorch/run_pytorch_tests.py
external-builds\pytorch\pytorch\test\test_cuda.py::TestCuda::test_tiny_half_norm_
Error: Process completed with exit code 139.
Warning
Because we have test failures, this PR will stop promotion from v2-staging to v2 on Windows for GPU families where we have test runners like gfx1151 and gfx110X, as is already done on Linux.
Motivation
Progress on #2258 and #1073. This changes the
test_pytorch_wheels.ymlworkflow from only running our PyTorch smoke tests to running the full set in ourrun_linux_pytorch_tests.pyscript.Technical Details
Due to #999, I added a
force_exit_with_code()hack torun_pytorch_tests.py. Since the test process does not terminate on its own, even after all test cases complete, I kill the process withos.kill(). I tried to use nicer methods likesys.exit()andos._exit()but these were not sufficient. A consequence of this is that the exit code of the process is now always 15 (SIGTERM) on Windows, so the script now writesexit_code.txtto the current directory for thetest_pytorch_wheels.ymlworkflow to use.Two test cases caused additional issues:
test_cublas_config_nondeterministic_alert_cudaintest_torch.pytest_graph_errorintest_cuda.pyThese test cases should be fixed or conditionally skipped in the upstream pytorch test files. Until then, I marked them as skipped using our new test filtering under a new "platform/windows" category.
Test Plan
python D:/projects/TheRock/external-builds/pytorch/run_pytorch_tests.py --pytorch-dir D:/b/pytorch --amdgpu-family=gfx110X-dgpu > C:\Users\Nod-Shark16\.therock\logs\run_pytorch_tests_%date%_%time::=%.txt 2>&1test_torch.pyonly: https://github.com/ROCm/TheRock/actions/runs/19585433265/job/56093159807Test Result
The specific set of tests running and their current results on my gfx1100 system for PyTorch 2.9 are:
test_nn.pytest_torch.pytest_cuda.pytest_unary_ufuncs.pytest_binary_ufuncs.pytest_autograd.py* (Numbers might not quite add up there since I ran at a few different pytorch commits)
Submission Checklist