Skip to content

[CI] Widespread IVF and TSNE test failures in conda-python-tests-singlegpu jobs #7407

@csadorf

Description

@csadorf

Summary

All conda-python-tests-singlegpu test jobs are experiencing widespread failures (34 tests) across two components: IVF-based nearest neighbors (segmentation faults) and TSNE sparse input with specific distance metrics (CUDA device function errors).

Failing tests/components:

  • test_nearest_neighbors.py - IVF-Flat and IVF-PQ tests (27 failures)
  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input (7 failures)
  • test_pickle.py - IVF pickle tests (2 failures)

Failure observed in:

Environment

Environment independent.

Test Details

Category 1: IVF Nearest Neighbors - Segmentation Faults (27 failures)

Affected tests:

  • test_nearest_neighbors.py::test_ann_distances_metrics[ivfflat-*] (5 tests)
  • test_nearest_neighbors.py::test_ann_distances_metrics[ivfpq-*] (4 tests)
  • test_nearest_neighbors.py::test_neighborhood_predictions[ivfflat-*] (4 tests)
  • test_nearest_neighbors.py::test_neighborhood_predictions[ivfpq-*] (4 tests)
  • test_nearest_neighbors.py::test_ivfflat_pred[*] (3 tests)
  • test_nearest_neighbors.py::test_ivfpq_pred[*] (2 tests)
  • test_nearest_neighbors.py::test_knn_graph_algorithm[ivfpq]
  • test_nearest_neighbors.py::test_nearest_neighbors_rbc[*]
  • test_pickle.py::test_nearest_neighbors_pickle[ivfflat]
  • test_pickle.py::test_nearest_neighbors_pickle[ivfpq]

Error:

Fatal Python error: Segmentation fault

Current thread (most recent call first):
  File "cuml/internals/api_decorators.py", line 200 in wrapper
  File "test_nearest_neighbors.py", line 241 in test_ivfpq_pred

Multiple pytest workers (gw2-gw24) crashed during execution with memory corruption errors.

Category 2: TSNE Sparse Input - CUDA Device Function Error (7 failures)

Affected tests:

  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input[euclidean-exact]
  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input[cityblock-fft]
  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input[cityblock-barnes_hut]
  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input[cityblock-exact]
  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input[l1-fft]
  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input[l1-barnes_hut]
  • test_tsne.py::test_tsne_distance_metrics_on_sparse_input[l1-exact]

Error:

RuntimeError: transform: failed inside CUB: cudaErrorInvalidDeviceFunction: invalid device function
  File "test_tsne.py", line 422, in test_tsne_distance_metrics_on_sparse_input
    cuml_embedding = cuml_tsne.fit_transform(data_sparse)
  File "cuml/manifold/t_sne.pyx", line 654, in cuml.manifold.t_sne.TSNE.fit

Failures occur only with sparse input and L1/cityblock/euclidean distance metrics across all TSNE methods.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcidependency-breakIssue is related to an upstream breaking change.

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions