Skip to content

bug(services/hf): hf_dataset behavior tests timeout on main #7577

@Xuanwo

Description

@Xuanwo

Describe the bug

core / ubuntu-latest / hf / hf_dataset failed on main in the Behavior Test workflow:

The triggering commit only changed README.md and website/static/img/architectural.png, so this does not look caused by an HF service code change in that commit.

The job created a temporary private dataset repo successfully:

  • OPENDAL_HF_REPO_ID=opendal/test-dataset-26216207614-test-b4d1f2ca
  • OPENDAL_HF_REPO_TYPE=dataset
  • OPENDAL_TEST=hf

The behavior run then hit repeated 10s I/O timeouts against the real HF dataset backend. The failures were spread across write/delete/list-related tests instead of one assertion:

will retry Write (attempt 1) after 1s because: Unexpected (temporary) at write => io operation timeout reached
Context:
   timeout: 10

will retry Delete (attempt 1) after 1s because: Unexpected (temporary) at delete => io operation timeout reached
Context:
   timeout: 10

failures:
    behavior::test_read_full
    behavior::test_batch_delete
    behavior::test_list_file_with_recursive
    behavior::test_list_dir_with_file_path

test result: FAILED. 88 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 119.18s

Post-job cleanup also failed:

Cleanup failed: HTTP 403: {"error":"You have read access but not the required permissions for this operation"}

This suggests we should inspect both the HF dataset behavior-test setup and the token/repo-permission model. It may also be backend slowness/rate-limiting or our 10s timeout being too aggressive for HF/XET writes under this test shape.

cc @kszucs since you authored the recent HF/XET write path and related HF fixes.

Steps to Reproduce

Run the core behavior test matrix for the HF dataset setup on Linux:

# In CI this is generated by .github/workflows/test_behavior_core.yml
# via .github/services/hf/hf_dataset/action.yml.
OPENDAL_TEST=hf \
OPENDAL_HF_REPO_TYPE=dataset \
OPENDAL_HF_REPO_ID=<temporary-private-dataset-repo> \
OPENDAL_HF_TOKEN=<token> \
RUST_TEST_THREADS=1 \
cargo test -p opendal --features services-hf,tests behavior

The observed failure happened in GitHub Actions on Ubuntu 24.04 with Rust 1.95.0.

Expected Behavior

The HF dataset behavior test should pass reliably, or fail with a clear HF/OpenDAL error that identifies the real backend/token problem. It should not fail multiple unrelated behavior tests via generic 10s I/O timeouts.

Additional Context

Relevant local files:

  • .github/services/hf/hf_dataset/action.yml
  • .github/actions/hf-temp-repo/setup.js
  • .github/actions/hf-temp-repo/cleanup.js
  • core/services/hf/

Related but different issue: #7367 tracked the Java blocking HF/XET segfault and was mitigated by disabling Java HF behavior tests. This issue is about the core Rust HF dataset behavior job timing out on main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions