Skip to content

Implement Glushkov's NFA into cuDF#21936

Draft
lingyany-nv wants to merge 17 commits intorapidsai:mainfrom
lingyany-nv:lingyany/glushkov-nfa
Draft

Implement Glushkov's NFA into cuDF#21936
lingyany-nv wants to merge 17 commits intorapidsai:mainfrom
lingyany-nv:lingyany/glushkov-nfa

Conversation

@lingyany-nv
Copy link
Copy Markdown

@lingyany-nv lingyany-nv commented Mar 25, 2026

Description

Add bit-parallel Glushkov NFA regex engine with shared memory optimization

Implement Glushkov's NFA for regex string matching in cuDF to be more GPU friendly, references (1) hyperscan paper (2) HybridSA paper (3) vectorscan repo. Basically, this is Glushkov's NFA compared with the other popular Thompson's NFA (also used in current cuDF regex). The Glushkov engine represents NFA state as a single uint64_t bitmask (max 64 positions), requiring zero GPU working memory per thread compared to Thompson NFA's per-thread state arrays. A shared memory cache further accelerates execution by cooperatively loading read-only program data (reach masks, shift masks, exception successors) into SMEM at kernel entry.

Key changes

  • Two-phase O(n) unanchored search algorithm (glushkov.inl): Phase 1 scans forward, injecting start states each character and recording provisional match ends. Phase 2 rescans only the match region to find the true leftmost start. Each character is processed at most twice.
  • Leftmost-first correctness via priority-kill (glushkov.inl, glushkov_regcomp.cpp): A runtime glushkov_priority_kill clears lower-priority alternative paths at accept time. A compile-time conflict detector (frontier_has_priority_conflict) conservatively falls back to Thompson when bit-index ordering cannot guarantee Thompson-compatible leftmost-first semantics.
  • Automatic fallback: Patterns with anchors (^, $, \b, \B), >64 positions, nullable top-level expressions, capture group requirements (extract, backref_re), or priority conflicts transparently fall back to Thompson NFA — no user intervention needed.
  • Shared memory DataSource abstraction (glushkov.cuh, utilities.cuh): Templates compute_follow and compute_reach over glushkov_global_source vs glushkov_shmem_source, with cooperative SMEM loading in kernel wrappers.

Limitations

  • do not support capturing groups (e.g. extract, replace_with_backrefs)
  • do not support zero-width assertions like BOL/EOL/BOW/NBOW
  • max 64 character-consuming positions since we are using uint64_t
  • do not support lazy quantifiers
  • empty/degenerate patterns rejected
  • do not support nullable patterns as well as some ambiguous alternation patterns

When above condition is detected, it falls back to use the current Thompson's NFA.

Unit tests + benchmark

  • Priority-kill parity tests: Verify Glushkov matches Thompson for overlapping-prefix alternations (foo|foobar, cat|catch, a|aa) across all 5 operations (contains, count, findall, replace, split)
  • Nullable fallback parity: Confirm nullable patterns (a*, \d*, (ab)?) transparently fall back to Thompson and produce identical results
  • Mixed-engine regression test (MixedEngineReplace): Exercises multi-pattern replace where one pattern is Glushkov-backed and another falls back to Thompson
  • Spark-rapids compatibility: ~60 regex patterns from spark-rapids integration tests validated under both engines via parametrized Python tests
  • Benchmarks: 6–9 patterns per benchmark covering char classes, alternation, bounded repetition, dot wildcards, and late-failure stress patterns; state.skip() guards for Glushkov-unsupported combinations (anchors, backrefs)
  • Extended more complex regexes in the current split_re/contains/replace_re/count, it showed 1.01-6.62x speedup.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@lingyany-nv lingyany-nv requested review from a team as code owners March 25, 2026 21:09
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Mar 25, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Mar 25, 2026
@lingyany-nv lingyany-nv marked this pull request as draft March 25, 2026 21:24
@PointKernel PointKernel added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change labels Mar 25, 2026
@GregoryKimball GregoryKimball changed the title [draft] Implement Glushkov's NFA into cudf [draft] Implement Glushkov's NFA into cuDF Apr 8, 2026
@lingyany-nv lingyany-nv force-pushed the lingyany/glushkov-nfa branch from 3712832 to 2eaec3c Compare April 13, 2026 20:27
@lingyany-nv lingyany-nv marked this pull request as ready for review April 13, 2026 21:36
@lingyany-nv lingyany-nv changed the title [draft] Implement Glushkov's NFA into cuDF Implement Glushkov's NFA into cuDF Apr 13, 2026
@davidwendt davidwendt marked this pull request as draft April 13, 2026 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants