Skip to content

[Multi-GPU Polars] Unify streaming engine options#21930

Open
madsbk wants to merge 4 commits intorapidsai:mainfrom
madsbk:StreamingOptions
Open

[Multi-GPU Polars] Unify streaming engine options#21930
madsbk wants to merge 4 commits intorapidsai:mainfrom
madsbk:StreamingOptions

Conversation

@madsbk
Copy link
Copy Markdown
Member

@madsbk madsbk commented Mar 25, 2026

Introduces StreamingOptions, a single typed dataclass that unifies all RapidsMPF, executor, and engine configuration in one place.

SPMD, Ray, and the upcoming Dask frontend now expose a from_options classmethod:

engine = SPMDEngine.from_options(opts)
engine = RayEngine.from_options(opts)
engine = DaskEngine.from_options(opts)  # upcoming

This is the first step toward standardizing the many configuration options we currently support.

@madsbk madsbk self-assigned this Mar 25, 2026
@madsbk madsbk added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 25, 2026
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Mar 25, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Mar 25, 2026
@madsbk madsbk force-pushed the StreamingOptions branch from 872e49a to 54be0e8 Compare March 25, 2026 21:18
@madsbk madsbk marked this pull request as ready for review March 26, 2026 06:51
@madsbk madsbk requested a review from a team as a code owner March 26, 2026 06:51
@madsbk madsbk force-pushed the StreamingOptions branch from 54be0e8 to cd99655 Compare March 26, 2026 06:51
@rapidsai rapidsai deleted a comment from copy-pr-bot bot Mar 26, 2026
@rapidsai rapidsai deleted a comment from copy-pr-bot bot Mar 26, 2026
@madsbk madsbk force-pushed the StreamingOptions branch from cd99655 to c995fc1 Compare March 26, 2026 06:53
madsbk added a commit to madsbk/cudf that referenced this pull request Mar 26, 2026
This is a single commit intended to be cherry-picked onto `main`
once rapidsai#21930 has been merged.
Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider nesting the options under their category (e.g. streaming_options.rapidsmpf.num_streaming_threads instead of streaming_options.num_streaming_threads?

I like this because

  1. It avoids an risk of a name clash between options from different categories
  2. It simplifies usage of the options later on (you don't need the category when converting to rapidsmpf Options, for example).

----------
num_streaming_threads
Threads used to execute coroutines.
Env: ``RAPIDSMPF_NUM_STREAMING_THREADS``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, any all environment variables should be prefixed by CUDF_POLARS_.

I'd also recommend the convention CUDF_POLARS__<CATEGORY>__<name> to avoid any possibility of a name clash between two options with the same name from different categories.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to have duplicate options like CUDF_POLARS_LOG and RAPIDSMPF_LOG? I think it might be better to accept that RapidsMPF-specific options use a different prefix?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, the environment variable would be CUDF_POLARS_RAPIDSMPF_LOG, not CUDF_POLARS_LOG.

My preference would be to prefix everything. Then there's no guessing about the environment variable names, it's always CUDF_POLARS_<name> (or CUDF_POLARS__<CATEGORY>__<NAME> if we nest things).

Special casing rapidsmpf variables to not have the prefix CUDF_POLARS prefix isn't the worst thing, just something to remember.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the user has RAPIDSMPF_NUM_STREAMING_THREADS set in their environment, that value will be used unless they have CUDF_POLARS_RAPIDSMPF_NUM_STREAMING_THREADS set. Right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be the ultimate effect, given that (IIUC) it would eventually fall back to the rapidsmpf default, which is to read RAPIDSMPF_NUM_STREAMING_THREADS from the environment.

# ---- RapidsMPF runtime ----
num_streaming_threads: int | _Unspecified = _opt("rapidsmpf")
num_streams: int | _Unspecified = _opt("rapidsmpf")
log: str | _Unspecified = _opt("rapidsmpf")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the literal levels allowed.

Suggested change
log: str | _Unspecified = _opt("rapidsmpf")
log: Literal["NONE", "PRINT", "WARN", "INFO", "DEBUG", "TRACE"] | _Unspecified = _opt("rapidsmpf")

spill_device_limit: str | _Unspecified = _opt("rapidsmpf")
periodic_spill_check: str | _Unspecified = _opt("rapidsmpf")
# ---- Executor ----
rapidsmpf_py_executor_max_workers: int | _Unspecified = _opt("executor")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit strange to have an option with the "rapidsmpf" prefix under the executor category.

``RAPIDSMPF_*`` environment variable (if set); otherwise the rapidsmpf
C++ library applies its own built-in default.
"""
from rapidsmpf.config import Options, get_environment_variables
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to import rapidsmpf at the top of the file here.

env = get_environment_variables()

opts: dict[str, str] = {}
for f in dataclasses.fields(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also dataclasses.asdict(self), which would avoid the need for a getattr later on.

I guess we'd lose the category which is on the field rather than the fields value, but we nest the options then that'd be present (and you'd just do something like dataclasses.asdict(self.rapidsmpf)).

Comment on lines +316 to +322
known = {f.name for f in dataclasses.fields(cls)}
unknown = d.keys() - known
if unknown:
raise TypeError(
f"StreamingOptions.from_dict() got unknown field(s): "
f"{', '.join(sorted(unknown))}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not let the dataclass's __init__ handle this for us?

return cls(**kwargs)

@classmethod
def from_argparse(cls, args: argparse.Namespace) -> StreamingOptions:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be part of the public API IMO.

If we want it (presumably for our benchmarks file) then let's make it private.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this could be generally useful. Most applications built on cudf-polars would likely want to support these kinds of command-line arguments, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm I'm not sure. In my experience (primarily deploying services on Kubernetes) most configuration is done through some kind of centralized management system that places the configuration files / environment variables before the application starts up.

If we're able to use it internally (e.g. for benchmarks) then great. But I'd caution against advertising this as a stable, public API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants