[Multi-GPU Polars] Unify streaming engine options#21930
[Multi-GPU Polars] Unify streaming engine options#21930madsbk wants to merge 4 commits intorapidsai:mainfrom
Conversation
872e49a to
54be0e8
Compare
54be0e8 to
cd99655
Compare
cd99655 to
c995fc1
Compare
This is a single commit intended to be cherry-picked onto `main` once rapidsai#21930 has been merged.
TomAugspurger
left a comment
There was a problem hiding this comment.
Did you consider nesting the options under their category (e.g. streaming_options.rapidsmpf.num_streaming_threads instead of streaming_options.num_streaming_threads?
I like this because
- It avoids an risk of a name clash between options from different categories
- It simplifies usage of the options later on (you don't need the
categorywhen converting to rapidsmpf Options, for example).
| ---------- | ||
| num_streaming_threads | ||
| Threads used to execute coroutines. | ||
| Env: ``RAPIDSMPF_NUM_STREAMING_THREADS``. |
There was a problem hiding this comment.
IMO, any all environment variables should be prefixed by CUDF_POLARS_.
I'd also recommend the convention CUDF_POLARS__<CATEGORY>__<name> to avoid any possibility of a name clash between two options with the same name from different categories.
There was a problem hiding this comment.
Do we really want to have duplicate options like CUDF_POLARS_LOG and RAPIDSMPF_LOG? I think it might be better to accept that RapidsMPF-specific options use a different prefix?
There was a problem hiding this comment.
Just to be clear, the environment variable would be CUDF_POLARS_RAPIDSMPF_LOG, not CUDF_POLARS_LOG.
My preference would be to prefix everything. Then there's no guessing about the environment variable names, it's always CUDF_POLARS_<name> (or CUDF_POLARS__<CATEGORY>__<NAME> if we nest things).
Special casing rapidsmpf variables to not have the prefix CUDF_POLARS prefix isn't the worst thing, just something to remember.
There was a problem hiding this comment.
So, if the user has RAPIDSMPF_NUM_STREAMING_THREADS set in their environment, that value will be used unless they have CUDF_POLARS_RAPIDSMPF_NUM_STREAMING_THREADS set. Right?
There was a problem hiding this comment.
I think that would be the ultimate effect, given that (IIUC) it would eventually fall back to the rapidsmpf default, which is to read RAPIDSMPF_NUM_STREAMING_THREADS from the environment.
| # ---- RapidsMPF runtime ---- | ||
| num_streaming_threads: int | _Unspecified = _opt("rapidsmpf") | ||
| num_streams: int | _Unspecified = _opt("rapidsmpf") | ||
| log: str | _Unspecified = _opt("rapidsmpf") |
There was a problem hiding this comment.
This should be the literal levels allowed.
| log: str | _Unspecified = _opt("rapidsmpf") | |
| log: Literal["NONE", "PRINT", "WARN", "INFO", "DEBUG", "TRACE"] | _Unspecified = _opt("rapidsmpf") |
| spill_device_limit: str | _Unspecified = _opt("rapidsmpf") | ||
| periodic_spill_check: str | _Unspecified = _opt("rapidsmpf") | ||
| # ---- Executor ---- | ||
| rapidsmpf_py_executor_max_workers: int | _Unspecified = _opt("executor") |
There was a problem hiding this comment.
It's a bit strange to have an option with the "rapidsmpf" prefix under the executor category.
| ``RAPIDSMPF_*`` environment variable (if set); otherwise the rapidsmpf | ||
| C++ library applies its own built-in default. | ||
| """ | ||
| from rapidsmpf.config import Options, get_environment_variables |
There was a problem hiding this comment.
We should be able to import rapidsmpf at the top of the file here.
| env = get_environment_variables() | ||
|
|
||
| opts: dict[str, str] = {} | ||
| for f in dataclasses.fields(self): |
There was a problem hiding this comment.
There's also dataclasses.asdict(self), which would avoid the need for a getattr later on.
I guess we'd lose the category which is on the field rather than the fields value, but we nest the options then that'd be present (and you'd just do something like dataclasses.asdict(self.rapidsmpf)).
| known = {f.name for f in dataclasses.fields(cls)} | ||
| unknown = d.keys() - known | ||
| if unknown: | ||
| raise TypeError( | ||
| f"StreamingOptions.from_dict() got unknown field(s): " | ||
| f"{', '.join(sorted(unknown))}" | ||
| ) |
There was a problem hiding this comment.
Why not let the dataclass's __init__ handle this for us?
| return cls(**kwargs) | ||
|
|
||
| @classmethod | ||
| def from_argparse(cls, args: argparse.Namespace) -> StreamingOptions: |
There was a problem hiding this comment.
This shouldn't be part of the public API IMO.
If we want it (presumably for our benchmarks file) then let's make it private.
There was a problem hiding this comment.
I was thinking this could be generally useful. Most applications built on cudf-polars would likely want to support these kinds of command-line arguments, right?
There was a problem hiding this comment.
Mmm I'm not sure. In my experience (primarily deploying services on Kubernetes) most configuration is done through some kind of centralized management system that places the configuration files / environment variables before the application starts up.
If we're able to use it internally (e.g. for benchmarks) then great. But I'd caution against advertising this as a stable, public API.
Introduces
StreamingOptions, a single typed dataclass that unifies all RapidsMPF, executor, and engine configuration in one place.SPMD, Ray, and the upcoming Dask frontend now expose a
from_optionsclassmethod:This is the first step toward standardizing the many configuration options we currently support.