[SPARK-55278] Introduce module and core abstraction for language-agnostic UDF worker.#55089
[SPARK-55278] Introduce module and core abstraction for language-agnostic UDF worker.#55089haiyangsun-db wants to merge 6 commits intoapache:masterfrom
Conversation
dtenedor
left a comment
There was a problem hiding this comment.
LGTM as a start. It conforms to the SPIP as documented so far.
We can leave it open for some time for the community to take a look and possibly comment.
2ab77f7 to
fe9c5b6
Compare
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem: Spark has no language-agnostic UDF protocol. Python UDFs use an in-process pipe model (BasePythonRunner), which doesn't generalize to other languages with separate worker processes.
Design approach: This PR establishes a new udf/worker module split into proto/ (protobuf wire format + Scala wrapper) and core/ (engine-side abstractions). The design follows a Dispatcher → Session pattern: WorkerDispatcher manages worker lifecycle and pooling, WorkerSession represents a single UDF execution, and WorkerSecurityScope partitions the worker pool by security boundary.
Key design decisions:
WorkerDispatcheris a trait extendingAutoCloseable, making implementations responsible for resource cleanupWorkerSessionandWorkerSecurityScopeare abstract classes with no methods — placeholders for concrete implementations in follow-up PRs- The proto module overrides
protobuf-javascope fromprovided(root POM) tocompileso generated classes are available on the compile classpath
Implementation sketch: The module is registered in the Maven reactor (pom.xml), SBT build (SparkBuild.scala — project refs, MiMa exclusions, protobuf codegen), and CI (modules.py). All new APIs are @Experimental.
General comments
- The PR description says
WorkerSpecis incore/, but the typed wrapper is actuallyWorkerSpecificationinproto/. The README has the correct information — the PR description should be updated to match.
| * implementation based on the [[WorkerSpecification]]. | ||
| */ | ||
| @Experimental | ||
| abstract class WorkerSession |
There was a problem hiding this comment.
WorkerSession does not extend AutoCloseable, while WorkerDispatcher does. Since a session "can carry per-execution state" and implementations "may add lifecycle hooks," callers have no standard way to release per-session resources. If AutoCloseable is added later, all callers must be updated. Consider abstract class WorkerSession extends AutoCloseable from the start, even if concrete implementations initially no-op on close().
There was a problem hiding this comment.
Fixed, added AutoClosable.
| * Workers are only reused within the same security scope. | ||
| */ | ||
| @Experimental | ||
| abstract class WorkerSecurityScope |
There was a problem hiding this comment.
The Scaladoc says "Workers are only reused within the same security scope," which implies dispatcher implementations will compare scopes for equality. The default Object.equals (reference equality) means structurally equivalent scopes won't match, silently preventing worker reuse. Consider documenting that subclasses must override equals/hashCode, or making this a sealed trait with concrete implementations that enforce it.
There was a problem hiding this comment.
Fixed, making equals/hardCode mandatory
|
Hi @holdenk , this is the first PR for the language-agnostic UDF work. The main goal of this PR is to set up the module and few key worker abstractions. More detailed pieces will follow in subsequent PRs. Please feel free to take a look. |
What changes were proposed in this pull request?
This PR introduces the foundational package structure and core abstractions for the language-agnostic UDF worker framework described in SPIP SPARK-55278.
The new
udf/workermodule contains two sub-modules:proto/— Protobuf definition ofUDFWorkerSpecification(currently a placeholder; full schema to follow), plus a typed Scala wrapper:WorkerSpecification— Scala wrapper around the protobuf spec.core/— Engine-side APIs (all@Experimental):WorkerDispatcher— manages workers for a given spec; creates sessions. Handles pooling, reuse, and lifecycle behind the scenes. ExtendsAutoCloseable.WorkerSession— represents one single UDF execution. Not 1-to-1 with a worker process; multiple sessions may share the same underlying worker. ExtendsAutoCloseablewith a default no-opclose()so callers can use try-with-resourcesfrom the start.
WorkerSecurityScope— identifies a security boundary for worker connection pooling. Requires subclasses to implementequals/hashCodeso that structurally equivalent scopes enable worker reuse.Build integration:
project/SparkBuild.scalaupdated to register the new modules and configure unidoc exclusions (JavaUnidoc only — Scala API docs are included).Why are the changes needed?
This is the first step toward a language-agnostic UDF protocol for Spark that enables UDF workers written in any language to communicate with the Spark engine through a well-defined specification and API boundary. The abstractions introduced here establish the core contract that concrete implementations (e.g., process-based or gRPC-based workers) will build on.
Why introduce a separate root-level module:
Does this PR introduce any user-facing change?
No. All new APIs are marked
@Experimentaland there are no behavioral changes to existing code.How was this patch tested?
WorkerAbstractionSuiteprovides a basic test placeholder.build/sbt unidoc(ScalaUnidoc succeeds; JavaUnidoc excludes udf-worker modules, consistent with howconnectCommon/connect/protobufmodules are handled).Was this patch authored or co-authored using generative AI tooling?
Yes.