[SPARK-56413] Add gRPC UDF execution protocol by haiyangsun-db · Pull Request #55657 · apache/spark

haiyangsun-db · 2026-05-03T07:48:44Z

What changes were proposed in this pull request?

Adds udf_protocol.proto, the gRPC wire contract between the Spark engine and a
UDF worker process, as described in SPIP. Sits next to the existing worker_spec.proto.

Defines a Worker service with two RPCs:

Execute(stream UdfRequest) returns (stream UdfResponse) — one bidirectional
stream per UDF execution. Lifecycle on the stream: Init → 0..N
DataRequest / DataResponse → exactly one Finish or Cancel.
PayloadChunk streams oversized UDF bodies.
Manage(WorkerRequest) returns (WorkerResponse) — unary, worker-scoped
(heartbeat, graceful shutdown).

UdfPayload carries the engine-opaque callable bytes plus a format tag,
an eval_type worker-dispatch hint, and optional input/output encoders.
Init carries data_format, schemas, session_conf, task_context, and
timezone (the first graduate from session_conf); a reserved field range
absorbs future graduates.

Also fixes two typos in common.proto (exachanged/bidrectional).

Out of scope

No planning info on the wire (no execution-shape / cardinality enum, no
chained-UDF metadata). Both can be added additively later.

Why are the changes needed?

Spark Connect's UDF support today is Python-only and tied to a Python-specific
socket protocol. Onboarding other client languages requires a structured,
language-neutral wire contract. This PR lands the proto layer; engine and
worker implementations will follow.

Does this PR introduce any user-facing change?

No. Wire contract only; not yet wired into any end-to-end path.

How was this patch tested?

Verified the proto compiles with protoc against common.proto and
worker_spec.proto, and inspected the generated descriptor for field-number
and oneof correctness. End-to-end conformance tests will land with the
engine-side client and first worker implementation.

Was this patch authored or co-authored using generative AI tooling?

Yes

xianzhe-databricks · 2026-05-04T10:57:15Z

+    // with no branch set.
+    oneof control {
+        Init         init    = 1;
+        PayloadChunk payload = 2;


what is the reason that payload does not require a confirmation from the worker?

Payloadchunks are an addition to the Init message, we can assume it's part of the Init, only used when payload is large. It doesn't require a response. Init + (payload_chunk)* maps to one InitResponse

xianzhe-databricks · 2026-05-04T11:07:13Z

+// the engine: the engine forwards [[payload]] and [[format]]
+// unchanged, and the worker decodes them per the format the client
+// and worker have agreed on.
+message UdfPayload {


how about payload language? I assume because the worker is already tied to a specific language, so it does not need to know what language this UDF payload is in?

exactly, that does not matter. worker is already provided by the worker spec, engine doesn't care what language it is

xianzhe-databricks · 2026-05-04T11:11:28Z

+    // a typed field number from the reserved range right after this
+    // block and is removed from [[session_conf]]. [[timezone]] below
+    // is an example of a key that has already been promoted.
+    map<string, string> session_conf = 6;


why would this and task_context above have no optional prefix?

map field cannot be optional in protobuf. Leave it to be an empty map when it is not set.

xianzhe-databricks · 2026-05-04T11:13:19Z

+
+    // (Optional) Session timezone, promoted out of [[session_conf]]
+    // because every eval needs it for timestamp encoding/decoding.
+    optional string timezone = 7;


is string the canonical type to represent the timezone? I am afraid all kinds of conversion errors may happen with no schema/enum enforcement.

this is convention from Spark, timezone is a string in spark.

xianzhe-databricks · 2026-05-04T11:16:15Z

+message Heartbeat {}
+
+// Acknowledgment for [[Heartbeat]].
+message HeartbeatAck {}


Suggested change

message HeartbeatAck {}

message HeartbeatResponse {}

just for some consistency?

sven-weber-db · 2026-05-04T12:35:32Z

+
+    // (Optional) Session timezone, promoted out of [[session_conf]]
+    // because every eval needs it for timestamp encoding/decoding.
+    optional string timezone = 7;


We should specify the exact format in which the timezone will be reported since its a string

sven-weber-db · 2026-05-04T12:47:45Z

+    // Packed by the client side of the protocol; opaque to the
+    // wire protocol. Left unset whenever the worker's built-in
+    // decoders are sufficient.
+    optional bytes input_encoder = 6;


AFAIK there are no use cases for the custom in/output encoders at the moment. Should we maybe only add them when they are needed?

sven-weber-db · 2026-05-04T13:13:47Z

  def cancel(): Unit

  /** Closes this session and releases resources. */
  override def close(): Unit


I think we need to clarify the exact semantics of close/finish and cancel within the background of how we could implement calling them in Spark.

From my current understanding, finish would indicate to the UDF worker that no more input batches are to be send. Therefore, the worker would finish processing the batches it has already received/buffered and then response with a FinishResponse. From the Spark side (e.g. in a operator), we could send multiple batches and then call finish, meaning the batches and the finish message would sit in the client or server-side gRPC buffer. If the Spark task now gets canceled, we have to wait for the UDF worker to finish as the finish message was already sent. Is this a scenario, would we like to support more eager cancellation? This also somewhat depends on the processing time per batch and the buffer size. Alternatively, we can force worker termination via the system-level primitives (SIGTERM/SIGKILL).

Another concern is a potential race condition between finish and cancel calls. Cancellations are most likely going to be implemented using a taskInterruptListener on the Spark task running a UDF. The callback invoked from this listener does not necessarily share the whole context of the operator execution. Therefore, it might not know if finish has already been called/queued. Would it make sense for the cancel/finish calls from the WorkerSession to implement a no-op in this case? The session has all the state and can know whether the session has already previously been canceled/finished.

The same concern exists between init and cancel. If we implement cancellation via a taskInterruptListener, cancel might be called before init was called.

grpc udf protocol

2477fa3

haiyangsun-db force-pushed the SPARK-56413 branch from a11e533 to 2477fa3 Compare May 3, 2026 16:12

haiyangsun-db marked this pull request as ready for review May 3, 2026 16:13

haiyangsun-db changed the title ~~[SPARK-56413] Introduce the grpc protocol for UDF execution.~~ [SPARK-56413] Add gRPC UDF execution protocol May 3, 2026

update README, remove InitMessage place holder.

0736d42

xianzhe-databricks reviewed May 4, 2026

View reviewed changes

sven-weber-db reviewed May 4, 2026

View reviewed changes

Conversation

haiyangsun-db commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Out of scope

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

haiyangsun-db May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haiyangsun-db commented May 3, 2026 •

edited

Loading

haiyangsun-db May 4, 2026 •

edited

Loading