Skip to content

[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node#55088

Open
richardc-db wants to merge 4 commits intoapache:masterfrom
richardc-db:richardc-db/streaming_dedupe_nan_and_0s
Open

[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node#55088
richardc-db wants to merge 4 commits intoapache:masterfrom
richardc-db:richardc-db/streaming_dedupe_nan_and_0s

Conversation

@richardc-db
Copy link
Copy Markdown
Contributor

@richardc-db richardc-db commented Mar 30, 2026

What changes were proposed in this pull request?

title - adds a project operator before the streaming dedupe node when any of the keys are double/floats

Why are the changes needed?

if two NaNs have different bits, but are both NaN, they won't be deduplicated. They should be deduplicated.

Does this PR introduce any user-facing change?

yes - NaNs with different bit patterns are now deduplicated/considered duplicates.

How was this patch tested?

adde UT

Was this patch authored or co-authored using generative AI tooling?

Generated-by: claude 2.1.88 (Claude Code)

@HeartSaVioR HeartSaVioR changed the title [SPARK-56280][STREAMING] normalize NaN and +/-0.0 in streaming dedupe node [SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node Mar 31, 2026
@HeartSaVioR
Copy link
Copy Markdown
Contributor

General comment: please follow the PR template closely - there is a clear ask about the format of message on AI usage. Please update the PR description to be fit to the PR template.

Also this way the normalization isn't just used to ensure binary equality on state store in dedup operator, but to normalize the original input row, which could be considered as a side effect. How the batch query works for this today?

@richardc-db
Copy link
Copy Markdown
Contributor Author

Also this way the normalization isn't just used to ensure binary equality on state store in dedup operator, but to normalize the original input row, which could be considered as a side effect. How the batch query works for this today?

@HeartSaVioR hmm good point. I had figured that it would be ok to have the normalized values through the rest of the query plan, but it is indeed a side effect.

the alternative way to do this would be to directly run an unsafe projection within the streaming deduplication node itself. I believe this is what joins/aggs do, which is wrap the key expressions with a "normalize" expression that shouldn't be propagated downstream in the plan. I'll make this change, it feels safer.

also, i updated the AI tooling section of the PR! Lmk if any other changes should be made!

@HeartSaVioR
Copy link
Copy Markdown
Contributor

I'm generally OK as long as we are consistent with batch query - that would keep one of principles on Structured Streaming, the same query with perfect watermark should end up with the same result between batch and streaming. The order of rows after shuffle matter here since it's a dedup, but beside that, it'd be ideal to match with the behavior of batch query.

@HeartSaVioR
Copy link
Copy Markdown
Contributor

Just to clarify, if you see the batch query to effectively normalize the value and impact the output, we can just do the same. I'd just love to ask you to try running the batch query and check the behavior and ensure we are at least either way 1) be in sync with batch query or 2) avoid bringing side effect. Either way is fine.

* child in a ProjectExec that normalizes NaN and -0.0 so that semantically equal values produce
* identical UnsafeRow bytes in the state store.
*
* Note that the streaming dedupe node does not support map-typed keys, althoug the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Maps are rejected for batch group keys as well; it's not streaming specific behavior.

Also, typo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants