[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node#55088
[SPARK-56280][SS] normalize NaN and +/-0.0 in streaming dedupe node#55088richardc-db wants to merge 4 commits intoapache:masterfrom
Conversation
|
General comment: please follow the PR template closely - there is a clear ask about the format of message on AI usage. Please update the PR description to be fit to the PR template. Also this way the normalization isn't just used to ensure binary equality on state store in dedup operator, but to normalize the original input row, which could be considered as a side effect. How the batch query works for this today? |
@HeartSaVioR hmm good point. I had figured that it would be ok to have the normalized values through the rest of the query plan, but it is indeed a side effect. the alternative way to do this would be to directly run an unsafe projection within the streaming deduplication node itself. I believe this is what joins/aggs do, which is wrap the key expressions with a "normalize" expression that shouldn't be propagated downstream in the plan. I'll make this change, it feels safer. also, i updated the AI tooling section of the PR! Lmk if any other changes should be made! |
|
I'm generally OK as long as we are consistent with batch query - that would keep one of principles on Structured Streaming, the same query with perfect watermark should end up with the same result between batch and streaming. The order of rows after shuffle matter here since it's a dedup, but beside that, it'd be ideal to match with the behavior of batch query. |
|
Just to clarify, if you see the batch query to effectively normalize the value and impact the output, we can just do the same. I'd just love to ask you to try running the batch query and check the behavior and ensure we are at least either way 1) be in sync with batch query or 2) avoid bringing side effect. Either way is fine. |
| * child in a ProjectExec that normalizes NaN and -0.0 so that semantically equal values produce | ||
| * identical UnsafeRow bytes in the state store. | ||
| * | ||
| * Note that the streaming dedupe node does not support map-typed keys, althoug the |
There was a problem hiding this comment.
FYI: Maps are rejected for batch group keys as well; it's not streaming specific behavior.
Also, typo.
What changes were proposed in this pull request?
title - adds a project operator before the streaming dedupe node when any of the keys are double/floats
Why are the changes needed?
if two NaNs have different bits, but are both NaN, they won't be deduplicated. They should be deduplicated.
Does this PR introduce any user-facing change?
yes - NaNs with different bit patterns are now deduplicated/considered duplicates.
How was this patch tested?
adde UT
Was this patch authored or co-authored using generative AI tooling?
Generated-by: claude 2.1.88 (Claude Code)