-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: added parallelization on key partitioned data #18919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: added parallelization on key partitioned data #18919
Conversation
| pub preserve_partition_values: bool, | ||
| /// Cached result of key_partition_exprs computation to avoid repeated work | ||
| #[allow(clippy::type_complexity)] | ||
| key_partition_exprs_cache: OnceLock<Option<Vec<Arc<dyn PhysicalExpr>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caches results of compute_key_partition_exprs() which is expensive:
- loops through file groups and does hash set operations
- called multiple times (output_partitioning() and eq_properties())
| } | ||
| Distribution::KeyPartitioned(_) => { | ||
| // Nothing to do: treated as satisfied upstream | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No-op because we can guarantee that our data is correctly distributed
| 02)--AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[nth_value(multiple_ordered_table.c,Int64(1)) ORDER BY [multiple_ordered_table.c ASC NULLS LAST]], ordering_mode=Sorted | ||
| 03)----SortExec: expr=[a@0 ASC NULLS LAST], preserve_partitioning=[true] | ||
| 04)------CoalesceBatchesExec: target_batch_size=8192 | ||
| 05)--------RepartitionExec: partitioning=Hash([a@0], 4), input_partitions=4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eliminates this hash because it would break ordering guarantees
|
Thanks a lot for the description and companion doc, they are super useful. This work is super nice and is even crucial for distributed DataFusion. Reusing partitioning and avoiding repartitions can make a huge difference when the repartition is done on the network. The plans you posted as examples are exactly what we should be aiming for. I think I am still missing part of the point of
So my current understanding is: Sorry for the wall of text, I am mostly trying to wrap my head around this, please correct anything I missed in here. |
| /// Allocate rows based on a hash of one of more expressions and the specified number of | ||
| /// partitions | ||
| Hash(Vec<Arc<dyn PhysicalExpr>>, usize), | ||
| /// Partitions that are already organized by disjoint key values for the provided expressions. | ||
| /// Rows that have the same values for these expressions are guaranteed to be in the same partition. | ||
| KeyPartitioned(Vec<Arc<dyn PhysicalExpr>>, usize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression over all is that KeyPartitioned should not be adding anything that is not already representable with Hash. I was planning on doing a longer reasoning on this, but @fmonjalet is right on point in his comment here #18919 (comment), so I'd just +1 his comment, grab some 🍿, and see what comes out of it.
Yes, key partitioning guarantees that each distinct value of the key is fully contained within a single partition which is pretty much a stronger hash partitioning. Another thing to note is that key partitioning can only root from the file scan level as of now compared to hash which of course has a repartitioning operator.
Yes, I believe you have the right idea but to be sure,
Yes, this is a noted limit to the original design. I added the comment: "best with moderate partition counts (10-100 partitions)." in the config. This is rooting from splitting distinct keys into their own partitions as of now. I did this to keep the first iteration relatively simple as the PR is large. In a follow up issue, some gerat work would be to merge group to
It depends what "group" means. If we simply merge key partitioned data into a single partition, no, this is still key partitioned as each key is still fully in one place. If we are repartitioning or shuffling data, we lose key partitioning and fallback to hash
For this first PR, yes, but I think this isn't a one PR fix all scenario. I think this comes down to how intentional the user is. Yes, key partitioned data is rarer than say hash, but it is powerful enough for people to consider it. The use cases will also increse as follow up issues are resolved: higher cardinality, propagation through joins, etc.
BEFORE: DataSourceExec -> Aggregate Partial (gby: a) -> Repartition Hash(a) -> Aggregate Final (gby: a) In some cases we also eliminate bottlenecks due to SPMs between aggregations:
I am in favor of keeping Hash and KeyPartitioned separate as I see them as two distinct methods of partitioning. I also don't knowif adding more information into Hash partitioning will eliminate cimplexity and raher just cause more indirection. I do like the idea of merging file groups for higher cardinality as this was my main concern with this v1 (as noted in the comments) but chose to refrain due to complexity.
Do not apologize, this is a lot of the internal debates I was / am having and am glad to talk about the trade offs. Let me know what you think 😄 CC: @gabotechs |
|
I suggest, that this PR remain the limited scope, not meant for high cardinality queries. This was my motivation for having this option set to false by default. Then submit follow up issues to address grouping files into partitions to help with higher cardinality. I just do not want to introduce to many things in this PR and adding this seems like another substantial PR in itself. |
Full Report
Issue 18777 Parallelize Key Partitioned Data.pdf
Which issue does this PR close?
Rationale for this change
Optimize aggregations on Hive-partitioned tables by eliminating unnecessary repartitioning/coalescing when grouping by partition columns. This enables parallel computation of complete results without a merge bottleneck.
What changes are included in this PR?
KeyPartitionedAre these changes tested?
Benchmarking
For tpch it was unaffected as expected (not partitioned):
I create my own benchmark and saw these results:
These are not huge improvements as in memory hashing is pretty efficient but these are consistent gain (ran many times).
These improvements will be crucial for distributed datafusion as network shuffles are much less efficient than in memory repartitioning.
Are there any user-facing changes?
listing_table_preserve_partition_values