Releases: datafusion-contrib/datafusion-distributed
Releases · datafusion-contrib/datafusion-distributed
v2.0.0
What's Changed
- Expose WorkerServiceServer publicly by @gabrielkerr in #406
- Asynchronous distributed planning by @gabotechs in #383
- Factor out UUID functions into common module by @gabotechs in #400
- Refactor send plan task to a struct by @gabotechs in #401
- refactor / add benches for shuffle and transport shuffling by @gene-bordegaray in #408
- Remove batch_coalescing_below_network_boundaries rule by @EdsonPetry in #407
- Leverage PartialReduce AggregationExec mode to reduce shuffle size by @nchu05 in #396
- Introduce WorkUnit concept for streaming units of work to leaf nodes at runtime by @gabotechs in #411
- Add msg count metric for worker connection pool metrics by @gabotechs in #414
- Fix metrics loss on early stream termination by @LiaCastaneda in #415
- Buffering improvements to inter-worker communication by @gabotechs in #419
- Remove trailing whitespace from plan task display by @gene-bordegaray in #420
- fix union recomputing child properties by @gene-bordegaray in #423
- Update Readme by @Rich-T-kid in #433
- Cherry pick LocalWorkerContext by @gabotechs in #435
- Task Routing with the Task Estimator: Allow users to map tasks to URLs by @JSOD11 in #409
- speed up test exeuction time by transitioning to in_memory worker by @Rich-T-kid in #428
- Fix TPCDS runner explain analyze await by @gene-bordegaray in #442
- swap std for hashbrown by @Rich-T-kid in #444
- Remove artificial broadcast task limit in the build side by @gabotechs in #422
- Refactor Stage struct to be more explicit about states by @gabotechs in #424
- set up git hooks by @Rich-T-kid in #446
- Harden work unit tests by @gabotechs in #418
- Create Clippy rule dis-allowing std::collection::hashmap by @Rich-T-kid in #447
- feat: use tpchgen-rs for benchmarks by @clflushopt in #443
- Remove 2-pass planner (annotator + distribution) in favor of a 1-pass planner by @gabotechs in #416
- Refactor coordinator-worker channel code by @gabotechs in #425
- Refactor distributed.rs into coordinator module by @gabotechs in #426
- Add reproducer tests for ChildrenIsolatorUnionExec bug by @gabotechs in #455
- Un-ignore test by @Rich-T-kid in #452
- Fix ChildrenIsolatorUnionExec budget mismatch via weight-based proportional allocation by @gabotechs in #456
- update benchmark docs to match --path -> --dataset rename by @kentkr in #466
- push fetches into network coalesces by @gene-bordegaray in #468
- Use in-memory comms if both workers involved in an exchange happen to be the same by @gabotechs in #427
- Factor out distributed recursion by @gabotechs in #469
- Introduce DistributedLeafExec by @gabotechs in #467
- Add WorkUnitFileScanConfig for testing WorkUnits with real data by @gabotechs in #448
- Add metrics to WorkUnitFeed by @gabotechs in #453
- Collect stage-level metrics by @gabotechs in #461
- Report distributed tasks in remote benchmarks by @gabotechs in #462
- Fix early worker partition drop in RemoteWorkerConnection demux by @gabotechs in #473
- Fix ClickBench EventDate column type (UInt16 → Date32) by @gabotechs in #472
- Regenerate tpcds plans by @gabotechs in #481
- No eager buffering in network connections by @gabotechs in #477
- Rebalance files globally across the partitions produced by FileScanConfigTaskEstimator by @shehab-ali in #450
- Make DistributedTaskContext Copy by @homa31 in #475
- Benchmarks housekeeping by @gabotechs in #485
- Upgrade DF to v54 by @LiaCastaneda in #465
- Fix benchmarks build: remove nonexistent with_distributed_dynamic_task_count call by @EdsonPetry in #488
- Add broadcast support for cross and nested loop joins by @EdsonPetry in #484
- fix: await async metric delivery in metrics collection tests by @sjhddh in #490
- unskip metric test by @Rich-T-kid in #493
- Add NetworkBoundaryBuilder argument to inject_network_boundaries by @gabotechs in #463
- Add MaxGauge metric by @gabotechs in #464
- Lazily set the producer head at execution time by @gabotechs in #478
- #313: consolidate test plan helpers by @kentkr in #491
- Chunk work unit feeds by @gabotechs in #492
- Rework leaf task estimation around bytes-per-partition + per-task DistributedLeafExec visualization by @gabotechs in #496
- Improve docs and examples by @gabotechs in #494
New Contributors
- @nchu05 made their first contribution in #396
- @Rich-T-kid made their first contribution in #433
- @clflushopt made their first contribution in #443
- @kentkr made their first contribution in #466
- @shehab-ali made their first contribution in #450
- @homa31 made their first contribution in #475
- @sjhddh made their first contribution in #490
Full Changelog: v1.0.0...v2.0.0
v1.0.0
What's Changed
- Bring https://github.com/gabotechs/datafusion-distributed-experiment code by @gabotechs in #68
- Adds error serialization-deserialization by @gabotechs in #69
- Remove stage delegation in favor of planning-time stage assignation by @gabotechs in #71
- Fix rust toolchain to 1.83.0 by @gabotechs in #72
- Completed execution path + failing test by @robtandy in #74
- Fix serialization error by @LiaCastaneda in #76
- Small cleanup after #74 by @gabotechs in #75
- Fix ArrowFlightReadExec result streaming by @gabotechs in #77
- Add stage planner tests by @gabotechs in #78
- Split ArrowFlightReadExec node placement for distributed planning by @gabotechs in #79
- Update DataFusion version from 48.0.0 to 49.0.0 by @gabotechs in #82
- add doc comment for execution stage struct by @robtandy in #80
- Support user provided codecs by @gabotechs in #81
- Move all test utils to src/ and hide them behind an "integration" feature by @gabotechs in #84
- Add test comparing distributed + single node execution on TPCH data by @jayshrivastava in #83
- Execution working on all 22 TPCH queries by @robtandy in #89
- Add delta report for benchmarks by @gabotechs in #91
- Removes an extra line jump in distributed explains by @gabotechs in #95
- Create TTL map with time wheel architecture by @jayshrivastava in #96
- Fix compilation errors and warnings by @gabotechs in #102
- Introduce
ConfigExtensionExt, allowing the propagation of arbitraryConfigExtensions across network boundaries by @gabotechs in #100 - Nested Loop Joins (fixes TPCH query 22) by @robtandy in #104
- Improve
SessionBuilderergonomy and fix clippy errors by @gabotechs in #103 - Collect Left Hash Joins by @robtandy in #105
- Introduce
DistributedExttrait that extends the capabilities of DataFusion's session building tools by @gabotechs in #106 - Add plan validations to TPCH tests by @gabotechs in #107
- do_get: use TTL map to store task state by @jayshrivastava in #108
- Refactor arrow_flight_read.rs and friends by @gabotechs in #109
- Add
localhost_run.rsandlocalhost_worker.rsexamples by @gabotechs in #111 - Add README.md and LICENSE.txt by @gabotechs in #114
- Fix panics in tests and un-ignore working tests by @gabotechs in #120
- Improve EXPLAIN render by @gabotechs in #121
- Bigger TPCH tests by @gabotechs in #122
- File name and folder restructure by @gabotechs in #124
- Refactor do_get.rs and adjacent files by @gabotechs in #125
- Adds in-memory example by @gabotechs in #132
- Add support for in-memory TPCH tests by @gabotechs in #129
- Comment flaky test by @gabotechs in #133
- Support
--threadsand--workerson TPCH benchmarks by @gabotechs in #130 - Report host stats on TPCH benchmarks by @gabotechs in #131
- Robtandy/better graphviz plans by @robtandy in #135
- changes to allow nice graphviz of single node plans too by @robtandy in #136
- metrics: add metrics module and protos by @jayshrivastava in #141
- fix bug in graphviz for determining output partitions by @robtandy in #142
- move chrono out of optional deps so project can compile by @robtandy in #143
- execution_plans: add metrics collector and re-writer by @jayshrivastava in #144
- Distributed planning overhaul by @gabotechs in #145
- Update README.md with new diagrams based on NetworkShuffleExec and NetworkCoalesceExec by @gabotechs in #153
- Do not require default datafusion features by @gabotechs in #154
- set msrv via Cargo.toml, use 2024 edition by @adriangb in #152
- remove feature flags around chrono::DateTime by @adriangb in #155
- fix: Move
error.rsto protobuf by @jonathanc-n in #156 - execution_plans: add MetricsCollectingStream by @jayshrivastava in #150
- flight_service: add TrailingFlightDataStream by @jayshrivastava in #157
- fix: Incorrect weather parquet path in examples by @zuston in #165
- Generalize functions for NetworkCoalesceExec creation by @jonathanc-n in #162
- fix: Enable distributed plan for localhost_run by @zuston in #166
- Address public api weak points by @gabotechs in #158
- Add partition coalescing at the head of the plan by @gabotechs in #164
- Fix early drop stateful nodes by @gabotechs in #159
- Remove unnecessary StageExec proto serde overhead by @gabotechs in #163
- flight_service: emit metrics from ArrowFlightEndpoint by @jayshrivastava in #160
- Evolve
ChannelResolvertrait for requiring aFlightServiceClientinstead of atonic::BoxSyncCloneChannelby @gabotechs in #172 - update to DataFusion 50 by @adriangb in #146
- Use upstream composed extension codec by @gabotechs in #176
- Fix Dictionary Encoded Values by @cetra3 in #174
- Rework execution plan hierarchy for better interoperability by @gabotechs in #178
- Fix in-memory example by @gabotechs in #183
- Misc improvements to public API by @gabotechs in #181
- Add DistributedPlanError::NonDistributable rule and do not distribute SHOW COLUMNS by @gabotechs in #195
- implement distributed EXPLAIN ANALYZE by @jayshrivastava in #182
- Refactor distributed planner into its own folder by @gabotechs in #196
- Fix user provided UDFs encoding by @gabotechs in #200
- Add dynamic task config based on DataFusion extension options by @gabotechs in htt...