[SPARK-56174][SQL] Complete V2 file write path for DataFrame API#55091
Draft
LuciferYang wants to merge 3 commits intoapache:masterfrom
Draft
[SPARK-56174][SQL] Complete V2 file write path for DataFrame API#55091LuciferYang wants to merge 3 commits intoapache:masterfrom
LuciferYang wants to merge 3 commits intoapache:masterfrom
Conversation
…Frame API writes and delete FallBackFileSourceV2 Key changes: - FileWrite: added partitionSchema, customPartitionLocations, dynamicPartitionOverwrite, isTruncate; path creation and truncate logic; dynamic partition overwrite via FileCommitProtocol - FileTable: createFileWriteBuilder with SupportsDynamicOverwrite and SupportsTruncate; capabilities now include TRUNCATE and OVERWRITE_DYNAMIC; fileIndex skips file existence checks when userSpecifiedSchema is provided (write path) - All file format writes (Parquet, ORC, CSV, JSON, Text, Avro) use createFileWriteBuilder with partition/truncate/overwrite support - DataFrameWriter.lookupV2Provider: enabled FileDataSourceV2 for non-partitioned Append and Overwrite via df.write.save(path) - DataFrameWriter.insertInto: V1 fallback for file sources (TODO: SPARK-56175) - DataFrameWriter.saveAsTable: V1 fallback for file sources (TODO: SPARK-56230, needs StagingTableCatalog) - DataSourceV2Utils.getTableProvider: V1 fallback for file sources (TODO: SPARK-56175) - Removed FallBackFileSourceV2 rule - V2SessionCatalog.createTable: V1 FileFormat data type validation
Contributor
Author
Contributor
Author
|
#55034 can be improved. Let me go back and revise it first, then come back to update this PR. |
6f4b9f3 to
d99b215
Compare
d99b215 to
752ef85
Compare
…catalog table loading, and gate removal Key changes: - FileTable extends SupportsPartitionManagement with createPartition, dropPartition, listPartitionIdentifiers, partitionSchema - Partition operations sync to catalog metastore (best-effort) - V2SessionCatalog.loadTable returns FileTable instead of V1Table, sets catalogTable and useCatalogFileIndex on FileTable - V2SessionCatalog.getDataSourceOptions includes storage.properties for proper option propagation (header, ORC bloom filter, etc.) - V2SessionCatalog.createTable validates data types via FileTable - FileTable.columns() restores NOT NULL constraints from catalogTable - FileTable.partitioning() falls back to userSpecifiedPartitioning or catalog partition columns - FileTable.fileIndex uses CatalogFileIndex when catalog has registered partitions (custom partition locations) - FileTable.schema checks column name duplication for non-catalog tables only - DataSourceV2Utils.getTableProvider: removed FileDataSourceV2 gate - DataFrameWriter.insertInto: enabled V2 for file sources - DataFrameWriter.saveAsTable: V1 fallback (TODO: SPARK-56230) - ResolveSessionCatalog: V1 fallback for FileTable-backed commands (AnalyzeTable, AnalyzeColumn, TruncateTable, TruncatePartition, ShowPartitions, RecoverPartitions, AddPartitions, RenamePartitions, DropPartitions, SetTableLocation, CREATE TABLE validation, REPLACE TABLE blocking) - FindDataSourceTable: streaming V1 fallback for FileTable (TODO: SPARK-56233) - DataSource.planForWritingFileFormat: graceful V2 handling
752ef85 to
ef5bc4b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR is part of SPARK-56170. It removes the remaining V1 fallbacks in
DataFrameWriter.lookupV2Provider()so that all save modes (ErrorIfExists, Ignore, Append, Overwrite) and partitioned writes go through the V2 file write path, and enablesINSERT INTO format.\path`` syntax via V2.1. ErrorIfExists/Ignore via V2
Routes
ErrorIfExistsandIgnoresave modes into the V2 write path with a pre-write path-existence check matching V1 semantics (InsertIntoHadoopFsRelationCommand):PATH_ALREADY_EXISTSLocalRelation(no-op)AppendData2. Partitioned writes via V2
FileDataSourceV2.getTable: extracts partition column names fromTransformarray and setsFileTable.userSpecifiedPartitioningFileTable.createFileWriteBuilder: case-insensitive partition column lookup with.copy(name = c)to preservepartitionByargument case for directory namesFileWrite.createWriteJobDescription: uses partition column names frompartitionSchema(notallColumns) to ensure directory names matchpartitionBycase3. FileWrite improvements
RequiresDistributionAndOrderingwith ascending sort by partition columns, ensuringDynamicPartitionDataSingleWritersees contiguous partition valuespartitionSet.contains(col.name)) instead of object identitydescriptionbeforesetupJob(fixes Parquet summary configuration ordering)4. INSERT INTO format.`path` via V2
Updates
ResolveSQLOnFileto resolveINSERT INTO parquet.\/path`andSELECT FROM format.`path`via V2 when aFileDataSourceV2` provider is available.5. Aggregate pushdown case sensitivity fix
Fixes
AggregatePushDownUtils.isPartitionColandgetStructFieldForColto use case-insensitive matching, resolving aggregate pushdown failures when partition directory names differ in case from query column references.6. CSV-specific fixes
CSVTable.allowDuplicatedColumnNames = true: allows CSV writes with duplicate column names (matching V1CSVFileFormat.allowDuplicatedColumnNames)CSVTable.supportsWriteDataType: rejectsVariantTypefor writes while allowing it for reads (matching V1'ssupportDataTypevssupportReadDataTypedistinction)7. Infrastructure fixes
DataSourceV2Strategy.refreshCache: propagates data source options to Hadoop config viar.options, guards against emptyrootPathsFileDataSourceV2.getTable: reuses cached table frominferSchema()when available for correctMetadataLogFileIndexbehavior (streaming sink output)FileTable.schema: skips column name duplication check for formats withallowDuplicatedColumnNames; validates data types for non-partition columns onlyWhy are the changes needed?
This completes the V2 file write path for the DataFrame API, eliminating the last V1 fallbacks in
DataFrameWriter.lookupV2Provider(). All file source writes now go through the V2 infrastructure, enabling consistent behavior and paving the way for future V2 enhancements (streaming writes, bucketing, etc.).Does this PR introduce any user-facing change?
No. The behavior is functionally equivalent to V1 for all save modes and partitioned writes. Directory names now use the
partitionByargument case (matching V1 behavior).How was this patch tested?
ErrorIfExistsmode,Ignoremode,INSERT INTO format.\path`,SELECT FROM format.`path``Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code 4.6