-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Labels
Description
Describe the bug
When reading a file that was created with older parquet writer (parquet-mr specificlly) and passing a schema that got from ArrowReaderMetadata
fails with:
ArrowError("incompatible arrow schema, expected struct got List(Field { name: \"col_15\", data_type: Struct([Field { name: \"col_16\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"col_17\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"col_18\", data_type: Struct([Field { name: \"col_19\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"col_20\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })")
To Reproduce
I've added the file in:
1 liner
Run this in datafusion-cli
select * from 'https://github.com/apache/parquet-testing/raw/6d1dae7ac5dfb23fa1ac1fed5b77d3b919fbb5f8/data/backward_compat_nested.parquet';
Only the relevant parts
This is the reproduction when taking from datafusion
only the relevant parts that got to that error
Cargo.toml
:
[package]
name = "repro"
version = "0.1.0"
edition = "2024"
[dependencies]
arrow = "56.2.0"
parquet = "56.2.0"
bytes = "1.10.1"
main.rs
:
use std::sync::Arc;
use bytes::Bytes;
use parquet::arrow::arrow_reader::{ArrowReaderMetadata, ArrowReaderOptions};
fn main() {
// The file is the file that added here: https://github.com/apache/parquet-testing/pull/96
let file_path = "/private/tmp/parquet-testing/data/backward_compat_nested.parquet".to_string();
let mut data = Bytes::from(std::fs::read(file_path).unwrap());
let mut options = ArrowReaderOptions::new();
let reader_metadata = ArrowReaderMetadata::load(&mut data, options.clone()).unwrap();
let physical_file_schema = Arc::clone(reader_metadata.schema());
// Commenting this out will make the code work
options = options
.with_schema(Arc::clone(&physical_file_schema));
ArrowReaderMetadata::try_new(Arc::clone(reader_metadata.metadata()), options)
.unwrap();
}
Expected behavior
Should not fail
Additional context
this might be a bug in DataFusion rather than parquet reader here due to backward compatibility the schema was updated to the new version:
I've added the file in: