Skip to content

providing the same schema that read from backward compatible parquet fails: incompatible arrow schema, expected struct got List #8495

@rluvaton

Description

@rluvaton

Describe the bug
When reading a file that was created with older parquet writer (parquet-mr specificlly) and passing a schema that got from ArrowReaderMetadata fails with:

ArrowError("incompatible arrow schema, expected struct got List(Field { name: \"col_15\", data_type: Struct([Field { name: \"col_16\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"col_17\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"col_18\", data_type: Struct([Field { name: \"col_19\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"col_20\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} })")

To Reproduce

I've added the file in:

1 liner

Run this in datafusion-cli

select * from 'https://github.com/apache/parquet-testing/raw/6d1dae7ac5dfb23fa1ac1fed5b77d3b919fbb5f8/data/backward_compat_nested.parquet';

Only the relevant parts

This is the reproduction when taking from datafusion only the relevant parts that got to that error

Cargo.toml:

[package]
name = "repro"
version = "0.1.0"
edition = "2024"

[dependencies]
arrow = "56.2.0"
parquet = "56.2.0"
bytes = "1.10.1"

main.rs:

use std::sync::Arc;
use bytes::Bytes;
use parquet::arrow::arrow_reader::{ArrowReaderMetadata, ArrowReaderOptions};

fn main() {
    // The file is the file that added here: https://github.com/apache/parquet-testing/pull/96
    let file_path = "/private/tmp/parquet-testing/data/backward_compat_nested.parquet".to_string();
    
    let mut data = Bytes::from(std::fs::read(file_path).unwrap());

    let mut options = ArrowReaderOptions::new();
    let reader_metadata = ArrowReaderMetadata::load(&mut data, options.clone()).unwrap();

    let physical_file_schema = Arc::clone(reader_metadata.schema());

    // Commenting this out will make the code work
    options = options
        .with_schema(Arc::clone(&physical_file_schema));

    ArrowReaderMetadata::try_new(Arc::clone(reader_metadata.metadata()), options)
            .unwrap();
}

Expected behavior
Should not fail

Additional context
this might be a bug in DataFusion rather than parquet reader here due to backward compatibility the schema was updated to the new version:

I've added the file in:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions