Skip to content

[Parquet] Reduce reallocations when reading StringView in parquet #9059

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I am profiling clickbench query 10 with predicate pushdown enabled as part of

samply record -- /Users/andrewlamb/Software/datafusion2/target/profiling/datafusion-cli   -f q.sql  > /dev/null  2>&1
SELECT "MobilePhoneModel", COUNT(DISTINCT "UserID") AS u FROM hits WHERE "MobilePhoneModel" <> '' GROUP BY "MobilePhoneModel" ORDER BY u DESC LIMIT 10;

While looking at the profile, I noticed that 7% of the time is spent in allocating / regrowing vectors (aka reallocating and copying)

Image

Describe the solution you'd like
Avoid the time spent regrowing these vectors

It appears that the vectors in question are part of the ViewBuffer struct:

pub struct ViewBuffer {
pub views: Vec<u128>,
pub buffers: Vec<Buffer>,
}

Describe alternatives you've considered
Since we know how many views will be in each output buffer, we could create the ViewBuffers with the correct size initially

Something like like

ViewBuffers::with_capacity

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions