Skip to content

[Improvement]: Solve High Memory Usage in Planning Phase Caused by Massive Deleted Files #4255

@Akeron-Zhu

Description

@Akeron-Zhu

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

Currently, the planning phase suffers from excessive memory consumption when dealing with tables containing massive deleted small files. The root cause lies in the inefficient storage of file relationships: a single DeleteFile is often associated with multiple DataFiles.
In the current implementation, these associations are likely stored as explicit lists or object references. When a table has a large volume of data files referencing the same delete files, the memory overhead for maintaining these references grows unboundedly. This redundancy causes the planning index to consume significantly more heap memory than necessary, leading to potential Out-Of-Memory (OOM) errors and degraded performance during query planning.

How should we improve?

I propose optimizing the memory layout of the planning index by introducing RoaringBitmap to compress the association between DeleteFile and DataFile. Instead of storing explicit lists of file IDs or object references, we can use RoaringBitmaps to represent the set of DataFile IDs associated with each DeleteFile. RoaringBitmap provides highly efficient compression for integer sets (file IDs), significantly reducing the memory footprint required to store these many-to-many relationships.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Subtasks

No response

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions