IndexOutOfBoundsException when loading compressed IPC format

I encountered this bug when I loaded a dataframe stored in the Arrow IPC format.

 
```java

// Java Code from "Apache Arrow Java Cookbook"
File file = new File("example.arrow");
try (
        BufferAllocator rootAllocator = new RootAllocator();
        FileInputStream fileInputStream = new FileInputStream(file);
        ArrowFileReader reader = new ArrowFileReader(fileInputStream.getChannel(), rootAllocator)
) {
    System.out.println("Record batches in file: " + reader.getRecordBlocks().size());
    for (ArrowBlock arrowBlock : reader.getRecordBlocks()) {
        reader.loadRecordBatch(arrowBlock);
        VectorSchemaRoot vectorSchemaRootRecover = reader.getVectorSchemaRoot();
        System.out.print(vectorSchemaRootRecover.contentToTSVString());
    }
} catch (IOException e) {
    e.printStackTrace();
} 
```
Call stack:
```

Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, length: 2048 (expected: range(0, 2024))
    at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
    at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:955)
    at org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:451)
    at org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:732)
    at org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:240)
    at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:86)
    at org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:220)
    at org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:166)
    at org.apache.arrow.vector.ipc.ArrowFileReader.loadRecordBatch(ArrowFileReader.java:197)
```
This bug can be reproduced by a simple dataframe created by pandas:

 
```java

pd.DataFrame({'a': range(10000)}).to_feather('example.arrow') 
```
Pandas compresses the dataframe by default. If the compression is turned off, Java can load the dataframe. Thus, I guess the bounds checking code is buggy when loading compressed file.

 

That dataframe can be loaded in polars, pandas and pyarrow, so it's unlikely to be a pandas bug.

 

 

**Environment**: Linux and Windows.
Apache Arrow Java version: 10.0.0, 9.0.0, 4.0.1.
Pandas 1.4.2 using pyarrow 8.0.0 (anaconda3-2022.05)
**Reporter**: [Georeth Zhou](https://issues.apache.org/jira/browse/ARROW-18198)

<sub>**Note**: *This issue was originally created as [ARROW-18198](https://issues.apache.org/jira/browse/ARROW-18198). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

IndexOutOfBoundsException when loading compressed IPC format #230

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

IndexOutOfBoundsException when loading compressed IPC format #230

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions