Skip to content

[Fix](connector) Fix schema size mismatch caused by Doris internal columns#351

Open
dingyufei615 wants to merge 1 commit intoapache:masterfrom
dingyufei615:issue-349
Open

[Fix](connector) Fix schema size mismatch caused by Doris internal columns#351
dingyufei615 wants to merge 1 commit intoapache:masterfrom
dingyufei615:issue-349

Conversation

@dingyufei615
Copy link

[Fix] Fix Arrow field count mismatch with schema causing read failures

Proposed changes

Issue Number: close #349

Problem Summary:

Fixed the DorisException: Load Doris data failed, schema size of fetch data is wrong error that occurs when reading data from Doris 2.0+ using Spark Doris Connector, caused by Arrow returning more fields than defined in the schema.

Root Cause

Doris 2.0+ includes internal system columns (such as __DORIS_DELETE_SIGN__) in the Arrow data stream for certain table types (e.g., Unique Key tables). These columns are used for Merge-on-Read implementation but should not be visible to users. The original strict validation logic fieldVectors.size() > schema.size() would throw an exception immediately, preventing normal data reading.

Solution

  1. Modified validation logic: Changed the strict > check to only throw exceptions when fieldVectors.size() < schema.size() (actual error scenario)
  2. Compatible with internal columns: Log a warning instead of throwing exception when fieldVectors.size() > schema.size()
  3. Process only user columns: In both readBatch() and convertArrowToRowBatch(), only process columns defined in schema, ignoring extra internal columns

Changes Made

File: spark-doris-connector-base/src/main/java/org/apache/doris/spark/client/read/RowBatch.java

  1. readBatch() method:

    • Reversed validation logic to only throw exception when fields are insufficient
    • Log warning instead of throwing exception when fields exceed schema size
    • Use schema.size() instead of fieldVectors.size() to initialize Row objects
  2. convertArrowToRowBatch() method:

    • Loop only processes schema.size() fields, ignoring extra internal columns

Checklist(Required)

  1. Does it affect the original behavior: Yes - Fixes the issue where Doris 2.0+ data cannot be read, making the connector compatible with Arrow data streams containing internal columns
  2. Has unit tests been added: No Need - This is a fix for existing logic; existing tests cover core functionality
  3. Has document been added or modified: No Need - This is an internal implementation fix that doesn't affect user-facing APIs
  4. Does it need to update dependencies: No - No dependency changes
  5. Are there any changes that cannot be rolled back: No - Can be safely rolled back

Further comments

Testing & Verification

This fix has been verified in the following environment:

  • Doris Version: 2.0.x
  • Spark Version: 3.3
  • Table Type: Unique Key tables (containing __DORIS_DELETE_SIGN__ internal column)

Impact Scope

  • Benefited scenarios: All read operations using Doris 2.0+ with tables containing internal system columns
  • Backward compatibility: Fully compatible with older Doris versions, no impact on existing functionality
  • Performance impact: No performance impact, only adjusted field processing logic

Related Issues

This issue has been reported in the community:

@dingyufei615
Copy link
Author

@JNSimba

Comment on lines +148 to +150
if (fieldVectors.size() < schema.size()) {
logger.error("Arrow field size '{}' is less than data schema size '{}'.",
fieldVectors.size(), schema.size());
Copy link
Member

@JNSimba JNSimba Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there was a version of Doris where the schema would return delete_sign, but the data wasn't actually there. Could this change cause a problem?

Copy link
Author

@dingyufei615 dingyufei615 Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review

The fix handles both scenarios safely:

Scenario 1 (Issue #349): Arrow has extra internal columns

  • fieldVectors.size() > schema.size()
  • Behavior: Log warning, continue processing
  • Fixes the reported issue

Scenario 2 (Your concern): Schema has columns missing in Arrow data

  • fieldVectors.size() < schema.size()
  • Behavior: Throw exception immediately (line 148-152)
  • Maintains fail-fast behavior

Code logic:

// Still throws exception when Arrow data is missing expected columns
if (fieldVectors.size() < schema.size()) {
    throw new DorisException("Load Doris data failed, schema size of fetch data is wrong.");
}

// Only allows extra columns (Doris 2.0+ internal columns)
if (fieldVectors.size() > schema.size()) {
    logger.warn("This may be due to internal columns in Doris 2.0+...");
}

Could you share which Doris version had the issue you mentioned? I'd like to verify if it still exists and add test coverage if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants