Skip to content

GH-2123: Add chunkOverlap support to TokenTextSplitter #4054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ralla0405
Copy link

Summary

This PR adds chunk overlap functionality to the TokenTextSplitter class to
improve text chunking for better context preservation in document processing.

Closes #2123

Changes

  • TokenTextSplitter class enhancements:

    • Added chunkOverlap field with default value of 50 tokens
    • Updated constructor and builder to support chunk overlap configuration
    • Added validation to ensure chunkOverlap < chunkSize
    • Refactored doSplit method to implement overlap logic
    • Added optimizeChunkBoundary method for sentence-aware splitting
  • Test improvements:

    • Added testChunkOverlapFunctionality to verify overlap behavior
    • Added testChunkOverlapValidation for input validation
    • Added testBoundaryOptimizationWithOverlap for sentence boundary testing
    • Added testKeepSeparatorVariations for separator handling
    • Updated existing tests to handle dynamic chunk counts with overlap

Key Features

  • Configurable overlap: Allows overlapping tokens between consecutive
    chunks
  • Boundary optimization: Attempts to split at sentence boundaries when
    possible
  • Input validation: Prevents invalid overlap configurations
  • Backward compatibility: Maintains existing API with sensible defaults

Test Coverage

All new functionality is covered by comprehensive unit tests that verify:

  • Overlap functionality works correctly
  • Input validation prevents invalid configurations
  • Boundary optimization improves chunk quality
  • Metadata handling remains consistent across chunks

Fixes spring-projectsGH-2123 (spring-projects#2123)

  - Add chunkOverlap field and configuration to TokenTextSplitter class
  - Implement overlap functionality in doSplit method with boundary
  optimization
  - Add optimizeChunkBoundary method for sentence-aware chunk splitting
  - Add validation to ensure chunkOverlap < chunkSize
  - Update Builder pattern with withChunkOverlap method
  - Add comprehensive test coverage for overlap functionality
  - Improve existing tests to handle dynamic chunk counts

  Signed-off-by: Seunghwan Jung <[email protected]>

Signed-off-by: Seunghwan Jung <[email protected]>
@ralla0405 ralla0405 force-pushed the GH-2123-add-overlap-field-to-splitter branch from e584def to f073fff Compare August 7, 2025 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Text Chunking with Overlap in TokenTextSplitter
1 participant