Skip to content

Conversation

@edknv
Copy link
Collaborator

@edknv edknv commented Aug 13, 2025

Description

This PR introduces a new method ingest_in_chunks to the Ingestor class.

Before

Currently, users who need to process very large datasets that don't fit on disk must implement their own manual looping and chunking logic, e.g.,

all_files = glob.glob("./very_large_dataset/*.pdf")
chunks = [all_files[i:i + 500] for i in range(0, len(all_files), 500)]

for chunk_num, chunk in enumerate(chunks):
  print(f"Doing chunk {chunk_num}")
  # The user must re-instantiate and re-configure the Ingestor in every loop
  ingestor = (
      Ingestor(message_client_hostname=hostname)
      .files(chunk)
      .extract(...)
      .embed(...)
      .vdb_upload(...)
      .save_to_disk()
  )
  # User also has to manually aggregate results and handle failures
  results, failures = ingestor.ingest(return_failures=True)

After

This PR encapsulates this logic in the new ingest_in_chunks method which leads to a simplified user experience.

# The user configures the Ingestor just once
ingestor = (
    Ingestor()
    .files("./very_large_dataset/*.pdf")
    .extract(...)
    .embed(...)
    .vdb_upload(vdb_op="milvus")
    .save_to_disk()
)

# A single, simple method call handles all chunking, processing, and aggregation
results, failures = ingestor.ingest_in_chunks(chunk_size=500, return_failures=True)
# or use the optional chunk_size parameter in the existing ingest()
results, failures = ingestor.ingest(chunk_size=500, return_failures=True)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@edknv edknv requested a review from a team as a code owner August 13, 2025 20:11
@edknv edknv requested review from drobison00 and removed request for a team August 13, 2025 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant