Skip to content

fix: handle utf-8 chunk boundary detection#583

Open
cnYui wants to merge 1 commit into
coderamp-labs:mainfrom
cnYui:codex/fix-utf8-chunk-boundary-detection
Open

fix: handle utf-8 chunk boundary detection#583
cnYui wants to merge 1 commit into
coderamp-labs:mainfrom
cnYui:codex/fix-utf8-chunk-boundary-detection

Conversation

@cnYui

@cnYui cnYui commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Test plan

  • RED: python -m pytest tests/test_filesystem.py::test_content_keeps_utf8_text_when_multibyte_character_crosses_chunk_boundary -q
    • Before the fix: failed because node.content returned [Binary file].
  • GREEN: python -m pytest tests/test_filesystem.py::test_content_keeps_utf8_text_when_multibyte_character_crosses_chunk_boundary -q
    • After the fix: 1 passed.
  • python -m pytest tests/test_filesystem.py tests/test_ingestion.py -q
    • 9 passed.
  • python -m ruff check src/gitingest/utils/file_utils.py tests/test_filesystem.py
    • All checks passed!.
  • python -m ruff format --check src/gitingest/utils/file_utils.py tests/test_filesystem.py
    • 2 files already formatted.
  • python -m pytest -q -k "not bitbucket"
    • 154 passed, 7 deselected.

Note: full python -m pytest -q currently fails on three Bitbucket-hosted tests/query_parser/test_git_host_agnostic.py cases because git ls-remote https://bitbucket.org/na-dna/llm-knowledge-share HEAD prompts for credentials in this non-interactive environment (fatal: could not read Username for 'https://bitbucket.org'). The failure is outside this file-content decoding path.

@cnYui cnYui force-pushed the codex/fix-utf8-chunk-boundary-detection branch from 20cf4e7 to c57c2c4 Compare June 8, 2026 11:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(bug): Unicode characters might break chunk decoding logic leading to [Binary File]

1 participant