Skip to content

Merge PR #9: PathFilter, file_path scoping, and progress reporting#10

Open
Codeturion wants to merge 26 commits intomasterfrom
review/pr-9
Open

Merge PR #9: PathFilter, file_path scoping, and progress reporting#10
Codeturion wants to merge 26 commits intomasterfrom
review/pr-9

Conversation

@Codeturion
Copy link
Copy Markdown
Owner

Summary

Merges #9 from @michael-howell-island with conflict resolution against #8 (C++ parser).

  • Resolve merge conflicts in db.py and server.py — both namespace and file_path params now coexist in get_class_members()
  • Align C++ parser to use BaseParser.parse_directory (consolidated os.walk with PathFilter)
  • Restore .test.ts/.spec.ts skip suffixes in TypeScript parser

What's included from #9

  • PathFilter — os.walk with dir pruning, .codesurfaceignore, --exclude CLI globs, worktree/submodule skipping
  • file_path scopingsearch, get_signature, get_class tools accept optional file_path filter
  • Progress reporting — stderr output during indexing (scanning N files..., indexing: 50%, done:)
  • Consolidated parse_directory — all parsers now use BaseParser's os.walk loop with skip_suffixes, skip_files, _should_skip_dir hooks

Test plan

  • 32/32 pytest tests pass
  • C++ parser verified on imgui (4,840 records), bullet3 (15,867 records)
  • TS parser verified on vscode (91,794 records, test files excluded)

michael-howell-island and others added 25 commits March 29, 2026 15:55
rglob() cannot skip directories mid-walk — it traverses the entire tree
(including all git worktrees) before the path_filter check runs. With 16
worktrees the indexer was doing ~17x the necessary work, causing 1000+s
startup times instead of ~2 minutes.

Switch all six walk sites (base.py, server.py x2, typescript.py,
python_parser.py, go.py, java.py) to os.walk() with in-place dirs[:]
pruning. Excluded directories (.worktrees, submodules, _SKIP_DIRS) are
now dropped before descent so os.walk never enters them.
…orktree-support

feat: file filtering, worktree exclusions, and query-time file_path scoping
Also forward on_progress in TypeScriptParser's parse_directory override.
The callback is invoked after each successful parse_file call, passing
the Path of the parsed file; skipped files and parse errors do not
trigger it.
…ertions

- Remove `print(summary, file=sys.stderr)` from `main()` after calling
  `_index_full()`, since `_index_full` already prints the done line itself.
- Strengthen `test_index_full_emits_progress_to_stderr` to also assert
  that the scanning line and 0% baseline indexing line appear in stderr.
…le level

Move on_progress callback to a finally block in all five parsers so it
fires whether parse_file succeeds or raises. Previously, unparseable files
were counted by _count_files but never triggered on_progress, causing the
startup progress display to stall below 100% on repos with problematic files.

Also move `import sys` from inside except blocks to module-level imports in
typescript.py, python_parser.py, go.py, and java.py.
- Add _DEFAULT_EXCLUDED_DIRS to PathFilter (node_modules, .git, dist,
  build, vendor, .nx, .yarn, etc.) so vendored dirs are always skipped
- Rewrite detect_languages to use os.walk with PathFilter instead of
  rglob, which crawled into node_modules and hung indefinitely
- Consolidate duplicate parse_directory from all 4 parsers into
  BaseParser with str ops instead of pathlib (saves ~3s on 30K files)
- Rewrite _count_braces_and_parens: regex strip + str.count with fast
  path for lines without strings/comments (saves ~4s on 30K files)
- Replace Path.read_text with open()+read(), path.relative_to with
  os.path.relpath, and _file_to_module with pure string ops
- Remove per-parser _SKIP_DIRS (now centralized in PathFilter)
- Remove __tests__/__mocks__/test/spec file exclusions from TS parser
  so test files are indexed for coding reference
- Add skip_suffixes, skip_files, _should_skip_dir hooks to BaseParser
  for per-parser file filtering without duplicating the walk loop

Benchmark on 34K-file TS monorepo: 22s -> 15s (32% faster)
Path('~/work/cloud').is_dir() returns False because Python's pathlib
does not expand ~ — causing the server to skip indexing entirely and
sit idle in the MCP listen loop without emitting the done: line.
These are internal planning artifacts and fork install instructions
that don't belong in the upstream PR.
Resolve conflicts between PR #8 (C++ namespace support) and PR #9
(PathFilter, file_path scoping, progress reporting). Merged both
namespace and file_path params into get_class_members.
- Remove CppParser.parse_directory override, use BaseParser's os.walk
  with _should_skip_dir for C++-specific dirs (Debug, Release, x64,
  x86, cmake-build-*)
- Remove _SKIP_DIRS from cpp.py (now centralized in PathFilter)
- Apply open()+read() and os.path.relpath to cpp.py for consistency
- Restore .test.ts/.spec.ts skip suffixes in TypeScript parser
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive path filtering system and significant performance optimizations for the indexing process. Key changes include the addition of a PathFilter class to handle default and user-defined exclusions, the integration of progress reporting during full indexing, and the refactoring of language parsers to use os.walk for faster directory traversal. The database layer was also updated to support scoped searches via a new file_path parameter. Feedback focuses on further performance improvements, such as consolidating multiple directory walks, reducing pathlib overhead in hot loops, and addressing code duplication in SQL construction. Additionally, the file counting logic for progress reporting needs to be synchronized with parser-specific exclusion rules to ensure accuracy.

Comment on lines +170 to +176
if file_path:
if file_path.endswith("/"):
conditions.append("r.file_path LIKE ?")
params.append(file_path + "%")
else:
conditions.append("(r.file_path = ? OR r.file_path LIKE ?)")
params.extend([file_path, file_path + "/%"])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for constructing the file_path SQL condition is duplicated here and in get_class_members (lines 212-218). Consider extracting this into a shared utility function to ensure consistency and reduce maintenance overhead.

d for d in dirs
if not path_filter.is_dir_excluded_name(d)
and not self._should_skip_dir(d)
and not path_filter.is_dir_excluded(Path(os.path.join(root, d)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a Path object for every directory in the walk (Path(os.path.join(root, d))) introduces overhead that the class documentation explicitly aims to avoid. Furthermore, path_filter.is_dir_excluded performs I/O (checking for .git files) on every directory, which is costly. Consider using string-based checks and caching the results of the .git file detection to avoid redundant disk access.

if not self._globs:
return False
try:
rel = str(path.relative_to(self._root)).replace("\\", "/")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using path.relative_to(self._root) inside a tight loop for every file is inefficient. Since the walk already provides the current root and filename, string-based path joining and slicing would be significantly faster and more consistent with the performance goals stated in the parser base class.

- Extract _walk_files from parse_directory so _count_files can reuse
  the same filter logic (skip_suffixes, skip_files, _should_skip_dir)
- Collect mtimes in on_progress callback instead of a separate walk
- _index_full now does 2 walks (count + parse) instead of 3
@Codeturion Codeturion self-assigned this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants