Skip to content

Various fixes for speed, flexibility, and reporting status#9

Open
michael-howell-island wants to merge 25 commits intoCodeturion:masterfrom
michael-howell-island:master
Open

Various fixes for speed, flexibility, and reporting status#9
michael-howell-island wants to merge 25 commits intoCodeturion:masterfrom
michael-howell-island:master

Conversation

@michael-howell-island
Copy link
Copy Markdown

No description provided.

rglob() cannot skip directories mid-walk — it traverses the entire tree
(including all git worktrees) before the path_filter check runs. With 16
worktrees the indexer was doing ~17x the necessary work, causing 1000+s
startup times instead of ~2 minutes.

Switch all six walk sites (base.py, server.py x2, typescript.py,
python_parser.py, go.py, java.py) to os.walk() with in-place dirs[:]
pruning. Excluded directories (.worktrees, submodules, _SKIP_DIRS) are
now dropped before descent so os.walk never enters them.
…orktree-support

feat: file filtering, worktree exclusions, and query-time file_path scoping
Also forward on_progress in TypeScriptParser's parse_directory override.
The callback is invoked after each successful parse_file call, passing
the Path of the parsed file; skipped files and parse errors do not
trigger it.
…ertions

- Remove `print(summary, file=sys.stderr)` from `main()` after calling
  `_index_full()`, since `_index_full` already prints the done line itself.
- Strengthen `test_index_full_emits_progress_to_stderr` to also assert
  that the scanning line and 0% baseline indexing line appear in stderr.
…le level

Move on_progress callback to a finally block in all five parsers so it
fires whether parse_file succeeds or raises. Previously, unparseable files
were counted by _count_files but never triggered on_progress, causing the
startup progress display to stall below 100% on repos with problematic files.

Also move `import sys` from inside except blocks to module-level imports in
typescript.py, python_parser.py, go.py, and java.py.
- Add _DEFAULT_EXCLUDED_DIRS to PathFilter (node_modules, .git, dist,
  build, vendor, .nx, .yarn, etc.) so vendored dirs are always skipped
- Rewrite detect_languages to use os.walk with PathFilter instead of
  rglob, which crawled into node_modules and hung indefinitely
- Consolidate duplicate parse_directory from all 4 parsers into
  BaseParser with str ops instead of pathlib (saves ~3s on 30K files)
- Rewrite _count_braces_and_parens: regex strip + str.count with fast
  path for lines without strings/comments (saves ~4s on 30K files)
- Replace Path.read_text with open()+read(), path.relative_to with
  os.path.relpath, and _file_to_module with pure string ops
- Remove per-parser _SKIP_DIRS (now centralized in PathFilter)
- Remove __tests__/__mocks__/test/spec file exclusions from TS parser
  so test files are indexed for coding reference
- Add skip_suffixes, skip_files, _should_skip_dir hooks to BaseParser
  for per-parser file filtering without duplicating the walk loop

Benchmark on 34K-file TS monorepo: 22s -> 15s (32% faster)
Path('~/work/cloud').is_dir() returns False because Python's pathlib
does not expand ~ — causing the server to skip indexing entirely and
sit idle in the MCP listen loop without emitting the done: line.
These are internal planning artifacts and fork install instructions
that don't belong in the upstream PR.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements path filtering, worktree support, and real-time indexing progress reporting. It introduces a PathFilter class to handle default exclusions and user-defined patterns via .codesurfaceignore or the --exclude flag, while optimizing directory traversal using os.walk. Additionally, search tools now support file_path scoping. Feedback suggests correcting an inconsistency in the README.md examples and refactoring duplicated SQL filtering logic into a shared helper function to enhance maintainability.

I am having trouble creating individual review comments. Click here to see my feedback.

README.md (143-144)

medium

The command in this example is inconsistent with the instructions provided earlier. On line 48, the documentation states to use codesurface as the command directly after installation. However, this example uses uvx with codesurface as an argument. To avoid confusion, please update this example to be consistent with the installation instructions.

      "command": "codesurface",
      "args": ["--project", "/path/to/src", "--exclude", "tests/**,**/*.generated.ts"]

src/codesurface/server.py (371-379)

medium

This logic for building the file_path filter clause and parameters is duplicated in db.py within the search and get_class_members functions. To improve maintainability and reduce code duplication, consider extracting this logic into a helper function in db.py. This helper could return the SQL clause string and the list of parameters, which can then be used in all three locations.

@Codeturion
Copy link
Copy Markdown
Owner

Thanks for this — the PathFilter and progress reporting are great additions. I merged it with conflict resolution against #8 (C++ parser) and also aligned cpp.py to use the consolidated BaseParser.parse_directory.

One question: the .test.ts/.spec.ts files were removed from the skip list — was that intentional for your workflow? I'm leaning toward keeping them excluded since test files can add noise to API lookups, but curious about your thinking there.

Codeturion added a commit that referenced this pull request Apr 2, 2026
Resolve conflicts between PR #8 (C++ namespace support) and PR #9
(PathFilter, file_path scoping, progress reporting). Merged both
namespace and file_path params into get_class_members.
@michael-howell-island
Copy link
Copy Markdown
Author

Yes, removing .test.ts/.spec.ts from the skip list was intentional. The reasoning:

  1. Tests are valuable coding reference — when using codesurface as an MCP tool, test files show real usage patterns and examples of how APIs are called. When I'm writing new tests or trying to understand how a service is used, the test files are often exactly what I need.
  2. file_path scoping handles the noise concern — this same PR added file_path filtering to search, get_signature, and get_class (commit 67fe469). So if test results are noisy for a particular query, you can scope with file_path: "src/" to exclude them at query time without losing them from the index entirely.
  3. Excluding at index time is permanent — once they're excluded from the DB, there's no way to search them without reindexing. Keeping them indexed but filterable at query time gives the user and ai the choice.

That said, I could see an argument for a config option (e.g., --exclude-tests) or a .codesurfaceignore pattern for repos where test files genuinely add too much noise. Happy to add that if you think it's worth it, but I'd default to including them.

@michael-howell-island
Copy link
Copy Markdown
Author

As a follow-up I'm thinking of adding an includeTests boolean (default false) on the search/get_class tools so test files are excluded from results by default but easily pulled in when you need them.

Merge upstream/master which adds C++ header parsing with access specifier
tracking, namespace disambiguation for get_class, and various C++ parser
bug fixes. Resolves conflicts by combining our file_path scoping with
upstream's namespace filtering in get_class_members.
Add include_tests boolean (default false) to search, get_signature, and
get_class MCP tools so test files are excluded from results by default
but easily pulled in when needed.

Test file detection covers common conventions:
- Directory patterns: __tests__/, tests/, test/
- Filename patterns: .test., .spec., _test., test_

Also adds .js/.jsx to TypeScript parser file_extensions so plain
JavaScript files are indexed alongside TypeScript.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants