Various fixes for speed, flexibility, and reporting status#9
Various fixes for speed, flexibility, and reporting status#9michael-howell-island wants to merge 25 commits intoCodeturion:masterfrom
Conversation
…r into server indexing
…embers, clean up docstring
rglob() cannot skip directories mid-walk — it traverses the entire tree (including all git worktrees) before the path_filter check runs. With 16 worktrees the indexer was doing ~17x the necessary work, causing 1000+s startup times instead of ~2 minutes. Switch all six walk sites (base.py, server.py x2, typescript.py, python_parser.py, go.py, java.py) to os.walk() with in-place dirs[:] pruning. Excluded directories (.worktrees, submodules, _SKIP_DIRS) are now dropped before descent so os.walk never enters them.
…orktree-support feat: file filtering, worktree exclusions, and query-time file_path scoping
Also forward on_progress in TypeScriptParser's parse_directory override. The callback is invoked after each successful parse_file call, passing the Path of the parsed file; skipped files and parse errors do not trigger it.
…ertions - Remove `print(summary, file=sys.stderr)` from `main()` after calling `_index_full()`, since `_index_full` already prints the done line itself. - Strengthen `test_index_full_emits_progress_to_stderr` to also assert that the scanning line and 0% baseline indexing line appear in stderr.
…le level Move on_progress callback to a finally block in all five parsers so it fires whether parse_file succeeds or raises. Previously, unparseable files were counted by _count_files but never triggered on_progress, causing the startup progress display to stall below 100% on repos with problematic files. Also move `import sys` from inside except blocks to module-level imports in typescript.py, python_parser.py, go.py, and java.py.
- Add _DEFAULT_EXCLUDED_DIRS to PathFilter (node_modules, .git, dist, build, vendor, .nx, .yarn, etc.) so vendored dirs are always skipped - Rewrite detect_languages to use os.walk with PathFilter instead of rglob, which crawled into node_modules and hung indefinitely - Consolidate duplicate parse_directory from all 4 parsers into BaseParser with str ops instead of pathlib (saves ~3s on 30K files) - Rewrite _count_braces_and_parens: regex strip + str.count with fast path for lines without strings/comments (saves ~4s on 30K files) - Replace Path.read_text with open()+read(), path.relative_to with os.path.relpath, and _file_to_module with pure string ops - Remove per-parser _SKIP_DIRS (now centralized in PathFilter) - Remove __tests__/__mocks__/test/spec file exclusions from TS parser so test files are indexed for coding reference - Add skip_suffixes, skip_files, _should_skip_dir hooks to BaseParser for per-parser file filtering without duplicating the walk loop Benchmark on 34K-file TS monorepo: 22s -> 15s (32% faster)
Path('~/work/cloud').is_dir() returns False because Python's pathlib
does not expand ~ — causing the server to skip indexing entirely and
sit idle in the MCP listen loop without emitting the done: line.
These are internal planning artifacts and fork install instructions that don't belong in the upstream PR.
There was a problem hiding this comment.
Code Review
This pull request implements path filtering, worktree support, and real-time indexing progress reporting. It introduces a PathFilter class to handle default exclusions and user-defined patterns via .codesurfaceignore or the --exclude flag, while optimizing directory traversal using os.walk. Additionally, search tools now support file_path scoping. Feedback suggests correcting an inconsistency in the README.md examples and refactoring duplicated SQL filtering logic into a shared helper function to enhance maintainability.
I am having trouble creating individual review comments. Click here to see my feedback.
README.md (143-144)
The command in this example is inconsistent with the instructions provided earlier. On line 48, the documentation states to use codesurface as the command directly after installation. However, this example uses uvx with codesurface as an argument. To avoid confusion, please update this example to be consistent with the installation instructions.
"command": "codesurface",
"args": ["--project", "/path/to/src", "--exclude", "tests/**,**/*.generated.ts"]
src/codesurface/server.py (371-379)
This logic for building the file_path filter clause and parameters is duplicated in db.py within the search and get_class_members functions. To improve maintainability and reduce code duplication, consider extracting this logic into a helper function in db.py. This helper could return the SQL clause string and the list of parameters, which can then be used in all three locations.
|
Thanks for this — the PathFilter and progress reporting are great additions. I merged it with conflict resolution against #8 (C++ parser) and also aligned cpp.py to use the consolidated BaseParser.parse_directory. One question: the .test.ts/.spec.ts files were removed from the skip list — was that intentional for your workflow? I'm leaning toward keeping them excluded since test files can add noise to API lookups, but curious about your thinking there. |
|
Yes, removing .test.ts/.spec.ts from the skip list was intentional. The reasoning:
That said, I could see an argument for a config option (e.g., --exclude-tests) or a .codesurfaceignore pattern for repos where test files genuinely add too much noise. Happy to add that if you think it's worth it, but I'd default to including them. |
|
As a follow-up I'm thinking of adding an includeTests boolean (default false) on the search/get_class tools so test files are excluded from results by default but easily pulled in when you need them. |
Merge upstream/master which adds C++ header parsing with access specifier tracking, namespace disambiguation for get_class, and various C++ parser bug fixes. Resolves conflicts by combining our file_path scoping with upstream's namespace filtering in get_class_members.
Add include_tests boolean (default false) to search, get_signature, and get_class MCP tools so test files are excluded from results by default but easily pulled in when needed. Test file detection covers common conventions: - Directory patterns: __tests__/, tests/, test/ - Filename patterns: .test., .spec., _test., test_ Also adds .js/.jsx to TypeScript parser file_extensions so plain JavaScript files are indexed alongside TypeScript.
No description provided.