fix(core): avoid workspace-wide walk in expand_outputs#35464
Open
marco2216 wants to merge 2 commits intonrwl:masterfrom
Open
fix(core): avoid workspace-wide walk in expand_outputs#35464marco2216 wants to merge 2 commits intonrwl:masterfrom
marco2216 wants to merge 2 commits intonrwl:masterfrom
Conversation
When a cached target's `outputs` array contains any glob entry (`*`, `**`), the native `expand_outputs(workspaceRoot, outputs[])` called from `NxCache.put` / `copyFilesFromCache` walked the entire workspace once per call - regardless of how deeply the glob is anchored or how few files it matches. On a 50k-file workspace this is ~55 ms per call; on production workspaces it scales to several seconds per cached task. The sibling `get_files_for_outputs` already partitions globs by their anchored root (via `partition_glob`) and walks only those subtrees, so the underlying primitive is fast. Delegate `expand_outputs`'s glob branch to `get_files_for_outputs` whenever no negated globs are present, preserving negation semantics by keeping the existing workspace-rooted walk in that path. On the same inputs this is ~27x faster for a single-glob entry and constant-time independent of workspace size. Add a regression test that exercises a glob anchored to a deep subdirectory with decoy files outside that subtree, asserting that `expand_outputs` and `get_files_for_outputs` produce the same result and that nothing outside the anchored root leaks in.
👷 Deploy request for nx-docs pending review.Visit the deploys page to approve it
|
👷 Deploy request for nx-dev pending review.Visit the deploys page to approve it
|
Author
|
Hey @FrozenPandaz would you mind having a look at this and giving me some feedback on whether this issue and solution approach makes sense? Ofc. it's not optimal that this issue remains if negated globs are there, so maybe there's a better way. Thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Current Behavior
When a cached target's
outputsarray contains any entry with a glob (*,**), the nativeexpand_outputs(workspaceRoot, outputs[])(called fromNxCache.put/copyFilesFromCachevia_expandOutputsinpackages/nx/src/tasks-runner/cache.ts) walks the entire workspace once per call - regardless of the glob's anchor depth or how few files it matches. Even a glob anchored to a leaf directory containing one file pays the full cost.On a 50k-file workspace (warm OS cache, M-series Mac SSD) this is ~55 ms per call; on production workspaces with several hundred thousand tracked files / build artifacts it scales to multiple seconds per cached task. Concrete file paths and pure directory paths take a fast path inside the same Rust function (<1 ms) - the cost is specifically tied to the workspace-rooted walk triggered by glob entries.
The sibling
get_files_for_outputs(in the same module, exposed viagetFilesForOutputsBatchand already used byoutputs-tracking.ts) already does this correctly: it partitions globs by their anchored root usingpartition_globand walks only those subtrees. So the underlying primitive is already fast on the same inputs.Expected Behavior
expand_outputsshould resolve glob entries without walking the entire workspace, matching the behavior ofget_files_for_outputs. On the same inputs the two functions return the same flat list of resolved files; routing the glob branch throughget_files_for_outputsmakesexpand_outputs~27x faster on globs and constant-time independent of workspace size.Patch
packages/nx/src/native/cache/expand_outputs.rs: in the glob branch of_expand_outputs, when there are no negated globs (the common case), delegate toget_files_for_outputsinstead of building a workspace-rootednx_walker_sync. Negated globs keep the existing workspace-rooted walk so that ignore semantics (!path/to/exclude) are preserved exactly. No JS or napi-rs surface change - all native callers and the JSexpandOutputsbinding benefit transparently.Reproducer & numbers
Standalone reproducer: https://github.com/marco2216/nx-expand-outputs-glob-perf-repro —
pnpm install && bash bench.sh. Default seed is 50k synthetic files, generated locally on first run; scale up viaSEED_FILES=200000 bash bench.shto amplify the bug. nx is pinned to current latest.In a large monorepo we maintain, applying the equivalent change as a workspace-side workaround (removing globs from
outputs) drops our prebuild from 22 s to 9 s serial, and 15 s to 3.7 s parallel (~4x speedup). The reproducer is a synthetic minimum; the real-world win on production codebases is substantially larger.Correctness verification
Ran cache miss -> cache hit cycle for targets with both concrete and glob
outputsconfigs in the reproducer; both restore output files byte-correct on cache hit. The existing Rust test suite for this module (should_expand_outputs,should_handle_multiple_extensions,should_handle_multiple_outputs_with_negation,should_expand_outputs_with_symlinks_and_globs,should_handle_cache_put_and_restore_with_symlinks) was reviewed against the new code path:get_files_for_outputs, which produces the same result set on these tests.nx_walker_syncso!apps/web/.next/cachesemantics are preserved exactly.New test
Added
should_not_walk_outside_glob_anchored_rootinexpand_outputs.rsthat:src/,dist/other-app/,noise/file-{0..49}.txt) that the previous workspace-wide walker would visit,expand_outputs("dist/app/**/*.js")and asserts the result matchesget_files_for_outputson the same inputs and contains exactly the two anchored-root files.This catches both functional drift between the two functions and any future regression that reintroduces a workspace-rooted walk in the glob fast path.
Related Issue(s)
Fixes #