perf(mappers): Optimize stream map expression evaluation#3565
perf(mappers): Optimize stream map expression evaluation#3565edgarrmondragon wants to merge 2 commits into
Conversation
Reviewer's GuideOptimizes the mapper’s simpleeval-based expression evaluation by reusing a prebuilt names dict per record, precomputing static parts at init time, and caching constant expression results, thereby reducing per-field overhead in stream map transforms. Sequence diagram for optimized record transform evaluationsequenceDiagram
actor Caller
participant Mapper
participant transform_fn
participant _build_eval_names
participant _eval
participant _MapperEval as expr_evaluator
Caller->>Mapper: transform(record)
activate Mapper
Mapper->>transform_fn: call with record
activate transform_fn
transform_fn->>_build_eval_names: _build_eval_names(record)
activate _build_eval_names
_build_eval_names-->>transform_fn: names (static entries + record + aliases)
deactivate _build_eval_names
loop for each mapped_property
alt cached_value is available
transform_fn->>transform_fn: use cached_value
else cached_value is _UNSET and expression is dynamic
transform_fn->>_eval: _eval(expr, expr_parsed, record, property_name, names)
activate _eval
alt names is None
_eval->>_build_eval_names: _build_eval_names(record)
_build_eval_names-->>_eval: names
end
_eval->>_eval: update names["self"] based on property_name
_eval->>_MapperEval: set expr_evaluator.names = names
_eval->>_MapperEval: eval(expr, previously_parsed)
activate _MapperEval
_MapperEval-->>_eval: result
deactivate _MapperEval
_eval-->>transform_fn: result
deactivate _eval
transform_fn->>transform_fn: set result[property_name] = result
end
end
transform_fn-->>Mapper: transformed_record
deactivate transform_fn
Mapper-->>Caller: transformed_record
deactivate Mapper
Class diagram for mapper evaluation and constant expression cachingclassDiagram
class Mapper {
- map_config: dict
- stream_alias: str
- faker_config: dict
- stream_name: str
- expr_evaluator: _MapperEval
- fake: Faker
- _static_eval_names: dict~str, any~
- _transform_fn: callable
- _filter_fn: callable
+ __init__(map_config: dict, stream_alias: str, faker_config: dict, stream_name: str)
+ transform(record: dict) dict
+ get_filter_result(record: dict) bool
+ functions() FunctionsDict
- _build_eval_names(record: dict) dict
- _is_constant_expression(parsed_expr: ast_Expr) bool
- _eval(expr: str, expr_parsed: ast_Expr, record: dict, property_name: str, names: dict) any
- _init_functions_and_schema(stream_map: dict) tuple
}
class _MapperEval {
- functions: FunctionsDict
- names: dict
+ __init__(functions: FunctionsDict)
+ eval(expr: str, previously_parsed: ast_Expr) any
}
class FunctionsDict {
<<typealias>>
dict~str, callable_or_module_wrapper~
}
class ConstantSentinel {
<<value>>
_UNSET: object
}
class transform_fn {
<<closure>>
+ __call__(record: dict) dict
}
Mapper --> _MapperEval : uses expr_evaluator
Mapper --> FunctionsDict : returns functions
Mapper --> ConstantSentinel : uses _UNSET
Mapper o-- transform_fn : defines
transform_fn --> Mapper : calls _build_eval_names
transform_fn --> Mapper : calls _eval
_MapperEval --> FunctionsDict : configured with
File-Level Changes
Assessment against linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3565 +/- ##
=======================================
Coverage 93.73% 93.73%
=======================================
Files 73 73
Lines 5890 5890
Branches 723 723
=======================================
Hits 5521 5521
Misses 274 274
Partials 95 95
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merging this PR will improve performance by 36.58%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ⚡ | test_bench_simple_map_transforms |
547.4 ms | 400.8 ms | +36.58% |
Comparing 3561-simpleeval-ModuleWrapper (e1f7a83) with main (33a6eb8)
0c6c5b7 to
0b576cc
Compare
Downstream in meltano/sdk#3565 (comment), we noticed a performance regression in the 1.0.5 release of `simpleeval`. The root problem seems to be that 1. there too many redundant instance checks for safe primitive types 2. `is_hashable` has try/except overhead. The fix is for 1 is to implement a fast path for simple types. The fix for 2 is to replace `is_hashable` with [`callable`](https://docs.python.org/3/library/functions.html#callable), which achieves the same purpose in this context. Signed-off-by: Edgar Ramírez Mondragón <edgarrm358@gmail.com>
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
2708023 to
558f487
Compare
This comment was marked as outdated.
This comment was marked as outdated.
6693fd7 to
33cd0d9
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
d5162d2 to
c212be3
Compare
simpleeval>=1.0.5 and use ModuleWrapper for modules exposed to stream maps expressionsc212be3 to
5b3b3b6
Compare
|
This PR now focuses entirely on performance optimizations, after the compatibility fix was implemented in I had Claude split the 4 performance optimizations into separate commits. I might open individual PRs for those, but the most urgent item of fixing compatibility while not regressing on performance has been addressed. |
ef44566 to
79ea4b3
Compare
The names dict passed to simpleeval contains entries that never change across records (config, __stream_name__, __original_stream_name__, fake). Previously these were assembled from scratch inside _eval on every single property of every record. This commit makes two related improvements: 1. Build the static portion once at __init__ time into _static_eval_names. _build_eval_names(record) copies that base dict and merges only the record-level fields. _eval now calls _build_eval_names instead of performing the inline assembly. 2. In transform_fn, call _build_eval_names(record) once per record and pass the resulting dict to every _eval call for that record via a new names kwarg. _eval only patches the "self" key (the current property's original value) before each simpleeval call. The evaluator and faker are also moved before _init_functions_and_schema so _static_eval_names can be fully populated before any transform closure is created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce _is_constant_expression, which inspects the parsed AST of a mapping expression. If every Name node refers to a known function (and therefore not a record field), the expression is evaluated once during _init_functions_and_schema. Its result is stored alongside the parsed entry in stream_map_parsed and reused verbatim for every subsequent record, bypassing simpleeval entirely for those fields. Also add type annotations to the mapper benchmark test fixtures. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
79ea4b3 to
e1f7a83
Compare
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In
__init__,self.fakeis initialized twice (once before building_static_eval_namesand once after_init_functions_and_schema), so_static_eval_namesmay capture a different faker instance than the one ultimately used; consider initializingself.fakeonce before building_static_eval_namesand reusing it. - The
_is_constant_expressionheuristic treats anyast.Namematching a function name as record-independent, but this breaks if a record field shadows a function name (e.g. a fieldjsonorfoowhere a functionfooalso exists), since the expression will be pre-evaluated using the function instead of the record value; consider checking for potential name shadowing or tightening the criteria for what counts as a constant expression.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `__init__`, `self.fake` is initialized twice (once before building `_static_eval_names` and once after `_init_functions_and_schema`), so `_static_eval_names` may capture a different faker instance than the one ultimately used; consider initializing `self.fake` once before building `_static_eval_names` and reusing it.
- The `_is_constant_expression` heuristic treats any `ast.Name` matching a function name as record-independent, but this breaks if a record field shadows a function name (e.g. a field `json` or `foo` where a function `foo` also exists), since the expression will be pre-evaluated using the function instead of the record value; consider checking for potential name shadowing or tightening the criteria for what counts as a constant expression.
## Individual Comments
### Comment 1
<location path="singer_sdk/mapper.py" line_range="435-444" />
<code_context>
+ names["record"] = record
+ return names
+
+ def _is_constant_expression(self, parsed_expr: ast.Expr) -> bool:
+ """Return True if the expression does not reference any record variables.
+
+ Expressions that only reference function names (from ``self.functions``) and
+ literal constants can be pre-evaluated once at init time and reused for every
+ record, avoiding repeated simpleeval overhead.
+
+ Args:
+ parsed_expr: Parsed AST node of the expression to check.
+
+ Returns:
+ True if the expression is a compile-time constant.
+ """
+ function_names = self.expr_evaluator.functions.keys()
+ return all(
+ node.id in function_names
</code_context>
<issue_to_address>
**issue (bug_risk):** The constant-expression detection assumes all registered functions are pure, which can break semantics for non-pure helpers like faker.
The current logic treats any expression whose `ast.Name` nodes are all in `self.expr_evaluator.functions` as a compile-time constant, so `_init_functions_and_schema` pre-evaluates it once and reuses the result for all records.
For non‑pure helpers (e.g. `faker.name()` or any random/time-based function exposed via `functions`), this changes behavior from “evaluate per record” to “evaluate once at init,” which is a silent and surprising semantic change.
To avoid this, consider either:
- Limiting constant folding to a small allowlist of known pure, deterministic helpers, or
- Treating function calls as non-constant by default and requiring explicit opting-in for helpers that are safe to pre-evaluate.
Otherwise, any new non-pure helper added to `functions` will accidentally become a candidate for pre-evaluation.
</issue_to_address>
### Comment 2
<location path="singer_sdk/mapper.py" line_range="672-680" />
<code_context>
raise MapExpressionError(msg) from ex
+ # Pre-evaluate expressions whose result doesn't depend on the record.
+ cached: t.Any = _UNSET
+ if self._is_constant_expression(parsed_def):
+ self.expr_evaluator.names = {}
+ cached = self.expr_evaluator.eval(
+ prop_def,
+ previously_parsed=parsed_def,
+ )
+
+ stream_map_parsed.append((prop_key, prop_def, parsed_def, cached))
+
else:
</code_context>
<issue_to_address>
**issue (bug_risk):** Pre-evaluated expression values are reused across records, which can cause shared mutable state for expressions like lists or dicts.
When `_is_constant_expression` is `True`, the evaluated result is stored in `cached` at init and reused for every record. For immutable results this is fine, but for mutable ones (e.g. `[]`, `{}`, `{"a": []}`) this changes behavior: instead of a fresh object per record, all records now share the same instance, so any later mutation affects all records. To keep the optimization without changing semantics, either restrict constant folding to known-immutable values or return a `copy.deepcopy(cached_value)` (or equivalent) per record instead of the cached object itself.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def _is_constant_expression(self, parsed_expr: ast.Expr) -> bool: | ||
| """Return True if the expression does not reference any record variables. | ||
|
|
||
| Expressions that only reference function names (from ``self.functions``) and | ||
| literal constants can be pre-evaluated once at init time and reused for every | ||
| record, avoiding repeated simpleeval overhead. | ||
|
|
||
| Args: | ||
| parsed_expr: Parsed AST node of the expression to check. | ||
|
|
There was a problem hiding this comment.
issue (bug_risk): The constant-expression detection assumes all registered functions are pure, which can break semantics for non-pure helpers like faker.
The current logic treats any expression whose ast.Name nodes are all in self.expr_evaluator.functions as a compile-time constant, so _init_functions_and_schema pre-evaluates it once and reuses the result for all records.
For non‑pure helpers (e.g. faker.name() or any random/time-based function exposed via functions), this changes behavior from “evaluate per record” to “evaluate once at init,” which is a silent and surprising semantic change.
To avoid this, consider either:
- Limiting constant folding to a small allowlist of known pure, deterministic helpers, or
- Treating function calls as non-constant by default and requiring explicit opting-in for helpers that are safe to pre-evaluate.
Otherwise, any new non-pure helper added to functions will accidentally become a candidate for pre-evaluation.
| cached: t.Any = _UNSET | ||
| if self._is_constant_expression(parsed_def): | ||
| self.expr_evaluator.names = {} | ||
| cached = self.expr_evaluator.eval( | ||
| prop_def, | ||
| previously_parsed=parsed_def, | ||
| ) | ||
|
|
||
| stream_map_parsed.append((prop_key, prop_def, parsed_def, cached)) |
There was a problem hiding this comment.
issue (bug_risk): Pre-evaluated expression values are reused across records, which can cause shared mutable state for expressions like lists or dicts.
When _is_constant_expression is True, the evaluated result is stored in cached at init and reused for every record. For immutable results this is fine, but for mutable ones (e.g. [], {}, {"a": []}) this changes behavior: instead of a fresh object per record, all records now share the same instance, so any later mutation affects all records. To keep the optimization without changing semantics, either restrict constant folding to known-immutable values or return a copy.deepcopy(cached_value) (or equivalent) per record instead of the cached object itself.
Summary
After the simpleeval compatibility fix was merged in #3595, this PR contains stream map performance optimizations — a 34% improvement on
test_bench_simple_map_transforms(547 ms → 407 ms).Optimizations
1. Pre-build static
namesdict at init timeEntries that never change across records —
config,__stream_name__,__original_stream_name__, andfake— are assembled once in__init__into_static_eval_names. Previously these were re-inserted into a fresh dict inside_evalon every single property of every single record.2. Build the
namesdict once per record, not once per fieldThe
transform_fnclosure now calls_build_eval_names(record)once and passes the result to every_evalcall for that record._evalonly patches theselfkey (the current property's original value) before each evaluation. Previously a fullrecord.copy()plus several dict assignments happened inside_evalfor every transformed property.3. Pre-evaluate constant expressions at init time
_is_constant_expressioninspects the parsed AST of each mapping expression. If allNamenodes refer to known function names (and therefore not to record variables), the expression is evaluated once during_init_functions_and_schemaand its result is cached. For every subsequent record the cached value is used directly, bypassing simpleeval entirely.Related
simpleeval1.0.5+ by implementingModuleWrapperfor modules exposed to stream maps expressions #3561_check_disallowed_itemsoverhead danthedeckie/simpleeval#176simpleeval1.0.5+ by subclassingsimpleeval.EvalWithCompoundTypes#3595 (simpleeval 1.0.5 compatibility, already merged tomain)__alias__expressions #3606 (debug-log removal, merged separately — no measured performance impact per CodSpeed)Summary by Sourcery
Optimize stream map expression evaluation in the mapper for improved performance while preserving existing behavior.
Enhancements: