Skip to content

perf(mappers): Optimize stream map expression evaluation#3565

Closed
edgarrmondragon wants to merge 2 commits into
mainfrom
3561-simpleeval-ModuleWrapper
Closed

perf(mappers): Optimize stream map expression evaluation#3565
edgarrmondragon wants to merge 2 commits into
mainfrom
3561-simpleeval-ModuleWrapper

Conversation

@edgarrmondragon
Copy link
Copy Markdown
Collaborator

@edgarrmondragon edgarrmondragon commented Mar 13, 2026

Summary

After the simpleeval compatibility fix was merged in #3595, this PR contains stream map performance optimizations — a 34% improvement on test_bench_simple_map_transforms (547 ms → 407 ms).

Optimizations

1. Pre-build static names dict at init time

Entries that never change across records — config, __stream_name__, __original_stream_name__, and fake — are assembled once in __init__ into _static_eval_names. Previously these were re-inserted into a fresh dict inside _eval on every single property of every single record.

2. Build the names dict once per record, not once per field

The transform_fn closure now calls _build_eval_names(record) once and passes the result to every _eval call for that record. _eval only patches the self key (the current property's original value) before each evaluation. Previously a full record.copy() plus several dict assignments happened inside _eval for every transformed property.

3. Pre-evaluate constant expressions at init time

_is_constant_expression inspects the parsed AST of each mapping expression. If all Name nodes refer to known function names (and therefore not to record variables), the expression is evaluated once during _init_functions_and_schema and its result is cached. For every subsequent record the cached value is used directly, bypassing simpleeval entirely.

Related

Summary by Sourcery

Optimize stream map expression evaluation in the mapper for improved performance while preserving existing behavior.

Enhancements:

  • Precompute a static evaluation context for config and stream metadata and reuse it across record evaluations.
  • Build the expression evaluator names dictionary once per record and reuse it for all property transformations within that record.
  • Introduce detection and pre-evaluation of constant mapping expressions so their results can be reused without re-invoking the expression evaluator.
  • Extend the mapper function type aliasing to support simpleeval module wrappers alongside callables.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Mar 13, 2026

Reviewer's Guide

Optimizes the mapper’s simpleeval-based expression evaluation by reusing a prebuilt names dict per record, precomputing static parts at init time, and caching constant expression results, thereby reducing per-field overhead in stream map transforms.

Sequence diagram for optimized record transform evaluation

sequenceDiagram
    actor Caller
    participant Mapper
    participant transform_fn
    participant _build_eval_names
    participant _eval
    participant _MapperEval as expr_evaluator

    Caller->>Mapper: transform(record)
    activate Mapper
    Mapper->>transform_fn: call with record
    activate transform_fn

    transform_fn->>_build_eval_names: _build_eval_names(record)
    activate _build_eval_names
    _build_eval_names-->>transform_fn: names (static entries + record + aliases)
    deactivate _build_eval_names

    loop for each mapped_property
        alt cached_value is available
            transform_fn->>transform_fn: use cached_value
        else cached_value is _UNSET and expression is dynamic
            transform_fn->>_eval: _eval(expr, expr_parsed, record, property_name, names)
            activate _eval
            alt names is None
                _eval->>_build_eval_names: _build_eval_names(record)
                _build_eval_names-->>_eval: names
            end
            _eval->>_eval: update names["self"] based on property_name
            _eval->>_MapperEval: set expr_evaluator.names = names
            _eval->>_MapperEval: eval(expr, previously_parsed)
            activate _MapperEval
            _MapperEval-->>_eval: result
            deactivate _MapperEval
            _eval-->>transform_fn: result
            deactivate _eval
            transform_fn->>transform_fn: set result[property_name] = result
        end
    end

    transform_fn-->>Mapper: transformed_record
    deactivate transform_fn
    Mapper-->>Caller: transformed_record
    deactivate Mapper
Loading

Class diagram for mapper evaluation and constant expression caching

classDiagram
    class Mapper {
        - map_config: dict
        - stream_alias: str
        - faker_config: dict
        - stream_name: str
        - expr_evaluator: _MapperEval
        - fake: Faker
        - _static_eval_names: dict~str, any~
        - _transform_fn: callable
        - _filter_fn: callable
        + __init__(map_config: dict, stream_alias: str, faker_config: dict, stream_name: str)
        + transform(record: dict) dict
        + get_filter_result(record: dict) bool
        + functions() FunctionsDict
        - _build_eval_names(record: dict) dict
        - _is_constant_expression(parsed_expr: ast_Expr) bool
        - _eval(expr: str, expr_parsed: ast_Expr, record: dict, property_name: str, names: dict) any
        - _init_functions_and_schema(stream_map: dict) tuple
    }

    class _MapperEval {
        - functions: FunctionsDict
        - names: dict
        + __init__(functions: FunctionsDict)
        + eval(expr: str, previously_parsed: ast_Expr) any
    }

    class FunctionsDict {
        <<typealias>>
        dict~str, callable_or_module_wrapper~
    }

    class ConstantSentinel {
        <<value>>
        _UNSET: object
    }

    class transform_fn {
        <<closure>>
        + __call__(record: dict) dict
    }

    Mapper --> _MapperEval : uses expr_evaluator
    Mapper --> FunctionsDict : returns functions
    Mapper --> ConstantSentinel : uses _UNSET
    Mapper o-- transform_fn : defines
    transform_fn --> Mapper : calls _build_eval_names
    transform_fn --> Mapper : calls _eval
    _MapperEval --> FunctionsDict : configured with
Loading

File-Level Changes

Change Details Files
Introduce reusable types, sentinels, and evaluator instances to support optimized expression handling.
  • Add FunctionsDict type alias for the mapper functions dictionary to support both callables and simpleeval.ModuleWrapper instances.
  • Introduce a module-level _UNSET sentinel object to distinguish between unset and falsy cached values.
  • Instantiate _MapperEval once in init and store it on self.expr_evaluator instead of constructing it in _init_functions_and_schema.
singer_sdk/mapper.py
Pre-build and reuse static and record-level evaluation context (names dict) across property evaluations.
  • Initialize a static _static_eval_names dict in init containing config, stream alias/original name, and optional fake instance.
  • Add _build_eval_names(record) to merge _static_eval_names with a record and set _, record aliases, returning a fresh names dict per record.
  • Update _eval to optionally accept a pre-built names dict, building one via _build_eval_names only when none is supplied, and to manage the self entry per property.
singer_sdk/mapper.py
Pre-parse and optionally pre-evaluate constant mapping expressions during schema initialization, caching their value for reuse at transform time.
  • Extend stream_map_parsed tuples to include a cached_value slot, initialized to _UNSET for dynamic/non-expression mappings.
  • Introduce _is_constant_expression(parsed_expr) that treats expressions whose Name nodes all refer to known function names as constant with respect to records.
  • During _init_functions_and_schema, parse string expressions, detect constant expressions with _is_constant_expression, and when constant, evaluate them once with expr_evaluator and store the result in cached_value.
  • Update transform_fn to build a names dict once per record and, when iterating stream_map_parsed, short-circuit to cached_value for constant expressions or call _eval with the shared names dict for dynamic expressions.
singer_sdk/mapper.py

Assessment against linked issues

Issue Objective Addressed Explanation
#3561 Support simpleeval 1.0.5+ in stream map expressions by wrapping any exposed modules with simpleeval.ModuleWrapper (e.g., json module) in the mapper expression evaluation logic. This PR focuses on performance optimizations for stream map expression evaluation (pre-building names dicts, per-record reuse, and pre-evaluating constant expressions). While it introduces a FunctionsDict type alias that includes simpleeval.ModuleWrapper, the actual use of ModuleWrapper for modules (such as json) is either unchanged from prior code or defined elsewhere (e.g., in PR #3595). No new or changed logic here specifically implements ModuleWrapper to achieve simpleeval 1.0.5+ compatibility, which the PR body itself states was already handled in #3595.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.73%. Comparing base (33a6eb8) to head (e1f7a83).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3565   +/-   ##
=======================================
  Coverage   93.73%   93.73%           
=======================================
  Files          73       73           
  Lines        5890     5890           
  Branches      723      723           
=======================================
  Hits         5521     5521           
  Misses        274      274           
  Partials       95       95           
Flag Coverage Δ
core 82.13% <ø> (ø)
end-to-end 75.50% <ø> (ø)
optional-components 42.83% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Mar 13, 2026

Merging this PR will improve performance by 36.58%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 1 improved benchmark
✅ 7 untouched benchmarks

Performance Changes

Benchmark BASE HEAD Efficiency
test_bench_simple_map_transforms 547.4 ms 400.8 ms +36.58%

Comparing 3561-simpleeval-ModuleWrapper (e1f7a83) with main (33a6eb8)

Open in CodSpeed

@edgarrmondragon edgarrmondragon force-pushed the 3561-simpleeval-ModuleWrapper branch from 0c6c5b7 to 0b576cc Compare March 16, 2026 01:02
edgarrmondragon added a commit to meltano/simpleeval that referenced this pull request Mar 16, 2026
Downstream in meltano/sdk#3565 (comment), we noticed a performance regression in the 1.0.5 release of `simpleeval`.

The root problem seems to be that

1. there too many redundant instance checks for safe primitive types
2. `is_hashable` has try/except overhead.

The fix is for 1 is to implement a fast path for simple types. The fix for 2 is to replace `is_hashable` with [`callable`](https://docs.python.org/3/library/functions.html#callable), which achieves the same purpose in this context.

Signed-off-by: Edgar Ramírez Mondragón <edgarrm358@gmail.com>
@edgarrmondragon edgarrmondragon marked this pull request as ready for review March 18, 2026 00:10
@edgarrmondragon edgarrmondragon requested a review from a team as a code owner March 18, 2026 00:10
sourcery-ai[bot]

This comment was marked as outdated.

@edgarrmondragon

This comment was marked as outdated.

@ReubenFrankel

This comment was marked as outdated.

@ReubenFrankel ReubenFrankel force-pushed the 3561-simpleeval-ModuleWrapper branch from 2708023 to 558f487 Compare March 18, 2026 12:29
@ReubenFrankel

This comment was marked as outdated.

@edgarrmondragon edgarrmondragon force-pushed the 3561-simpleeval-ModuleWrapper branch from 6693fd7 to 33cd0d9 Compare March 18, 2026 17:53
@edgarrmondragon

This comment was marked as outdated.

@ReubenFrankel

This comment was marked as outdated.

@edgarrmondragon edgarrmondragon marked this pull request as draft April 8, 2026 17:39
@edgarrmondragon edgarrmondragon force-pushed the 3561-simpleeval-ModuleWrapper branch from d5162d2 to c212be3 Compare April 14, 2026 22:21
@edgarrmondragon edgarrmondragon changed the title fix(taps): Require simpleeval>=1.0.5 and use ModuleWrapper for modules exposed to stream maps expressions perf(mappers): Optimize stream map expression evaluation Apr 14, 2026
@edgarrmondragon edgarrmondragon force-pushed the 3561-simpleeval-ModuleWrapper branch from c212be3 to 5b3b3b6 Compare April 15, 2026 20:51
@edgarrmondragon
Copy link
Copy Markdown
Collaborator Author

This PR now focuses entirely on performance optimizations, after the compatibility fix was implemented in

I had Claude split the 4 performance optimizations into separate commits. I might open individual PRs for those, but the most urgent item of fixing compatibility while not regressing on performance has been addressed.

@edgarrmondragon edgarrmondragon force-pushed the 3561-simpleeval-ModuleWrapper branch 5 times, most recently from ef44566 to 79ea4b3 Compare April 16, 2026 21:34
edgarrmondragon and others added 2 commits April 16, 2026 18:12
The names dict passed to simpleeval contains entries that never change
across records (config, __stream_name__, __original_stream_name__, fake).
Previously these were assembled from scratch inside _eval on every single
property of every record.

This commit makes two related improvements:

1. Build the static portion once at __init__ time into _static_eval_names.
   _build_eval_names(record) copies that base dict and merges only the
   record-level fields.  _eval now calls _build_eval_names instead of
   performing the inline assembly.

2. In transform_fn, call _build_eval_names(record) once per record and
   pass the resulting dict to every _eval call for that record via a new
   names kwarg.  _eval only patches the "self" key (the current
   property's original value) before each simpleeval call.

The evaluator and faker are also moved before _init_functions_and_schema
so _static_eval_names can be fully populated before any transform closure
is created.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduce _is_constant_expression, which inspects the parsed AST of a
mapping expression.  If every Name node refers to a known function (and
therefore not a record field), the expression is evaluated once during
_init_functions_and_schema.  Its result is stored alongside the parsed
entry in stream_map_parsed and reused verbatim for every subsequent
record, bypassing simpleeval entirely for those fields.

Also add type annotations to the mapper benchmark test fixtures.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@edgarrmondragon edgarrmondragon force-pushed the 3561-simpleeval-ModuleWrapper branch from 79ea4b3 to e1f7a83 Compare April 17, 2026 00:12
@edgarrmondragon edgarrmondragon marked this pull request as ready for review April 17, 2026 00:14
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In __init__, self.fake is initialized twice (once before building _static_eval_names and once after _init_functions_and_schema), so _static_eval_names may capture a different faker instance than the one ultimately used; consider initializing self.fake once before building _static_eval_names and reusing it.
  • The _is_constant_expression heuristic treats any ast.Name matching a function name as record-independent, but this breaks if a record field shadows a function name (e.g. a field json or foo where a function foo also exists), since the expression will be pre-evaluated using the function instead of the record value; consider checking for potential name shadowing or tightening the criteria for what counts as a constant expression.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `__init__`, `self.fake` is initialized twice (once before building `_static_eval_names` and once after `_init_functions_and_schema`), so `_static_eval_names` may capture a different faker instance than the one ultimately used; consider initializing `self.fake` once before building `_static_eval_names` and reusing it.
- The `_is_constant_expression` heuristic treats any `ast.Name` matching a function name as record-independent, but this breaks if a record field shadows a function name (e.g. a field `json` or `foo` where a function `foo` also exists), since the expression will be pre-evaluated using the function instead of the record value; consider checking for potential name shadowing or tightening the criteria for what counts as a constant expression.

## Individual Comments

### Comment 1
<location path="singer_sdk/mapper.py" line_range="435-444" />
<code_context>
+        names["record"] = record
+        return names
+
+    def _is_constant_expression(self, parsed_expr: ast.Expr) -> bool:
+        """Return True if the expression does not reference any record variables.
+
+        Expressions that only reference function names (from ``self.functions``) and
+        literal constants can be pre-evaluated once at init time and reused for every
+        record, avoiding repeated simpleeval overhead.
+
+        Args:
+            parsed_expr: Parsed AST node of the expression to check.
+
+        Returns:
+            True if the expression is a compile-time constant.
+        """
+        function_names = self.expr_evaluator.functions.keys()
+        return all(
+            node.id in function_names
</code_context>
<issue_to_address>
**issue (bug_risk):** The constant-expression detection assumes all registered functions are pure, which can break semantics for non-pure helpers like faker.

The current logic treats any expression whose `ast.Name` nodes are all in `self.expr_evaluator.functions` as a compile-time constant, so `_init_functions_and_schema` pre-evaluates it once and reuses the result for all records.

For non‑pure helpers (e.g. `faker.name()` or any random/time-based function exposed via `functions`), this changes behavior from “evaluate per record” to “evaluate once at init,” which is a silent and surprising semantic change.

To avoid this, consider either:
- Limiting constant folding to a small allowlist of known pure, deterministic helpers, or
- Treating function calls as non-constant by default and requiring explicit opting-in for helpers that are safe to pre-evaluate.

Otherwise, any new non-pure helper added to `functions` will accidentally become a candidate for pre-evaluation.
</issue_to_address>

### Comment 2
<location path="singer_sdk/mapper.py" line_range="672-680" />
<code_context>
                     raise MapExpressionError(msg) from ex

+                # Pre-evaluate expressions whose result doesn't depend on the record.
+                cached: t.Any = _UNSET
+                if self._is_constant_expression(parsed_def):
+                    self.expr_evaluator.names = {}
+                    cached = self.expr_evaluator.eval(
+                        prop_def,
+                        previously_parsed=parsed_def,
+                    )
+
+                stream_map_parsed.append((prop_key, prop_def, parsed_def, cached))
+
             else:
</code_context>
<issue_to_address>
**issue (bug_risk):** Pre-evaluated expression values are reused across records, which can cause shared mutable state for expressions like lists or dicts.

When `_is_constant_expression` is `True`, the evaluated result is stored in `cached` at init and reused for every record. For immutable results this is fine, but for mutable ones (e.g. `[]`, `{}`, `{"a": []}`) this changes behavior: instead of a fresh object per record, all records now share the same instance, so any later mutation affects all records. To keep the optimization without changing semantics, either restrict constant folding to known-immutable values or return a `copy.deepcopy(cached_value)` (or equivalent) per record instead of the cached object itself.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread singer_sdk/mapper.py
Comment on lines +435 to +444
def _is_constant_expression(self, parsed_expr: ast.Expr) -> bool:
"""Return True if the expression does not reference any record variables.

Expressions that only reference function names (from ``self.functions``) and
literal constants can be pre-evaluated once at init time and reused for every
record, avoiding repeated simpleeval overhead.

Args:
parsed_expr: Parsed AST node of the expression to check.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): The constant-expression detection assumes all registered functions are pure, which can break semantics for non-pure helpers like faker.

The current logic treats any expression whose ast.Name nodes are all in self.expr_evaluator.functions as a compile-time constant, so _init_functions_and_schema pre-evaluates it once and reuses the result for all records.

For non‑pure helpers (e.g. faker.name() or any random/time-based function exposed via functions), this changes behavior from “evaluate per record” to “evaluate once at init,” which is a silent and surprising semantic change.

To avoid this, consider either:

  • Limiting constant folding to a small allowlist of known pure, deterministic helpers, or
  • Treating function calls as non-constant by default and requiring explicit opting-in for helpers that are safe to pre-evaluate.

Otherwise, any new non-pure helper added to functions will accidentally become a candidate for pre-evaluation.

Comment thread singer_sdk/mapper.py
Comment on lines +672 to +680
cached: t.Any = _UNSET
if self._is_constant_expression(parsed_def):
self.expr_evaluator.names = {}
cached = self.expr_evaluator.eval(
prop_def,
previously_parsed=parsed_def,
)

stream_map_parsed.append((prop_key, prop_def, parsed_def, cached))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Pre-evaluated expression values are reused across records, which can cause shared mutable state for expressions like lists or dicts.

When _is_constant_expression is True, the evaluated result is stored in cached at init and reused for every record. For immutable results this is fine, but for mutable ones (e.g. [], {}, {"a": []}) this changes behavior: instead of a fresh object per record, all records now share the same instance, so any later mutation affects all records. To keep the optimization without changing semantics, either restrict constant folding to known-immutable values or return a copy.deepcopy(cached_value) (or equivalent) per record instead of the cached object itself.

@edgarrmondragon edgarrmondragon modified the milestones: v0.54, v0.55 May 12, 2026
@edgarrmondragon edgarrmondragon deleted the 3561-simpleeval-ModuleWrapper branch May 27, 2026 01:51
@edgarrmondragon edgarrmondragon removed this from the v0.55 milestone May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

chore: Support simpleeval 1.0.5+ by implementing ModuleWrapper for modules exposed to stream maps expressions

2 participants