Pipeline state dump and load #352

NathalieCharbel · 2025-06-04T10:18:13Z

Description

This PR introduces state management capabilities to the Pipeline class, enabling:

Checkpointing pipeline execution at specific components using pipeline.run(..., until='component_x',...)`
Resuming pipeline execution from saved states using pipeline.run(..., from_='component_y',...)`
Dumping/loading pipeline state using dump_state(pipeline_run_id) and load_state(state).

The state includes results from previous runs. This feature is particularly useful for:

Debugging long-running pipelines
Recovering from failures
Comparing component implementations with deterministic inputs

Type of Change

Complexity

Complexity: high

How Has This Been Tested?

Unit tests
E2E tests
Manual tests

Checklist

The following requirements should have been met (depending on the changes in the branch):

Documentation has been updated
Unit tests have been updated
E2E tests have been updated
Examples have been updated
New files have copyright header
CLA (https://neo4j.com/developer/cla/) has been signed
CHANGELOG.md updated if appropriate

docs/source/user_guide_pipeline.rst

src/neo4j_graphrag/experimental/pipeline/pipeline.py

tests/unit/experimental/pipeline/components.py

docs/source/user_guide_pipeline.rst

src/neo4j_graphrag/experimental/pipeline/pipeline.py

src/neo4j_graphrag/experimental/pipeline/stores.py

stellasia · 2025-06-11T16:18:33Z

src/neo4j_graphrag/experimental/pipeline/stores.py

+        keys_to_remove = [
+            key for key in self._data.keys() if key.startswith(run_id_prefix)
+        ]
+        for key in keys_to_remove:


So here we are removing all results from a previous run with this run_id, right?

stellasia · 2025-06-11T16:25:47Z

src/neo4j_graphrag/experimental/pipeline/pipeline.py

@@ -140,6 +140,7 @@ def __init__(
        }
        """
        self.missing_inputs: dict[str, list[str]] = defaultdict()
+        self._current_run_id: Optional[str] = None


This can not be saved in the Pipeline instance, since concurrent runs will override it.

I will move it back to dump() function. I think we should maintain creating different run_ids even after resuming the same pipeline and dump the state based on previous ones. we could keep track of run_ids of the same pipeline in the state. This should resolve the concurrency issue, right?

src/neo4j_graphrag/experimental/pipeline/pipeline.py

…for state management

NathalieCharbel added 6 commits June 4, 2025 11:19

Serialize/deserialize component state

4a93761

Run pipeline until/Resume pipeline from

f940506

Remove in memory storage support for pipeline state

12d2e26

Add pipeline run_id and ability to save and load state from json file

8a00015

Add unit tests

5f0c892

Update changelog and docs

22722b8

NathalieCharbel requested a review from a team as a code owner June 4, 2025 10:18

Ruff

9a57046

stellasia reviewed Jun 9, 2025

View reviewed changes

NathalieCharbel added 4 commits June 10, 2025 15:45

Remove state management for component

d1f7389

Remove resume_from and run_until and reuse existing run interface

8bc8ac0

Add dump and load to InMemoryStore

f76e5ca

Ruff

2cc5b99

stellasia reviewed Jun 11, 2025

View reviewed changes

src/neo4j_graphrag/experimental/pipeline/stores.py Outdated Show resolved Hide resolved

NathalieCharbel added 3 commits June 11, 2025 16:44

Add ability ro load and dump state by run_id

2a819ea

Allow orchestrator to run use a run_id from previous run

472eaf1

Refactor pipeline and validate loaded state

13254dd

stellasia reviewed Jun 11, 2025

View reviewed changes

src/neo4j_graphrag/experimental/pipeline/pipeline.py Outdated Show resolved Hide resolved

stellasia reviewed Jun 11, 2025

View reviewed changes

src/neo4j_graphrag/experimental/pipeline/pipeline.py Show resolved Hide resolved

NathalieCharbel added 6 commits June 13, 2025 12:45

Refactor pipeline run_id management

a9413d3

Ensure previous run_ids are kept in store

64bdc66

Ensure resume run with different run_ids

d020a7e

Cleanup stores

c97bb97

Ensure proper handling of previous run_ids

74e7db0

Update changelog and docs

0ca4c60

NathalieCharbel marked this pull request as draft June 16, 2025 14:22

Fix orchestrator's way of handling tasks on complete and transitions …

2c091c2

…for state management

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline state dump and load #352

Pipeline state dump and load #352

Uh oh!

NathalieCharbel commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stellasia Jun 11, 2025

Uh oh!

NathalieCharbel Jun 12, 2025

Uh oh!

stellasia Jun 11, 2025

Uh oh!

NathalieCharbel Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pipeline state dump and load #352

Are you sure you want to change the base?

Pipeline state dump and load #352

Uh oh!

Conversation

NathalieCharbel commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Complexity

How Has This Been Tested?

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stellasia Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

NathalieCharbel Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

stellasia Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

NathalieCharbel Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NathalieCharbel commented Jun 4, 2025 •

edited

Loading

NathalieCharbel Jun 12, 2025 •

edited

Loading