simplify id_mapping by cattabiani · Pull Request #53 · openbraininstitute/brainbuilder

cattabiani · 2026-05-21T13:59:52Z

Summary

Refactored the internal id_mapping from dict[str, pd.DataFrame] to an IdMapping class wrapping dict[str, dict[str, pd.DataFrame]] (destination → source → DataFrame).

Changes

New file: brainbuilder/utils/sonata/id_mapping.py — IdMapping class with:
- add_source(dest, source, old_ids) — adds IDs with automatic shift and deduplication
- node_count(dest) — returns total node count for a destination
- write(output, parent_mapping_path=None) — serializes to id_mapping.json with lazy original_id resolution
- _resolve_original_ids() — chains through parent provenance
Removed from split_population.py:
- _get_node_id_mapping (replaced by IdMapping.add_source)
- _resolve_original_ids (moved into IdMapping.write)
- _write_mapping (moved into IdMapping.write)
- SOURCE, PARENT_IDS, ORIG_IDS, PARENT_NAME, ORIG_NAME constants (live in id_mapping.py)
JSON format: backward compatible. Multi-source populations add parent2_id/parent2_name fields (single-source unchanged).
Memory reduction: DataFrames only store new_id column (no source, no original_id). Source info encoded in dict keys; original_id resolved at write time.
No sorting: IDs stored sequentially by insertion order.

Tests

All 42 existing tests pass unchanged (integration tests validate output files are identical).
New tests/unit/test_sonata/test_id_mapping.py with 16 unit tests for the IdMapping class.

…it_nodes

…tants

…ct conventions

…e MagicMock from tests

…_nested_dicts

mgeplf · 2026-05-22T09:34:08Z

+    """Nested dict mapping destination_pop -> source_pop -> DataFrame(index=old_ids, columns=[new_id]).
+
+    Encapsulates the id remapping logic for subcircuit extraction:
+    - Adding sources with automatic shift computation


what is shift computation; if a new term is being introduced, best to define it

true. I added a small explanation

mgeplf · 2026-05-22T09:34:48Z

+
+    Encapsulates the id remapping logic for subcircuit extraction:
+    - Adding sources with automatic shift computation
+    - Serialization to id_mapping.json with lazy original_id resolution


with lazy original_id resolution is not an important detail

mgeplf · 2026-05-22T11:45:51Z

+    """
+
+    def __init__(self):
+        self.data: dict[str, dict[str, pd.DataFrame]] = {}


can we use a more descriptive name than data?

I think data is fine here honestly. It's a standard pattern for wrapper classes (dataclasses use it, pandas uses .values/.data, etc.). Since this is a dict with some additional functions (because composition is better than inheritance) I do not think it is missleading to use data. The member keeps the underlying ... data

Alternatives like mappings, populations, or entries are slightly more descriptive but also slightly misleading in different ways:

mappings: could be confused with the serialized id_mapping.json

populations: it's not just populations, it's the nested dest→source→DataFrame structure

entries: same problem as data

I don't think this is a wrapper class; it has a bunch of business logic and doesn't expose the original interface.

what about population_id_map or something similar?

we use .data everywhere in the code. It is not _data. It exposes the data. If I did not convince you, I can put pop_id_map (pop is quite used everywhere)

mgeplf · 2026-05-22T11:51:05Z

+
+        mapping = {}
+        for dest_pop, sources in self.data.items():
+            all_new_ids = []


entry[NEW_IDS] = all_new_ids = [] to save from having to do the assignment on 119, same w/ all_orig_ids and entry

mgeplf · 2026-05-22T11:52:33Z

+                None for first-level extractions.
+
+        Returns:
+            The filename of the written mapping (relative to output).


instead of writing the file, wouldn't it be easier for testing and such to return the contents?

it also splits the concerns:

collecting in a dict

wiriting
ok, done

mgeplf · 2026-05-22T11:54:52Z


-    sgids_new = id_mapping[write_edge_config.src_mapping].index.to_numpy()
-    tgids_new = id_mapping[write_edge_config.dst_mapping].index.to_numpy()
+    src_concat = pd.concat(id_mapping.data[write_edge_config.src_mapping].values())


src_concat more descriptive name, please

mgeplf · 2026-05-22T11:57:04Z

    for cfg in ext_edge_configs:
        # Add source_filter to newly-externalized inputs (source = the biophysical pop name)
-        src_pop = id_mapping[cfg.src_mapping][SOURCE].iloc[-1]  # last entry is from biophysical
+        src_pop = list(id_mapping.data[cfg.src_mapping].keys())[-1]


why keys() and why -1? even the old comment doesn't help much (ie: last entry is from biophysical - is that an invariant?)

yes, it is also a little fragile. here we rely on ordering to identify the biophysical source. I think I have a better solution. I let you check

…_nested_dicts

…me src_concat, explicit biophysical source lookup

…_nested_dicts

…_nested_dicts # Conflicts: # brainbuilder/utils/sonata/split_population.py

cattabiani added 3 commits May 21, 2026 15:58

wip: introduce id_mapping2 alongside existing id_mapping with assertions

cf9383d

migrate _write_mapping to id_mapping2, remove _resolve_original_ids

1857c55

introduce IdMapping class, move write logic there, use add_source helper

33e6766

cattabiani force-pushed the katta/id_mapping_nested_dicts branch from 346fe7f to 33e6766 Compare May 21, 2026 14:40

cattabiani added 11 commits May 21, 2026 17:00

simplify add_source: handle duplicate IDs internally, remove from_spl…

e8d6a78

…it_nodes

extract _resolve_original_ids as staticmethod in IdMapping

743778b

migrate node writing and _update_node_sets to id_mapping2

d177c5d

fix tests: update _update_node_sets tests to use IdMapping object

60aef86

full swap: remove old id_mapping, use IdMapping object everywhere

0760c17

add unit tests for IdMapping, clean up stale comments and unused cons…

fb9dec0

…tants

add IdMapping.load, multi-source JSON format (parent2_id/parent2_name)

343666c

simplify load/write: unified loop for single and multi-source entries

5941b29

import NEW_IDS from id_mapping module, remove duplicate constant

77d1209

add node_count method, remove load, flatten test style to match proje…

b97572e

…ct conventions

decouple write from circuit object: accept parent_mapping_path, remov…

1dccbf7

…e MagicMock from tests

cattabiani requested a review from mgeplf May 22, 2026 09:20

cattabiani self-assigned this May 22, 2026

cattabiani marked this pull request as ready for review May 22, 2026 09:20

Merge branch 'katta/merge-external-populations' into katta/id_mapping…

4e5bfb6

…_nested_dicts

mgeplf reviewed May 22, 2026

View reviewed changes

cattabiani added 4 commits May 22, 2026 16:13

Merge branch 'katta/merge-external-populations' into katta/id_mapping…

c87cab3

…_nested_dicts

review: clarify IdMapping docstring, extract to_dict from write, rena…

f3ede16

…me src_concat, explicit biophysical source lookup

Merge branch 'katta/merge-external-populations' into katta/id_mapping…

84bb10e

…_nested_dicts

Merge branch 'katta/merge-external-populations' into katta/id_mapping…

5426ec7

…_nested_dicts # Conflicts: # brainbuilder/utils/sonata/split_population.py

Conversation

cattabiani commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Tests

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cattabiani commented May 21, 2026 •

edited

Loading