Skip to content

simplify id_mapping#53

Open
cattabiani wants to merge 19 commits into
katta/merge-external-populationsfrom
katta/id_mapping_nested_dicts
Open

simplify id_mapping#53
cattabiani wants to merge 19 commits into
katta/merge-external-populationsfrom
katta/id_mapping_nested_dicts

Conversation

@cattabiani
Copy link
Copy Markdown
Contributor

@cattabiani cattabiani commented May 21, 2026

Summary

Refactored the internal id_mapping from dict[str, pd.DataFrame] to an IdMapping class wrapping dict[str, dict[str, pd.DataFrame]] (destination → source → DataFrame).

Changes

  • New file: brainbuilder/utils/sonata/id_mapping.pyIdMapping class with:

    • add_source(dest, source, old_ids) — adds IDs with automatic shift and deduplication
    • node_count(dest) — returns total node count for a destination
    • write(output, parent_mapping_path=None) — serializes to id_mapping.json with lazy original_id resolution
    • _resolve_original_ids() — chains through parent provenance
  • Removed from split_population.py:

    • _get_node_id_mapping (replaced by IdMapping.add_source)
    • _resolve_original_ids (moved into IdMapping.write)
    • _write_mapping (moved into IdMapping.write)
    • SOURCE, PARENT_IDS, ORIG_IDS, PARENT_NAME, ORIG_NAME constants (live in id_mapping.py)
  • JSON format: backward compatible. Multi-source populations add parent2_id/parent2_name fields (single-source unchanged).

  • Memory reduction: DataFrames only store new_id column (no source, no original_id). Source info encoded in dict keys; original_id resolved at write time.

  • No sorting: IDs stored sequentially by insertion order.

Tests

  • All 42 existing tests pass unchanged (integration tests validate output files are identical).
  • New tests/unit/test_sonata/test_id_mapping.py with 16 unit tests for the IdMapping class.

@cattabiani cattabiani force-pushed the katta/id_mapping_nested_dicts branch from 346fe7f to 33e6766 Compare May 21, 2026 14:40
@cattabiani cattabiani requested a review from mgeplf May 22, 2026 09:20
@cattabiani cattabiani self-assigned this May 22, 2026
@cattabiani cattabiani marked this pull request as ready for review May 22, 2026 09:20
Comment thread brainbuilder/utils/sonata/id_mapping.py Outdated
"""Nested dict mapping destination_pop -> source_pop -> DataFrame(index=old_ids, columns=[new_id]).

Encapsulates the id remapping logic for subcircuit extraction:
- Adding sources with automatic shift computation
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is shift computation; if a new term is being introduced, best to define it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. I added a small explanation

Comment thread brainbuilder/utils/sonata/id_mapping.py Outdated

Encapsulates the id remapping logic for subcircuit extraction:
- Adding sources with automatic shift computation
- Serialization to id_mapping.json with lazy original_id resolution
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with lazy original_id resolution is not an important detail

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

"""

def __init__(self):
self.data: dict[str, dict[str, pd.DataFrame]] = {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a more descriptive name than data?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think data is fine here honestly. It's a standard pattern for wrapper classes (dataclasses use it, pandas uses .values/.data, etc.). Since this is a dict with some additional functions (because composition is better than inheritance) I do not think it is missleading to use data. The member keeps the underlying ... data

Alternatives like mappings, populations, or entries are slightly more descriptive but also slightly misleading in different ways:

  • mappings: could be confused with the serialized id_mapping.json
  • populations: it's not just populations, it's the nested dest→source→DataFrame structure
  • entries: same problem as data

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a wrapper class; it has a bunch of business logic and doesn't expose the original interface.

what about population_id_map or something similar?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use .data everywhere in the code. It is not _data. It exposes the data. If I did not convince you, I can put pop_id_map (pop is quite used everywhere)

Comment thread brainbuilder/utils/sonata/id_mapping.py Outdated

mapping = {}
for dest_pop, sources in self.data.items():
all_new_ids = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

entry[NEW_IDS] = all_new_ids = [] to save from having to do the assignment on 119, same w/ all_orig_ids and entry

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment thread brainbuilder/utils/sonata/id_mapping.py Outdated
None for first-level extractions.

Returns:
The filename of the written mapping (relative to output).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of writing the file, wouldn't it be easier for testing and such to return the contents?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also splits the concerns:

  • collecting in a dict
  • wiriting
    ok, done


sgids_new = id_mapping[write_edge_config.src_mapping].index.to_numpy()
tgids_new = id_mapping[write_edge_config.dst_mapping].index.to_numpy()
src_concat = pd.concat(id_mapping.data[write_edge_config.src_mapping].values())
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src_concat more descriptive name, please

for cfg in ext_edge_configs:
# Add source_filter to newly-externalized inputs (source = the biophysical pop name)
src_pop = id_mapping[cfg.src_mapping][SOURCE].iloc[-1] # last entry is from biophysical
src_pop = list(id_mapping.data[cfg.src_mapping].keys())[-1]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why keys() and why -1? even the old comment doesn't help much (ie: last entry is from biophysical - is that an invariant?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is also a little fragile. here we rely on ordering to identify the biophysical source. I think I have a better solution. I let you check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants