simplify id_mapping#53
Conversation
346fe7f to
33e6766
Compare
…e MagicMock from tests
| """Nested dict mapping destination_pop -> source_pop -> DataFrame(index=old_ids, columns=[new_id]). | ||
|
|
||
| Encapsulates the id remapping logic for subcircuit extraction: | ||
| - Adding sources with automatic shift computation |
There was a problem hiding this comment.
what is shift computation; if a new term is being introduced, best to define it
There was a problem hiding this comment.
true. I added a small explanation
|
|
||
| Encapsulates the id remapping logic for subcircuit extraction: | ||
| - Adding sources with automatic shift computation | ||
| - Serialization to id_mapping.json with lazy original_id resolution |
There was a problem hiding this comment.
with lazy original_id resolution is not an important detail
| """ | ||
|
|
||
| def __init__(self): | ||
| self.data: dict[str, dict[str, pd.DataFrame]] = {} |
There was a problem hiding this comment.
can we use a more descriptive name than data?
There was a problem hiding this comment.
I think data is fine here honestly. It's a standard pattern for wrapper classes (dataclasses use it, pandas uses .values/.data, etc.). Since this is a dict with some additional functions (because composition is better than inheritance) I do not think it is missleading to use data. The member keeps the underlying ... data
Alternatives like mappings, populations, or entries are slightly more descriptive but also slightly misleading in different ways:
- mappings: could be confused with the serialized id_mapping.json
- populations: it's not just populations, it's the nested dest→source→DataFrame structure
- entries: same problem as data
There was a problem hiding this comment.
I don't think this is a wrapper class; it has a bunch of business logic and doesn't expose the original interface.
what about population_id_map or something similar?
There was a problem hiding this comment.
we use .data everywhere in the code. It is not _data. It exposes the data. If I did not convince you, I can put pop_id_map (pop is quite used everywhere)
|
|
||
| mapping = {} | ||
| for dest_pop, sources in self.data.items(): | ||
| all_new_ids = [] |
There was a problem hiding this comment.
entry[NEW_IDS] = all_new_ids = [] to save from having to do the assignment on 119, same w/ all_orig_ids and entry
| None for first-level extractions. | ||
|
|
||
| Returns: | ||
| The filename of the written mapping (relative to output). |
There was a problem hiding this comment.
instead of writing the file, wouldn't it be easier for testing and such to return the contents?
There was a problem hiding this comment.
it also splits the concerns:
- collecting in a dict
- wiriting
ok, done
|
|
||
| sgids_new = id_mapping[write_edge_config.src_mapping].index.to_numpy() | ||
| tgids_new = id_mapping[write_edge_config.dst_mapping].index.to_numpy() | ||
| src_concat = pd.concat(id_mapping.data[write_edge_config.src_mapping].values()) |
There was a problem hiding this comment.
src_concat more descriptive name, please
| for cfg in ext_edge_configs: | ||
| # Add source_filter to newly-externalized inputs (source = the biophysical pop name) | ||
| src_pop = id_mapping[cfg.src_mapping][SOURCE].iloc[-1] # last entry is from biophysical | ||
| src_pop = list(id_mapping.data[cfg.src_mapping].keys())[-1] |
There was a problem hiding this comment.
why keys() and why -1? even the old comment doesn't help much (ie: last entry is from biophysical - is that an invariant?)
There was a problem hiding this comment.
yes, it is also a little fragile. here we rely on ordering to identify the biophysical source. I think I have a better solution. I let you check
…me src_concat, explicit biophysical source lookup
…_nested_dicts # Conflicts: # brainbuilder/utils/sonata/split_population.py
Summary
Refactored the internal
id_mappingfromdict[str, pd.DataFrame]to anIdMappingclass wrappingdict[str, dict[str, pd.DataFrame]](destination → source → DataFrame).Changes
New file:
brainbuilder/utils/sonata/id_mapping.py—IdMappingclass with:add_source(dest, source, old_ids)— adds IDs with automatic shift and deduplicationnode_count(dest)— returns total node count for a destinationwrite(output, parent_mapping_path=None)— serializes toid_mapping.jsonwith lazyoriginal_idresolution_resolve_original_ids()— chains through parent provenanceRemoved from
split_population.py:_get_node_id_mapping(replaced byIdMapping.add_source)_resolve_original_ids(moved intoIdMapping.write)_write_mapping(moved intoIdMapping.write)SOURCE,PARENT_IDS,ORIG_IDS,PARENT_NAME,ORIG_NAMEconstants (live inid_mapping.py)JSON format: backward compatible. Multi-source populations add
parent2_id/parent2_namefields (single-source unchanged).Memory reduction: DataFrames only store
new_idcolumn (nosource, nooriginal_id). Source info encoded in dict keys;original_idresolved at write time.No sorting: IDs stored sequentially by insertion order.
Tests
tests/unit/test_sonata/test_id_mapping.pywith 16 unit tests for theIdMappingclass.