Skip to content

Fix/desc construction#342

Open
GalAshkenazi1 wants to merge 5 commits into
developfrom
fix/desc_construction
Open

Fix/desc construction#342
GalAshkenazi1 wants to merge 5 commits into
developfrom
fix/desc_construction

Conversation

@GalAshkenazi1
Copy link
Copy Markdown
Contributor

Fix records= path producing 'None' descriptions + unified description construction


Main Bug Fixed

When constructing EEGDashDataset via records=, every recording's description was
the string "None". The records= path never built or forwarded a description dict to
EEGDashRaw; braindecode then stringified the missing None default.


Changes

eegdash/dataset/dataset.py

  • Bug fix: records= path now builds and passes a description dict to each
    EEGDashRaw, matching the behaviour of the query and offline paths.

  • Unified _build_description helper: All three construction paths (records=,
    offline, query) previously built descriptions independently. A shared _build_description
    method now covers all three, eliminating divergence. It:

    • Pre-fills every requested field with None (so desc["subject"] never raises KeyError)
    • Looks up fields via _find_key_in_nested_dict (recursive, case/separator-insensitive,
      handles both v1 and v2 record formats)
    • Merges participant_tsv data with configurable precedence (see below)
    • Emits a DEBUG log when a conflict is detected
  • Configurable precedence (description_precedence): New constructor parameter controls
    which source wins when the same field appears in both the record and participant_tsv:

    • "record" (default) — existing behaviour, record-level value is kept
    • "participant_tsv" — participant_tsv overwrites the record value, including None
      (documented intentional behaviour: choosing this mode means fully trusting that source)
  • All-None field warning: After construction, emits a WARNING if any
    description_fields entry is None across all recordings — surfaces typos like
    "sbject" early.

  • _normalize_records called on the full batch: Preserves deduplication correctness
    (per-record calls broke _dedupe_records).

tests/unit_tests/dataset/test_build_description.py (new file)

Test What it checks
test_build_description_precedence_conflict Default "record" mode: top-level value wins, "kept" logged
test_build_description_missing_fields_padding Absent fields → None, no KeyError
test_build_description_key_insensitivity "Subject-ID" in record maps to "subject_id" in fields
test_dataset_initialization_path_parity All three init paths produce identical description DataFrames
test_build_description_participant_tsv_precedence "participant_tsv" mode: tsv value wins; None in tsv overwrites real value
test_dataset_invalid_description_precedence Unknown description_precedence raises ValueError at construction

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

📚 Documentation Preview

📦 Download Documentation Artifact

Download the documentation-html artifact from the workflow run to view the docs locally.

💡 To enable live previews, add a SURGE_TOKEN secret to this repository. See surge.sh for setup instructions.

**base_dataset_kwargs,
)
)
for record in self.records
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for record in self.records
datasets = [
EEGDashRaw(
record,
self.cache_dir,
description=self._build_description(record, description_fields),
**base_dataset_kwargs,
)
for record in self.records
]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimizing the diff.

record: dict[str, Any],
description_fields: list[str],
participants_row: dict[str, Any] | None = None,
) -> dict[str, Any]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more verbose than necessary for this function, and I would move to an auxiliary place, some utilities. This way, the dataset object does not deliver this function, which is only used once.

Comment on lines +584 to +599
def _build_description(
self,
record: dict[str, Any],
description_fields: list[str],
participants_row: dict[str, Any] | None = None,
) -> dict[str, Any]:
"""Build a description dict for a single record.

Extracts values for each requested field from the record, then merges
participant data from either an explicit ``participants_row`` (offline
path, from a local ``participants.tsv``) or the embedded
``participant_tsv`` key inside the record (online paths). Fields still
absent after the merge are set to ``None`` so the schema is always
complete. When both the record and participant data carry the same
field, precedence is determined by ``self._description_precedence``; a
``debug``-level log is emitted when the values differ.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can compact much more function too.

participants_row=part,
description_fields=description_fields,
)
description = self._build_description(record, description_fields)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice simplification

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test suite, we need to do much more compact, use parametrization, and use a more pytest-style.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants