Skip to content

Loss of ATAC metadata in CellBender-filtered .h5 outputs for 10x Multiome libraries (MuData compatibility) #417

@gauravsinghrathore

Description

@gauravsinghrathore

Dear CellBender team,

First, thank you for building and maintaining this fantastic tool.

Context

We're working with paired RNA+ATAC 10x datasets and using cellbender remove-background to denoise the RNA signal. Since our pipeline is built on MuData (mudata/muon), we rely on preserving metadata across both modalities for integrated downstream analysis.

We observed that after applying CellBender to a multiome .h5 file, the filtered output loses several ATAC-specific metadata fields, which are crucial for downstream analyses (e.g. QC metrics, peak annotation, plotting).


Loading Raw Input (Before CellBender)

import warnings
import muon as mu

with warnings.catch_warnings():
    warnings.simplefilter("ignore", UserWarning)
    mdata_before_cellbender = mu.read_10x_h5(
        "/mnt/nfs/CX000008_DS1/projects/gaurav/multiome/data_run/NBB4285_CAU_NBB4438_CINGLV/outs/raw_feature_bc_matrix.h5"
    )

# Ensure feature names are unique
for mod in mdata_before_cellbender.mod:
    mdata_before_cellbender.mod[mod].var_names_make_unique()

mdata_before_cellbender

Output:

MuData object with n_obs × n_vars = 735169 × 90616
  var: 'gene_ids', 'feature_types', 'genome', 'interval'
  2 modalities
    rna:  735169 × 38606
      var: 'gene_ids', 'feature_types', 'genome', 'interval'
    atac: 735169 × 52010
      var: 'gene_ids', 'feature_types', 'genome', 'interval'
      uns: 'atac', 'files'

After CellBender Filtering

The output .h5 file (written by --output) is loaded the same way:

mdata_after_cellbender = mu.read_10x_h5("/path/to/output_filtered.h5")

Output:

MuData object with n_obs × n_vars = 26390 × 90616
  var: 'gene_ids', 'feature_types', 'genome'
  2 modalities
    rna:  26390 × 38606
      var: 'gene_ids', 'feature_types', 'genome'
    atac: 26390 × 52010
      var: 'gene_ids', 'feature_types', 'genome'
      uns: (missing 'files', 'atac')

Missing:

  • var['interval'] for both RNA and ATAC
  • uns['files'], uns['atac'] for ATAC

Additional Test

We tried copying the CellBender-filtered .h5 back into the original outs/ folder and loading via mu.read_10x_h5:

test = mu.read_10x_h5(
    "/mnt/nfs/CX000008_DS1/projects/gaurav/multiome/data_run/NBB4285_CAU_NBB4438_CINGLV/outs/NBB4285_CAU_NBB4438_CINGLV_filtered.h5"
)

This version restored uns['files'] and uns['atac'], but still did not contain interval in var.


🛠️ Workaround: Metadata Restoration Function

To address this, we developed a utility function that restores critical metadata from the original MuData object:

def restore_metadata_with_sanity_check(m_before, m_after, verbose=True):
    """
    Restore ATAC + RNA metadata from pre-CellBender MuData (m_before)
    to post-CellBender MuData (m_after), with safety checks.
    """
    # 1. Check cell overlap
    obs_missing = m_after.obs_names.difference(m_before.obs_names)
    if len(obs_missing) > 0 and verbose:
        print(f"⚠️ WARNING: {len(obs_missing)} cells missing from original.")
        print(f"Examples: {list(obs_missing)[:5]}")
    elif verbose:
        print("✅ All filtered cells matched in original.")

    # 2. Check var compatibility
    for modality in ['rna', 'atac']:
        if not m_after.mod[modality].var_names.equals(m_before.mod[modality].var_names):
            raise ValueError(f"❌ {modality.upper()} features do not match between m_before and m_after.")

    # 3. Restore var['interval']
    for modality in ['rna', 'atac']:
        if 'interval' in m_before.mod[modality].var.columns:
            m_after.mod[modality].var['interval'] = m_before.mod[modality].var['interval']
            if verbose:
                print(f"✅ Restored 'interval' for {modality}.")
        else:
            print(f"⚠️ No 'interval' found in {modality}. Skipping.")

    # 4. Restore atac.uns metadata
    for field in ['files', 'atac']:
        if field in m_before.mod['atac'].uns:
            m_after.mod['atac'].uns[field] = m_before.mod['atac'].uns[field]
            if verbose:
                print(f"✅ Restored 'atac.uns[\"{field}\"]'.")
        else:
            if verbose:
                print(f"⚠️ 'atac.uns[\"{field}\"]' not found. Skipping.")

    return m_after

✅Usage:

mdata_after_cellbender = restore_metadata_with_sanity_check(
    m_before=mdata_before_cellbender,
    m_after=mdata_after_cellbender,
    verbose=True
)

This works well — but we're wondering if this step should even be necessary.


Our Questions

  1. Is this metadata loss expected when applying CellBender to multiome .h5 files?
  2. Would you recommend our restoration function, or is there a better way to retain metadata within the CellBender workflow?
  3. Would you consider preserving var['interval'] and uns fields in future versions of the tool when the input is a multi-modal file?

Thank you again for your excellent tool and support. Looking forward to your guidance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions