-
Notifications
You must be signed in to change notification settings - Fork 76
Description
Dear CellBender team,
First, thank you for building and maintaining this fantastic tool.
Context
We're working with paired RNA+ATAC 10x datasets and using cellbender remove-background to denoise the RNA signal. Since our pipeline is built on MuData (mudata/muon), we rely on preserving metadata across both modalities for integrated downstream analysis.
We observed that after applying CellBender to a multiome .h5 file, the filtered output loses several ATAC-specific metadata fields, which are crucial for downstream analyses (e.g. QC metrics, peak annotation, plotting).
Loading Raw Input (Before CellBender)
import warnings
import muon as mu
with warnings.catch_warnings():
warnings.simplefilter("ignore", UserWarning)
mdata_before_cellbender = mu.read_10x_h5(
"/mnt/nfs/CX000008_DS1/projects/gaurav/multiome/data_run/NBB4285_CAU_NBB4438_CINGLV/outs/raw_feature_bc_matrix.h5"
)
# Ensure feature names are unique
for mod in mdata_before_cellbender.mod:
mdata_before_cellbender.mod[mod].var_names_make_unique()
mdata_before_cellbenderOutput:
MuData object with n_obs × n_vars = 735169 × 90616
var: 'gene_ids', 'feature_types', 'genome', 'interval'
2 modalities
rna: 735169 × 38606
var: 'gene_ids', 'feature_types', 'genome', 'interval'
atac: 735169 × 52010
var: 'gene_ids', 'feature_types', 'genome', 'interval'
uns: 'atac', 'files'
After CellBender Filtering
The output .h5 file (written by --output) is loaded the same way:
mdata_after_cellbender = mu.read_10x_h5("/path/to/output_filtered.h5")Output:
MuData object with n_obs × n_vars = 26390 × 90616
var: 'gene_ids', 'feature_types', 'genome'
2 modalities
rna: 26390 × 38606
var: 'gene_ids', 'feature_types', 'genome'
atac: 26390 × 52010
var: 'gene_ids', 'feature_types', 'genome'
uns: (missing 'files', 'atac')
Missing:
var['interval']for both RNA and ATACuns['files'],uns['atac']for ATAC
Additional Test
We tried copying the CellBender-filtered .h5 back into the original outs/ folder and loading via mu.read_10x_h5:
test = mu.read_10x_h5(
"/mnt/nfs/CX000008_DS1/projects/gaurav/multiome/data_run/NBB4285_CAU_NBB4438_CINGLV/outs/NBB4285_CAU_NBB4438_CINGLV_filtered.h5"
)This version restored uns['files'] and uns['atac'], but still did not contain interval in var.
🛠️ Workaround: Metadata Restoration Function
To address this, we developed a utility function that restores critical metadata from the original MuData object:
def restore_metadata_with_sanity_check(m_before, m_after, verbose=True):
"""
Restore ATAC + RNA metadata from pre-CellBender MuData (m_before)
to post-CellBender MuData (m_after), with safety checks.
"""
# 1. Check cell overlap
obs_missing = m_after.obs_names.difference(m_before.obs_names)
if len(obs_missing) > 0 and verbose:
print(f"⚠️ WARNING: {len(obs_missing)} cells missing from original.")
print(f"Examples: {list(obs_missing)[:5]}")
elif verbose:
print("✅ All filtered cells matched in original.")
# 2. Check var compatibility
for modality in ['rna', 'atac']:
if not m_after.mod[modality].var_names.equals(m_before.mod[modality].var_names):
raise ValueError(f"❌ {modality.upper()} features do not match between m_before and m_after.")
# 3. Restore var['interval']
for modality in ['rna', 'atac']:
if 'interval' in m_before.mod[modality].var.columns:
m_after.mod[modality].var['interval'] = m_before.mod[modality].var['interval']
if verbose:
print(f"✅ Restored 'interval' for {modality}.")
else:
print(f"⚠️ No 'interval' found in {modality}. Skipping.")
# 4. Restore atac.uns metadata
for field in ['files', 'atac']:
if field in m_before.mod['atac'].uns:
m_after.mod['atac'].uns[field] = m_before.mod['atac'].uns[field]
if verbose:
print(f"✅ Restored 'atac.uns[\"{field}\"]'.")
else:
if verbose:
print(f"⚠️ 'atac.uns[\"{field}\"]' not found. Skipping.")
return m_after✅Usage:
mdata_after_cellbender = restore_metadata_with_sanity_check(
m_before=mdata_before_cellbender,
m_after=mdata_after_cellbender,
verbose=True
)This works well — but we're wondering if this step should even be necessary.
Our Questions
- Is this metadata loss expected when applying CellBender to multiome
.h5files? - Would you recommend our restoration function, or is there a better way to retain metadata within the CellBender workflow?
- Would you consider preserving
var['interval']andunsfields in future versions of the tool when the input is a multi-modal file?
Thank you again for your excellent tool and support. Looking forward to your guidance!