Skip to content

openproblems-bio/task_cyto_batch_integration

Repository files navigation

Cyto Batch Integration

Benchmarking of batch integration algorithms for cytometry data.

Repository: openproblems-bio/task_cyto_batch_integration

Description

Cytometry is a non-sequencing single cell profiling technique commonly used in clinical studies. It is very sensitive to batch effects, which can lead to biases in the interpretation of the result. Batch integration algorithms are often used to mitigate this effect.

In this project, we are building a pipeline for reproducible and continuous benchmarking of batch integration algorithms for cytometry data. As input, methods require cleaned and normalised (using arc-sinh or logicle transformation) data with multiple batches, cell type labels, and biological subjects, with paired samples from a subject profiled across multiple batches. The batch integrated output must be an integrated marker by cell matrix stored in Anndata format. All markers in the input data must be returned, regardless of whether they were integrated or not. This output is then evaluated using metrics that assess how well the batch effects were removed and how much biological signals were preserved.

Authors & contributors

name roles
Luca Leomazzi author, maintainer
Givanna Putri author, maintainer
Robrecht Cannoodt author
Katrien Quintelier contributor
Sofie Van Gassen contributor

API

Loading
flowchart TB
  file_common_dataset("<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#file-format-common-dataset'>Common Dataset</a>")
  comp_data_processor[/"<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#component-type-data-processor'>Data processor</a>"/]
  file_censored_split1("<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#file-format-censored--split-1-'>Censored (split 1)</a>")
  file_censored_split2("<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#file-format-censored--split-2-'>Censored (split 2)</a>")
  file_unintegrated("<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#file-format-unintegrated'>Unintegrated</a>")
  comp_method[/"<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#component-type-method'>Method</a>"/]
  comp_method_split2[/"<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#component-type-method'>Method</a>"/]
  comp_control_method[/"<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#component-type-control-method'>Control Method</a>"/]
  comp_metric[/"<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#component-type-metric'>Metric</a>"/]
  file_integrated_split1("<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#file-format-integrated--split-1-'>Integrated (split 1)</a>")
  file_integrated_split2("<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#file-format-integrated--split-2-'>Integrated (split 2)</a>")
  file_score("<a href='https://github.com/openproblems-bio/task_cyto_batch_integration#file-format-score'>Score</a>")
  file_common_dataset---comp_data_processor
  comp_data_processor-->file_censored_split1
  comp_data_processor-->file_censored_split2
  comp_data_processor-->file_unintegrated
  file_censored_split1---comp_method
  file_censored_split2---comp_method_split2
  file_unintegrated---comp_control_method
  file_unintegrated---comp_metric
  comp_method-->file_integrated_split1
  comp_method_split2-->file_integrated_split2
  comp_control_method-->file_integrated_split1
  comp_control_method-->file_integrated_split2
  comp_metric-->file_score
  file_integrated_split1---comp_metric
  file_integrated_split2---comp_metric

File format: Common Dataset

A subset of the common dataset.

Example file: resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/common_dataset.h5ad

Format:

AnnData object
 obs: 'cell_type', 'batch', 'sample', 'donor', 'group', 'is_control', 'split'
 var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
 layers: 'preprocessed'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'goal_batch', 'parameter_som_xdim', 'parameter_som_ydim', 'parameter_num_clusters'

Data structure:

Slot Type Description
obs["cell_type"] string Cell type information.
obs["batch"] string Batch information.
obs["sample"] string Sample ID.
obs["donor"] string Donor ID.
obs["group"] string Biological group of the donor.
obs["is_control"] integer Whether the sample the cell came from can be used as a control for batch effect correction. * 0: cannot be used as a control. * >= 1: can be used as a control. * For cells with >= 1: cells with the same value come from the same donor. Different values indicate different donors.
obs["split"] integer Which split the cell will be used in. * 0: control samples * 1: split 1 * 2: split 2 .
var["numeric_id"] integer Numeric ID associated with each marker.
var["channel"] string The channel / detector of the instrument.
var["marker"] string (Optional) The marker name associated with the channel.
var["marker_type"] string Whether the marker is a functional or lineage marker.
var["to_correct"] boolean Whether the marker will be batch corrected.
layers["preprocessed"] double preprocessed data, e.g. already compensated, transformed and debris/doublets removed.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["goal_batch"] integer Parameter to set the reference batch to which the batch aligment is performed. Only useful for tools that perform the batch integration “towards a goal batch”.
uns["parameter_som_xdim"] integer Parameter used to define the width of the self-organizing map (SOM) grid. Usually between 10 and 20.
uns["parameter_som_ydim"] integer Parameter used to define the height of the self-organizing map (SOM) grid. Usually between 10 and 20.
uns["parameter_num_clusters"] integer Parameter used to define the number of clusters. Set this number to be slightly higher than the number of cell types expected in the dataset.

Component type: Data processor

A data processor.

Arguments:

Name Type Description
--input file A subset of the common dataset.
--output_censored_split1 file (Output) An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias.
--output_censored_split2 file (Output) An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias.
--output_unintegrated file (Output) The complete unintegrated dataset.

File format: Censored (split 1)

An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias.

Example file: resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/censored_split1.h5ad

Description:

An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias. The batch correction algorithm should not have to rely on these information to properly integrate different batches. This dataset is used as the input for the batch correction algorithm. The cells therein are identical to those in the unintegrated dataset.

Format:

AnnData object
 obs: 'batch', 'sample', 'is_control'
 var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
 layers: 'preprocessed'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["sample"] string Sample ID.
obs["is_control"] integer Whether the sample the cell came from can be used as a control for batch effect correction. * 0: cannot be used as a control. * >= 1: can be used as a control. * For cells with >= 1: cells with the same value come from the same donor. Different values indicate different donors.
var["numeric_id"] integer Numeric ID associated with each marker.
var["channel"] string The channel / detector of the instrument.
var["marker"] string (Optional) The marker name associated with the channel.
var["marker_type"] string Whether the marker is a functional or lineage marker.
var["to_correct"] boolean Whether the marker will be batch corrected.
layers["preprocessed"] double preprocessed data, e.g. already compensated, transformed and debris/doublets removed.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.

File format: Censored (split 2)

An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias.

Example file: resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/censored_split2.h5ad

Description:

An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias. The batch correction algorithm should not have to rely on these information to properly integrate different batches. This dataset is used as the input for the batch correction algorithm. The cells therein are identical to those in the unintegrated dataset.

Format:

AnnData object
 obs: 'batch', 'sample', 'is_control'
 var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
 layers: 'preprocessed'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism'

Data structure:

Slot Type Description
obs["batch"] string Batch information.
obs["sample"] string Sample ID.
obs["is_control"] integer Whether the sample the cell came from can be used as a control for batch effect correction. * 0: cannot be used as a control. * >= 1: can be used as a control. * For cells with >= 1: cells with the same value come from the same donor. Different values indicate different donors.
var["numeric_id"] integer Numeric ID associated with each marker.
var["channel"] string The channel / detector of the instrument.
var["marker"] string (Optional) The marker name associated with the channel.
var["marker_type"] string Whether the marker is a functional or lineage marker.
var["to_correct"] boolean Whether the marker will be batch corrected.
layers["preprocessed"] double preprocessed data, e.g. already compensated, transformed and debris/doublets removed.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.

File format: Unintegrated

The complete unintegrated dataset.

Example file: resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/unintegrated.h5ad

Description:

The complete unintegrated dataset. The cells in this dataset are the same to those in the censored dataset.

Format:

AnnData object
 obs: 'cell_type', 'batch', 'sample', 'donor', 'group', 'is_control', 'split'
 var: 'numeric_id', 'channel', 'marker', 'marker_type', 'to_correct'
 layers: 'preprocessed'
 uns: 'dataset_id', 'dataset_name', 'dataset_url', 'dataset_reference', 'dataset_summary', 'dataset_description', 'dataset_organism', 'parameter_som_xdim', 'parameter_som_ydim', 'parameter_num_clusters'

Data structure:

Slot Type Description
obs["cell_type"] string Cell type information.
obs["batch"] string Batch information.
obs["sample"] string Sample ID.
obs["donor"] string Donor ID.
obs["group"] string Biological group of the donor.
obs["is_control"] integer Whether the sample the cell came from can be used as a control for batch effect correction. * 0: cannot be used as a control. * >= 1: can be used as a control. * For cells with >= 1: cells with the same value come from the same donor. Different values indicate different donors.
obs["split"] integer Which split the cell will be used in. * 0: control samples * 1: split 1 * 2: split 2 .
var["numeric_id"] integer Numeric ID associated with each marker.
var["channel"] string The channel / detector of the instrument.
var["marker"] string (Optional) The marker name associated with the channel.
var["marker_type"] string Whether the marker is a functional or lineage marker.
var["to_correct"] boolean Whether the marker will be batch corrected.
layers["preprocessed"] double preprocessed data, e.g. already compensated, transformed and debris/doublets removed.
uns["dataset_id"] string A unique identifier for the dataset.
uns["dataset_name"] string Nicely formatted name.
uns["dataset_url"] string (Optional) Link to the original source of the dataset.
uns["dataset_reference"] string (Optional) Bibtex reference of the paper in which the dataset was published.
uns["dataset_summary"] string Short description of the dataset.
uns["dataset_description"] string Long description of the dataset.
uns["dataset_organism"] string (Optional) The organism of the sample in the dataset.
uns["parameter_som_xdim"] integer Parameter used to define the width of the self-organizing map (SOM) grid. Usually between 10 and 20.
uns["parameter_som_ydim"] integer Parameter used to define the height of the self-organizing map (SOM) grid. Usually between 10 and 20.
uns["parameter_num_clusters"] integer Parameter used to define the number of clusters. Set this number to be slightly higher than the number of cell types expected in the dataset.

Component type: Method

A method for integrating batch effects in cytometry data.

Arguments:

Name Type Description
--input file An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias.
--output file (Output) Integrated dataset which batch effect was corrected by an algorithm.

Component type: Method

A method for integrating batch effects in cytometry data.

Arguments:

Name Type Description
--input file An unintegrated dataset with certain columns (cells metadata), such as the donor information, hidden. These columns are intentionally hidden to prevent bias.
--output file (Output) Integrated dataset which batch effect was corrected by an algorithm.

Component type: Control Method

Quality control methods for verifying the pipeline.

Arguments:

Name Type Description
--input_unintegrated file The complete unintegrated dataset.
--output_integrated_split1 file (Output) Integrated dataset which batch effect was corrected by an algorithm.
--output_integrated_split2 file (Output) Integrated dataset which batch effect was corrected by an algorithm.

Component type: Metric

A task template metric.

Arguments:

Name Type Description
--input_unintegrated file The complete unintegrated dataset.
--input_integrated_split1 file Integrated dataset which batch effect was corrected by an algorithm.
--input_integrated_split2 file Integrated dataset which batch effect was corrected by an algorithm.
--output file (Output) File indicating the score of a metric.

File format: Integrated (split 1)

Integrated dataset which batch effect was corrected by an algorithm

Example file: resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/integrated_split1.h5ad

Format:

AnnData object
 layers: 'integrated'
 uns: 'dataset_id', 'method_id', 'parameters'

Data structure:

Slot Type Description
layers["integrated"] double The integrated data as returned by a batch correction method.
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.
uns["parameters"] object (Optional) The parameters used for the integration.

File format: Integrated (split 2)

Integrated dataset which batch effect was corrected by an algorithm

Example file: resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/integrated_split2.h5ad

Format:

AnnData object
 layers: 'integrated'
 uns: 'dataset_id', 'method_id', 'parameters'

Data structure:

Slot Type Description
layers["integrated"] double The integrated data as returned by a batch correction method.
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the method.
uns["parameters"] object (Optional) The parameters used for the integration.

File format: Score

File indicating the score of a metric.

Example file: resources_test/task_cyto_batch_integration/mouse_spleen_flow_cytometry_subset/score.h5ad

Format:

AnnData object
 uns: 'dataset_id', 'method_id', 'metric_ids', 'metric_values'

Data structure:

Slot Type Description
uns["dataset_id"] string A unique identifier for the dataset.
uns["method_id"] string A unique identifier for the batch correction method.
uns["metric_ids"] string One or more unique metric identifiers.
uns["metric_values"] double The metric values obtained. Must be of same length as ‘metric_ids’.