Skip to content

Commit 071e16a

Browse files
authored
Update training_tips.rst
1 parent f0f5e60 commit 071e16a

File tree

1 file changed

+15
-7
lines changed

1 file changed

+15
-7
lines changed

docs/training_tips.rst

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,29 @@
11
A few tips on training models
22
===================
33

4-
- We recommend you to set `loss_fn` = `nb` or `zinb`. These loss functions require the access to count and not normalized data. You need to have normalized log-transformed data in `adata.X` and raw count data in `adata.raw.X`. You also need to have normalization factors for each cell in `adata.obs[scale_factors]`. These normalization factors can be obtained with `scanpy.pp.normalize_total <https://github.com/theislab/scarches/blob/master/requirements.txt>`_ or other normalization methods such as `scran <https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html>`_.
4+
trVAE
5+
- We recommend you to set `recon_loss` = `nb` or `zinb`. These loss functions require access to count and not normalized data. You need to have normalized log-transformed data in `adata.X` and raw count data in `adata.raw.X`. You also need to have normalization factors for each cell in `adata.obs[scale_factors]`. These normalization factors can be obtained with `scanpy.pp.normalize_total <https://github.com/theislab/scarches/blob/master/requirements.txt>`_ or other normalization methods such as `scran <https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html>`_.
6+
7+
- If you don't have access to count data and have normalized data then set `recon_loss` to `mse`.
58

9+
- trVAE relies on an extra MMD term to force further integration of data sets. There is a parameter called `beta` (default=1) which regulates MMD effect in training. Higher values of `beta` will force extra mixing (might remove biological variation if too big!) while smaller values might result in less mixing (still batch effect). If you set `beta` = `0` the model reduces to a Vanilla CVAE.
610

11+
- It is important to use highly variable genes for training. We recommend to use at least 2000 HVGs and if you have more complicated datasets, conditions then try to increase it to 5000 or so to include enough information for the model.
712

8-
- If you don't have access to count data and have normalized data then set `loss_fn` to `sse` or `mse`.
13+
- Regarding `architecture` always try with the default one ([128,128], `z_dimension`=10) and check the results. If you have more complicated data sets with many datasets and conditions and etc then you can increase the depth ([128,128,128] or [128,128,128,128]). According to our experiments, small values of `z_dimension` between 10 (default) and 20 are good.
914

15+
scVI
16+
- scVI require access to raw count data.
17+
- scVI already has a default good parameter the only thing you might change is `n_hidden` which we suggest increasing to 2 (min) and max 4-5 for more
18+
complicated datasets.
1019

20+
21+
scANVI
22+
- It requires access to raw count data.
23+
- If you have query data the query data should be treated as unlabelled (Unknown) or have the same set of cell-types labels as reference. If you have a new cell-type label that is in the query data but not in reference and you want to use this in the training query you will get an error! We will fix this in future releases.
1124

12-
- If you want better separation of cell types you can increase the `n_epochs`. 100 epochs in most cases yield good quality but you can increase uo to 200. If some cell types are merged which should not be try to increase `n_epochs` and decrease `alpha` (see next tip).
1325

1426

1527

16-
- If you want to increase the mixing of the different batches then try to increase `alpha` when tou construct the the model. Maximum value of `alpha` can be 1. Increasing alpha will give you better mixing but it is a trade off! Increase `alpha` might also merge some small cell types or conditions. You can start with very small values (e.g 0.0001) and then increase that (0.001 ->0.005->0.01 and even 0.1 and finally 0.5).
1728

1829

19-
- It is important to use highly variable genes for training. We recommend to use at least 2000 hvgs and if you have more complicated datasets, conditions then try to increase it to 5000 or so to include enough information for the model.
20-
21-
- Regarding `architecture` always try with the default one ([128,128], `z_dimension`=10) and check the results. If you have more complicated data sets with many datasets and conditions and etc then you can increase the depth ([128,128,128] or [128,128,128,128]). According to our experiments small values of `z_dimension` between 10 (default) and 20 are good.

0 commit comments

Comments
 (0)