|
1 | 1 | A few tips on training models |
2 | 2 | =================== |
3 | 3 |
|
4 | | -- We recommend you to set `loss_fn` = `nb` or `zinb`. These loss functions require the access to count and not normalized data. You need to have normalized log-transformed data in `adata.X` and raw count data in `adata.raw.X`. You also need to have normalization factors for each cell in `adata.obs[scale_factors]`. These normalization factors can be obtained with `scanpy.pp.normalize_total <https://github.com/theislab/scarches/blob/master/requirements.txt>`_ or other normalization methods such as `scran <https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html>`_. |
| 4 | +trVAE |
| 5 | + - We recommend you to set `recon_loss` = `nb` or `zinb`. These loss functions require access to count and not normalized data. You need to have normalized log-transformed data in `adata.X` and raw count data in `adata.raw.X`. You also need to have normalization factors for each cell in `adata.obs[scale_factors]`. These normalization factors can be obtained with `scanpy.pp.normalize_total <https://github.com/theislab/scarches/blob/master/requirements.txt>`_ or other normalization methods such as `scran <https://bioconductor.org/packages/devel/bioc/vignettes/scran/inst/doc/scran.html>`_. |
| 6 | + |
| 7 | + - If you don't have access to count data and have normalized data then set `recon_loss` to `mse`. |
5 | 8 |
|
| 9 | + - trVAE relies on an extra MMD term to force further integration of data sets. There is a parameter called `beta` (default=1) which regulates MMD effect in training. Higher values of `beta` will force extra mixing (might remove biological variation if too big!) while smaller values might result in less mixing (still batch effect). If you set `beta` = `0` the model reduces to a Vanilla CVAE. |
6 | 10 |
|
| 11 | + - It is important to use highly variable genes for training. We recommend to use at least 2000 HVGs and if you have more complicated datasets, conditions then try to increase it to 5000 or so to include enough information for the model. |
7 | 12 |
|
8 | | -- If you don't have access to count data and have normalized data then set `loss_fn` to `sse` or `mse`. |
| 13 | + - Regarding `architecture` always try with the default one ([128,128], `z_dimension`=10) and check the results. If you have more complicated data sets with many datasets and conditions and etc then you can increase the depth ([128,128,128] or [128,128,128,128]). According to our experiments, small values of `z_dimension` between 10 (default) and 20 are good. |
9 | 14 |
|
| 15 | +scVI |
| 16 | + - scVI require access to raw count data. |
| 17 | + - scVI already has a default good parameter the only thing you might change is `n_hidden` which we suggest increasing to 2 (min) and max 4-5 for more |
| 18 | + complicated datasets. |
10 | 19 |
|
| 20 | + |
| 21 | +scANVI |
| 22 | + - It requires access to raw count data. |
| 23 | + - If you have query data the query data should be treated as unlabelled (Unknown) or have the same set of cell-types labels as reference. If you have a new cell-type label that is in the query data but not in reference and you want to use this in the training query you will get an error! We will fix this in future releases. |
11 | 24 |
|
12 | | -- If you want better separation of cell types you can increase the `n_epochs`. 100 epochs in most cases yield good quality but you can increase uo to 200. If some cell types are merged which should not be try to increase `n_epochs` and decrease `alpha` (see next tip). |
13 | 25 |
|
14 | 26 |
|
15 | 27 |
|
16 | | -- If you want to increase the mixing of the different batches then try to increase `alpha` when tou construct the the model. Maximum value of `alpha` can be 1. Increasing alpha will give you better mixing but it is a trade off! Increase `alpha` might also merge some small cell types or conditions. You can start with very small values (e.g 0.0001) and then increase that (0.001 ->0.005->0.01 and even 0.1 and finally 0.5). |
17 | 28 |
|
18 | 29 |
|
19 | | -- It is important to use highly variable genes for training. We recommend to use at least 2000 hvgs and if you have more complicated datasets, conditions then try to increase it to 5000 or so to include enough information for the model. |
20 | | - |
21 | | -- Regarding `architecture` always try with the default one ([128,128], `z_dimension`=10) and check the results. If you have more complicated data sets with many datasets and conditions and etc then you can increase the depth ([128,128,128] or [128,128,128,128]). According to our experiments small values of `z_dimension` between 10 (default) and 20 are good. |
|
0 commit comments