memory error while generating bus file with large index 

Hi,

I have been following this tutorial on generating spliced and unspliced matrices for RNA velocity analysis: https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices

where intronic sequences have to be included in the index to differentiate between spliced and unspliced transcripts. The kallisto index (7 Gb in size) was made from the genome "BSgenome.Hsapiens.UCSC.hg38".

Then I get this output when I try to generate the initial bus file using this index:

```
kallisto bus -i ../../output/mm_cDNA_introns_97.idx
-o ../../output/neuron10k_velocity -x 10xv3 -t8
neuron_10k_v3_S1_L002_R1_001.fastq.gz neuron_10k_v3_S1_L002_R2_001.fastq.gz
neuron_10k_v3_S1_L001_R1_001.fastq.gz neuron_10k_v3_S1_L001_R2_001.fastq.gz

[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[index] k-mer length: 31
[index] number of targets: 1,378,373
[index] number of k-mers: 1,560,141,285
[index] number of equivalence classes: 4,890,909,195,324,358,656
terminate called after throwing an instance of 'std::length_error'

```

Is there a way to reduce the number of equivalence classes produced and reduce memory usage while retaining the intronic sequences in the index file? I tried submitting this job to a cluster with 200Gb of memory and it still wasn't enough to complete the job.

This was the code I used to make the index: 

```
ah <- AnnotationHub() 
query(ah, pattern = c("Ensembl", "97", "Homo sapiens", "EnsDb"))

AnnotationHub with 1 record
# snapshotDate(): 2021-10-20
# names(): AH73881
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 97 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based o...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("97", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl",
#   "Gene", "Protein", "Transcript") 
# retrieve record with 'object[["AH73881"]]' 

edb <- ah[["AH73881"]]
library(BSgenome.Hsapiens.UCSC.hg38)
library(BUSpaRse)
get_velocity_files(ah[["AH73881"]], L = 91, Genome = BSgenome.Hsapiens.UCSC.hg38, 
                   out_path = "./16_Azi_velocity", 
                   isoform_action = "separate")
kallisto index -i ./hg_cDNA_introns_97.idx ./16_Azi_velocity/cDNA_introns.fa
```

In the tutorial from https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices
, they also generate the velocity files using a genome:

```
get_velocity_files(edb, L = 91, Genome = BSgenome.Mmusculus.UCSC.mm10, 
                   out_path = "./output/neuron10k_velocity", 
                   isoform_action = "separate")
```
since the goal was to "build a kallisto index for cDNAs as reads are pseudoaligned to cDNAs. Here, for RNA velocity, as reads are pseudoaligned to the flanked intronic sequences in addition to the cDNAs, the flanked intronic sequences should also be part of the kallisto index." Was I supposed to do something else? Thank you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory error while generating bus file with large index #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

memory error while generating bus file with large index #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions