Skip to content

memory error while generating bus file with large index  #16

@tangybat

Description

@tangybat

Hi,

I have been following this tutorial on generating spliced and unspliced matrices for RNA velocity analysis: https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices

where intronic sequences have to be included in the index to differentiate between spliced and unspliced transcripts. The kallisto index (7 Gb in size) was made from the genome "BSgenome.Hsapiens.UCSC.hg38".

Then I get this output when I try to generate the initial bus file using this index:

kallisto bus -i ../../output/mm_cDNA_introns_97.idx
-o ../../output/neuron10k_velocity -x 10xv3 -t8
neuron_10k_v3_S1_L002_R1_001.fastq.gz neuron_10k_v3_S1_L002_R2_001.fastq.gz
neuron_10k_v3_S1_L001_R1_001.fastq.gz neuron_10k_v3_S1_L001_R2_001.fastq.gz

[bus] Note: Strand option was not specified; setting it to --fr-stranded for specified technology
[index] k-mer length: 31
[index] number of targets: 1,378,373
[index] number of k-mers: 1,560,141,285
[index] number of equivalence classes: 4,890,909,195,324,358,656
terminate called after throwing an instance of 'std::length_error'

Is there a way to reduce the number of equivalence classes produced and reduce memory usage while retaining the intronic sequences in the index file? I tried submitting this job to a cluster with 200Gb of memory and it still wasn't enough to complete the job.

This was the code I used to make the index:

ah <- AnnotationHub() 
query(ah, pattern = c("Ensembl", "97", "Homo sapiens", "EnsDb"))

AnnotationHub with 1 record
# snapshotDate(): 2021-10-20
# names(): AH73881
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: EnsDb
# $rdatadateadded: 2019-05-02
# $title: Ensembl 97 EnsDb for Homo sapiens
# $description: Gene and protein annotations for Homo sapiens based o...
# $taxonomyid: 9606
# $genome: GRCh38
# $sourcetype: ensembl
# $sourceurl: http://www.ensembl.org
# $sourcesize: NA
# $tags: c("97", "AHEnsDbs", "Annotation", "EnsDb", "Ensembl",
#   "Gene", "Protein", "Transcript") 
# retrieve record with 'object[["AH73881"]]' 

edb <- ah[["AH73881"]]
library(BSgenome.Hsapiens.UCSC.hg38)
library(BUSpaRse)
get_velocity_files(ah[["AH73881"]], L = 91, Genome = BSgenome.Hsapiens.UCSC.hg38, 
                   out_path = "./16_Azi_velocity", 
                   isoform_action = "separate")
kallisto index -i ./hg_cDNA_introns_97.idx ./16_Azi_velocity/cDNA_introns.fa

In the tutorial from https://bustools.github.io/BUS_notebooks_R/velocity.html#generate_spliced_and_unspliced_matrices
, they also generate the velocity files using a genome:

get_velocity_files(edb, L = 91, Genome = BSgenome.Mmusculus.UCSC.mm10, 
                   out_path = "./output/neuron10k_velocity", 
                   isoform_action = "separate")

since the goal was to "build a kallisto index for cDNAs as reads are pseudoaligned to cDNAs. Here, for RNA velocity, as reads are pseudoaligned to the flanked intronic sequences in addition to the cDNAs, the flanked intronic sequences should also be part of the kallisto index." Was I supposed to do something else? Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions