-
Notifications
You must be signed in to change notification settings - Fork 5
Description
I'm trying to use simpleaf to build an index for Glycine max (soybean). The genome and gtf files required some preprocessing to get them properly formatted.
I ran the following command (using simpleaf 0.16.2):
simpleaf index --output simpleaf_index --fasta ../../g_max.genome.fasta --gtf ../../g_max.longest_transcripts.gtf --rlen 91 --threads 16 --use-piscem
Which resulted in the following output:
2024-06-04T10:05:48.261414Z INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
2024-06-04T10:05:50.342651Z INFO grangers::reader::gtf: Finished parsing the input file. Found 0 comments and 752330 records.
2024-06-04T10:05:51.029383Z INFO roers: Built the Grangers object for 752330 records
2024-06-04T10:05:51.237147Z WARN grangers::grangers_info: The exon_number column contains null values. Will compute the exon number from exon start position .
2024-06-04T10:05:51.527120Z WARN roers: Found missing gene_id and/or gene_name; Imputing. If both missing, will impute using transcript_id; Otherwise, will impute using the existing one.
2024-06-04T10:05:51.549542Z INFO roers: Proceed 278761 exon records from 55589 transcripts
Error: invalid base: 0067
The error message is a bit cryptic, so I don't really know what to do. I tried searching some of the rust repositories but haven't found the error message source yet.
If relevant I can provide the genome and gtf files.
EDIT:
Upon further investigation this seems to stem from the noodles crate: https://github.com/zaeleus/noodles/blob/906f5237c68fc6b04a73010580d3c4fed2c7b66e/noodles-fasta/src/record/sequence/complement.rs#L24. However, I don't really understand what's wrong yet.
Quick python check:
>>> bytes([67])
b'C'
Which should be possible to reverse complement?