BAM and SAM files are fine without a FASTA reference, but CRAM files might be an issue:
Always decodable without a reference (bases are not involved):
- Header — SQ list, RG, samples, sort order.
- CRAI index — byte offsets and coords.
- Slice/container metadata — ref_seq_id, alignment_start/span per slice.
- Per-record fields stored explicitly: position, flags, MAPQ, CIGAR, read name, mate info, quality scores, aux tags. None of these go through reference-based compression.
Bases — depends on how the writer encoded the file (three cases, in order of likelihood you can read them):
RR = false in the CRAM compression header — the writer disabled reference-based compression and stored bases verbatim. Full decode, no reference at all. We already parse RR (crates/seqair/src/cram/compression_header.rs:24); we just don't currently let callers run without the FASTA path.
RR = true but the slice embeds its own reference (embedded_reference >= 0) — the ref bytes ride along inside the slice. Already supported per r[impl cram.slice.embedded_ref]. No external FASTA needed.
RR = true, no embedded ref — bases require the external FASTA. We currently error here (r[impl cram.edge.missing_reference]), as we should.
So practically, "what works without a reference" is: every metadata field, every CIGAR, every position — for any CRAM.
HTSlib is fine opening cram files and then reads fail when cram_decode_slice actually needs to reconstruct sequence and finds no ref.
For pileups, we have a reference_base() that is very nice to have but could be optional.
BAM and SAM files are fine without a FASTA reference, but CRAM files might be an issue:
HTSlib is fine opening cram files and then reads fail when cram_decode_slice actually needs to reconstruct sequence and finds no ref.
For pileups, we have a
reference_base()that is very nice to have but could be optional.