Skip to content

tabix performance #1949

@pd3

Description

@pd3

It was reported offline by Kees Albers that the VCF index performs poorly on very large data. In fact, when chatGPT was asked to write a bgzip parser and then the VCF was queried directly using a simple binary search algorithm, this solution was much much faster than using tabix.

The test data (available offline)

  • 7.4 GB in size and contains 6807 variants for ~500K samples
  • the tabix index size is 180 bytes

Test machine

  • 8 cores, 16 GB of RAM, SSD disk

Tabix index performance

  • 86 seconds

Direct binary search

  • 0.17 seconds

varseek.txt

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions