Benchmark datasets
Tiered approach
Dragon is benchmarked at three scales to validate scalability from laptop to server:
Tier 1: Small (development & CI)
Property |
Value |
|---|---|
Genomes |
500 complete E. coli / Shigella from RefSeq |
Total sequence |
~2.5 Gbp |
Redundancy |
High (~95% ANI within species) |
Index time |
<1 hour |
Index size |
~1.5 GB |
Use case |
Unit testing, CI, rapid iteration |
Tier 2: Medium (validation)
Property |
Value |
|---|---|
Genomes |
~85,000 GTDB r220 representative genomes |
Total sequence |
~250 Gbp |
Redundancy |
Medium (one genome per species) |
Index time |
~1 hour |
Index size |
~15 GB |
Use case |
Sensitivity/accuracy validation |
Tier 3: Large (full benchmark)
Property |
Value |
|---|---|
Genomes |
~2.34M GenBank + RefSeq prokaryotic assemblies |
Total sequence |
~10 Tbp |
Redundancy |
Very high (many strains per species) |
Index time |
~12 hours |
Index size |
~100 GB |
Use case |
Full-scale comparison with LexicMap |
Query types
Gene-level queries (primary)
1,000 random subsequences (500-5,000 bp) extracted from 100 diverse genomes
Controlled mutations at 6 divergence levels: 0%, 1%, 3%, 5%, 10%, 15%
Mutation model: 70% substitutions, 20% insertions, 10% deletions
Long reads (Badread)
Simulated Oxford Nanopore reads from 50 genomes
Mean length: 5 Kbp
Identity range: 85-99%
Chimera rate: 1%
Challenging scenarios
16S rRNA: highly conserved, extreme seed frequency
Plasmids: 10-200 Kbp, multi-copy
AMR genes: batch of 1,003 genes from CARD database
HGT events: genes implanted from distant species