# Benchmark datasets

## Tiered approach

Dragon is benchmarked at three scales to validate scalability from laptop to server:

### Tier 1: Small (development & CI)

| Property | Value |
|----------|-------|
| **Genomes** | 500 complete *E. coli* / *Shigella* from RefSeq |
| **Total sequence** | ~2.5 Gbp |
| **Redundancy** | High (~95% ANI within species) |
| **Index time** | <1 hour |
| **Index size** | ~1.5 GB |
| **Use case** | Unit testing, CI, rapid iteration |

### Tier 2: Medium (validation)

| Property | Value |
|----------|-------|
| **Genomes** | ~85,000 GTDB r220 representative genomes |
| **Total sequence** | ~250 Gbp |
| **Redundancy** | Medium (one genome per species) |
| **Index time** | ~1 hour |
| **Index size** | ~15 GB |
| **Use case** | Sensitivity/accuracy validation |

### Tier 3: Large (full benchmark)

| Property | Value |
|----------|-------|
| **Genomes** | ~2.34M GenBank + RefSeq prokaryotic assemblies |
| **Total sequence** | ~10 Tbp |
| **Redundancy** | Very high (many strains per species) |
| **Index time** | ~12 hours |
| **Index size** | ~100 GB |
| **Use case** | Full-scale comparison with LexicMap |

## Query types

### Gene-level queries (primary)

- 1,000 random subsequences (500-5,000 bp) extracted from 100 diverse genomes
- Controlled mutations at 6 divergence levels: 0%, 1%, 3%, 5%, 10%, 15%
- Mutation model: 70% substitutions, 20% insertions, 10% deletions

### Long reads (Badread)

- Simulated Oxford Nanopore reads from 50 genomes
- Mean length: 5 Kbp
- Identity range: 85-99%
- Chimera rate: 1%

### Challenging scenarios

- **16S rRNA**: highly conserved, extreme seed frequency
- **Plasmids**: 10-200 Kbp, multi-copy
- **AMR genes**: batch of 1,003 genes from CARD database
- **HGT events**: genes implanted from distant species