# Benchmark datasets ## Tiered approach Dragon is benchmarked at three scales to validate scalability from laptop to server: ### Tier 1: Small (development & CI) | Property | Value | |----------|-------| | **Genomes** | 500 complete *E. coli* / *Shigella* from RefSeq | | **Total sequence** | ~2.5 Gbp | | **Redundancy** | High (~95% ANI within species) | | **Index time** | <1 hour | | **Index size** | ~1.5 GB | | **Use case** | Unit testing, CI, rapid iteration | ### Tier 2: Medium (validation) | Property | Value | |----------|-------| | **Genomes** | ~85,000 GTDB r220 representative genomes | | **Total sequence** | ~250 Gbp | | **Redundancy** | Medium (one genome per species) | | **Index time** | ~1 hour | | **Index size** | ~15 GB | | **Use case** | Sensitivity/accuracy validation | ### Tier 3: Large (full benchmark) | Property | Value | |----------|-------| | **Genomes** | ~2.34M GenBank + RefSeq prokaryotic assemblies | | **Total sequence** | ~10 Tbp | | **Redundancy** | Very high (many strains per species) | | **Index time** | ~12 hours | | **Index size** | ~100 GB | | **Use case** | Full-scale comparison with LexicMap | ## Query types ### Gene-level queries (primary) - 1,000 random subsequences (500-5,000 bp) extracted from 100 diverse genomes - Controlled mutations at 6 divergence levels: 0%, 1%, 3%, 5%, 10%, 15% - Mutation model: 70% substitutions, 20% insertions, 10% deletions ### Long reads (Badread) - Simulated Oxford Nanopore reads from 50 genomes - Mean length: 5 Kbp - Identity range: 85-99% - Chimera rate: 1% ### Challenging scenarios - **16S rRNA**: highly conserved, extreme seed frequency - **Plasmids**: 10-200 Kbp, multi-copy - **AMR genes**: batch of 1,003 genes from CARD database - **HGT events**: genes implanted from distant species