# Reproducing benchmarks ## Quick start (synthetic data) Run the complete benchmark pipeline with synthetic test data in under 2 minutes: ```bash # 1. Build Dragon cargo build --release # 2. Generate synthetic test data (20 genomes, 50 queries) python3 benchmark/setup_test_data.py # 3. Run benchmarks python3 benchmark/run_benchmark.py ``` Results are saved to: - `benchmark/results/metrics/` — accuracy and resource TSV files - `manuscript/figures/` — regenerated PDF and PNG figures ## Full benchmark (Snakemake) For the full benchmark with real data and all tools: ### Prerequisites ```bash # Install benchmark dependencies pip install snakemake matplotlib seaborn pandas numpy # Install comparison tools conda install -c bioconda lexicmap minimap2 blast mmseqs2 cobs sourmash skani # Install read simulators conda install -c bioconda badread art ``` ### Running ```bash cd benchmark # Dry run (show what will be executed) snakemake --cores 8 -n # Full run snakemake --cores 8 --use-conda # Regenerate figures only snakemake --cores 1 -R plot_figures ``` ### Configuration Edit `benchmark/config.yaml` to customise: ```yaml # Dataset paths datasets: tier1: genome_dir: "data/tier1_genomes" # Simulation parameters simulation: gene_queries: num_genes: 1000 divergence_levels: [0.0, 0.01, 0.03, 0.05, 0.10, 0.15] # Hardware hardware: default_threads: 8 ``` ## Pipeline structure ``` benchmark/ +-- Snakefile Workflow orchestration +-- config.yaml Configuration +-- setup_test_data.py Synthetic data generator +-- run_benchmark.py Standalone benchmark runner +-- simulate/ | +-- extract_genes.py Extract random genes from genomes | +-- mutate_sequences.py Introduce controlled mutations | +-- run_badread.sh Long-read simulation | +-- run_art.sh Short-read simulation +-- scripts/ | +-- compute_metrics.py Sensitivity, precision, F1 | +-- resource_usage.py Parse /usr/bin/time output +-- notebooks/ | +-- figures.py Generate all manuscript figures +-- envs/ | +-- simulation.yaml Conda environment +-- results/ Output (auto-generated) +-- metrics/ TSV metric files +-- figures/ PDF/PNG plots +-- search/ Raw search output files ``` ## Adding a new tool 1. Add tool configuration to `config.yaml` under `tools:` 2. Create a conda environment file in `envs/` 3. The Snakemake pipeline will automatically include it in indexing, searching, and evaluation ```yaml # Example: adding a new tool tools: my_new_tool: binary: "mytool" version: "1.0" index_cmd: "mytool index {genome_dir} {index_dir}" search_cmd: "mytool search {index_dir} {query} > {output}" ```