Reproducing benchmarks

Quick start (synthetic data)

Run the complete benchmark pipeline with synthetic test data in under 2 minutes:

# 1. Build Dragon
cargo build --release

# 2. Generate synthetic test data (20 genomes, 50 queries)
python3 benchmark/setup_test_data.py

# 3. Run benchmarks
python3 benchmark/run_benchmark.py

Results are saved to:

benchmark/results/metrics/ — accuracy and resource TSV files
manuscript/figures/ — regenerated PDF and PNG figures

Full benchmark (Snakemake)

For the full benchmark with real data and all tools:

Prerequisites

# Install benchmark dependencies
pip install snakemake matplotlib seaborn pandas numpy

# Install comparison tools
conda install -c bioconda lexicmap minimap2 blast mmseqs2 cobs sourmash skani

# Install read simulators
conda install -c bioconda badread art

Running

cd benchmark

# Dry run (show what will be executed)
snakemake --cores 8 -n

# Full run
snakemake --cores 8 --use-conda

# Regenerate figures only
snakemake --cores 1 -R plot_figures

Configuration

Edit benchmark/config.yaml to customise:

# Dataset paths
datasets:
  tier1:
    genome_dir: "data/tier1_genomes"

# Simulation parameters
simulation:
  gene_queries:
    num_genes: 1000
    divergence_levels: [0.0, 0.01, 0.03, 0.05, 0.10, 0.15]

# Hardware
hardware:
  default_threads: 8

Pipeline structure

benchmark/
+-- Snakefile                    Workflow orchestration
+-- config.yaml                  Configuration
+-- setup_test_data.py           Synthetic data generator
+-- run_benchmark.py             Standalone benchmark runner
+-- simulate/
|   +-- extract_genes.py         Extract random genes from genomes
|   +-- mutate_sequences.py      Introduce controlled mutations
|   +-- run_badread.sh           Long-read simulation
|   +-- run_art.sh               Short-read simulation
+-- scripts/
|   +-- compute_metrics.py       Sensitivity, precision, F1
|   +-- resource_usage.py        Parse /usr/bin/time output
+-- notebooks/
|   +-- figures.py               Generate all manuscript figures
+-- envs/
|   +-- simulation.yaml          Conda environment
+-- results/                     Output (auto-generated)
    +-- metrics/                 TSV metric files
    +-- figures/                 PDF/PNG plots
    +-- search/                  Raw search output files

Adding a new tool

Add tool configuration to config.yaml under tools:
Create a conda environment file in envs/
The Snakemake pipeline will automatically include it in indexing, searching, and evaluation

# Example: adding a new tool
tools:
  my_new_tool:
    binary: "mytool"
    version: "1.0"
    index_cmd: "mytool index {genome_dir} {index_dir}"
    search_cmd: "mytool search {index_dir} {query} > {output}"