Reproducing benchmarks
Quick start (synthetic data)
Run the complete benchmark pipeline with synthetic test data in under 2 minutes:
# 1. Build Dragon
cargo build --release
# 2. Generate synthetic test data (20 genomes, 50 queries)
python3 benchmark/setup_test_data.py
# 3. Run benchmarks
python3 benchmark/run_benchmark.py
Results are saved to:
benchmark/results/metrics/— accuracy and resource TSV filesmanuscript/figures/— regenerated PDF and PNG figures
Full benchmark (Snakemake)
For the full benchmark with real data and all tools:
Prerequisites
# Install benchmark dependencies
pip install snakemake matplotlib seaborn pandas numpy
# Install comparison tools
conda install -c bioconda lexicmap minimap2 blast mmseqs2 cobs sourmash skani
# Install read simulators
conda install -c bioconda badread art
Running
cd benchmark
# Dry run (show what will be executed)
snakemake --cores 8 -n
# Full run
snakemake --cores 8 --use-conda
# Regenerate figures only
snakemake --cores 1 -R plot_figures
Configuration
Edit benchmark/config.yaml to customise:
# Dataset paths
datasets:
tier1:
genome_dir: "data/tier1_genomes"
# Simulation parameters
simulation:
gene_queries:
num_genes: 1000
divergence_levels: [0.0, 0.01, 0.03, 0.05, 0.10, 0.15]
# Hardware
hardware:
default_threads: 8
Pipeline structure
benchmark/
+-- Snakefile Workflow orchestration
+-- config.yaml Configuration
+-- setup_test_data.py Synthetic data generator
+-- run_benchmark.py Standalone benchmark runner
+-- simulate/
| +-- extract_genes.py Extract random genes from genomes
| +-- mutate_sequences.py Introduce controlled mutations
| +-- run_badread.sh Long-read simulation
| +-- run_art.sh Short-read simulation
+-- scripts/
| +-- compute_metrics.py Sensitivity, precision, F1
| +-- resource_usage.py Parse /usr/bin/time output
+-- notebooks/
| +-- figures.py Generate all manuscript figures
+-- envs/
| +-- simulation.yaml Conda environment
+-- results/ Output (auto-generated)
+-- metrics/ TSV metric files
+-- figures/ PDF/PNG plots
+-- search/ Raw search output files
Adding a new tool
Add tool configuration to
config.yamlundertools:Create a conda environment file in
envs/The Snakemake pipeline will automatically include it in indexing, searching, and evaluation
# Example: adding a new tool
tools:
my_new_tool:
binary: "mytool"
version: "1.0"
index_cmd: "mytool index {genome_dir} {index_dir}"
search_cmd: "mytool search {index_dir} {query} > {output}"