Dragon

Getting Started

  • Installation
    • Requirements
    • From source (recommended)
      • 1. Install Rust
      • 2. Clone and build
      • 3. (Optional) Install system-wide
      • 4. Verify installation
    • Optional dependencies
      • GGCAT (recommended for large databases)
      • Cloud-native (Zarr) dependencies
      • Benchmark dependencies
    • Troubleshooting
      • sux crate build failure
      • Memory issues during index construction
  • Quick start
    • Step 1: Prepare genome files
    • Step 2: Build the index
    • Step 3: Search
    • Step 4: Inspect results
    • Example output
    • Step 5 (optional): Multi-shard search
    • Step 6 (optional): Cloud-native deployment
    • Step 7 (optional): Surveillance summary
  • Tutorial: AMR gene search
    • Scenario
    • Step 1: Download test data
    • Step 2: Build index
    • Step 3: Search AMR genes
    • Step 4: BLAST-tabular output
    • Step 5: Batch analysis
    • Performance notes

User Guide

  • Indexing
    • Overview
    • Command
      • Required arguments
      • Optional arguments
    • Input format
    • Choosing k-mer size
    • Index files
    • Resource requirements
    • GGCAT integration
  • Searching
    • Overview
    • Command
      • Required arguments
      • Optional arguments
    • Query types
    • Seed finding details
    • Candidate filtering
    • Chaining
    • Memory management
  • Output formats
    • PAF (Pairwise Alignment Format)
      • Optional tags
      • Example
    • BLAST tabular (outfmt 6)
      • Example
    • Parsing output
      • Extract top hits per query
      • Filter by identity
      • Count hits per genome
  • Performance tuning
    • Hardware recommendations
    • SSD vs HDD
    • Tuning parameters
      • For maximum sensitivity
      • For maximum speed
      • For low-memory machines (4 GB total RAM)
    • Scaling guidelines
      • Number of genomes vs resources
      • Query length vs performance
    • Distributing pre-built indices

Architecture

  • Architecture overview
    • The redundancy problem
    • Pipeline overview
      • Index construction
      • Query pipeline
    • Why this is efficient
    • Module map
  • Coloured compacted de Bruijn graph
    • What is a de Bruijn graph?
    • Why use a de Bruijn graph?
    • Construction
      • GGCAT advantages
      • Fallback builder
    • Colour storage
  • Run-length FM-index
    • Background
    • Why run-length?
    • Construction
    • Backward search
    • Variable-length seed matching
    • Position-to-unitig mapping
  • Graph-aware colinear chaining
    • The chaining problem
    • Why graph-aware?
    • Algorithm
      • Step 1: Map seeds to genome coordinates
      • Step 2: Sort anchors by reference position
      • Step 3: Fenwick tree DP
      • Step 4: Gap-sensitive scoring
    • Complexity
  • Data structures
    • 2-bit DNA encoding
    • Roaring Bitmaps
    • Elias-Fano cumulative length index
    • Fenwick tree (Binary Indexed Tree)
    • Variable-length integers (varint)

Benchmark

  • Benchmark datasets
    • Tiered approach
      • Tier 1: Small (development & CI)
      • Tier 2: Medium (validation)
      • Tier 3: Large (full benchmark)
    • Query types
      • Gene-level queries (primary)
      • Long reads (Badread)
      • Challenging scenarios
  • Benchmark methodology
    • Tools compared
    • Accuracy metrics
    • Resource metrics
    • Scalability metrics
    • Read simulation
      • Gene-level queries
      • Long reads (Badread)
    • Statistical analysis
  • Benchmark results
    • Sensitivity vs divergence
    • Resource comparison
      • Index size
      • Peak query RAM
    • Batch query performance
    • Figures
  • Reproducing benchmarks
    • Quick start (synthetic data)
    • Full benchmark (Snakemake)
      • Prerequisites
      • Running
      • Configuration
    • Pipeline structure
    • Adding a new tool

API Reference

  • CLI reference
    • dragon index
      • Index examples
      • Resume
    • dragon search
      • Core options
      • Filtering & scoring
      • ML scoring & training
      • Search examples
    • dragon info
      • Example output
    • dragon download
      • Supported databases
      • Example
    • dragon update
    • dragon compact
    • dragon summarize
    • dragon export-zarr
    • dragon search-zarr
    • dragon signal-index
    • dragon signal-search
    • Environment variables
  • Module reference
    • Index modules (src/index/)
      • index::dbg
      • index::unitig
      • index::color
      • index::fm
      • index::paths
      • index::metadata
    • Query modules (src/query/)
      • query::seed
      • query::candidate
      • query::chain
      • query::align
    • Data structures (src/ds/)
      • ds::fenwick
      • ds::elias_fano
      • ds::varint
    • Utilities (src/util/)
      • util::dna
      • util::mmap
    • I/O modules (src/io/)
      • io::fasta
      • io::paf
      • io::blast
Dragon
  • Tutorial: AMR gene search
  • View page source

Tutorial: AMR gene search

This tutorial demonstrates searching antimicrobial resistance (AMR) genes against a collection of bacterial genomes.

Scenario

You have:

  • A database of 500 E. coli genomes

  • A set of AMR gene sequences from the CARD database

  • Goal: identify which genomes carry which resistance genes

Step 1: Download test data

# Create working directory
mkdir dragon_tutorial && cd dragon_tutorial

# Download example E. coli genomes (first 10 for this tutorial)
# In practice, download from NCBI/RefSeq
mkdir genomes
for i in $(seq 1 10); do
  # Replace with actual genome download commands
  echo ">genome_${i}" > genomes/genome_${i}.fa
  # ... download genome FASTA ...
done

# Download AMR genes from CARD
# https://card.mcmaster.ca/download
wget -O card_genes.fasta https://card.mcmaster.ca/latest/data

Step 2: Build index

dragon index \
  --input genomes/ \
  --output ecoli_index/ \
  --kmer-size 31 \
  --threads 4

# Check the index
dragon info --index ecoli_index/

Expected output:

Dragon Index Information
========================
Version:         0.1.0
K-mer size:      31
Genomes:         10
Unitigs:         45230
Total bases:     48500000
Index size:      0.05 GB

Step 3: Search AMR genes

dragon search \
  --index ecoli_index/ \
  --query card_genes.fasta \
  --output amr_hits.paf \
  --format paf \
  --threads 4 \
  --min-chain-score 100

# Count hits
echo "Total hits: $(wc -l < amr_hits.paf)"

# Top genomes with most AMR genes
cut -f6 amr_hits.paf | sort | uniq -c | sort -rn | head

Step 4: BLAST-tabular output

dragon search \
  --index ecoli_index/ \
  --query card_genes.fasta \
  --output amr_hits.tsv \
  --format blast6

# Filter by identity > 90%
awk -F'\t' '$3 > 90' amr_hits.tsv | head

Step 5: Batch analysis

Dragon handles batch queries efficiently via parallel processing:

# Search 1000 AMR genes with 8 threads
dragon search \
  --index ecoli_index/ \
  --query card_all_1003_genes.fasta \
  --output all_amr_hits.paf \
  --threads 8 \
  --max-ram 4.0

Performance notes

  • Small datasets (<100 genomes): index builds in seconds, queries in milliseconds

  • Medium datasets (~85K genomes): index builds in ~1 hour, queries in seconds

  • Large datasets (>1M genomes): index builds in hours (use GGCAT), queries in minutes

  • RAM: stays below 4 GB at query time regardless of database size

Previous Next

© Copyright 2026, Louise Cerdeira.

Built with Sphinx using a theme provided by Read the Docs.