Tutorial: AMR gene search

This tutorial demonstrates searching antimicrobial resistance (AMR) genes against a collection of bacterial genomes.

Scenario

You have:

A database of 500 E. coli genomes
A set of AMR gene sequences from the CARD database
Goal: identify which genomes carry which resistance genes

Step 1: Download test data

# Create working directory
mkdir dragon_tutorial && cd dragon_tutorial

# Download example E. coli genomes (first 10 for this tutorial)
# In practice, download from NCBI/RefSeq
mkdir genomes
for i in $(seq 1 10); do
  # Replace with actual genome download commands
  echo ">genome_${i}" > genomes/genome_${i}.fa
  # ... download genome FASTA ...
done

# Download AMR genes from CARD
# https://card.mcmaster.ca/download
wget -O card_genes.fasta https://card.mcmaster.ca/latest/data

Step 2: Build index

dragon index \
  --input genomes/ \
  --output ecoli_index/ \
  --kmer-size 31 \
  --threads 4

# Check the index
dragon info --index ecoli_index/

Expected output:

Dragon Index Information
========================
Version:         0.1.0
K-mer size:      31
Genomes:         10
Unitigs:         45230
Total bases:     48500000
Index size:      0.05 GB

Step 3: Search AMR genes

dragon search \
  --index ecoli_index/ \
  --query card_genes.fasta \
  --output amr_hits.paf \
  --format paf \
  --threads 4 \
  --min-chain-score 100

# Count hits
echo "Total hits: $(wc -l < amr_hits.paf)"

# Top genomes with most AMR genes
cut -f6 amr_hits.paf | sort | uniq -c | sort -rn | head

Step 4: BLAST-tabular output

dragon search \
  --index ecoli_index/ \
  --query card_genes.fasta \
  --output amr_hits.tsv \
  --format blast6

# Filter by identity > 90%
awk -F'\t' '$3 > 90' amr_hits.tsv | head

Step 5: Batch analysis

Dragon handles batch queries efficiently via parallel processing:

# Search 1000 AMR genes with 8 threads
dragon search \
  --index ecoli_index/ \
  --query card_all_1003_genes.fasta \
  --output all_amr_hits.paf \
  --threads 8 \
  --max-ram 4.0

Performance notes

Small datasets (<100 genomes): index builds in seconds, queries in milliseconds
Medium datasets (~85K genomes): index builds in ~1 hour, queries in seconds
Large datasets (>1M genomes): index builds in hours (use GGCAT), queries in minutes
RAM: stays below 4 GB at query time regardless of database size