Tutorial: AMR gene search
This tutorial demonstrates searching antimicrobial resistance (AMR) genes against a collection of bacterial genomes.
Scenario
You have:
A database of 500 E. coli genomes
A set of AMR gene sequences from the CARD database
Goal: identify which genomes carry which resistance genes
Step 1: Download test data
# Create working directory
mkdir dragon_tutorial && cd dragon_tutorial
# Download example E. coli genomes (first 10 for this tutorial)
# In practice, download from NCBI/RefSeq
mkdir genomes
for i in $(seq 1 10); do
# Replace with actual genome download commands
echo ">genome_${i}" > genomes/genome_${i}.fa
# ... download genome FASTA ...
done
# Download AMR genes from CARD
# https://card.mcmaster.ca/download
wget -O card_genes.fasta https://card.mcmaster.ca/latest/data
Step 2: Build index
dragon index \
--input genomes/ \
--output ecoli_index/ \
--kmer-size 31 \
--threads 4
# Check the index
dragon info --index ecoli_index/
Expected output:
Dragon Index Information
========================
Version: 0.1.0
K-mer size: 31
Genomes: 10
Unitigs: 45230
Total bases: 48500000
Index size: 0.05 GB
Step 3: Search AMR genes
dragon search \
--index ecoli_index/ \
--query card_genes.fasta \
--output amr_hits.paf \
--format paf \
--threads 4 \
--min-chain-score 100
# Count hits
echo "Total hits: $(wc -l < amr_hits.paf)"
# Top genomes with most AMR genes
cut -f6 amr_hits.paf | sort | uniq -c | sort -rn | head
Step 4: BLAST-tabular output
dragon search \
--index ecoli_index/ \
--query card_genes.fasta \
--output amr_hits.tsv \
--format blast6
# Filter by identity > 90%
awk -F'\t' '$3 > 90' amr_hits.tsv | head
Step 5: Batch analysis
Dragon handles batch queries efficiently via parallel processing:
# Search 1000 AMR genes with 8 threads
dragon search \
--index ecoli_index/ \
--query card_all_1003_genes.fasta \
--output all_amr_hits.paf \
--threads 8 \
--max-ram 4.0
Performance notes
Small datasets (<100 genomes): index builds in seconds, queries in milliseconds
Medium datasets (~85K genomes): index builds in ~1 hour, queries in seconds
Large datasets (>1M genomes): index builds in hours (use GGCAT), queries in minutes
RAM: stays below 4 GB at query time regardless of database size