# Tutorial: AMR gene search This tutorial demonstrates searching antimicrobial resistance (AMR) genes against a collection of bacterial genomes. ## Scenario You have: - A database of 500 *E. coli* genomes - A set of AMR gene sequences from the CARD database - Goal: identify which genomes carry which resistance genes ## Step 1: Download test data ```bash # Create working directory mkdir dragon_tutorial && cd dragon_tutorial # Download example E. coli genomes (first 10 for this tutorial) # In practice, download from NCBI/RefSeq mkdir genomes for i in $(seq 1 10); do # Replace with actual genome download commands echo ">genome_${i}" > genomes/genome_${i}.fa # ... download genome FASTA ... done # Download AMR genes from CARD # https://card.mcmaster.ca/download wget -O card_genes.fasta https://card.mcmaster.ca/latest/data ``` ## Step 2: Build index ```bash dragon index \ --input genomes/ \ --output ecoli_index/ \ --kmer-size 31 \ --threads 4 # Check the index dragon info --index ecoli_index/ ``` Expected output: ``` Dragon Index Information ======================== Version: 0.1.0 K-mer size: 31 Genomes: 10 Unitigs: 45230 Total bases: 48500000 Index size: 0.05 GB ``` ## Step 3: Search AMR genes ```bash dragon search \ --index ecoli_index/ \ --query card_genes.fasta \ --output amr_hits.paf \ --format paf \ --threads 4 \ --min-chain-score 100 # Count hits echo "Total hits: $(wc -l < amr_hits.paf)" # Top genomes with most AMR genes cut -f6 amr_hits.paf | sort | uniq -c | sort -rn | head ``` ## Step 4: BLAST-tabular output ```bash dragon search \ --index ecoli_index/ \ --query card_genes.fasta \ --output amr_hits.tsv \ --format blast6 # Filter by identity > 90% awk -F'\t' '$3 > 90' amr_hits.tsv | head ``` ## Step 5: Batch analysis Dragon handles batch queries efficiently via parallel processing: ```bash # Search 1000 AMR genes with 8 threads dragon search \ --index ecoli_index/ \ --query card_all_1003_genes.fasta \ --output all_amr_hits.paf \ --threads 8 \ --max-ram 4.0 ``` ## Performance notes - **Small datasets (<100 genomes)**: index builds in seconds, queries in milliseconds - **Medium datasets (~85K genomes)**: index builds in ~1 hour, queries in seconds - **Large datasets (>1M genomes)**: index builds in hours (use GGCAT), queries in minutes - **RAM**: stays below 4 GB at query time regardless of database size