# Quick start This guide walks you through indexing a small set of genomes and searching for a query sequence. ## Step 1: Prepare genome files Organise your reference genomes as individual FASTA files in a directory: ``` genomes/ genome_001.fasta genome_002.fasta genome_003.fasta ... ``` Supported extensions: `.fa`, `.fasta`, `.fna`, `.fsa` ## Step 2: Build the index ```bash dragon index \ --input genomes/ \ --output my_index/ \ --kmer-size 31 \ --threads 8 ``` This creates the following files in `my_index/`: | File | Description | | --- | --- | | `fm_index.bin` | FM-index over concatenated unitig sequences | | `colors.drgn` | Roaring-bitmap colour index (unitig → genome mapping) | | `paths.bin` | Genome path index (mmap-friendly v2 format) | | `specificity.drgn` | Per-genome private-unitig sets | | `unitigs.fa` | Unitig sequences from the de Bruijn graph (optional after build) | | `metadata.json` | Index statistics (genome count, k-mer size, total bases) | ## Step 3: Search ```bash dragon search \ --index my_index/ \ --query query_genes.fasta \ --output results.paf \ --threads 8 ``` ## Step 4: Inspect results ```bash # View PAF output head results.paf # Count hits per query cut -f1 results.paf | sort | uniq -c | sort -rn | head # View index statistics dragon info --index my_index/ ``` ## Example output PAF format (tab-separated): ``` gene_001 1500 10 1490 + genome_042 4800000 123456 124946 1450 1490 60 AS:i:2900 gene_001 1500 10 1490 + genome_108 5100000 234567 236057 1430 1490 55 AS:i:2860 ``` Columns: query name, query length, query start, query end, strand, target name, target length, target start, target end, matches, alignment length, mapping quality, tags. ## Step 5 (optional): Multi-shard search If your collection is too large for a single index, build several shards and search them as one: ```bash dragon search \ --index shard_a/ \ --shard shard_b/ \ --shard shard_c/ \ --query query_genes.fasta \ --output results.paf ``` Each shard is loaded in turn (memory-bounded) and results are merged with per-genome deduplication. ## Step 6 (optional): Cloud-native deployment Export an index as a Zarr v3 store for direct reading from S3 / GCS: ```bash dragon export-zarr -i my_index/ -o my_index.zarr/ aws s3 sync my_index.zarr/ s3://your-bucket/my_index/ # Anywhere with internet (no AWS creds needed for public buckets): pip install 'zarr>=3.0' s3fs numcodecs python scripts/zarr_demo.py s3://your-bucket/my_index ``` A pre-built 16,000-genome demo lives at `s3://dragon-zarr/saureus/b1/` (eu-west-2, public-read). See [Architecture overview](../architecture/overview.md) for details. ## Step 7 (optional): Surveillance summary For AMR-gene panels and similar epidemiological queries, ask for a per-species summary instead of raw PAF: ```bash dragon search -i my_index/ -q amr_genes.fa --format summary > prevalence.tsv ``` Or post-process an existing PAF: ```bash dragon summarize --input results.paf --format tsv > prevalence.tsv ```