# Indexing ## Overview Dragon's index construction transforms a collection of genome FASTA files into a compact, queryable index through five stages: 1. **De Bruijn graph construction** — builds a coloured compacted de Bruijn graph (ccdBG) from all genomes 2. **Unitig encoding** — encodes unitig sequences in 2-bit packed format 3. **Colour index** — creates Roaring Bitmap mappings from unitigs to genomes 4. **FM-index** — builds a run-length FM-index over concatenated unitigs 5. **Genome path index** — records each genome's traversal through the graph ## Command ```bash dragon index [OPTIONS] --input --output ``` ### Required arguments | Argument | Description | |----------|-------------| | `--input`, `-i` | Directory containing genome FASTA files | | `--output`, `-o` | Output directory for the index | ### Optional arguments | Argument | Default | Description | |----------|---------|-------------| | `--kmer-size`, `-k` | 31 | K-mer size for the de Bruijn graph. Must be odd. Range: 15-31. | | `--threads`, `-j` | 4 | Number of threads for parallel processing | ## Input format - One FASTA file per genome - Supported extensions: `.fa`, `.fasta`, `.fna`, `.fsa` - Genomes may contain multiple contigs/chromosomes - Ambiguous bases (N, R, Y, etc.) are treated as A ## Choosing k-mer size | k | Sensitivity | Specificity | Index size | Use case | |---|-------------|-------------|------------|----------| | 15 | Highest | Lowest | Larger | Highly divergent queries (>10%) | | 21 | High | Medium | Medium | General purpose | | 31 | Medium | Highest | Smallest | Closely related genomes (<5% divergence) | **Default (k=31)** is recommended for most prokaryotic applications. ## Index files | File | Description | Typical size (2M genomes) | |------|-------------|--------------------------| | `fm_index.bin` | Run-length FM-index | 2-4 GB | | `colors.drgn` | Roaring Bitmap colour index | 50-100 GB | | `paths.bin` | Genome path index | 8-15 GB | | `unitigs.fa` | Unitig FASTA sequences | 1-2 GB | | `colors.tsv` | Unitig-to-genome mapping (text) | varies | | `metadata.json` | Index metadata | <1 KB | ## Resource requirements | Database scale | Build time | Build RAM | Index size | |----------------|-----------|-----------|------------| | 500 genomes | ~10 seconds | <1 GB | ~1.5 GB | | 85K genomes | ~1 hour | ~8 GB | ~15 GB | | 2.34M genomes | ~12 hours | ~64 GB | ~100 GB | ## GGCAT integration For databases larger than ~10,000 genomes, Dragon automatically uses [GGCAT](https://github.com/algbio/ggcat) if available in `PATH`. GGCAT provides: - 5-39x faster graph construction than alternatives - Lower memory usage via minimiser-based partitioning - Native Rust implementation Without GGCAT, Dragon falls back to a built-in graph builder that constructs each unique k-mer as a minimal unitig. This works correctly but produces a less compacted graph.