Indexing

Overview

Dragon’s index construction transforms a collection of genome FASTA files into a compact, queryable index through five stages:

De Bruijn graph construction — builds a coloured compacted de Bruijn graph (ccdBG) from all genomes
Unitig encoding — encodes unitig sequences in 2-bit packed format
Colour index — creates Roaring Bitmap mappings from unitigs to genomes
FM-index — builds a run-length FM-index over concatenated unitigs
Genome path index — records each genome’s traversal through the graph

Command

dragon index [OPTIONS] --input <DIR> --output <DIR>

Required arguments

Argument	Description
`--input`, `-i`	Directory containing genome FASTA files
`--output`, `-o`	Output directory for the index

Optional arguments

Argument	Default	Description
`--kmer-size`, `-k`	31	K-mer size for the de Bruijn graph. Must be odd. Range: 15-31.
`--threads`, `-j`	4	Number of threads for parallel processing

Input format

One FASTA file per genome
Supported extensions: .fa, .fasta, .fna, .fsa
Genomes may contain multiple contigs/chromosomes
Ambiguous bases (N, R, Y, etc.) are treated as A

Choosing k-mer size

k	Sensitivity	Specificity	Index size	Use case
15	Highest	Lowest	Larger	Highly divergent queries (>10%)
21	High	Medium	Medium	General purpose
31	Medium	Highest	Smallest	Closely related genomes (<5% divergence)

Default (k=31) is recommended for most prokaryotic applications.

Index files

File	Description	Typical size (2M genomes)
`fm_index.bin`	Run-length FM-index	2-4 GB
`colors.drgn`	Roaring Bitmap colour index	50-100 GB
`paths.bin`	Genome path index	8-15 GB
`unitigs.fa`	Unitig FASTA sequences	1-2 GB
`colors.tsv`	Unitig-to-genome mapping (text)	varies
`metadata.json`	Index metadata	<1 KB

Resource requirements

Database scale	Build time	Build RAM	Index size
500 genomes	~10 seconds	<1 GB	~1.5 GB
85K genomes	~1 hour	~8 GB	~15 GB
2.34M genomes	~12 hours	~64 GB	~100 GB

GGCAT integration

For databases larger than ~10,000 genomes, Dragon automatically uses GGCAT if available in PATH. GGCAT provides:

5-39x faster graph construction than alternatives
Lower memory usage via minimiser-based partitioning
Native Rust implementation

Without GGCAT, Dragon falls back to a built-in graph builder that constructs each unique k-mer as a minimal unitig. This works correctly but produces a less compacted graph.