Coloured compacted de Bruijn graph
What is a de Bruijn graph?
A de Bruijn graph represents a DNA sequence collection where:
Each node is a unique k-mer (substring of length k)
Each edge connects k-mers that overlap by k-1 bases
A compacted graph merges non-branching paths into unitigs (maximal linear chains)
A coloured graph annotates each unitig with the set of genomes containing it
Why use a de Bruijn graph?
For 2 million prokaryotic genomes totalling ~10 Tbp of sequence:
Without graph: store 10 Tbp of sequence = ~2.5 TB at 2 bits/base
With graph: store ~5 Gbp of unique unitig sequence = ~1.2 GB at 2 bits/base
The graph collapses shared content, achieving a ~2,000-fold reduction in sequence storage.
Construction
Dragon uses GGCAT for graph construction:
Input: N genome FASTA files
K-mer extraction: extract all k-mers (default k=31) from all genomes
Graph construction: build the de Bruijn graph, merge non-branching paths into unitigs
Colour annotation: record which genomes contain each unitig
Output: unitig FASTA file + colour mapping
GGCAT advantages
Written in Rust (same as Dragon)
5-39x faster than Bifrost
Low memory via minimiser-based partitioning
Native colour support
Fallback builder
Without GGCAT, Dragon uses a built-in builder that:
Collects all k-mers with genome IDs via a BTreeMap
Outputs each unique k-mer as a minimal unitig
Works correctly but produces less compacted graphs
Suitable for datasets up to ~10,000 genomes
Colour storage
Each unitig’s colour set (the genomes containing it) is stored as a Roaring Bitmap — a compressed integer set that efficiently handles:
Dense ranges: consecutive genome IDs (e.g., all E. coli genomes)
Sparse sets: scattered genome IDs
Fast operations: intersection, union, cardinality in O(n/64) time
The colour index is stored as a memory-mapped file with an offset table for O(1) per-unitig access.