Coloured compacted de Bruijn graph

What is a de Bruijn graph?

A de Bruijn graph represents a DNA sequence collection where:

Each node is a unique k-mer (substring of length k)
Each edge connects k-mers that overlap by k-1 bases
A compacted graph merges non-branching paths into unitigs (maximal linear chains)
A coloured graph annotates each unitig with the set of genomes containing it

Why use a de Bruijn graph?

For 2 million prokaryotic genomes totalling ~10 Tbp of sequence:

Without graph: store 10 Tbp of sequence = ~2.5 TB at 2 bits/base
With graph: store ~5 Gbp of unique unitig sequence = ~1.2 GB at 2 bits/base

The graph collapses shared content, achieving a ~2,000-fold reduction in sequence storage.

Construction

Dragon uses GGCAT for graph construction:

Input: N genome FASTA files
K-mer extraction: extract all k-mers (default k=31) from all genomes
Graph construction: build the de Bruijn graph, merge non-branching paths into unitigs
Colour annotation: record which genomes contain each unitig
Output: unitig FASTA file + colour mapping

GGCAT advantages

Written in Rust (same as Dragon)
5-39x faster than Bifrost
Low memory via minimiser-based partitioning
Native colour support

Fallback builder

Without GGCAT, Dragon uses a built-in builder that:

Collects all k-mers with genome IDs via a BTreeMap
Outputs each unique k-mer as a minimal unitig
Works correctly but produces less compacted graphs
Suitable for datasets up to ~10,000 genomes

Colour storage

Each unitig’s colour set (the genomes containing it) is stored as a Roaring Bitmap — a compressed integer set that efficiently handles:

Dense ranges: consecutive genome IDs (e.g., all E. coli genomes)
Sparse sets: scattered genome IDs
Fast operations: intersection, union, cardinality in O(n/64) time

The colour index is stored as a memory-mapped file with an offset table for O(1) per-unitig access.