Coloured compacted de Bruijn graph

What is a de Bruijn graph?

A de Bruijn graph represents a DNA sequence collection where:

  • Each node is a unique k-mer (substring of length k)

  • Each edge connects k-mers that overlap by k-1 bases

  • A compacted graph merges non-branching paths into unitigs (maximal linear chains)

  • A coloured graph annotates each unitig with the set of genomes containing it

Why use a de Bruijn graph?

For 2 million prokaryotic genomes totalling ~10 Tbp of sequence:

  • Without graph: store 10 Tbp of sequence = ~2.5 TB at 2 bits/base

  • With graph: store ~5 Gbp of unique unitig sequence = ~1.2 GB at 2 bits/base

The graph collapses shared content, achieving a ~2,000-fold reduction in sequence storage.

Construction

Dragon uses GGCAT for graph construction:

  1. Input: N genome FASTA files

  2. K-mer extraction: extract all k-mers (default k=31) from all genomes

  3. Graph construction: build the de Bruijn graph, merge non-branching paths into unitigs

  4. Colour annotation: record which genomes contain each unitig

  5. Output: unitig FASTA file + colour mapping

GGCAT advantages

  • Written in Rust (same as Dragon)

  • 5-39x faster than Bifrost

  • Low memory via minimiser-based partitioning

  • Native colour support

Fallback builder

Without GGCAT, Dragon uses a built-in builder that:

  • Collects all k-mers with genome IDs via a BTreeMap

  • Outputs each unique k-mer as a minimal unitig

  • Works correctly but produces less compacted graphs

  • Suitable for datasets up to ~10,000 genomes

Colour storage

Each unitig’s colour set (the genomes containing it) is stored as a Roaring Bitmap — a compressed integer set that efficiently handles:

  • Dense ranges: consecutive genome IDs (e.g., all E. coli genomes)

  • Sparse sets: scattered genome IDs

  • Fast operations: intersection, union, cardinality in O(n/64) time

The colour index is stored as a memory-mapped file with an offset table for O(1) per-unitig access.