Architecture overview

The redundancy problem

Millions of prokaryotic genomes share the vast majority of their sequence content. Within a single species, genomes typically share >95% average nucleotide identity (ANI). Existing tools like LexicMap, Minimap2, and BLASTn index each genome independently, duplicating shared content millions of times.

Dragon solves this by storing shared sequence once using a coloured compacted de Bruijn graph, then indexing the unique content with a compressed FM-index.

Pipeline overview

Dragon has two phases: offline index construction (expensive, run once) and online query (fast, low RAM, run many times).

Index construction

FASTA genomes
    |
    v
GGCAT (coloured compacted de Bruijn graph)
    |
    +---> unitigs.fa (2-bit encoded unitigs)
    |         |
    |         v
    |     fm_index.bin (concatenated text + suffix array)
    |
    +---> colors.drgn      (Roaring Bitmaps: unitig -> genome set)
    +---> paths.bin v2     (genome -> unitig traversal, mmap'd offset table)
    +---> specificity.drgn (per-genome private-unitig sets)
    +---> metadata.json
    |
    v
On-disk Dragon index (memory-mapped at query time)
    |
    | dragon export-zarr
    v
Zarr v3 store (chunked + Zstd, served from local disk or s3:// / gs://)

Query pipeline

Query FASTA
    |
    v
Stage 1: FM-index backward search
    |     (variable-length seed matching)
    v
Stage 2: Candidate genome filtering
    |     (Roaring-bitmap colour voting)
    v
Stage 3: ML-weighted graph-aware colinear chaining
    |     (logistic-regression seed scoring + Fenwick / O(h^2) DP)
    v
Stage 4: Containment ranking (top-N candidates by total matched bases)
    |
    v
Stage 5: Banded wavefront alignment along genome paths
    |
    v
PAF / BLAST6 / surveillance summary / GFA output

Multi-shard search (--shard) drives this pipeline once per shard, then merges results with per-genome deduplication so quota- or RAM-split indices behave like a single logical database.

Why this is efficient

Component	What it compresses	Compression factor
De Bruijn graph	Shared sequence across genomes	~2,000x (10 Tbp -> 5 Gbp)
Run-length FM-index	Repetitive BWT runs	10-100x (r/n ~ 0.01-0.1)
Roaring Bitmaps	Clustered genome ID sets	~10x vs raw bitvectors
Delta-coded paths	Similar genome traversals	5-10x within species clusters

Net result: ~50x total disk reduction vs LexicMap (100 GB vs 5.46 TB for 2.34M genomes).

Module map

src/
+-- index/                      Index construction
|   +-- dbg.rs                  GGCAT integration (with internal-builder fallback)
|   +-- unitig.rs               Unitig parsing, 2-bit encoding
|   +-- color.rs                Roaring-bitmap colour index
|   +-- ggcat_colors.rs         GGCAT binary colormap -> colors.drgn (no TSV)
|   +-- fm.rs                   FM-index construction and search
|   +-- paths.rs                Genome path index (legacy bincode loader)
|   +-- paths_v2.rs             Mmap-friendly v2 format (default for new builds)
|   +-- specificity.rs          Per-genome private-unitig sets
|   +-- auto_batch.rs           Auto-split large collections into overlay batches
|   +-- update.rs               Incremental overlay addition (dragon update / compact)
|   +-- zarr_backend.rs         Zarr v3 export + ZarrFmIndex / ZarrColorIndex readers
|   +-- metadata.rs             Index metadata (JSON)
|
+-- query/                      Query pipeline
|   +-- seed.rs                 FM-index backward search (variable-length)
|   +-- candidate.rs            Colour-based genome voting
|   +-- chain.rs                ML-weighted graph-aware chaining + containment ranking
|   +-- containment.rs          K-mer containment scoring
|   +-- direct_align.rs         Direct alignment to top candidate genomes
|   +-- align.rs                Banded wavefront alignment along genome paths
|   +-- mod.rs                  Multi-shard orchestration (search_multi_index)
|
+-- signal/                     Raw nanopore current search
|   +-- index.rs                Pore-model-driven discretisation + signal FM-index
|   +-- search.rs               Backward search over signal k-mers
|
+-- ds/                         Data structures
|   +-- elias_fano.rs           CumulativeLengthIndex (position -> unitig)
|   +-- fenwick.rs              Binary indexed tree for O(h log h) chaining
|   +-- varint.rs               LEB128 codecs (also used by paths_v2)
|
+-- io/                         Input / output
|   +-- fasta.rs                FASTA / FASTQ parser
|   +-- paf.rs                  PAF writer
|   +-- blast.rs                BLAST-tabular writer
|   +-- gfa.rs                  Graph-context GFA writer
|   +-- summary.rs              Surveillance summary writer
|
+-- util/                       Utilities
    +-- dna.rs                  2-bit DNA encoding, reverse complement
    +-- mmap.rs                 Bincode + memory-mapped helpers
    +-- colorspace.rs           SOLiD-style 2-base colour-space encoder/decoder
    +-- progress.rs             Progress bars