Architecture overview
The redundancy problem
Millions of prokaryotic genomes share the vast majority of their sequence content. Within a single species, genomes typically share >95% average nucleotide identity (ANI). Existing tools like LexicMap, Minimap2, and BLASTn index each genome independently, duplicating shared content millions of times.
Dragon solves this by storing shared sequence once using a coloured compacted de Bruijn graph, then indexing the unique content with a compressed FM-index.
Pipeline overview
Dragon has two phases: offline index construction (expensive, run once) and online query (fast, low RAM, run many times).
Index construction
FASTA genomes
|
v
GGCAT (coloured compacted de Bruijn graph)
|
+---> unitigs.fa (2-bit encoded unitigs)
| |
| v
| fm_index.bin (concatenated text + suffix array)
|
+---> colors.drgn (Roaring Bitmaps: unitig -> genome set)
+---> paths.bin v2 (genome -> unitig traversal, mmap'd offset table)
+---> specificity.drgn (per-genome private-unitig sets)
+---> metadata.json
|
v
On-disk Dragon index (memory-mapped at query time)
|
| dragon export-zarr
v
Zarr v3 store (chunked + Zstd, served from local disk or s3:// / gs://)
Query pipeline
Query FASTA
|
v
Stage 1: FM-index backward search
| (variable-length seed matching)
v
Stage 2: Candidate genome filtering
| (Roaring-bitmap colour voting)
v
Stage 3: ML-weighted graph-aware colinear chaining
| (logistic-regression seed scoring + Fenwick / O(h^2) DP)
v
Stage 4: Containment ranking (top-N candidates by total matched bases)
|
v
Stage 5: Banded wavefront alignment along genome paths
|
v
PAF / BLAST6 / surveillance summary / GFA output
Multi-shard search (--shard) drives this pipeline once per shard, then merges results with per-genome deduplication so quota- or RAM-split indices behave like a single logical database.
Why this is efficient
Component |
What it compresses |
Compression factor |
|---|---|---|
De Bruijn graph |
Shared sequence across genomes |
~2,000x (10 Tbp -> 5 Gbp) |
Run-length FM-index |
Repetitive BWT runs |
10-100x (r/n ~ 0.01-0.1) |
Roaring Bitmaps |
Clustered genome ID sets |
~10x vs raw bitvectors |
Delta-coded paths |
Similar genome traversals |
5-10x within species clusters |
Net result: ~50x total disk reduction vs LexicMap (100 GB vs 5.46 TB for 2.34M genomes).
Module map
src/
+-- index/ Index construction
| +-- dbg.rs GGCAT integration (with internal-builder fallback)
| +-- unitig.rs Unitig parsing, 2-bit encoding
| +-- color.rs Roaring-bitmap colour index
| +-- ggcat_colors.rs GGCAT binary colormap -> colors.drgn (no TSV)
| +-- fm.rs FM-index construction and search
| +-- paths.rs Genome path index (legacy bincode loader)
| +-- paths_v2.rs Mmap-friendly v2 format (default for new builds)
| +-- specificity.rs Per-genome private-unitig sets
| +-- auto_batch.rs Auto-split large collections into overlay batches
| +-- update.rs Incremental overlay addition (dragon update / compact)
| +-- zarr_backend.rs Zarr v3 export + ZarrFmIndex / ZarrColorIndex readers
| +-- metadata.rs Index metadata (JSON)
|
+-- query/ Query pipeline
| +-- seed.rs FM-index backward search (variable-length)
| +-- candidate.rs Colour-based genome voting
| +-- chain.rs ML-weighted graph-aware chaining + containment ranking
| +-- containment.rs K-mer containment scoring
| +-- direct_align.rs Direct alignment to top candidate genomes
| +-- align.rs Banded wavefront alignment along genome paths
| +-- mod.rs Multi-shard orchestration (search_multi_index)
|
+-- signal/ Raw nanopore current search
| +-- index.rs Pore-model-driven discretisation + signal FM-index
| +-- search.rs Backward search over signal k-mers
|
+-- ds/ Data structures
| +-- elias_fano.rs CumulativeLengthIndex (position -> unitig)
| +-- fenwick.rs Binary indexed tree for O(h log h) chaining
| +-- varint.rs LEB128 codecs (also used by paths_v2)
|
+-- io/ Input / output
| +-- fasta.rs FASTA / FASTQ parser
| +-- paf.rs PAF writer
| +-- blast.rs BLAST-tabular writer
| +-- gfa.rs Graph-context GFA writer
| +-- summary.rs Surveillance summary writer
|
+-- util/ Utilities
+-- dna.rs 2-bit DNA encoding, reverse complement
+-- mmap.rs Bincode + memory-mapped helpers
+-- colorspace.rs SOLiD-style 2-base colour-space encoder/decoder
+-- progress.rs Progress bars