# Architecture overview

## The redundancy problem

Millions of prokaryotic genomes share the vast majority of their sequence content. Within a single species, genomes typically share >95% average nucleotide identity (ANI). Existing tools like LexicMap, Minimap2, and BLASTn index each genome independently, duplicating shared content millions of times.

Dragon solves this by **storing shared sequence once** using a coloured compacted de Bruijn graph, then indexing the unique content with a compressed FM-index.

## Pipeline overview

Dragon has two phases: **offline index construction** (expensive, run once) and **online query** (fast, low RAM, run many times).

### Index construction

```text
FASTA genomes
    |
    v
GGCAT (coloured compacted de Bruijn graph)
    |
    +---> unitigs.fa (2-bit encoded unitigs)
    |         |
    |         v
    |     fm_index.bin (concatenated text + suffix array)
    |
    +---> colors.drgn      (Roaring Bitmaps: unitig -> genome set)
    +---> paths.bin v2     (genome -> unitig traversal, mmap'd offset table)
    +---> specificity.drgn (per-genome private-unitig sets)
    +---> metadata.json
    |
    v
On-disk Dragon index (memory-mapped at query time)
    |
    | dragon export-zarr
    v
Zarr v3 store (chunked + Zstd, served from local disk or s3:// / gs://)
```

### Query pipeline

```text
Query FASTA
    |
    v
Stage 1: FM-index backward search
    |     (variable-length seed matching)
    v
Stage 2: Candidate genome filtering
    |     (Roaring-bitmap colour voting)
    v
Stage 3: ML-weighted graph-aware colinear chaining
    |     (logistic-regression seed scoring + Fenwick / O(h^2) DP)
    v
Stage 4: Containment ranking (top-N candidates by total matched bases)
    |
    v
Stage 5: Banded wavefront alignment along genome paths
    |
    v
PAF / BLAST6 / surveillance summary / GFA output
```

Multi-shard search (`--shard`) drives this pipeline once per shard, then merges results with per-genome deduplication so quota- or RAM-split indices behave like a single logical database.

## Why this is efficient

| Component | What it compresses | Compression factor |
|-----------|-------------------|-------------------|
| De Bruijn graph | Shared sequence across genomes | ~2,000x (10 Tbp -> 5 Gbp) |
| Run-length FM-index | Repetitive BWT runs | 10-100x (r/n ~ 0.01-0.1) |
| Roaring Bitmaps | Clustered genome ID sets | ~10x vs raw bitvectors |
| Delta-coded paths | Similar genome traversals | 5-10x within species clusters |

**Net result**: ~50x total disk reduction vs LexicMap (100 GB vs 5.46 TB for 2.34M genomes).

## Module map

```text
src/
+-- index/                      Index construction
|   +-- dbg.rs                  GGCAT integration (with internal-builder fallback)
|   +-- unitig.rs               Unitig parsing, 2-bit encoding
|   +-- color.rs                Roaring-bitmap colour index
|   +-- ggcat_colors.rs         GGCAT binary colormap -> colors.drgn (no TSV)
|   +-- fm.rs                   FM-index construction and search
|   +-- paths.rs                Genome path index (legacy bincode loader)
|   +-- paths_v2.rs             Mmap-friendly v2 format (default for new builds)
|   +-- specificity.rs          Per-genome private-unitig sets
|   +-- auto_batch.rs           Auto-split large collections into overlay batches
|   +-- update.rs               Incremental overlay addition (dragon update / compact)
|   +-- zarr_backend.rs         Zarr v3 export + ZarrFmIndex / ZarrColorIndex readers
|   +-- metadata.rs             Index metadata (JSON)
|
+-- query/                      Query pipeline
|   +-- seed.rs                 FM-index backward search (variable-length)
|   +-- candidate.rs            Colour-based genome voting
|   +-- chain.rs                ML-weighted graph-aware chaining + containment ranking
|   +-- containment.rs          K-mer containment scoring
|   +-- direct_align.rs         Direct alignment to top candidate genomes
|   +-- align.rs                Banded wavefront alignment along genome paths
|   +-- mod.rs                  Multi-shard orchestration (search_multi_index)
|
+-- signal/                     Raw nanopore current search
|   +-- index.rs                Pore-model-driven discretisation + signal FM-index
|   +-- search.rs               Backward search over signal k-mers
|
+-- ds/                         Data structures
|   +-- elias_fano.rs           CumulativeLengthIndex (position -> unitig)
|   +-- fenwick.rs              Binary indexed tree for O(h log h) chaining
|   +-- varint.rs               LEB128 codecs (also used by paths_v2)
|
+-- io/                         Input / output
|   +-- fasta.rs                FASTA / FASTQ parser
|   +-- paf.rs                  PAF writer
|   +-- blast.rs                BLAST-tabular writer
|   +-- gfa.rs                  Graph-context GFA writer
|   +-- summary.rs              Surveillance summary writer
|
+-- util/                       Utilities
    +-- dna.rs                  2-bit DNA encoding, reverse complement
    +-- mmap.rs                 Bincode + memory-mapped helpers
    +-- colorspace.rs           SOLiD-style 2-base colour-space encoder/decoder
    +-- progress.rs             Progress bars
```