# Data structures ## 2-bit DNA encoding DNA bases are packed 32 per `u64` word: | Base | Encoding | |------|----------| | A | `00` | | C | `01` | | G | `10` | | T | `11` | Operations: - **Encode**: 10,000 bases in ~2 microseconds - **Decode**: 10,000 bases in ~3 microseconds - **Reverse complement**: bitwise NOT + reverse pairs - **K-mer extraction**: shift and mask operations Ambiguous bases (N, R, Y, etc.) are encoded as A (`00`). ## Roaring Bitmaps Used for the colour index (mapping unitigs to genome sets). Roaring Bitmaps partition the integer universe into blocks of 2^16, using the optimal representation per block: - **Array container**: for sparse blocks (<4,096 integers) - **Bitmap container**: for dense blocks (>4,096 integers) - **Run container**: for blocks with long consecutive runs This is ideal for genome ID sets, which tend to be clustered by species. ## Elias-Fano cumulative length index Maps FM-index text positions to unitig IDs via binary search on a sorted array of cumulative start positions. Current implementation uses a simple `Vec` with binary search; production version can upgrade to a true Elias-Fano structure for O(1) predecessor queries. ## Fenwick tree (Binary Indexed Tree) Used in colinear chaining for prefix maximum queries: - **Update**: set position i to max(current, value) in O(log n) - **Query**: find maximum in prefix [0, i] in O(log n) - **Space**: O(n) Two variants: - `FenwickMax`: prefix maximum (used in chaining) - `FenwickSum`: prefix sum (used in vote counting) ## Variable-length integers (varint) LEB128 encoding for genome path compression: | Value range | Bytes | |------------|-------| | 0-127 | 1 | | 128-16,383 | 2 | | 16,384-2,097,151 | 3 | | 2,097,152-268,435,455 | 4 | Also supports: - **Zigzag encoding**: efficient encoding of signed integers - **Delta encoding**: for sorted sequences (store differences) - **Slice encoding/decoding**: batch operations