Data structures
2-bit DNA encoding
DNA bases are packed 32 per u64 word:
Base |
Encoding |
|---|---|
A |
|
C |
|
G |
|
T |
|
Operations:
Encode: 10,000 bases in ~2 microseconds
Decode: 10,000 bases in ~3 microseconds
Reverse complement: bitwise NOT + reverse pairs
K-mer extraction: shift and mask operations
Ambiguous bases (N, R, Y, etc.) are encoded as A (00).
Roaring Bitmaps
Used for the colour index (mapping unitigs to genome sets). Roaring Bitmaps partition the integer universe into blocks of 2^16, using the optimal representation per block:
Array container: for sparse blocks (<4,096 integers)
Bitmap container: for dense blocks (>4,096 integers)
Run container: for blocks with long consecutive runs
This is ideal for genome ID sets, which tend to be clustered by species.
Elias-Fano cumulative length index
Maps FM-index text positions to unitig IDs via binary search on a sorted array of cumulative start positions. Current implementation uses a simple Vec<u64> with binary search; production version can upgrade to a true Elias-Fano structure for O(1) predecessor queries.
Fenwick tree (Binary Indexed Tree)
Used in colinear chaining for prefix maximum queries:
Update: set position i to max(current, value) in O(log n)
Query: find maximum in prefix [0, i] in O(log n)
Space: O(n)
Two variants:
FenwickMax: prefix maximum (used in chaining)FenwickSum: prefix sum (used in vote counting)
Variable-length integers (varint)
LEB128 encoding for genome path compression:
Value range |
Bytes |
|---|---|
0-127 |
1 |
128-16,383 |
2 |
16,384-2,097,151 |
3 |
2,097,152-268,435,455 |
4 |
Also supports:
Zigzag encoding: efficient encoding of signed integers
Delta encoding: for sorted sequences (store differences)
Slice encoding/decoding: batch operations