# Data structures

## 2-bit DNA encoding

DNA bases are packed 32 per `u64` word:

| Base | Encoding |
|------|----------|
| A | `00` |
| C | `01` |
| G | `10` |
| T | `11` |

Operations:
- **Encode**: 10,000 bases in ~2 microseconds
- **Decode**: 10,000 bases in ~3 microseconds
- **Reverse complement**: bitwise NOT + reverse pairs
- **K-mer extraction**: shift and mask operations

Ambiguous bases (N, R, Y, etc.) are encoded as A (`00`).

## Roaring Bitmaps

Used for the colour index (mapping unitigs to genome sets). Roaring Bitmaps partition the integer universe into blocks of 2^16, using the optimal representation per block:

- **Array container**: for sparse blocks (<4,096 integers)
- **Bitmap container**: for dense blocks (>4,096 integers)
- **Run container**: for blocks with long consecutive runs

This is ideal for genome ID sets, which tend to be clustered by species.

## Elias-Fano cumulative length index

Maps FM-index text positions to unitig IDs via binary search on a sorted array of cumulative start positions. Current implementation uses a simple `Vec<u64>` with binary search; production version can upgrade to a true Elias-Fano structure for O(1) predecessor queries.

## Fenwick tree (Binary Indexed Tree)

Used in colinear chaining for prefix maximum queries:

- **Update**: set position i to max(current, value) in O(log n)
- **Query**: find maximum in prefix [0, i] in O(log n)
- **Space**: O(n)

Two variants:
- `FenwickMax`: prefix maximum (used in chaining)
- `FenwickSum`: prefix sum (used in vote counting)

## Variable-length integers (varint)

LEB128 encoding for genome path compression:

| Value range | Bytes |
|------------|-------|
| 0-127 | 1 |
| 128-16,383 | 2 |
| 16,384-2,097,151 | 3 |
| 2,097,152-268,435,455 | 4 |

Also supports:
- **Zigzag encoding**: efficient encoding of signed integers
- **Delta encoding**: for sorted sequences (store differences)
- **Slice encoding/decoding**: batch operations