Data structures

2-bit DNA encoding

DNA bases are packed 32 per u64 word:

Base

Encoding

A

00

C

01

G

10

T

11

Operations:

  • Encode: 10,000 bases in ~2 microseconds

  • Decode: 10,000 bases in ~3 microseconds

  • Reverse complement: bitwise NOT + reverse pairs

  • K-mer extraction: shift and mask operations

Ambiguous bases (N, R, Y, etc.) are encoded as A (00).

Roaring Bitmaps

Used for the colour index (mapping unitigs to genome sets). Roaring Bitmaps partition the integer universe into blocks of 2^16, using the optimal representation per block:

  • Array container: for sparse blocks (<4,096 integers)

  • Bitmap container: for dense blocks (>4,096 integers)

  • Run container: for blocks with long consecutive runs

This is ideal for genome ID sets, which tend to be clustered by species.

Elias-Fano cumulative length index

Maps FM-index text positions to unitig IDs via binary search on a sorted array of cumulative start positions. Current implementation uses a simple Vec<u64> with binary search; production version can upgrade to a true Elias-Fano structure for O(1) predecessor queries.

Fenwick tree (Binary Indexed Tree)

Used in colinear chaining for prefix maximum queries:

  • Update: set position i to max(current, value) in O(log n)

  • Query: find maximum in prefix [0, i] in O(log n)

  • Space: O(n)

Two variants:

  • FenwickMax: prefix maximum (used in chaining)

  • FenwickSum: prefix sum (used in vote counting)

Variable-length integers (varint)

LEB128 encoding for genome path compression:

Value range

Bytes

0-127

1

128-16,383

2

16,384-2,097,151

3

2,097,152-268,435,455

4

Also supports:

  • Zigzag encoding: efficient encoding of signed integers

  • Delta encoding: for sorted sequences (store differences)

  • Slice encoding/decoding: batch operations