Run-length FM-index

Background

The FM-index is a compressed full-text index based on the Burrows-Wheeler Transform (BWT). Given a text T of length n, it supports:

Count: how many times pattern P occurs in T, in O(|P|) time
Locate: the positions of all occurrences, in O(|P| + occ) time
Space: O(n) for standard FM-index, O(r) for run-length variant

Why run-length?

The run-length FM-index (r-index) stores the BWT in run-length encoded form. The number of BWT runs r is the key parameter:

For random text: r ~ n (no compression)
For repetitive text: r << n (massive compression)

Prokaryotic genome collections are highly repetitive: related species share >95% of their sequence. When their unitig sequences are concatenated, the BWT contains long runs of identical characters.

Database	Text length n	BWT runs r	r/n	Compression
500 E. coli	~2.5 Gbp	~50M	0.02	50x
85K GTDB	~250 Gbp	~2.5B	0.01	100x
2.34M GenBank	~5 Gbp unitigs	~250M	0.05	20x

Construction

Concatenate all unitig sequences with separator characters ($)
Build the suffix array using SA-IS (linear time)
Derive the BWT from the suffix array
Run-length encode the BWT
Sample the suffix array at run boundaries

Backward search

To search for pattern P[1..m]:

Initialise suffix array interval [lo, hi) to the full range
Process P from right to left: for each character P[i], update [lo, hi) using the LF-mapping
After m steps, hi - lo = number of occurrences

Variable-length seed matching

Unlike fixed k-mer lookup, Dragon extends each backward search beyond k characters as long as the SA interval remains non-empty:

Query:  ACGTACGTACGTNNNN...
Search: ACGTACGTACGT       (12 chars, SA interval = [50, 55))
        ACGTACGTACGTN      (13 chars, SA interval = empty -> stop)
Result: 12-mer match with 5 occurrences

This produces super-maximal exact matches (SMEMs) of variable length, which are more informative than fixed-length k-mers.

Position-to-unitig mapping

FM-index positions are in the concatenated text. Dragon maps them back to unitig IDs using a cumulative length index (Elias-Fano encoded monotone sequence):

Position 12345 in text
  -> binary search on cumulative lengths
  -> unitig_id = 42, offset = 789