# Run-length FM-index ## Background The FM-index is a compressed full-text index based on the Burrows-Wheeler Transform (BWT). Given a text T of length n, it supports: - **Count**: how many times pattern P occurs in T, in O(|P|) time - **Locate**: the positions of all occurrences, in O(|P| + occ) time - **Space**: O(n) for standard FM-index, O(r) for run-length variant ## Why run-length? The **run-length FM-index** (r-index) stores the BWT in run-length encoded form. The number of BWT runs **r** is the key parameter: - For random text: r ~ n (no compression) - For repetitive text: r << n (massive compression) Prokaryotic genome collections are **highly repetitive**: related species share >95% of their sequence. When their unitig sequences are concatenated, the BWT contains long runs of identical characters. | Database | Text length n | BWT runs r | r/n | Compression | |----------|--------------|-----------|-----|-------------| | 500 E. coli | ~2.5 Gbp | ~50M | 0.02 | 50x | | 85K GTDB | ~250 Gbp | ~2.5B | 0.01 | 100x | | 2.34M GenBank | ~5 Gbp unitigs | ~250M | 0.05 | 20x | ## Construction 1. **Concatenate** all unitig sequences with separator characters (`$`) 2. **Build** the suffix array using SA-IS (linear time) 3. **Derive** the BWT from the suffix array 4. **Run-length encode** the BWT 5. **Sample** the suffix array at run boundaries ## Backward search To search for pattern P[1..m]: 1. Initialise suffix array interval [lo, hi) to the full range 2. Process P from right to left: for each character P[i], update [lo, hi) using the LF-mapping 3. After m steps, `hi - lo` = number of occurrences ## Variable-length seed matching Unlike fixed k-mer lookup, Dragon extends each backward search beyond k characters as long as the SA interval remains non-empty: ``` Query: ACGTACGTACGTNNNN... Search: ACGTACGTACGT (12 chars, SA interval = [50, 55)) ACGTACGTACGTN (13 chars, SA interval = empty -> stop) Result: 12-mer match with 5 occurrences ``` This produces **super-maximal exact matches (SMEMs)** of variable length, which are more informative than fixed-length k-mers. ## Position-to-unitig mapping FM-index positions are in the concatenated text. Dragon maps them back to unitig IDs using a **cumulative length index** (Elias-Fano encoded monotone sequence): ``` Position 12345 in text -> binary search on cumulative lengths -> unitig_id = 42, offset = 789 ```