# Graph-aware colinear chaining ## The chaining problem After seed finding, Dragon has a set of seed hits: `{(query_pos, unitig_id, offset, match_len)}`. The goal is to find the **best colinear subset** — a chain of seeds that are consistent in both query and reference order, representing a contiguous alignment. ## Why graph-aware? Traditional tools (Minimap2, LexicMap) chain seeds in **linear coordinate space** — assuming the reference is a flat string. Dragon chains seeds along the **genome's path through the de Bruijn graph**, which: 1. Naturally handles structural variation (alternative paths in the graph) 2. Respects the actual genome structure 3. Avoids spurious chains across unrelated genomic regions ## Algorithm ### Step 1: Map seeds to genome coordinates For each candidate genome, Dragon loads its path (ordered sequence of unitig IDs) and maps each seed hit to a genome-level coordinate: ``` seed at (unitig_42, offset_100) + genome path: [..., unitig_42 at genome_pos_50000, ...] = anchor at (query_pos, genome_pos_50100, match_len) ``` ### Step 2: Sort anchors by reference position Anchors are sorted by their genome coordinate, separating forward and reverse-complement hits. ### Step 3: Fenwick tree DP The colinear chaining DP finds the highest-scoring subset of anchors where each anchor starts after the previous one ends in both query and reference: ``` For each anchor a[i] (left to right by ref position): score[i] = match_len[i] + max over valid predecessors j of score[j] where: a[j].query_end <= a[i].query_start a[j].ref_end <= a[i].ref_start ``` Using a Fenwick tree (binary indexed tree) indexed by query position, this runs in **O(h log h)** time where h is the number of anchors. ### Step 4: Gap-sensitive scoring The chain score includes a gap penalty proportional to the difference between reference and query gap lengths: ``` gap_cost(a[i], a[j]) = alpha * |ref_gap - query_gap| ``` This penalises indels and structural variations proportional to their size. ## Complexity | Step | Time | Space | |------|------|-------| | Map to genome coordinates | O(h) | O(h) | | Sort | O(h log h) | O(h) | | Fenwick tree DP | O(h log h) | O(h) | | Traceback | O(chain_len) | O(chain_len) | | **Total** | **O(h log h)** | **O(h)** | For a typical gene query with h ~ 100 anchors, chaining takes <1 ms per genome.