Graph-aware colinear chaining

The chaining problem

After seed finding, Dragon has a set of seed hits: {(query_pos, unitig_id, offset, match_len)}. The goal is to find the best colinear subset — a chain of seeds that are consistent in both query and reference order, representing a contiguous alignment.

Why graph-aware?

Traditional tools (Minimap2, LexicMap) chain seeds in linear coordinate space — assuming the reference is a flat string. Dragon chains seeds along the genome’s path through the de Bruijn graph, which:

Naturally handles structural variation (alternative paths in the graph)
Respects the actual genome structure
Avoids spurious chains across unrelated genomic regions

Algorithm

Step 1: Map seeds to genome coordinates

For each candidate genome, Dragon loads its path (ordered sequence of unitig IDs) and maps each seed hit to a genome-level coordinate:

seed at (unitig_42, offset_100)
  + genome path: [..., unitig_42 at genome_pos_50000, ...]
  = anchor at (query_pos, genome_pos_50100, match_len)

Step 2: Sort anchors by reference position

Anchors are sorted by their genome coordinate, separating forward and reverse-complement hits.

Step 3: Fenwick tree DP

The colinear chaining DP finds the highest-scoring subset of anchors where each anchor starts after the previous one ends in both query and reference:

For each anchor a[i] (left to right by ref position):
    score[i] = match_len[i]  +  max over valid predecessors j of score[j]
    where: a[j].query_end <= a[i].query_start
           a[j].ref_end   <= a[i].ref_start

Using a Fenwick tree (binary indexed tree) indexed by query position, this runs in O(h log h) time where h is the number of anchors.

Step 4: Gap-sensitive scoring

The chain score includes a gap penalty proportional to the difference between reference and query gap lengths:

gap_cost(a[i], a[j]) = alpha * |ref_gap - query_gap|

This penalises indels and structural variations proportional to their size.

Complexity

Step	Time	Space
Map to genome coordinates	O(h)	O(h)
Sort	O(h log h)	O(h)
Fenwick tree DP	O(h log h)	O(h)
Traceback	O(chain_len)	O(chain_len)
Total	O(h log h)	O(h)

For a typical gene query with h ~ 100 anchors, chaining takes <1 ms per genome.