Graph-aware colinear chaining
The chaining problem
After seed finding, Dragon has a set of seed hits: {(query_pos, unitig_id, offset, match_len)}. The goal is to find the best colinear subset — a chain of seeds that are consistent in both query and reference order, representing a contiguous alignment.
Why graph-aware?
Traditional tools (Minimap2, LexicMap) chain seeds in linear coordinate space — assuming the reference is a flat string. Dragon chains seeds along the genome’s path through the de Bruijn graph, which:
Naturally handles structural variation (alternative paths in the graph)
Respects the actual genome structure
Avoids spurious chains across unrelated genomic regions
Algorithm
Step 1: Map seeds to genome coordinates
For each candidate genome, Dragon loads its path (ordered sequence of unitig IDs) and maps each seed hit to a genome-level coordinate:
seed at (unitig_42, offset_100)
+ genome path: [..., unitig_42 at genome_pos_50000, ...]
= anchor at (query_pos, genome_pos_50100, match_len)
Step 2: Sort anchors by reference position
Anchors are sorted by their genome coordinate, separating forward and reverse-complement hits.
Step 3: Fenwick tree DP
The colinear chaining DP finds the highest-scoring subset of anchors where each anchor starts after the previous one ends in both query and reference:
For each anchor a[i] (left to right by ref position):
score[i] = match_len[i] + max over valid predecessors j of score[j]
where: a[j].query_end <= a[i].query_start
a[j].ref_end <= a[i].ref_start
Using a Fenwick tree (binary indexed tree) indexed by query position, this runs in O(h log h) time where h is the number of anchors.
Step 4: Gap-sensitive scoring
The chain score includes a gap penalty proportional to the difference between reference and query gap lengths:
gap_cost(a[i], a[j]) = alpha * |ref_gap - query_gap|
This penalises indels and structural variations proportional to their size.
Complexity
Step |
Time |
Space |
|---|---|---|
Map to genome coordinates |
O(h) |
O(h) |
Sort |
O(h log h) |
O(h) |
Fenwick tree DP |
O(h log h) |
O(h) |
Traceback |
O(chain_len) |
O(chain_len) |
Total |
O(h log h) |
O(h) |
For a typical gene query with h ~ 100 anchors, chaining takes <1 ms per genome.