Graph-aware colinear chaining

The chaining problem

After seed finding, Dragon has a set of seed hits: {(query_pos, unitig_id, offset, match_len)}. The goal is to find the best colinear subset — a chain of seeds that are consistent in both query and reference order, representing a contiguous alignment.

Why graph-aware?

Traditional tools (Minimap2, LexicMap) chain seeds in linear coordinate space — assuming the reference is a flat string. Dragon chains seeds along the genome’s path through the de Bruijn graph, which:

  1. Naturally handles structural variation (alternative paths in the graph)

  2. Respects the actual genome structure

  3. Avoids spurious chains across unrelated genomic regions

Algorithm

Step 1: Map seeds to genome coordinates

For each candidate genome, Dragon loads its path (ordered sequence of unitig IDs) and maps each seed hit to a genome-level coordinate:

seed at (unitig_42, offset_100)
  + genome path: [..., unitig_42 at genome_pos_50000, ...]
  = anchor at (query_pos, genome_pos_50100, match_len)

Step 2: Sort anchors by reference position

Anchors are sorted by their genome coordinate, separating forward and reverse-complement hits.

Step 3: Fenwick tree DP

The colinear chaining DP finds the highest-scoring subset of anchors where each anchor starts after the previous one ends in both query and reference:

For each anchor a[i] (left to right by ref position):
    score[i] = match_len[i]  +  max over valid predecessors j of score[j]
    where: a[j].query_end <= a[i].query_start
           a[j].ref_end   <= a[i].ref_start

Using a Fenwick tree (binary indexed tree) indexed by query position, this runs in O(h log h) time where h is the number of anchors.

Step 4: Gap-sensitive scoring

The chain score includes a gap penalty proportional to the difference between reference and query gap lengths:

gap_cost(a[i], a[j]) = alpha * |ref_gap - query_gap|

This penalises indels and structural variations proportional to their size.

Complexity

Step

Time

Space

Map to genome coordinates

O(h)

O(h)

Sort

O(h log h)

O(h)

Fenwick tree DP

O(h log h)

O(h)

Traceback

O(chain_len)

O(chain_len)

Total

O(h log h)

O(h)

For a typical gene query with h ~ 100 anchors, chaining takes <1 ms per genome.