Searching
Overview
Dragon’s query pipeline finds sequences in the indexed database through four stages:
Seed finding — FM-index backward search with variable-length extension
Candidate filtering — colour-based voting to identify promising genomes
Colinear chaining — graph-aware dynamic programming to find the best alignment chain
Alignment — banded wavefront alignment for base-level accuracy
Command
dragon search [OPTIONS] --index <DIR> --query <FILE>
Required arguments
Argument |
Description |
|---|---|
|
Path to Dragon index directory |
|
Query FASTA or FASTQ file |
Optional arguments
Argument |
Default |
Description |
|---|---|---|
|
|
Output file path |
|
|
Output format: |
|
4 |
Number of parallel threads |
|
4.0 |
Maximum RAM budget in GB |
|
15 |
Minimum seed match length |
|
10,000 |
Skip seeds more frequent than this |
|
50 |
Minimum chain score to report |
|
100 |
Maximum target genomes per query |
Query types
Dragon handles various query types:
Query type |
Typical length |
Recommended parameters |
|---|---|---|
Single gene |
500-5,000 bp |
defaults |
16S rRNA |
~1,500 bp |
|
Plasmid |
2,000-200,000 bp |
defaults |
Long read (ONT/PacBio) |
1,000-50,000 bp |
|
AMR gene panel |
batch of 1,000+ |
|
Seed finding details
Dragon uses variable-length seed matching: rather than searching for fixed k-mers, it extends each FM-index backward search character by character until the suffix array interval is empty. This naturally produces longer, more specific seeds in conserved regions and shorter seeds in variable regions.
Seeds with suffix array interval width exceeding --max-seed-freq are discarded as too repetitive (e.g., rRNA-derived seeds that match millions of genomes).
Candidate filtering
For each seed hit, Dragon retrieves the unitig’s Roaring Bitmap to identify which genomes contain it. Genomes accumulate “votes” proportional to the number of distinct unitig hits. Only genomes exceeding a vote threshold proceed to the chaining stage, dramatically reducing computation.
Chaining
Dragon performs colinear chaining along each candidate genome’s path through the de Bruijn graph, not in linear coordinate space. This provides two advantages:
Seeds are chained in graph topology order, naturally handling structural rearrangements
The Fenwick tree DP runs in O(h log h) time where h is the number of anchors
Memory management
Dragon keeps query-time RAM below --max-ram via:
Memory-mapped index files — only active pages consume RAM
On-demand colour loading — Roaring Bitmaps deserialised per-unitig
Streaming query processing — queries processed one at a time
Configurable batch size — parallel queries share the same index