Searching

Overview

Dragon’s query pipeline finds sequences in the indexed database through four stages:

Seed finding — FM-index backward search with variable-length extension
Candidate filtering — colour-based voting to identify promising genomes
Colinear chaining — graph-aware dynamic programming to find the best alignment chain
Alignment — banded wavefront alignment for base-level accuracy

Command

dragon search [OPTIONS] --index <DIR> --query <FILE>

Required arguments

Argument	Description
`--index`, `-i`	Path to Dragon index directory
`--query`, `-q`	Query FASTA or FASTQ file

Optional arguments

Argument	Default	Description
`--output`, `-o`	`-` (stdout)	Output file path
`--format`, `-f`	`paf`	Output format: `paf` or `blast6`
`--threads`, `-j`	4	Number of parallel threads
`--max-ram`	4.0	Maximum RAM budget in GB
`--min-seed-len`	15	Minimum seed match length
`--max-seed-freq`	10,000	Skip seeds more frequent than this
`--min-chain-score`	50	Minimum chain score to report
`--max-target-seqs`	100	Maximum target genomes per query

Query types

Dragon handles various query types:

Query type	Typical length	Recommended parameters
Single gene	500-5,000 bp	defaults
16S rRNA	~1,500 bp	`--max-seed-freq 50000` (highly conserved)
Plasmid	2,000-200,000 bp	defaults
Long read (ONT/PacBio)	1,000-50,000 bp	`--min-seed-len 12`
AMR gene panel	batch of 1,000+	`--threads 16`

Seed finding details

Dragon uses variable-length seed matching: rather than searching for fixed k-mers, it extends each FM-index backward search character by character until the suffix array interval is empty. This naturally produces longer, more specific seeds in conserved regions and shorter seeds in variable regions.

Seeds with suffix array interval width exceeding --max-seed-freq are discarded as too repetitive (e.g., rRNA-derived seeds that match millions of genomes).

Candidate filtering

For each seed hit, Dragon retrieves the unitig’s Roaring Bitmap to identify which genomes contain it. Genomes accumulate “votes” proportional to the number of distinct unitig hits. Only genomes exceeding a vote threshold proceed to the chaining stage, dramatically reducing computation.

Chaining

Dragon performs colinear chaining along each candidate genome’s path through the de Bruijn graph, not in linear coordinate space. This provides two advantages:

Seeds are chained in graph topology order, naturally handling structural rearrangements
The Fenwick tree DP runs in O(h log h) time where h is the number of anchors

Memory management

Dragon keeps query-time RAM below --max-ram via:

Memory-mapped index files — only active pages consume RAM
On-demand colour loading — Roaring Bitmaps deserialised per-unitig
Streaming query processing — queries processed one at a time
Configurable batch size — parallel queries share the same index