Searching

Overview

Dragon’s query pipeline finds sequences in the indexed database through four stages:

  1. Seed finding — FM-index backward search with variable-length extension

  2. Candidate filtering — colour-based voting to identify promising genomes

  3. Colinear chaining — graph-aware dynamic programming to find the best alignment chain

  4. Alignment — banded wavefront alignment for base-level accuracy

Command

dragon search [OPTIONS] --index <DIR> --query <FILE>

Required arguments

Argument

Description

--index, -i

Path to Dragon index directory

--query, -q

Query FASTA or FASTQ file

Optional arguments

Argument

Default

Description

--output, -o

- (stdout)

Output file path

--format, -f

paf

Output format: paf or blast6

--threads, -j

4

Number of parallel threads

--max-ram

4.0

Maximum RAM budget in GB

--min-seed-len

15

Minimum seed match length

--max-seed-freq

10,000

Skip seeds more frequent than this

--min-chain-score

50

Minimum chain score to report

--max-target-seqs

100

Maximum target genomes per query

Query types

Dragon handles various query types:

Query type

Typical length

Recommended parameters

Single gene

500-5,000 bp

defaults

16S rRNA

~1,500 bp

--max-seed-freq 50000 (highly conserved)

Plasmid

2,000-200,000 bp

defaults

Long read (ONT/PacBio)

1,000-50,000 bp

--min-seed-len 12

AMR gene panel

batch of 1,000+

--threads 16

Seed finding details

Dragon uses variable-length seed matching: rather than searching for fixed k-mers, it extends each FM-index backward search character by character until the suffix array interval is empty. This naturally produces longer, more specific seeds in conserved regions and shorter seeds in variable regions.

Seeds with suffix array interval width exceeding --max-seed-freq are discarded as too repetitive (e.g., rRNA-derived seeds that match millions of genomes).

Candidate filtering

For each seed hit, Dragon retrieves the unitig’s Roaring Bitmap to identify which genomes contain it. Genomes accumulate “votes” proportional to the number of distinct unitig hits. Only genomes exceeding a vote threshold proceed to the chaining stage, dramatically reducing computation.

Chaining

Dragon performs colinear chaining along each candidate genome’s path through the de Bruijn graph, not in linear coordinate space. This provides two advantages:

  1. Seeds are chained in graph topology order, naturally handling structural rearrangements

  2. The Fenwick tree DP runs in O(h log h) time where h is the number of anchors

Memory management

Dragon keeps query-time RAM below --max-ram via:

  • Memory-mapped index files — only active pages consume RAM

  • On-demand colour loading — Roaring Bitmaps deserialised per-unitig

  • Streaming query processing — queries processed one at a time

  • Configurable batch size — parallel queries share the same index