GitHub - sequana/lora: Run assembler (Canu, flye, hifiasm) on a set of long read files

JOSS (journal of open source software) DOI

LORA — Long Read Assembly pipeline

Overview:	Assemble long reads (PacBio HiFi, PacBio subreads, Nanopore) into high-quality genome assemblies with optional polishing, annotation, and quality assessment.
Input:	BAM files from PacBio sequencers, or FastQ files from Nanopore or PacBio HiFi sequencers.
Output:	HTML reports with per-sample assembly statistics, coverage, BLAST identification, BUSCO scores, and optional annotation.
Status:	Production
Citation:	Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, doi:10.21105/joss.00352 Zenodo DOI: Biorxiv: https://www.biorxiv.org/content/10.64898/2026.01.06.697901v1

Installation

pip install sequana-lora

To upgrade an existing installation:

pip install sequana-lora --upgrade

Quick Start

Step 1 — prepare the working directory:

sequana_lora \
    --input-directory /path/to/reads \
    --data-type pacbio-hifi \
    --assembler flye \
    --genome-size 3m \
    --apptainer-prefix /path/to/containers

This creates a lora/ working directory containing config.yaml and a lora.sh launch script.

Step 2 — review the configuration (optional but recommended):

cd lora
cat config.yaml   # adjust parameters as needed

Step 3 — run the pipeline:

sh lora.sh

Or launch directly from step 1 with --execute (skips the review step):

sequana_lora ... --execute

To watch live progress in the terminal, add --monitor:

sequana_lora ... --execute --monitor

Required options

Three options are always required:

--assembler

Assembler to use. Choices: flye (recommended for HiFi), canu, hifiasm, unicycler, necat, pecat.

--data-type

Technology and quality of the input reads:

Value	Description
pacbio-hifi	PacBio HiFi / CCS reads (Q20+)
pacbio-raw	PacBio CLR / subreads (raw)
pacbio-corr	PacBio corrected reads
nano-hq	Nanopore Q20+ reads (e.g. R10.4 with SUP basecalling)
nano-raw	Nanopore standard reads
nano-corr	Nanopore corrected reads

--genome-size

Estimated genome size, e.g. 3m (3 Mb), 2.5g (2.5 Gb). Required by Flye; used by Canu for coverage reporting.

Common Examples

PacBio HiFi (recommended setup)

sequana_lora \
    --input-directory /data/hifi \
    --data-type pacbio-hifi \
    --assembler flye \
    --genome-size 3m \
    --apptainer-prefix /shared/containers \
    --do-coverage

Nanopore (bacteria, full quality pipeline)

Use --mode bacteria to enable sequana_coverage, prokka, busco, and checkm in one shot:

sequana_lora \
    --input-directory /data/nanopore \
    --data-type nano-hq \
    --assembler flye \
    --genome-size 3m \
    --apptainer-prefix /shared/containers \
    --mode bacteria \
    --busco-lineage bacteria \
    --checkm-rank genus \
    --checkm-name Streptococcus

PacBio subreads (with CCS construction)

If your input is raw PacBio BAM files (subreads), LORA can build CCS/HiFi reads first:

sequana_lora \
    --input-directory /data/subreads \
    --data-type pacbio-raw \
    --assembler flye \
    --genome-size 3m \
    --pacbio-build-ccs \
    --pacbio-ccs-min-passes 10 \
    --pacbio-ccs-min-rq 0.99

Or, if you have multiple BAM files per sample, provide a CSV:

sequana_lora \
    --pacbio-input-csv samples.csv \
    --data-type pacbio-raw \
    --assembler flye \
    --genome-size 3m

The CSV format is: one row per sample with columns sample,file1[,file2,...].

Optional Steps

Coverage analysis

Computes depth of coverage and breadth of coverage for each contig using sequana_coverage. Highly recommended to check assembly quality:

--do-coverage

BUSCO completeness

Assess genome completeness against a lineage-specific marker gene set:

--busco-lineage bacteria          # auto-download bacteria lineage
--busco-lineage streptococcales   # specific clade
--busco-print-lineages            # list all available lineages

Prokka annotation

Annotate contigs (bacterial genomes):

--do-prokka

CheckM genome quality

Estimate completeness and contamination for bacterial genomes:

--checkm-rank genus --checkm-name Streptococcus

Use an invalid --checkm-name value to get a list of valid names for a given rank, e.g. --checkm-rank genus --checkm-name HELP.

Polypolish (Illumina polishing)

Polish long-read contigs with paired-end Illumina data:

--do-polypolish \
--polypolish-input-directory /data/illumina \
--polypolish-input-pattern "*.fastq.gz" \
--polypolish-input-readtag "_R[12]_"

Circularisation

Explicit circularisation with Circlator (Flye performs this automatically):

--do-circlator

BLAST identification

BLAST aligns each contig against a nucleotide database to identify the assembled sequences. The top hits appear in the HTML report.

Local BLAST

Requires a locally installed BLAST+ and a downloaded nt database (~270 GB). Fastest option with no network dependency:

--blastdb /path/to/blast/databases

Remote BLAST (NCBI)

No local database required — jobs are submitted to NCBI's BLAST servers. Enable by providing an email address:

--blast-email your@email.com

Jobs are submitted sequentially (one contig at a time) to avoid IP-level CPU throttling by NCBI. The default database is nt; use --blast-remote-db to change it:

--blast-email your@email.com --blast-remote-db refseq_genomic

Restricting the search to an organism group (strongly recommended)

Searching all of nt for a large contig is slow and prone to NCBI CPU throttling. Restrict the search by editing config.yaml after running sequana_lora. The entrez_query parameter is equivalent to filling the "Organism" box on the NCBI BLAST web form.

Option 1 — curated bacterial reference genomes (fastest, recommended for bacteria)

Use refseq_genomic as the database and restrict to bacteria. RefSeq contains only complete, curated reference genomes — a much smaller and higher-quality search space than nt:

blast:
    remote_db: 'refseq_genomic'
    entrez_query: 'Bacteria[Organism]'

Option 2 — all bacteria in nt, reference sequences only

Stay on nt but filter to RefSeq-quality entries:

blast:
    remote_db: 'nt'
    entrez_query: 'Bacteria[Organism] AND refseq[filter]'

Option 3 — single genus

Useful when the organism is known:

blast:
    remote_db: 'nt'
    entrez_query: 'Streptococcus[Organism]'

Note

If your organism is novel or not yet in RefSeq, fall back to remote_db: nt with a broad entrez_query such as Bacteria[Organism].

NCBI API key (optional but recommended)

Register for a free NCBI API key at https://www.ncbi.nlm.nih.gov/account/ (sign in → API Key Management). It raises the rate limit from 3 to 10 requests/second and reduces CPU throttling for large queries. Add it to config.yaml:

blast:
    api_key: 'YOUR_KEY_HERE'

HPC / SLURM cluster

On a cluster with SLURM, pass --profile slurm:

sequana_lora \
    --input-directory /data/hifi \
    --data-type pacbio-hifi \
    --assembler flye \
    --genome-size 3m \
    --profile slurm \
    --slurm-queue fast \
    --jobs 40 \
    --apptainer-prefix /shared/containers

Per-rule memory and thread settings are controlled via the resources blocks in config.yaml.

Apptainer / Singularity (no system installs needed)

Every tool runs inside a pre-built container. Point --apptainer-prefix to a shared directory so images are downloaded once and reused across projects:

--apptainer-prefix /shared/containers

Images are downloaded automatically on first run from Zenodo. Pass extra bind mounts with --apptainer-args if your data lives outside $HOME:

--apptainer-args "-B /data:/data"

Configuration file

After running sequana_lora, a config.yaml is created in the working directory. All pipeline parameters can be tuned there. Key sections:

assembler — which assembler to use
flye / canu / hifiasm — assembler-specific options
fastp — read filtering (minimum length, etc.)
blast — BLAST settings including entrez_query and api_key
busco / prokka / checkm — optional QC tools
sequana_coverage — coverage analysis parameters
multiqc — aggregated report settings

Full reference: config.yaml

Pipeline overview

Read filtering — fastp removes reads below the minimum length threshold.
[Optional] CCS — build HiFi reads from PacBio subreads (ccs tool).
Assembly — Flye / Canu / Hifiasm / Unicycler / NECAT / PECAT.
[Optional] Circularisation — Circlator (or built into Flye).
[Optional] Polishing — Polypolish with paired-end Illumina reads.
Contig sorting — SeqKit sorts contigs by length (largest first).
Read mapping — Minimap2 maps reads back to contigs; Mosdepth computes coverage.
[Optional] Coverage analysis — sequana_coverage per contig.
Quality assessment — QUAST assembly statistics.
[Optional] BLAST — top hits per contig (local or remote NCBI).
[Optional] BUSCO — genome completeness.
[Optional] Prokka — genome annotation.
[Optional] CheckM — contamination and completeness for bacteria.
Reports — per-sample HTML report and a multi-sample summary.

Changelog

Version	Description
1.1.0	remote BLAST via NCBI URL API (no local database needed); sequential submission to avoid IP-level CPU throttling entrez_query support to restrict BLAST to a taxonomic group (e.g. Bacteria[Organism], refseq_genomic) — equivalent to the "Organism" box on the NCBI BLAST web form NCBI API key support for higher rate limits (blast.api_key) bioservices >= 1.16.0 required for NCBIBlastAPI retry logic when NCBI returns READY with empty result set improved HTML reports: Sequana logo in header, back-to-summary button, FASTA download link per sample, GC content in coverage table, informative amber warning box when BLAST returns no hits update busco container to busco_6.0.0 fix sequana_coverage log redirection (was showing 2 s in monitor) updated README with BLAST, entrez_query and refseq_genomic docs
1.0.0	uniformised extension with other pipelines. fix regression on schema file update sequana container to v0.16.5 add unicycler apptainer add checkm module to help users choosing correct marker and name replaces --pacbio and --nanopore with --data-type. pacbio is now decomposed into 3 sub-categories: pacbio-raw, pacbio-hifi and pacbio-corr add bandage if assembly graph is available fixed hifiasm container to use newest version improved report html make genome-size compulsory add fastp as preprocessing tool remove presets in favor of click options CCS defaults to hifi. pacbio presets in config set to pacbio-hifi blast removed from default; users must set blast DB themselves busco lineage downloaded from the web CANU preset changes: pacbio → pacbio-hifi CANU-correction preset changes: pacbio → pacbio-hifi FLYE preset changes: pacbio-raw → pacbio-hifi remote BLAST via NCBI URL API (no local database needed) entrez_query support to restrict BLAST to a taxonomic group NCBI API key support for higher rate limits
0.3.0	Use click instead of argparse added multiqc / checkm / unicycler
0.2.0	add apptainers in most rules remove utils.smk to move rulegraph inside main pipeline rename lora.smk into lora.rules for consistency with other pipelines add checkm in the pipeline and HTML report
0.1.0	First release.

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
.github/workflows		.github/workflows
doc		doc
sequana_pipelines/lora		sequana_pipelines/lora
tests		tests
.codacy.yaml		.codacy.yaml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.workflow-index.yml		.workflow-index.yml
LICENSE		LICENSE
README.rst		README.rst
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LORA — Long Read Assembly pipeline

Installation

Quick Start

Required options

Common Examples

PacBio HiFi (recommended setup)

Nanopore (bacteria, full quality pipeline)

PacBio subreads (with CCS construction)

Optional Steps

Coverage analysis

BUSCO completeness

Prokka annotation

CheckM genome quality

Polypolish (Illumina polishing)

Circularisation

BLAST identification

Local BLAST

Remote BLAST (NCBI)

HPC / SLURM cluster

Apptainer / Singularity (no system installs needed)

Configuration file

Pipeline overview

Changelog

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LORA — Long Read Assembly pipeline

Installation

Quick Start

Required options

Common Examples

PacBio HiFi (recommended setup)

Nanopore (bacteria, full quality pipeline)

PacBio subreads (with CCS construction)

Optional Steps

Coverage analysis

BUSCO completeness

Prokka annotation

CheckM genome quality

Polypolish (Illumina polishing)

Circularisation

BLAST identification

Local BLAST

Remote BLAST (NCBI)

HPC / SLURM cluster

Apptainer / Singularity (no system installs needed)

Configuration file

Pipeline overview

Changelog

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages