Skip to content

RunningMSN/SPIDER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sliding Primer In silico Detection of Encoded Regions (SPIDER)

SPIDER is a reference based in silico PCR based tool for detecting microbial sequences of interest from genome assemblies. It features two key functions. First, it will search for sequences of interest inside of your whole genome assembly. Second, it can extract those sequences of interest into a FASTA file for quickly analyzing genomic epidemiology.

Installation

  1. Download or clone this GitHub repository
  2. SPIDER uses a conda environment to handle dependencies. Install the conda environment from the provided environment.yml file using conda env create -f environment.yml
  3. Activate the SPIDER conda environment using the command conda activate spider

SPIDER Search

To search for sequences of interest, SPIDER requires one or more query sequences and a database to search. The query sequences may be specified as either a single FASTA file, list of paths to multiple FASTA files or a folder containing multiple sequences (.fasta or .fna). You can either search a pre-compiled database using a keyword, or provide a custom database in FASTA format. The full list of parameters is available in a table below. If you want to just get going, see the example commands below.

Examples

Searching a whole-genome assembly for virulence factors in the VFDB belonging to Staphylococcus aureus:

python spider.py -f assembly.fasta -db vfdb -s "Staphylococcus aureus"

Search a list of genome assemblies for a sequence in the database custom_db.fasta and save the output to the file out.txt:

python spider.py -l genome_list.txt -db custom_db.fasta -o out.txt

Full SPIDER Search Parameters

Parameter Description Required
Input Options
-f, --fasta Path to a single genome sequence Yes, only one of these options at a time
-l, --list Path to a list of genome sequences. This file is expected to contain paths to genome sequences, each on a newline.
-d, --directory Path to a directory. SPIDER will look for any files that end in .fasta or .fna inside of this directory
-a, --annotation Path to a GFF3 formatted annotation file. When included, SPIDER will compare detected amplicons to the annotations and check for overlap with any annotations. This feature only works with a single fasta input at a time. No
Database Options
-db, --database Either a keyword for a pre-compiled database, or path to a custom database in FASTA format. Yes
--list_dbs Provides a list of pre-compiled databases that can be searched. This is a stand-alone command that can be run without specifying a query and database. No
-s, --search This is a search term. If specified, the database will be filtered to FASTA headers that contain this term. No
Output Options
-o, --output Output file that will be generated.  For SPIDER search, this will be a tab-separated-values file. If no output is specified, SPIDER will print to stdout. No
Additional Search options
--overlaps Checks if any of the identified sequences are overlapping one another. Default: False No
--scan_codons Searches for nearest start and stop codons to the start and end of identified amplicons and if they are in frame with one another. Default: False No
-sl, --slide_limit Percent length of a reference sequence that primers are allowed to slide. Default is 5 (5%). No
-lt, --length Percent length tolerance between an extracted amplicon and the reference sequence. Default is 20 (20%). This allows matches of 80-100% of the reference sequence. No
-it, --identity Percent identity tolerance between an extracted amplicon and the reference sequence. Anything above this threshold will be called positive. Default is 0 (0%). No
-p, --primer_size Length of primers for SPIDER to use. Default is 20 (20nt). No

Database Shortcuts

SPIDER includes shortcuts to search common databases. To use a pre-compiled database, use its keyword in the -db argument. For example python spider.py -f assembly.fasta -db vfdb will search assembly.fasta for all virulence factors included in the Virulence Factor Database (VFDB). The -s/--search keywords can be used to filter the database for genes of interest. For example, searching for ExoU in VFDB can be performed with python spider.py -f assembly.fasta -db vfdb -s ExoU.

Database Keyword Citation
Virulence Factor Database (VFDB) vfdb Chen L, Zheng D, Liu B, Yang J, Jin Q. VFDB 2016: hierarchical and refined dataset for big data analysis--10 years on. Nucleic Acids Res. 2016;44(D1):D694-D697. doi:10.1093/nar/gkv1239

If there are additional existing databases you would like to see added to this tool, please open an issue on GitHub.

SPIDER Extract

SPIDER can quickly extract the sequences of amplicons identified by the program. Sequences will be extracted in the same orientation as your reference sequence, so sequences on opposite strands will be reverse complemented to display in the same direction. To run this command, it is requited that a SPIDER search is run first, and the output is saved to a file. SPIDER will parse the output of the search, and extract all sequences that were flagged as valid by the search.

Examples

Extract the nucleotide sequence of search using the custom database custom_db.fasta in the genome of assembly.fasta. Note the first line is a SPIDER search from the above section.

python spider.py -f assembly.fasta -c custom_db.fasta -o search_custom_db.txt
python spider.py -e search_custom_db.txt -o assembly_custom_db.fasta

Extract the amino acid sequences of the coagulase gene from all assemblies in the directory assemblies using the reference sequence from VFDB. Note the first line is a SPIDER search from the above section.

python spider.py -d assemblies -db vfdb -s "Staphylococcus aureus" -o coagulase_search.txt
python spider.py -e coagulase_search.txt --translate -o coagulase.fasta

Full SPIDER Extract Parameters

Parameter Description Required
-e, --extract Output of a SPIDER search for sequence(s) of interest in tab-separated-values format. Note that SPIDER assumes that your sequences are still located in their original location when you performed the search. Yes
-o, --output Output file that will be generated. For SPIDER extract, this will be in FASTA format. If using the --separate option, this should be the name of a folder. Default: stdout No
--translate Translates the extracted nucleotide sequences to amino acid sequences. Note that this function assumes that the extracted sequence is in the desired reading frame. No
--upstream Number of nucleotides upstream of the desired amplicon to extract. Default: 0 (start of desired sequence) No
--downstream Number of nucleotides downstream of the desired amplicon to extract. Default: 0 (end of desired sequence) No
--separate Separate the output sequences into multiple FASTA files by target name. If using this option, the output flag is required and should be the name of a folder rather than a file. Default: False No
--overwrite Overwrite an existing output file. Default: False No

About

In silico PCR reference-based method for detecting microbial sequences.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages