SPIDER is a reference based in silico PCR based tool for detecting microbial sequences of interest from genome assemblies. It features two key functions. First, it will search for sequences of interest inside of your whole genome assembly. Second, it can extract those sequences of interest into a FASTA file for quickly analyzing genomic epidemiology.
- Download or clone this GitHub repository
- SPIDER uses a conda environment to handle dependencies. Install the conda environment
from the provided environment.yml file using
conda env create -f environment.yml - Activate the SPIDER conda environment using the command
conda activate spider
To search for sequences of interest, SPIDER requires one or more query sequences and a database to search. The query sequences may be specified as either a single FASTA file, list of paths to multiple FASTA files or a folder containing multiple sequences (.fasta or .fna). You can either search a pre-compiled database using a keyword, or provide a custom database in FASTA format. The full list of parameters is available in a table below. If you want to just get going, see the example commands below.
Searching a whole-genome assembly for virulence factors in the VFDB belonging to Staphylococcus aureus:
python spider.py -f assembly.fasta -db vfdb -s "Staphylococcus aureus"
Search a list of genome assemblies for a sequence in the database custom_db.fasta and save the output to the file out.txt:
python spider.py -l genome_list.txt -db custom_db.fasta -o out.txt
| Parameter | Description | Required |
|---|---|---|
| Input Options | ||
| -f, --fasta | Path to a single genome sequence | Yes, only one of these options at a time |
| -l, --list | Path to a list of genome sequences. This file is expected to contain paths to genome sequences, each on a newline. | |
| -d, --directory | Path to a directory. SPIDER will look for any files that end in .fasta or .fna inside of this directory | |
| -a, --annotation | Path to a GFF3 formatted annotation file. When included, SPIDER will compare detected amplicons to the annotations and check for overlap with any annotations. This feature only works with a single fasta input at a time. | No |
| Database Options | ||
| -db, --database | Either a keyword for a pre-compiled database, or path to a custom database in FASTA format. | Yes |
| --list_dbs | Provides a list of pre-compiled databases that can be searched. This is a stand-alone command that can be run without specifying a query and database. | No |
| -s, --search | This is a search term. If specified, the database will be filtered to FASTA headers that contain this term. | No |
| Output Options | ||
| -o, --output | Output file that will be generated. For SPIDER search, this will be a tab-separated-values file. If no output is specified, SPIDER will print to stdout. | No |
| Additional Search options | ||
| --overlaps | Checks if any of the identified sequences are overlapping one another. Default: False | No |
| --scan_codons | Searches for nearest start and stop codons to the start and end of identified amplicons and if they are in frame with one another. Default: False | No |
| -sl, --slide_limit | Percent length of a reference sequence that primers are allowed to slide. Default is 5 (5%). | No |
| -lt, --length | Percent length tolerance between an extracted amplicon and the reference sequence. Default is 20 (20%). This allows matches of 80-100% of the reference sequence. | No |
| -it, --identity | Percent identity tolerance between an extracted amplicon and the reference sequence. Anything above this threshold will be called positive. Default is 0 (0%). | No |
| -p, --primer_size | Length of primers for SPIDER to use. Default is 20 (20nt). | No |
SPIDER includes shortcuts to search common databases. To use a pre-compiled database, use its keyword in the -db argument.
For example python spider.py -f assembly.fasta -db vfdb will search assembly.fasta for all virulence factors included in the Virulence Factor Database (VFDB).
The -s/--search keywords can be used to filter the database for genes of interest. For example, searching for ExoU in VFDB can be performed
with python spider.py -f assembly.fasta -db vfdb -s ExoU.
| Database | Keyword | Citation |
|---|---|---|
| Virulence Factor Database (VFDB) | vfdb | Chen L, Zheng D, Liu B, Yang J, Jin Q. VFDB 2016: hierarchical and refined dataset for big data analysis--10 years on. Nucleic Acids Res. 2016;44(D1):D694-D697. doi:10.1093/nar/gkv1239 |
If there are additional existing databases you would like to see added to this tool, please open an issue on GitHub.
SPIDER can quickly extract the sequences of amplicons identified by the program. Sequences will be extracted in the same orientation as your reference sequence, so sequences on opposite strands will be reverse complemented to display in the same direction. To run this command, it is requited that a SPIDER search is run first, and the output is saved to a file. SPIDER will parse the output of the search, and extract all sequences that were flagged as valid by the search.
Extract the nucleotide sequence of search using the custom database custom_db.fasta in
the genome of assembly.fasta. Note the first line is a SPIDER search from the above section.
python spider.py -f assembly.fasta -c custom_db.fasta -o search_custom_db.txt
python spider.py -e search_custom_db.txt -o assembly_custom_db.fasta
Extract the amino acid sequences of the coagulase gene from all assemblies in the directory
assemblies using the reference sequence from VFDB. Note the first line is a SPIDER search
from the above section.
python spider.py -d assemblies -db vfdb -s "Staphylococcus aureus" -o coagulase_search.txt
python spider.py -e coagulase_search.txt --translate -o coagulase.fasta
| Parameter | Description | Required |
|---|---|---|
| -e, --extract | Output of a SPIDER search for sequence(s) of interest in tab-separated-values format. Note that SPIDER assumes that your sequences are still located in their original location when you performed the search. | Yes |
| -o, --output | Output file that will be generated. For SPIDER extract, this will be in FASTA format. If using the --separate option, this should be the name of a folder. Default: stdout | No |
| --translate | Translates the extracted nucleotide sequences to amino acid sequences. Note that this function assumes that the extracted sequence is in the desired reading frame. | No |
| --upstream | Number of nucleotides upstream of the desired amplicon to extract. Default: 0 (start of desired sequence) | No |
| --downstream | Number of nucleotides downstream of the desired amplicon to extract. Default: 0 (end of desired sequence) | No |
| --separate | Separate the output sequences into multiple FASTA files by target name. If using this option, the output flag is required and should be the name of a folder rather than a file. Default: False | No |
| --overwrite | Overwrite an existing output file. Default: False | No |