Fast similarity search for protein sequences using ESM-2 embeddings and Approximate Nearest Neighbor (ANN) methods.
This project implements efficient protein retrieval in embedding space using ESM-2 protein language models, comparing multiple ANN algorithms (LSH, Hypercube, IVF-Flat, IVF-PQ, Neural LSH) against BLAST baseline for detecting remote homologs—proteins with low sequence identity (<30%) but conserved structure/function. For a more detailed analysis read the pdf reports!
- Embeddings: ESM-2 (t6_8M_UR50D) with mean pooling → 320-dim vectors
- ANN Methods: LSH, Hypercube LSH, IVF-Flat, IVF-PQ, Neural LSH
- Evaluation: Recall@N vs BLAST Top-N, QPS (queries/second)
- Biological Validation: UniProt/Pfam/GO annotation overlap for remote homology detection
| Dataset | File | Description | Size |
|---|---|---|---|
| SwissProt | swissprot.fasta |
Main protein database (ground truth) | ~570,000 proteins |
| Query Targets | targets.fasta |
Test queries for evaluation | 112 queries |
# 1. Generate embeddings
python protein_embed.py \
-i data/swissprot.fasta \
-o data/protein_vectors.dat
# 2. Run search & evaluation
python protein_search.py \
-d data/protein_vectors.dat \
-q data/targets.fasta \
--method all \
--run_blast --db_fasta data/swissprot.fasta| Method | Avg Recall@50 | QPS | Best For |
|---|---|---|---|
| Neural LSH | 0.35 | 1,648 | Accuracy |
| IVF-Flat | 0.33 | 2,146 | Balance |
| IVF-PQ | 0.32 | 2,839 | Speed/Memory |
| Hypercube | 0.21 | 3,291 | Maximum throughput |
- Python 3.8+
- PyTorch
- fair-esm
- NCBI BLAST+ (for ground truth)
- NumPy
- scikit-learn
- tqdm
- Ioannis Petrakis Ioannis Nikolopoulos