📚 Semantic Search and Clustering of Research Papers with Streamlit and FAISS

A Streamlit-powered platform to extract, summarize, embed, cluster, and semantically search academic research papers using modern NLP and vector search technologies. Designed for researchers, students, and engineers working with scientific literature.

🚀 Overview

Research Paper Explorer is a complete NLP pipeline and interactive web interface for working with scientific PDFs. The app allows users to upload academic papers, process them to extract meaningful content, generate semantic embeddings and summaries, and perform intelligent retrieval and clustering.

Built with:

🧠 Sentence Transformers for multilingual embeddings
📝 Transformers for summarization and classification
⚡ FAISS for efficient vector-based semantic search
🖥️ Streamlit for a clean and functional web UI
🐳 Docker for easy deployment

✅ Features

📥 Upload and process PDF-based academic research papers
🧾 Clean extraction of structured content from scientific formats
📝 Generate multilingual summaries using BART or mBART
🧠 Sentence-level and full-document embedding with transformer models
🔍 Semantic search across full papers and individual sections using FAISS
🧩 KMeans clustering of papers by similarity
🔬 Query filtering to exclude non-academic input
📁 Choice to temporarily cache or permanently save data
🖼️ Streamlit interface for browsing, searching, and organizing documents

📦 Requirements

Ensure these are included in your requirements.txt:

streamlit
transformers
sentence-transformers
faiss-cpu
pdfplumber
PyMuPDF
pdfreader
langdetect
pandas
scikit-learn
spacy
scispacy

🗂️ File Structure


.
├── main.py                      # Streamlit app entry point
├── Dockerfile                   # Container setup
├── requirements.txt             # Python dependencies
├── questions.txt                # Sample queries from three pre-indexed papers

├── .venv/                       # Local virtual environment (optional)
├── cache/                       # Temporary storage
│   ├── embeddings/              # Cached embeddings before confirmation
│   └── pdfs/                    # PDFs pending user confirmation

├── entity/                      # Permanently saved, structured paper data (JSON)
├── embeddings/                  # Persisted embeddings once saved by the user
├── pdfs/                        # Permanently stored uploaded PDFs
├── tmp/                         # FAISS index and metadata files
│   ├── faiss_global.idx
│   ├── faiss_metadata.pkl
│   ├── faiss_sections.idx
│   └── faiss_section_meta.pkl

├── utils/                       # Core functionality
│   ├── pdfExt.py                # PDF text extraction and cleanup
│   ├── summarize.py             # Summarization with Hugging Face models
│   ├── embeddingsPDF.py         # Embedding generation
│   ├── classification.py        # KMeans clustering
│   ├── query.py                 # Semantic search with FAISS
│   ├── appendFiass.py           # Append to FAISS indices
│   └── garbage_check.py         # Filter out low-quality/non-academic queries

├── MLModels/
│   └── models.py                # Model loading functions:
│                               # - get_embedding_query_model
│                               # - get_summarizer
│                               # - get_classifier

└── README.md

⚙️ Setup Instructions

🔁 Option 1: Using Docker (Recommended)

docker build -t research-explorer .
docker run -p 8502:8502 research-explorer

Then go to: http://localhost:8502

💻 Option 2: Using `.venv` Locally

Activate your virtual environment:
```
source .venv/bin/activate
```
Install dependencies:
```
pip install -r requirements.txt
```
Run the app:
```
streamlit run main.py
```

🧑‍💻 Usage

Start the app via Docker or locally.
Upload a PDF paper.
Choose to process:
- Extracts and cleans structured content.
- Summarizes major sections.
- Embeds sections and global text.
- Updates the FAISS indices.
Query the database using natural language.
Explore clusters or results.
Optionally: save the file and its embeddings to disk for future reuse.

🧠 Code Explanation

Module	Description
`pdfExt.py`	Extracts and cleans academic PDFs using `pdfplumber`, `fitz`, and `pdfreader`.
`summarize.py`	Generates section summaries using BART (English) or mBART (multilingual).
`embeddingsPDF.py`	Creates embeddings for sections and full documents using SentenceTransformer.
`classification.py`	Clusters documents via KMeans on global embeddings.
`query.py`	Performs FAISS-based semantic search for user queries.
`appendFiass.py`	Updates existing FAISS indices and metadata with new papers.
`garbage_check.py`	Filters non-academic or irrelevant queries.
`MLModels/models.py`	Loads and caches summarizer, embedding model, and classifier via `@st.cache_resource`.

⚠️ Limitation

This system works best when PDFs follow a conventional academic structure with recognizable headers:

Abstract
Introduction
Background
Literature Review
Methodology
Experiments
Results
Discussion
Conclusion
Acknowledgements
References

📌 Note: If a document does not follow this structure, results may vary. Support for flexible document layouts is under development. Be sure to name your PDF file using the article's title before uploading it to the App.

📄 Sample Queries File

A file named questions.txt is included in the root directory. It contains example queries prepared from the introduction sections of three pre-indexed academic papers. These queries are useful for testing the semantic search functionality of the system with real data.

Use this file to see what kind of responses the app provides from previously embedded documents.

🔮 Future Improvements

Support for documents with non-standard or missing headers
Built-in citation graph analysis
Topic modeling + tag suggestions
Multi-language UI
React + FastAPI
Improved search results ranking
Smart response rather than only semantic response
Pre-deployed version on GCP with user authentication
Visualize clusters using UMAP or t-SNE
Web-based PDF viewer with highlights
More to come with CI and CD.

🙏 Acknowledgments

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📬 Contact

Feel free to reach out or connect with me:

📧 Email: adenabrehama@gmail.com
💼 LinkedIn: linkedin.com/in/aden
🎨 CodePen: codepen.io/adexoxo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Semantic Search and Clustering of Research Papers with Streamlit and FAISS

🚀 Overview

📋 Table of Contents

✅ Features

📦 Requirements

🗂️ File Structure

⚙️ Setup Instructions

🔁 Option 1: Using Docker (Recommended)

💻 Option 2: Using `.venv` Locally

🧑‍💻 Usage

🧠 Code Explanation

⚠️ Limitation

📄 Sample Queries File

🔮 Future Improvements

🙏 Acknowledgments

📄 License

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
MLModels		MLModels
cache		cache
entity		entity
tmp		tmp
utils		utils
.DS_Store		.DS_Store
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
question.txt		question.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📚 Semantic Search and Clustering of Research Papers with Streamlit and FAISS

🚀 Overview

📋 Table of Contents

✅ Features

📦 Requirements

🗂️ File Structure

⚙️ Setup Instructions

🔁 Option 1: Using Docker (Recommended)

💻 Option 2: Using .venv Locally

🧑‍💻 Usage

🧠 Code Explanation

⚠️ Limitation

📄 Sample Queries File

🔮 Future Improvements

🙏 Acknowledgments

📄 License

📬 Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

💻 Option 2: Using `.venv` Locally

Packages