Skip to content

adexoxo13/pdf-semantic-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📚 Semantic Search and Clustering of Research Papers with Streamlit and FAISS

A Streamlit-powered platform to extract, summarize, embed, cluster, and semantically search academic research papers using modern NLP and vector search technologies. Designed for researchers, students, and engineers working with scientific literature.


🚀 Overview

Research Paper Explorer is a complete NLP pipeline and interactive web interface for working with scientific PDFs. The app allows users to upload academic papers, process them to extract meaningful content, generate semantic embeddings and summaries, and perform intelligent retrieval and clustering.

Built with:

  • 🧠 Sentence Transformers for multilingual embeddings
  • 📝 Transformers for summarization and classification
  • ⚡ FAISS for efficient vector-based semantic search
  • 🖥️ Streamlit for a clean and functional web UI
  • 🐳 Docker for easy deployment

📋 Table of Contents

  1. ✅ Features
  2. 📦 Requirements
  3. 🗂️ File Structure
  4. ⚙️ Setup Instructions
  5. 🧑‍💻 Usage
  6. 🧠 Code Explanation
  7. ⚠️ Limitation
  8. 📄 Sample Queries File
  9. 🔮 Future Improvements
  10. 🙏 Acknowledgments
  11. 📄 License
  12. 📬 Contact

✅ Features

  • 📥 Upload and process PDF-based academic research papers
  • 🧾 Clean extraction of structured content from scientific formats
  • 📝 Generate multilingual summaries using BART or mBART
  • 🧠 Sentence-level and full-document embedding with transformer models
  • 🔍 Semantic search across full papers and individual sections using FAISS
  • 🧩 KMeans clustering of papers by similarity
  • 🔬 Query filtering to exclude non-academic input
  • 📁 Choice to temporarily cache or permanently save data
  • 🖼️ Streamlit interface for browsing, searching, and organizing documents

📦 Requirements

Ensure these are included in your requirements.txt:

streamlit
transformers
sentence-transformers
faiss-cpu
pdfplumber
PyMuPDF
pdfreader
langdetect
pandas
scikit-learn
spacy
scispacy

🗂️ File Structure


.
├── main.py                      # Streamlit app entry point
├── Dockerfile                   # Container setup
├── requirements.txt             # Python dependencies
├── questions.txt                # Sample queries from three pre-indexed papers

├── .venv/                       # Local virtual environment (optional)
├── cache/                       # Temporary storage
│   ├── embeddings/              # Cached embeddings before confirmation
│   └── pdfs/                    # PDFs pending user confirmation

├── entity/                      # Permanently saved, structured paper data (JSON)
├── embeddings/                  # Persisted embeddings once saved by the user
├── pdfs/                        # Permanently stored uploaded PDFs
├── tmp/                         # FAISS index and metadata files
│   ├── faiss_global.idx
│   ├── faiss_metadata.pkl
│   ├── faiss_sections.idx
│   └── faiss_section_meta.pkl

├── utils/                       # Core functionality
│   ├── pdfExt.py                # PDF text extraction and cleanup
│   ├── summarize.py             # Summarization with Hugging Face models
│   ├── embeddingsPDF.py         # Embedding generation
│   ├── classification.py        # KMeans clustering
│   ├── query.py                 # Semantic search with FAISS
│   ├── appendFiass.py           # Append to FAISS indices
│   └── garbage_check.py         # Filter out low-quality/non-academic queries

├── MLModels/
│   └── models.py                # Model loading functions:
│                               # - get_embedding_query_model
│                               # - get_summarizer
│                               # - get_classifier

└── README.md


⚙️ Setup Instructions

🔁 Option 1: Using Docker (Recommended)

docker build -t research-explorer .
docker run -p 8502:8502 research-explorer

Then go to: http://localhost:8502


💻 Option 2: Using .venv Locally

  1. Activate your virtual environment:

    source .venv/bin/activate
  2. Install dependencies:

    pip install -r requirements.txt
  3. Run the app:

    streamlit run main.py

🧑‍💻 Usage

  1. Start the app via Docker or locally.
  2. Upload a PDF paper.
  3. Choose to process:
    • Extracts and cleans structured content.
    • Summarizes major sections.
    • Embeds sections and global text.
    • Updates the FAISS indices.
  4. Query the database using natural language.
  5. Explore clusters or results.
  6. Optionally: save the file and its embeddings to disk for future reuse.

🧠 Code Explanation

Module Description
pdfExt.py Extracts and cleans academic PDFs using pdfplumber, fitz, and pdfreader.
summarize.py Generates section summaries using BART (English) or mBART (multilingual).
embeddingsPDF.py Creates embeddings for sections and full documents using SentenceTransformer.
classification.py Clusters documents via KMeans on global embeddings.
query.py Performs FAISS-based semantic search for user queries.
appendFiass.py Updates existing FAISS indices and metadata with new papers.
garbage_check.py Filters non-academic or irrelevant queries.
MLModels/models.py Loads and caches summarizer, embedding model, and classifier via @st.cache_resource.

⚠️ Limitation

This system works best when PDFs follow a conventional academic structure with recognizable headers:

  • Abstract
  • Introduction
  • Background
  • Literature Review
  • Methodology
  • Experiments
  • Results
  • Discussion
  • Conclusion
  • Acknowledgements
  • References

📌 Note: If a document does not follow this structure, results may vary. Support for flexible document layouts is under development. Be sure to name your PDF file using the article's title before uploading it to the App.


📄 Sample Queries File

A file named questions.txt is included in the root directory. It contains example queries prepared from the introduction sections of three pre-indexed academic papers. These queries are useful for testing the semantic search functionality of the system with real data.

Use this file to see what kind of responses the app provides from previously embedded documents.


🔮 Future Improvements

  • Support for documents with non-standard or missing headers
  • Built-in citation graph analysis
  • Topic modeling + tag suggestions
  • Multi-language UI
  • React + FastAPI
  • Improved search results ranking
  • Smart response rather than only semantic response
  • Pre-deployed version on GCP with user authentication
  • Visualize clusters using UMAP or t-SNE
  • Web-based PDF viewer with highlights
  • More to come with CI and CD.

🙏 Acknowledgments


📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


📬 Contact

Feel free to reach out or connect with me:

About

A Streamlit app for semantic search, clustering, and exploration of research papers using FAISS and Sentence Transformers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors