A Streamlit-powered platform to extract, summarize, embed, cluster, and semantically search academic research papers using modern NLP and vector search technologies. Designed for researchers, students, and engineers working with scientific literature.
Research Paper Explorer is a complete NLP pipeline and interactive web interface for working with scientific PDFs. The app allows users to upload academic papers, process them to extract meaningful content, generate semantic embeddings and summaries, and perform intelligent retrieval and clustering.
Built with:
- 🧠 Sentence Transformers for multilingual embeddings
- 📝 Transformers for summarization and classification
- ⚡ FAISS for efficient vector-based semantic search
- 🖥️ Streamlit for a clean and functional web UI
- 🐳 Docker for easy deployment
- ✅ Features
- 📦 Requirements
- 🗂️ File Structure
- ⚙️ Setup Instructions
- 🧑💻 Usage
- 🧠 Code Explanation
-
⚠️ Limitation - 📄 Sample Queries File
- 🔮 Future Improvements
- 🙏 Acknowledgments
- 📄 License
- 📬 Contact
- 📥 Upload and process PDF-based academic research papers
- 🧾 Clean extraction of structured content from scientific formats
- 📝 Generate multilingual summaries using BART or mBART
- 🧠 Sentence-level and full-document embedding with transformer models
- 🔍 Semantic search across full papers and individual sections using FAISS
- 🧩 KMeans clustering of papers by similarity
- 🔬 Query filtering to exclude non-academic input
- 📁 Choice to temporarily cache or permanently save data
- 🖼️ Streamlit interface for browsing, searching, and organizing documents
Ensure these are included in your requirements.txt:
streamlit
transformers
sentence-transformers
faiss-cpu
pdfplumber
PyMuPDF
pdfreader
langdetect
pandas
scikit-learn
spacy
scispacy
.
├── main.py # Streamlit app entry point
├── Dockerfile # Container setup
├── requirements.txt # Python dependencies
├── questions.txt # Sample queries from three pre-indexed papers
├── .venv/ # Local virtual environment (optional)
├── cache/ # Temporary storage
│ ├── embeddings/ # Cached embeddings before confirmation
│ └── pdfs/ # PDFs pending user confirmation
├── entity/ # Permanently saved, structured paper data (JSON)
├── embeddings/ # Persisted embeddings once saved by the user
├── pdfs/ # Permanently stored uploaded PDFs
├── tmp/ # FAISS index and metadata files
│ ├── faiss_global.idx
│ ├── faiss_metadata.pkl
│ ├── faiss_sections.idx
│ └── faiss_section_meta.pkl
├── utils/ # Core functionality
│ ├── pdfExt.py # PDF text extraction and cleanup
│ ├── summarize.py # Summarization with Hugging Face models
│ ├── embeddingsPDF.py # Embedding generation
│ ├── classification.py # KMeans clustering
│ ├── query.py # Semantic search with FAISS
│ ├── appendFiass.py # Append to FAISS indices
│ └── garbage_check.py # Filter out low-quality/non-academic queries
├── MLModels/
│ └── models.py # Model loading functions:
│ # - get_embedding_query_model
│ # - get_summarizer
│ # - get_classifier
└── README.md
docker build -t research-explorer .
docker run -p 8502:8502 research-explorerThen go to: http://localhost:8502
-
Activate your virtual environment:
source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Run the app:
streamlit run main.py
- Start the app via Docker or locally.
- Upload a PDF paper.
- Choose to process:
- Extracts and cleans structured content.
- Summarizes major sections.
- Embeds sections and global text.
- Updates the FAISS indices.
- Query the database using natural language.
- Explore clusters or results.
- Optionally: save the file and its embeddings to disk for future reuse.
| Module | Description |
|---|---|
pdfExt.py |
Extracts and cleans academic PDFs using pdfplumber, fitz, and pdfreader. |
summarize.py |
Generates section summaries using BART (English) or mBART (multilingual). |
embeddingsPDF.py |
Creates embeddings for sections and full documents using SentenceTransformer. |
classification.py |
Clusters documents via KMeans on global embeddings. |
query.py |
Performs FAISS-based semantic search for user queries. |
appendFiass.py |
Updates existing FAISS indices and metadata with new papers. |
garbage_check.py |
Filters non-academic or irrelevant queries. |
MLModels/models.py |
Loads and caches summarizer, embedding model, and classifier via @st.cache_resource. |
This system works best when PDFs follow a conventional academic structure with recognizable headers:
- Abstract
- Introduction
- Background
- Literature Review
- Methodology
- Experiments
- Results
- Discussion
- Conclusion
- Acknowledgements
- References
📌 Note: If a document does not follow this structure, results may vary. Support for flexible document layouts is under development. Be sure to name your PDF file using the article's title before uploading it to the App.
A file named questions.txt is included in the root directory. It contains example queries prepared from the introduction sections of three pre-indexed academic papers. These queries are useful for testing the semantic search functionality of the system with real data.
Use this file to see what kind of responses the app provides from previously embedded documents.
- Support for documents with non-standard or missing headers
- Built-in citation graph analysis
- Topic modeling + tag suggestions
- Multi-language UI
- React + FastAPI
- Improved search results ranking
- Smart response rather than only semantic response
- Pre-deployed version on GCP with user authentication
- Visualize clusters using UMAP or t-SNE
- Web-based PDF viewer with highlights
- More to come with CI and CD.
This project is licensed under the MIT License - see the LICENSE file for details.
Feel free to reach out or connect with me:
- 📧 Email: adenabrehama@gmail.com
- 💼 LinkedIn: linkedin.com/in/aden
- 🎨 CodePen: codepen.io/adexoxo