A powerful Retrieval-Augmented Generation (RAG) system for analyzing PDF documents with optional AI-powered conversational responses.
🚀 Try the Live Demo | No setup required!
✅ Two Query Modes
- 🔍 Offline Search - No API keys, works instantly with TF-IDF + hybrid search
- 🤖 AI Mode - Natural language answers with Claude, OpenAI, or Gemini
✅ Production-Ready
- 📍 Exact page citations for all results
- 🎯 Smart relevance filtering
- 💾 Hybrid search (TF-IDF + LSA for accuracy)
- ⚡ Fast processing (10 seconds per PDF)
✅ Easy to Use - CLI, Python API, and Web Demo available ✅ Open Source - MIT licensed, contributions welcome!
No installation needed. Visit the live demo to test with sample PDFs:
git clone https://github.com/rushi-12320/RAG-Conversational-Agent
cd RAG-Conversational-Agent
pip install -r requirements.txt# Search default PDF (report.pdf)
python pdf_query.py "What are the major business segments?"
# Search custom PDF
python pdf_query.py "Your question" your_file.pdfThat's it! Results appear in seconds. 🚀
Get natural language answers by adding an API key:
# Set your API key (choose one)
export ANTHROPIC_API_KEY="sk-ant-..." # Claude
export OPENAI_API_KEY="sk-proj-..." # OpenAI
export GOOGLE_API_KEY="AIza..." # Gemini
# Run interactive chat
python main.py --pdf report.pdf --model claudeTry it instantly without any setup:
Features:
- Upload your own PDFs
- Search with or without AI
- Interactive chat interface
- Mobile-friendly
.
├── pdf_query.py # ⭐ Simple PDF query tool (USE THIS!)
├── main.py # Full RAG chatbot (requires API key)
├── requirements.txt # Python dependencies
│
├── rag/
│ ├── ingestion.py # PDF extraction & chunking
│ ├── retrieval.py # Hybrid search (TF-IDF + LSA)
│ ├── llm.py # LLM interfaces (Claude, OpenAI, Gemini)
│ ├── chat.py # Chat session management
│ └── pipeline.py # RAG pipeline
│
├── report.pdf # Sample Adani board outcome document
├── earnings.pdf # Sample earnings presentation
└── README.md # This file
No API key needed. Works offline.
# Basic usage
python pdf_query.py "your question"
# Examples
python pdf_query.py "What are the major business segments?"
python pdf_query.py "What is the consolidated total income in H1-26?"
python pdf_query.py "What is the CEO's email address?"
# Use different PDF
python pdf_query.py "your question" earnings.pdf🔍 Searching 'report.pdf' for: What are the major business segments?
======================================================================
✅ Found 3 relevant sections:
📄 Result 1 | Relevance: 45.2% | Page 6-7
----------------------------------------------------------------------
Adani Enterprises conducts business through multiple segments:
Infrastructure, Energy, Mining, Airports, Defence & Aerospace...
📄 Result 2 | Relevance: 38.1% | Page 8-9
----------------------------------------------------------------------
[Additional relevant content...]
======================================================================
For natural language answers powered by AI:
Option A: Anthropic Claude (Easiest)
- Visit https://console.anthropic.com
- Sign up → Get free trial credits
- Copy your API key
Option B: OpenAI
- Visit https://platform.openai.com/account/api-keys
- Create new key → Add billing (or use $5 free credit)
Option C: Google Gemini
- Go to https://aistudio.google.com
- Click "Get API key" → Create in Cloud Console
PowerShell:
$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."
# OR
$env:OPENAI_API_KEY = "sk-proj-xxx..."
# OR
$env:GOOGLE_API_KEY = "AIza..."Command Prompt:
set ANTHROPIC_API_KEY=sk-ant-xxx...Linux/Mac:
export ANTHROPIC_API_KEY="sk-ant-xxx..."# Interactive mode (ask multiple questions)
python main.py --pdf "report.pdf" --model claude
# Single question mode
python main.py --pdf "report.pdf" --model claude --question "What are business segments?"
# Use OpenAI instead
python main.py --pdf "report.pdf" --model openai --question "Your question here"
# Show debug info (retrieved chunks, scores)
python main.py --pdf "report.pdf" --model claude --debugpython pdf_query.py "question" # Search report.pdf
python pdf_query.py "question" file.pdf # Search custom PDF
python pdf_query.py # Show helpFeatures:
- ✅ No authentication
- ✅ Instant results (1-3 seconds)
- ✅ Works offline
- ✅ Shows page numbers
python main.py --pdf <file> [options]Required:
--pdf <file> Path to PDF file
Optional:
--model <name> LLM: claude, openai, gemini (default: claude)
--question <text> Ask one question and exit
--top-k <n> Retrieved chunks (default: 6)
--chunk-size <n> Words per chunk (default: 400)
--chunk-overlap <n> Word overlap (default: 80)
--debug Show retrieved chunks and scores
Examples:
# Interactive chat
python main.py --pdf report.pdf --model claude
# Single question
python main.py --pdf report.pdf --model claude --question "What is total income?"
# Show what's being retrieved
python main.py --pdf report.pdf --model openai --debug
# Customize chunks
python main.py --pdf report.pdf --chunk-size 500 --chunk-overlap 100python pdf_query.py "What are the major business segments discussed in the document?" report.pdfExpected Result:
✅ Found 5 relevant sections:
- Infrastructure
- Energy & Power
- Mining
- Airports
- Defence & Aerospace
python pdf_query.py "What is the consolidated total income in H1-26?" report.pdfExpected Result:
✅ Found 3 relevant sections:
📄 Result 1 | Relevance: 52.3% | Page 9-10
----------------------------------------------------------------------
Consolidated Total Income for H1-26: ₹44,280.69 crores
python pdf_query.py "What is the CEO's email address?" report.pdfExpected Result:
🔍 Searching 'report.pdf' for: What is the CEO's email address?
======================================================================
❌ Not found in the document.
$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."
python main.py --pdf report.pdf --model claude --question "Analyze the company's airport segment growth"- Python 3.8 or higher
- Windows, Mac, or Linux
mkdir my-rag-project
cd my-rag-projectDownload these files from the project:
main.pyrag/folder (all files inside)pdf_query.pyrequirements.txt
# Create virtual environment
python -m venv .venv
# Activate it
.\.venv\Scripts\Activate.ps1
# Install packages
pip install -r requirements.txtPlace your PDF in the same folder, e.g., report.pdf
python pdf_query.py "Your question here"pip install scikit-learnpip install pdfplumber pypdfMake sure PDF file is in the same directory:
ls # Check if PDF exists
python pdf_query.py "question" report.pdf # Specify filenameSet your API key first:
$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."Your API account quota is exhausted. Either:
- Use
pdf_query.py(no API needed) - Add billing to your API account
- Try a different LLM provider
The PDF might be scanned. Install OCR:
pip install pytesseract pdf2image- Extract → Read text from PDF using pdfplumber
- Chunk → Split into 400-word pieces with 80-word overlap
- Index → Build TF-IDF index (scikit-learn)
- Search → Find top 5 relevant chunks using hybrid search
- Display → Show results with page numbers
- Local search (steps 1-4 above)
- LLM → Send retrieved chunks to Claude/OpenAI/Gemini
- Generate → LLM synthesizes natural language answer
- Complete → Return answer with citations
| Metric | Value |
|---|---|
| Speed | 1-3 seconds per query |
| Accuracy | 70-85% relevant results |
| Memory | ~100MB for typical PDFs |
| File Size | Works up to 500+ pages |
| Scenario | Use |
|---|---|
| Quick fact lookup | pdf_query.py ✅ |
| No internet access | pdf_query.py ✅ |
| Complex analysis | main.py + API key |
| Budget-conscious | pdf_query.py ✅ |
| Natural language synthesis | main.py + API key |
Q: Do I need an API key?
A: No! Use pdf_query.py for instant local search.
Q: Why are results showing low relevance?
A: The question doesn't match document content well. Try more specific keywords.
Q: Can I use my own PDF?
A: Yes! python pdf_query.py "question" your_file.pdf
Q: How do I add Claude/OpenAI?
A: Set API key → python main.py --pdf file.pdf --model claude
- ✅ Try
pdf_query.pywith a sample PDF - ✅ Experiment with different question types
- ✅ Add your own PDF document
- ✅ (Optional) Set up an API key for AI responses
- ✅ Integrate with your application
- Check Troubleshooting section above
- Verify PDF is readable:
python -c "import pdfplumber; pdfplumber.open('file.pdf')" - Try example queries first to verify setup
- Check API key format and permissions
Happy querying! 🚀
We love contributions! Whether it's bug fixes, features, or documentation improvements:
- Fork the repository
- Create a branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
MIT License - see LICENSE file for details.
- Built with pdfplumber for PDF extraction
- Search powered by scikit-learn TF-IDF
- LLM integrations: Anthropic, OpenAI, Google
- Web demo with Streamlit
- ✅ PDF extraction & search
- ✅ LLM integration (Claude, OpenAI, Gemini)
- ✅ CLI interface
- ✅ Web demo (Streamlit)
- 🔄 In progress: Enhanced UI, batch processing
- 📋 Planned: Web scraping, URL support, advanced filtering
- Issues & Bugs: GitHub Issues
- Discussions: GitHub Discussions
- Author: Dilip Patel
⭐ Give us a star if you found this useful!
Made with ❤️ by the RAG team
python main.py --pdf ./doc.pdf --chunk-size 300 --chunk-overlap 60 --top-k 8
---
## Chat Commands
| Command | Action |
|---|---|
| *(any question)* | Answer grounded in the PDF |
| `clear` | Reset conversation history |
| `history` | Show recent conversation turns |
| `exit` / `quit` | End session |
---
## Acceptance Tests
Run the full test suite against the Adani Enterprises earnings PDF (or any similar PDF):
```bash
python tests/test_acceptance.py --pdf ./AEL_Earnings_Presentation_Q2-FY26.pdf --model claude
Tests covered:
- T1 — Grounded fact question (business segments with citations)
- T2 — Numeric question (consolidated total income, exact value or "Not found")
- T3 — Cross-section question (EBITDA drivers with citations)
- T4 — Negative control (CEO email → "Not found in the document.")
- T5 — Conversational follow-up (airport performance, 2-turn history-aware)
main.py
└── RAGPipeline (rag/pipeline.py)
├── extract_pages() ← rag/ingestion.py
│ pdfplumber → pypdf
├── chunk_pages() ← rag/ingestion.py
│ sentence-aware sliding window
├── HybridRetriever() ← rag/retrieval.py
│ TF-IDF sparse + LSA dense + RRF fusion
└── answer()
↳ build contextual query (history-aware)
↳ retrieve top-k chunks
↳ build grounded prompt with citations
↳ LLMBackend.generate()
ChatSession (rag/chat.py)
└── REPL loop with bounded history (20 turns)
| Flag | Default | Description |
|---|---|---|
--pdf |
(required) | Path to PDF |
--model |
claude |
LLM backend: claude / openai / gemini |
--top-k |
6 |
Chunks retrieved per query |
--chunk-size |
400 |
Target words per chunk |
--chunk-overlap |
80 |
Word overlap between chunks |
--debug |
off | Show retrieved chunks + scores |
Hybrid retrieval combines two complementary signals:
- Sparse (TF-IDF) — exact keyword matching; great for specific terms, numbers, abbreviations
- Dense (LSA) — latent semantic analysis via SVD; captures synonyms and topic relationships
Rankings from both legs are merged using Reciprocal Rank Fusion (RRF, k=60), a parameter-free fusion method that's robust to score scale differences.
This runs fully offline — no embedding model downloads required.
- Add a vector DB: Replace
HybridRetrieverwith a FAISS/Chroma/pgvector backend while keeping the same interface (retrieve(query, top_k)) - Add an embedding model: Replace the LSA dense leg with
sentence-transformersfor better semantic retrieval - Add reranking: Insert a cross-encoder reranker between retrieval and answer generation
- Confidence thresholding: Check the top fusion score; if below a threshold, return "Not found"