📄 RAG Conversational Agent

A powerful Retrieval-Augmented Generation (RAG) system for analyzing PDF documents with optional AI-powered conversational responses.

🚀 Try the Live Demo | No setup required!

✨ Features

✅ Two Query Modes

🔍 Offline Search - No API keys, works instantly with TF-IDF + hybrid search
🤖 AI Mode - Natural language answers with Claude, OpenAI, or Gemini

✅ Production-Ready

📍 Exact page citations for all results
🎯 Smart relevance filtering
💾 Hybrid search (TF-IDF + LSA for accuracy)
⚡ Fast processing (10 seconds per PDF)

✅ Easy to Use - CLI, Python API, and Web Demo available ✅ Open Source - MIT licensed, contributions welcome!

🎯 Quick Start (2 Minutes)

Option 1: Try the Live Demo (Easiest ⭐)

No installation needed. Visit the live demo to test with sample PDFs:

🔗 Open Live Demo

Option 2: Run Locally

1. Install Dependencies

git clone https://github.com/rushi-12320/RAG-Conversational-Agent
cd RAG-Conversational-Agent
pip install -r requirements.txt

2. Search a PDF (No API Key!)

# Search default PDF (report.pdf)
python pdf_query.py "What are the major business segments?"

# Search custom PDF
python pdf_query.py "Your question" your_file.pdf

That's it! Results appear in seconds. 🚀

3. (Optional) Enable AI Responses

Get natural language answers by adding an API key:

# Set your API key (choose one)
export ANTHROPIC_API_KEY="sk-ant-..."    # Claude
export OPENAI_API_KEY="sk-proj-..."      # OpenAI  
export GOOGLE_API_KEY="AIza..."          # Gemini

# Run interactive chat
python main.py --pdf report.pdf --model claude

🎮 Live Demo (Hosted)

Try it instantly without any setup:

👉 Streamlit Cloud Demo

Features:

Upload your own PDFs
Search with or without AI
Interactive chat interface
Mobile-friendly

Project Structure

.
├── pdf_query.py           # ⭐ Simple PDF query tool (USE THIS!)
├── main.py                # Full RAG chatbot (requires API key)
├── requirements.txt       # Python dependencies
│
├── rag/
│   ├── ingestion.py       # PDF extraction & chunking
│   ├── retrieval.py       # Hybrid search (TF-IDF + LSA)
│   ├── llm.py             # LLM interfaces (Claude, OpenAI, Gemini)
│   ├── chat.py            # Chat session management
│   └── pipeline.py        # RAG pipeline
│
├── report.pdf             # Sample Adani board outcome document
├── earnings.pdf           # Sample earnings presentation
└── README.md              # This file

How To Use

Method 1: Simple PDF Query (Recommended ⭐)

No API key needed. Works offline.

# Basic usage
python pdf_query.py "your question"

# Examples
python pdf_query.py "What are the major business segments?"
python pdf_query.py "What is the consolidated total income in H1-26?"
python pdf_query.py "What is the CEO's email address?"

# Use different PDF
python pdf_query.py "your question" earnings.pdf

Sample Output

🔍 Searching 'report.pdf' for: What are the major business segments?

======================================================================
✅ Found 3 relevant sections:

📄 Result 1 | Relevance: 45.2% | Page 6-7
----------------------------------------------------------------------
Adani Enterprises conducts business through multiple segments:
Infrastructure, Energy, Mining, Airports, Defence & Aerospace...

📄 Result 2 | Relevance: 38.1% | Page 8-9
----------------------------------------------------------------------
[Additional relevant content...]

======================================================================

Method 2: Full RAG with AI (Requires API Key)

For natural language answers powered by AI:

Step 1: Get an API Key

Option A: Anthropic Claude (Easiest)

Visit https://console.anthropic.com
Sign up → Get free trial credits
Copy your API key

Option B: OpenAI

Visit https://platform.openai.com/account/api-keys
Create new key → Add billing (or use $5 free credit)

Option C: Google Gemini

Go to https://aistudio.google.com
Click "Get API key" → Create in Cloud Console

Step 2: Set Environment Variable

PowerShell:

$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."
# OR
$env:OPENAI_API_KEY = "sk-proj-xxx..."
# OR
$env:GOOGLE_API_KEY = "AIza..."

Command Prompt:

set ANTHROPIC_API_KEY=sk-ant-xxx...

Linux/Mac:

export ANTHROPIC_API_KEY="sk-ant-xxx..."

Step 3: Run RAG Agent

# Interactive mode (ask multiple questions)
python main.py --pdf "report.pdf" --model claude

# Single question mode
python main.py --pdf "report.pdf" --model claude --question "What are business segments?"

# Use OpenAI instead
python main.py --pdf "report.pdf" --model openai --question "Your question here"

# Show debug info (retrieved chunks, scores)
python main.py --pdf "report.pdf" --model claude --debug

Command Reference

pdf_query.py (Simple - No API)

python pdf_query.py "question"           # Search report.pdf
python pdf_query.py "question" file.pdf  # Search custom PDF
python pdf_query.py                      # Show help

Features:

✅ No authentication
✅ Instant results (1-3 seconds)
✅ Works offline
✅ Shows page numbers

main.py (Full RAG - Requires API)

python main.py --pdf <file> [options]

Required:

--pdf <file>              Path to PDF file

Optional:

--model <name>            LLM: claude, openai, gemini (default: claude)
--question <text>         Ask one question and exit
--top-k <n>               Retrieved chunks (default: 6)
--chunk-size <n>          Words per chunk (default: 400)
--chunk-overlap <n>       Word overlap (default: 80)
--debug                   Show retrieved chunks and scores

Examples:

# Interactive chat
python main.py --pdf report.pdf --model claude

# Single question
python main.py --pdf report.pdf --model claude --question "What is total income?"

# Show what's being retrieved
python main.py --pdf report.pdf --model openai --debug

# Customize chunks
python main.py --pdf report.pdf --chunk-size 500 --chunk-overlap 100

Examples

Example 1: Search Business Segments

python pdf_query.py "What are the major business segments discussed in the document?" report.pdf

Expected Result:

✅ Found 5 relevant sections:
- Infrastructure
- Energy & Power
- Mining
- Airports
- Defence & Aerospace

Example 2: Financial Query

python pdf_query.py "What is the consolidated total income in H1-26?" report.pdf

Expected Result:

✅ Found 3 relevant sections:
📄 Result 1 | Relevance: 52.3% | Page 9-10
----------------------------------------------------------------------
Consolidated Total Income for H1-26: ₹44,280.69 crores

Example 3: Not Found Handling

python pdf_query.py "What is the CEO's email address?" report.pdf

Expected Result:

🔍 Searching 'report.pdf' for: What is the CEO's email address?

======================================================================
❌ Not found in the document.

Example 4: AI-Powered Response

$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."
python main.py --pdf report.pdf --model claude --question "Analyze the company's airport segment growth"

Installation Guide (Step-by-Step)

Prerequisites

Python 3.8 or higher
Windows, Mac, or Linux

Step 1: Create Project Folder

mkdir my-rag-project
cd my-rag-project

Step 2: Download Files

Download these files from the project:

main.py
rag/ folder (all files inside)
pdf_query.py
requirements.txt

Step 3: Install Python Dependencies

# Create virtual environment
python -m venv .venv

# Activate it
.\.venv\Scripts\Activate.ps1

# Install packages
pip install -r requirements.txt

Step 4: Add Your PDF

Place your PDF in the same folder, e.g., report.pdf

Step 5: Start Querying!

python pdf_query.py "Your question here"

Troubleshooting

"ModuleNotFoundError: No module named 'sklearn'"

pip install scikit-learn

"No module named 'pdfplumber'"

pip install pdfplumber pypdf

"FileNotFoundError: report.pdf not found"

Make sure PDF file is in the same directory:

ls  # Check if PDF exists
python pdf_query.py "question" report.pdf  # Specify filename

"ANTHROPIC_API_KEY not set"

Set your API key first:

$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."

"Error code: 429 - Quota exceeded"

Your API account quota is exhausted. Either:

Use pdf_query.py (no API needed)
Add billing to your API account
Try a different LLM provider

Cannot extract text from PDF

The PDF might be scanned. Install OCR:

pip install pytesseract pdf2image

How It Works

Local Search (pdf_query.py)

Extract → Read text from PDF using pdfplumber
Chunk → Split into 400-word pieces with 80-word overlap
Index → Build TF-IDF index (scikit-learn)
Search → Find top 5 relevant chunks using hybrid search
Display → Show results with page numbers

AI Search (main.py + API Key)

Local search (steps 1-4 above)
LLM → Send retrieved chunks to Claude/OpenAI/Gemini
Generate → LLM synthesizes natural language answer
Complete → Return answer with citations

Performance

Metric	Value
Speed	1-3 seconds per query
Accuracy	70-85% relevant results
Memory	~100MB for typical PDFs
File Size	Works up to 500+ pages

When To Use Each Tool

Scenario	Use
Quick fact lookup	`pdf_query.py` ✅
No internet access	`pdf_query.py` ✅
Complex analysis	`main.py` + API key
Budget-conscious	`pdf_query.py` ✅
Natural language synthesis	`main.py` + API key

FAQs

Q: Do I need an API key?
A: No! Use pdf_query.py for instant local search.

Q: Why are results showing low relevance?
A: The question doesn't match document content well. Try more specific keywords.

Q: Can I use my own PDF?
A: Yes! python pdf_query.py "question" your_file.pdf

Q: How do I add Claude/OpenAI?
A: Set API key → python main.py --pdf file.pdf --model claude

Next Steps

✅ Try pdf_query.py with a sample PDF
✅ Experiment with different question types
✅ Add your own PDF document
✅ (Optional) Set up an API key for AI responses
✅ Integrate with your application

Support & Issues

Check Troubleshooting section above
Verify PDF is readable: python -c "import pdfplumber; pdfplumber.open('file.pdf')"
Try example queries first to verify setup
Check API key format and permissions

Happy querying! 🚀

🤝 Contributing

We love contributions! Whether it's bug fixes, features, or documentation improvements:

Fork the repository
Create a branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

📄 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built with pdfplumber for PDF extraction
Search powered by scikit-learn TF-IDF
LLM integrations: Anthropic, OpenAI, Google
Web demo with Streamlit

📊 Project Status

✅ PDF extraction & search
✅ LLM integration (Claude, OpenAI, Gemini)
✅ CLI interface
✅ Web demo (Streamlit)
🔄 In progress: Enhanced UI, batch processing
📋 Planned: Web scraping, URL support, advanced filtering

📞 Contact & Support

Issues & Bugs: GitHub Issues
Discussions: GitHub Discussions
Author: Dilip Patel

⭐ Give us a star if you found this useful!

Made with ❤️ by the RAG team

python main.py --pdf ./doc.pdf --model openai

Tune chunking

python main.py --pdf ./doc.pdf --chunk-size 300 --chunk-overlap 60 --top-k 8


---

## Chat Commands

| Command | Action |
|---|---|
| *(any question)* | Answer grounded in the PDF |
| `clear` | Reset conversation history |
| `history` | Show recent conversation turns |
| `exit` / `quit` | End session |

---

## Acceptance Tests

Run the full test suite against the Adani Enterprises earnings PDF (or any similar PDF):

```bash
python tests/test_acceptance.py --pdf ./AEL_Earnings_Presentation_Q2-FY26.pdf --model claude

Tests covered:

T1 — Grounded fact question (business segments with citations)
T2 — Numeric question (consolidated total income, exact value or "Not found")
T3 — Cross-section question (EBITDA drivers with citations)
T4 — Negative control (CEO email → "Not found in the document.")
T5 — Conversational follow-up (airport performance, 2-turn history-aware)

Architecture

main.py
  └── RAGPipeline (rag/pipeline.py)
        ├── extract_pages()   ← rag/ingestion.py
        │     pdfplumber → pypdf
        ├── chunk_pages()     ← rag/ingestion.py
        │     sentence-aware sliding window
        ├── HybridRetriever() ← rag/retrieval.py
        │     TF-IDF sparse + LSA dense + RRF fusion
        └── answer()
              ↳ build contextual query (history-aware)
              ↳ retrieve top-k chunks
              ↳ build grounded prompt with citations
              ↳ LLMBackend.generate()

ChatSession (rag/chat.py)
  └── REPL loop with bounded history (20 turns)

Configuration

Flag	Default	Description
`--pdf`	(required)	Path to PDF
`--model`	`claude`	LLM backend: `claude` / `openai` / `gemini`
`--top-k`	`6`	Chunks retrieved per query
`--chunk-size`	`400`	Target words per chunk
`--chunk-overlap`	`80`	Word overlap between chunks
`--debug`	off	Show retrieved chunks + scores

Retrieval Details

Hybrid retrieval combines two complementary signals:

Sparse (TF-IDF) — exact keyword matching; great for specific terms, numbers, abbreviations
Dense (LSA) — latent semantic analysis via SVD; captures synonyms and topic relationships

Rankings from both legs are merged using Reciprocal Rank Fusion (RRF, k=60), a parameter-free fusion method that's robust to score scale differences.

This runs fully offline — no embedding model downloads required.

Extending

Add a vector DB: Replace HybridRetriever with a FAISS/Chroma/pgvector backend while keeping the same interface (retrieve(query, top_k))
Add an embedding model: Replace the LSA dense leg with sentence-transformers for better semantic retrieval
Add reranking: Insert a cross-encoder reranker between retrieval and answer generation
Confidence thresholding: Check the top fusion score; if below a threshold, return "Not found"

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
rag		rag
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
RAG-PDF-Query-Tool.zip		RAG-PDF-Query-Tool.zip
README.md		README.md
app.py		app.py
ask_pdf.py		ask_pdf.py
earnings.pdf		earnings.pdf
earnings_presentation.pdf		earnings_presentation.pdf
extract_financials.py		extract_financials.py
main.py		main.py
pdf_query.py		pdf_query.py
query_pdf.py		query_pdf.py
report.pdf		report.pdf
requirements.txt		requirements.txt
test_acceptance.py		test_acceptance.py

Folders and files

Latest commit

History

Repository files navigation

📄 RAG Conversational Agent

✨ Features

🎯 Quick Start (2 Minutes)

Option 1: Try the Live Demo (Easiest ⭐)

Option 2: Run Locally

1. Install Dependencies

2. Search a PDF (No API Key!)

3. (Optional) Enable AI Responses

🎮 Live Demo (Hosted)

Project Structure

How To Use

Method 1: Simple PDF Query (Recommended ⭐)

Sample Output

Method 2: Full RAG with AI (Requires API Key)

Step 1: Get an API Key

Step 2: Set Environment Variable

Step 3: Run RAG Agent

Command Reference

pdf_query.py (Simple - No API)

main.py (Full RAG - Requires API)

Examples

Example 1: Search Business Segments

Example 2: Financial Query

Example 3: Not Found Handling

Example 4: AI-Powered Response

Installation Guide (Step-by-Step)

Prerequisites

Step 1: Create Project Folder

Step 2: Download Files

Step 3: Install Python Dependencies

Step 4: Add Your PDF

Step 5: Start Querying!

Troubleshooting

"ModuleNotFoundError: No module named 'sklearn'"

"No module named 'pdfplumber'"

"FileNotFoundError: report.pdf not found"

"ANTHROPIC_API_KEY not set"

"Error code: 429 - Quota exceeded"

Cannot extract text from PDF

How It Works

Local Search (pdf_query.py)

AI Search (main.py + API Key)

Performance

When To Use Each Tool

FAQs

Next Steps

Support & Issues

🤝 Contributing

📄 License

🙏 Acknowledgments

📊 Project Status

📞 Contact & Support

Tune chunking

Architecture

Configuration

Retrieval Details

Extending

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages