Skip to content

rushi-12320/RAG-Conversational-Agent

Repository files navigation

📄 RAG Conversational Agent

GitHub License Python 3.8+ Open Source

A powerful Retrieval-Augmented Generation (RAG) system for analyzing PDF documents with optional AI-powered conversational responses.

🚀 Try the Live Demo | No setup required!

✨ Features

Two Query Modes

  • 🔍 Offline Search - No API keys, works instantly with TF-IDF + hybrid search
  • 🤖 AI Mode - Natural language answers with Claude, OpenAI, or Gemini

Production-Ready

  • 📍 Exact page citations for all results
  • 🎯 Smart relevance filtering
  • 💾 Hybrid search (TF-IDF + LSA for accuracy)
  • ⚡ Fast processing (10 seconds per PDF)

Easy to Use - CLI, Python API, and Web Demo available ✅ Open Source - MIT licensed, contributions welcome!


🎯 Quick Start (2 Minutes)

Option 1: Try the Live Demo (Easiest ⭐)

No installation needed. Visit the live demo to test with sample PDFs:

🔗 Open Live Demo

Option 2: Run Locally

1. Install Dependencies

git clone https://github.com/rushi-12320/RAG-Conversational-Agent
cd RAG-Conversational-Agent
pip install -r requirements.txt

2. Search a PDF (No API Key!)

# Search default PDF (report.pdf)
python pdf_query.py "What are the major business segments?"

# Search custom PDF
python pdf_query.py "Your question" your_file.pdf

That's it! Results appear in seconds. 🚀

3. (Optional) Enable AI Responses

Get natural language answers by adding an API key:

# Set your API key (choose one)
export ANTHROPIC_API_KEY="sk-ant-..."    # Claude
export OPENAI_API_KEY="sk-proj-..."      # OpenAI  
export GOOGLE_API_KEY="AIza..."          # Gemini

# Run interactive chat
python main.py --pdf report.pdf --model claude

🎮 Live Demo (Hosted)

Try it instantly without any setup:

👉 Streamlit Cloud Demo

Features:

  • Upload your own PDFs
  • Search with or without AI
  • Interactive chat interface
  • Mobile-friendly

Project Structure

.
├── pdf_query.py           # ⭐ Simple PDF query tool (USE THIS!)
├── main.py                # Full RAG chatbot (requires API key)
├── requirements.txt       # Python dependencies
│
├── rag/
│   ├── ingestion.py       # PDF extraction & chunking
│   ├── retrieval.py       # Hybrid search (TF-IDF + LSA)
│   ├── llm.py             # LLM interfaces (Claude, OpenAI, Gemini)
│   ├── chat.py            # Chat session management
│   └── pipeline.py        # RAG pipeline
│
├── report.pdf             # Sample Adani board outcome document
├── earnings.pdf           # Sample earnings presentation
└── README.md              # This file

How To Use

Method 1: Simple PDF Query (Recommended ⭐)

No API key needed. Works offline.

# Basic usage
python pdf_query.py "your question"

# Examples
python pdf_query.py "What are the major business segments?"
python pdf_query.py "What is the consolidated total income in H1-26?"
python pdf_query.py "What is the CEO's email address?"

# Use different PDF
python pdf_query.py "your question" earnings.pdf

Sample Output

🔍 Searching 'report.pdf' for: What are the major business segments?

======================================================================
✅ Found 3 relevant sections:

📄 Result 1 | Relevance: 45.2% | Page 6-7
----------------------------------------------------------------------
Adani Enterprises conducts business through multiple segments:
Infrastructure, Energy, Mining, Airports, Defence & Aerospace...

📄 Result 2 | Relevance: 38.1% | Page 8-9
----------------------------------------------------------------------
[Additional relevant content...]

======================================================================

Method 2: Full RAG with AI (Requires API Key)

For natural language answers powered by AI:

Step 1: Get an API Key

Option A: Anthropic Claude (Easiest)

  1. Visit https://console.anthropic.com
  2. Sign up → Get free trial credits
  3. Copy your API key

Option B: OpenAI

  1. Visit https://platform.openai.com/account/api-keys
  2. Create new key → Add billing (or use $5 free credit)

Option C: Google Gemini

  1. Go to https://aistudio.google.com
  2. Click "Get API key" → Create in Cloud Console

Step 2: Set Environment Variable

PowerShell:

$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."
# OR
$env:OPENAI_API_KEY = "sk-proj-xxx..."
# OR
$env:GOOGLE_API_KEY = "AIza..."

Command Prompt:

set ANTHROPIC_API_KEY=sk-ant-xxx...

Linux/Mac:

export ANTHROPIC_API_KEY="sk-ant-xxx..."

Step 3: Run RAG Agent

# Interactive mode (ask multiple questions)
python main.py --pdf "report.pdf" --model claude

# Single question mode
python main.py --pdf "report.pdf" --model claude --question "What are business segments?"

# Use OpenAI instead
python main.py --pdf "report.pdf" --model openai --question "Your question here"

# Show debug info (retrieved chunks, scores)
python main.py --pdf "report.pdf" --model claude --debug

Command Reference

pdf_query.py (Simple - No API)

python pdf_query.py "question"           # Search report.pdf
python pdf_query.py "question" file.pdf  # Search custom PDF
python pdf_query.py                      # Show help

Features:

  • ✅ No authentication
  • ✅ Instant results (1-3 seconds)
  • ✅ Works offline
  • ✅ Shows page numbers

main.py (Full RAG - Requires API)

python main.py --pdf <file> [options]

Required:

--pdf <file>              Path to PDF file

Optional:

--model <name>            LLM: claude, openai, gemini (default: claude)
--question <text>         Ask one question and exit
--top-k <n>               Retrieved chunks (default: 6)
--chunk-size <n>          Words per chunk (default: 400)
--chunk-overlap <n>       Word overlap (default: 80)
--debug                   Show retrieved chunks and scores

Examples:

# Interactive chat
python main.py --pdf report.pdf --model claude

# Single question
python main.py --pdf report.pdf --model claude --question "What is total income?"

# Show what's being retrieved
python main.py --pdf report.pdf --model openai --debug

# Customize chunks
python main.py --pdf report.pdf --chunk-size 500 --chunk-overlap 100

Examples

Example 1: Search Business Segments

python pdf_query.py "What are the major business segments discussed in the document?" report.pdf

Expected Result:

✅ Found 5 relevant sections:
- Infrastructure
- Energy & Power
- Mining
- Airports
- Defence & Aerospace

Example 2: Financial Query

python pdf_query.py "What is the consolidated total income in H1-26?" report.pdf

Expected Result:

✅ Found 3 relevant sections:
📄 Result 1 | Relevance: 52.3% | Page 9-10
----------------------------------------------------------------------
Consolidated Total Income for H1-26: ₹44,280.69 crores

Example 3: Not Found Handling

python pdf_query.py "What is the CEO's email address?" report.pdf

Expected Result:

🔍 Searching 'report.pdf' for: What is the CEO's email address?

======================================================================
❌ Not found in the document.

Example 4: AI-Powered Response

$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."
python main.py --pdf report.pdf --model claude --question "Analyze the company's airport segment growth"

Installation Guide (Step-by-Step)

Prerequisites

  • Python 3.8 or higher
  • Windows, Mac, or Linux

Step 1: Create Project Folder

mkdir my-rag-project
cd my-rag-project

Step 2: Download Files

Download these files from the project:

  • main.py
  • rag/ folder (all files inside)
  • pdf_query.py
  • requirements.txt

Step 3: Install Python Dependencies

# Create virtual environment
python -m venv .venv

# Activate it
.\.venv\Scripts\Activate.ps1

# Install packages
pip install -r requirements.txt

Step 4: Add Your PDF

Place your PDF in the same folder, e.g., report.pdf

Step 5: Start Querying!

python pdf_query.py "Your question here"

Troubleshooting

"ModuleNotFoundError: No module named 'sklearn'"

pip install scikit-learn

"No module named 'pdfplumber'"

pip install pdfplumber pypdf

"FileNotFoundError: report.pdf not found"

Make sure PDF file is in the same directory:

ls  # Check if PDF exists
python pdf_query.py "question" report.pdf  # Specify filename

"ANTHROPIC_API_KEY not set"

Set your API key first:

$env:ANTHROPIC_API_KEY = "sk-ant-xxx..."

"Error code: 429 - Quota exceeded"

Your API account quota is exhausted. Either:

  1. Use pdf_query.py (no API needed)
  2. Add billing to your API account
  3. Try a different LLM provider

Cannot extract text from PDF

The PDF might be scanned. Install OCR:

pip install pytesseract pdf2image

How It Works

Local Search (pdf_query.py)

  1. Extract → Read text from PDF using pdfplumber
  2. Chunk → Split into 400-word pieces with 80-word overlap
  3. Index → Build TF-IDF index (scikit-learn)
  4. Search → Find top 5 relevant chunks using hybrid search
  5. Display → Show results with page numbers

AI Search (main.py + API Key)

  1. Local search (steps 1-4 above)
  2. LLM → Send retrieved chunks to Claude/OpenAI/Gemini
  3. Generate → LLM synthesizes natural language answer
  4. Complete → Return answer with citations

Performance

Metric Value
Speed 1-3 seconds per query
Accuracy 70-85% relevant results
Memory ~100MB for typical PDFs
File Size Works up to 500+ pages

When To Use Each Tool

Scenario Use
Quick fact lookup pdf_query.py
No internet access pdf_query.py
Complex analysis main.py + API key
Budget-conscious pdf_query.py
Natural language synthesis main.py + API key

FAQs

Q: Do I need an API key?
A: No! Use pdf_query.py for instant local search.

Q: Why are results showing low relevance?
A: The question doesn't match document content well. Try more specific keywords.

Q: Can I use my own PDF?
A: Yes! python pdf_query.py "question" your_file.pdf

Q: How do I add Claude/OpenAI?
A: Set API key → python main.py --pdf file.pdf --model claude


Next Steps

  1. ✅ Try pdf_query.py with a sample PDF
  2. ✅ Experiment with different question types
  3. ✅ Add your own PDF document
  4. ✅ (Optional) Set up an API key for AI responses
  5. ✅ Integrate with your application

Support & Issues

  • Check Troubleshooting section above
  • Verify PDF is readable: python -c "import pdfplumber; pdfplumber.open('file.pdf')"
  • Try example queries first to verify setup
  • Check API key format and permissions

Happy querying! 🚀


🤝 Contributing

We love contributions! Whether it's bug fixes, features, or documentation improvements:

  1. Fork the repository
  2. Create a branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

See CONTRIBUTING.md for detailed guidelines.


📄 License

MIT License - see LICENSE file for details.


🙏 Acknowledgments


📊 Project Status

  • ✅ PDF extraction & search
  • ✅ LLM integration (Claude, OpenAI, Gemini)
  • ✅ CLI interface
  • ✅ Web demo (Streamlit)
  • 🔄 In progress: Enhanced UI, batch processing
  • 📋 Planned: Web scraping, URL support, advanced filtering

📞 Contact & Support


⭐ Give us a star if you found this useful!

Made with ❤️ by the RAG team

python main.py --pdf ./doc.pdf --model openai

Tune chunking

python main.py --pdf ./doc.pdf --chunk-size 300 --chunk-overlap 60 --top-k 8


---

## Chat Commands

| Command | Action |
|---|---|
| *(any question)* | Answer grounded in the PDF |
| `clear` | Reset conversation history |
| `history` | Show recent conversation turns |
| `exit` / `quit` | End session |

---

## Acceptance Tests

Run the full test suite against the Adani Enterprises earnings PDF (or any similar PDF):

```bash
python tests/test_acceptance.py --pdf ./AEL_Earnings_Presentation_Q2-FY26.pdf --model claude

Tests covered:

  1. T1 — Grounded fact question (business segments with citations)
  2. T2 — Numeric question (consolidated total income, exact value or "Not found")
  3. T3 — Cross-section question (EBITDA drivers with citations)
  4. T4 — Negative control (CEO email → "Not found in the document.")
  5. T5 — Conversational follow-up (airport performance, 2-turn history-aware)

Architecture

main.py
  └── RAGPipeline (rag/pipeline.py)
        ├── extract_pages()   ← rag/ingestion.py
        │     pdfplumber → pypdf
        ├── chunk_pages()     ← rag/ingestion.py
        │     sentence-aware sliding window
        ├── HybridRetriever() ← rag/retrieval.py
        │     TF-IDF sparse + LSA dense + RRF fusion
        └── answer()
              ↳ build contextual query (history-aware)
              ↳ retrieve top-k chunks
              ↳ build grounded prompt with citations
              ↳ LLMBackend.generate()

ChatSession (rag/chat.py)
  └── REPL loop with bounded history (20 turns)

Configuration

Flag Default Description
--pdf (required) Path to PDF
--model claude LLM backend: claude / openai / gemini
--top-k 6 Chunks retrieved per query
--chunk-size 400 Target words per chunk
--chunk-overlap 80 Word overlap between chunks
--debug off Show retrieved chunks + scores

Retrieval Details

Hybrid retrieval combines two complementary signals:

  • Sparse (TF-IDF) — exact keyword matching; great for specific terms, numbers, abbreviations
  • Dense (LSA) — latent semantic analysis via SVD; captures synonyms and topic relationships

Rankings from both legs are merged using Reciprocal Rank Fusion (RRF, k=60), a parameter-free fusion method that's robust to score scale differences.

This runs fully offline — no embedding model downloads required.


Extending

  • Add a vector DB: Replace HybridRetriever with a FAISS/Chroma/pgvector backend while keeping the same interface (retrieve(query, top_k))
  • Add an embedding model: Replace the LSA dense leg with sentence-transformers for better semantic retrieval
  • Add reranking: Insert a cross-encoder reranker between retrieval and answer generation
  • Confidence thresholding: Check the top fusion score; if below a threshold, return "Not found"

About

RAG Conversational Agent - PDF query tool using retrieval augmented generation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors