A production-ready Retrieval-Augmented Generation system that extracts intelligence from unstructured PDFs containing text, tables, and visual charts. Ask natural language questions and get precise, citation-backed answers evaluated for quality in real time.
A financial analyst asks: "What was the revenue trend in Q3?" The answer lives across a bar chart on page 5, a revenue table on page 6, and narrative text on page 7. No existing search tool can synthesize across all three modalities.
This project solves that.
Upload any PDF document and the system will:
- Extract text, tables, and images from every page
- Index all content in a unified vector space using ChromaDB (3 separate collections)
- Answer natural language questions by searching across all modalities simultaneously
- Cite every claim back to its original source (page number, content type)
- Evaluate each response with RAGAS scores (Faithfulness & Answer Relevancy)
| User | Use Case |
|---|---|
| Data Analysts | Query financial reports without scanning hundreds of pages |
| Compliance Teams | Extract specific facts from regulatory filings with source proof |
| Researchers | Search across academic papers containing charts, tables, and prose |
| Engineering Teams | Build searchable knowledge bases from technical documentation |
| Any Organization | Turn static PDF archives into a queryable, citation-backed knowledge system |
A running system with:
- FastAPI backend (port 8000) — handles document ingestion, retrieval, and evaluation
- Streamlit dashboard (port 8501) — interactive UI for uploading, querying, and viewing results
- RAGAS quality scores — every answer is scored for faithfulness (0-1) and relevancy (0-1)
- Source citations — every answer links back to exact text passages, table rows, or chart descriptions
DOCUMENT INGESTION PIPELINE
┌─────────────────────────────────────────────────────────────────────┐
│ PDF Upload │
│ ├── PyMuPDF4LLM ──────► Markdown text (per page) │
│ ├── pdfplumber ───────► Tables as DataFrames │
│ └── PyMuPDF ──────────► Images saved to disk │
│ │
│ Content Processing: │
│ ├── Text ──► Hierarchical Chunker (parent 2000 / child 400) │
│ ├── Tables ──► TableSerializer (markdown + natural language) │
│ └── Images ──► GPT-4o Vision (text descriptions at ingestion) │
│ │
│ Embedding & Storage: │
│ └── OpenAI text-embedding-3-small ──► ChromaDB (3 collections) │
│ + SQLite DocStore (parent documents) │
└─────────────────────────────────────────────────────────────────────┘
QUERY EXECUTION PIPELINE
┌─────────────────────────────────────────────────────────────────────┐
│ User Question │
│ ├── Embed query ──► Search all 3 ChromaDB collections │
│ ├── Reciprocal Rank Fusion (merge + rerank) │
│ ├── Parent Document Lookup (SQLite ──► full context) │
│ ├── Context Assembly (numbered [Source N] references) │
│ ├── GPT-4o Generation (citation-aware prompt) │
│ └── RAGAS Evaluation (async: Faithfulness + Relevancy) │
│ │
│ Response: { answer, sources[], evaluation{} } │
└─────────────────────────────────────────────────────────────────────┘
| Decision | Choice | Why |
|---|---|---|
| 3 ChromaDB collections | text, table, image separate | Independent tuning per modality |
| Unified embedding model | text-embedding-3-small for all | All modalities in same vector space |
| GPT-4o Vision at ingestion | Summarize images once | Avoids per-query vision API cost |
| Reciprocal Rank Fusion | Rank-based, not score-based | Handles score incomparability across collections |
| Parent-child chunking | Small child chunks retrieve, large parent chunks for LLM | Precise retrieval + sufficient context |
| RAGAS evaluation | LLM-as-judge (async) | Quality monitoring without ground truth |
| Layer | Technology |
|---|---|
| Orchestration | LangChain 0.3 |
| Vector Store | ChromaDB (persistent, 3 collections) |
| LLM | OpenAI GPT-4o (generation + vision) |
| Embeddings | OpenAI text-embedding-3-small |
| API | FastAPI + Uvicorn |
| Frontend | Streamlit (glassmorphism dark theme) |
| Charts | Plotly (gauge charts, trend lines) |
| Evaluation | RAGAS (Faithfulness, Answer Relevancy) |
| PDF Processing | PyMuPDF4LLM + pdfplumber |
| Database | SQLite / SQLAlchemy (docstore + eval logs) |
| Containerization | Docker + Docker Compose |
The fastest way to run the entire project. No Python setup required.
# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/Document-Intelligence-Multimodal-rag-agent.git
cd Document-Intelligence-multimodal-rag-agent
# 2. Set your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
# 3. Run with Docker Compose
docker compose up --build
# The system is now running:
# - FastAPI Backend: http://localhost:8000
# - Streamlit Dashboard: http://localhost:8501
# - API Docs (Swagger): http://localhost:8000/docs# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/Document-Intelligence-Multimodal-rag-agent.git
cd Document-Intelligence-Multimodal-rag-agent
# 2. Run the setup script
chmod +x setup.sh
./setup.sh
# 3. Set your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY
# 4. Start the backend (Terminal 1)
uvicorn app.main:app --host 0.0.0.0 --port 8000
# 5. Start the dashboard (Terminal 2)
streamlit run streamlit_app.py
# Open http://localhost:8501 in your browserUse the sidebar in the Streamlit dashboard to upload any PDF document. The system automatically extracts:
- Text from every page (via PyMuPDF4LLM)
- Tables from every page (via pdfplumber)
- Images and converts them to text descriptions (via GPT-4o Vision)
Type natural language questions in the chat interface:
- "What was the total revenue in 2024?"
- "Summarize the tables in this document"
- "What risks are mentioned in the report?"
Each answer includes source citations traced back to original content.
Enable the RAGAS Eval checkbox to score every answer:
- Faithfulness (0-1): Are all claims supported by the retrieved context?
- Answer Relevancy (0-1): Does the answer address the original question?
View aggregate scores and trends on the Evaluation Dashboard tab.
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
System status + ChromaDB connectivity |
POST |
/api/v1/ingest |
Upload PDF — extract, chunk, embed, store |
POST |
/api/v1/query |
Ask question — retrieve, generate, cite |
GET |
/api/v1/evaluations/summary |
Aggregate RAGAS metrics |
GET |
/api/v1/evaluations/{query_id} |
Per-query evaluation scores |
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{
"question": "What was the total revenue in 2024?",
"top_k": 10,
"include_evaluation": true
}'curl -X POST http://localhost:8000/api/v1/ingest \
-F "file=@report.pdf"multimodal-rag-agent/
├── app/
│ ├── main.py # FastAPI app entry point
│ ├── api/
│ │ ├── middleware.py # Error handling middleware
│ │ └── routes/
│ │ ├── health.py # GET /health
│ │ ├── ingest.py # POST /api/v1/ingest
│ │ ├── query.py # POST /api/v1/query
│ │ └── evaluate.py # GET /api/v1/evaluations/*
│ ├── core/
│ │ ├── config.py # Pydantic Settings
│ │ └── dependencies.py # Singleton resources
│ ├── models/
│ │ └── schemas.py # Request/Response models
│ ├── services/
│ │ ├── ingestion/
│ │ │ ├── pdf_processor.py # Text, table, image extraction
│ │ │ ├── table_serializer.py # DataFrame -> markdown/NL
│ │ │ ├── image_summarizer.py # GPT-4o Vision descriptions
│ │ │ └── ingestion_pipeline.py # Orchestrator
│ │ ├── embedding/
│ │ │ ├── chunker.py # Hierarchical parent-child
│ │ │ ├── embedding_service.py # OpenAI embeddings
│ │ │ └── vector_store.py # ChromaDB + SQLite docstore
│ │ ├── retrieval/
│ │ │ ├── multi_collection_retriever.py # 3-collection search + RRF
│ │ │ ├── parent_retriever.py # Child -> parent lookup
│ │ │ ├── context_assembler.py # Numbered source assembly
│ │ │ ├── generation_service.py # GPT-4o with citations
│ │ │ └── query_pipeline.py # End-to-end orchestrator
│ │ └── evaluation/
│ │ ├── ragas_evaluator.py # Faithfulness + Relevancy
│ │ └── eval_store.py # SQLite evaluation storage
│ └── utils/
│ ├── exceptions.py # Custom exception classes
│ └── logging.py # Structured logging
├── streamlit_app.py # Premium Streamlit dashboard
├── tests/ # 37 unit + API tests
│ ├── conftest.py
│ ├── unit/
│ └── api/
├── data/ # Runtime data (gitignored)
│ └── samples/ # Sample PDFs for testing
├── notebooks/ # Jupyter exploration notebooks
├── docker-compose.yml # One-command deployment
├── Dockerfile # Multi-stage production build
├── pyproject.toml # Dependencies & project config
├── setup.sh # Automated local setup
├── .env.example # Environment template
└── README.md # This file
All settings are managed via environment variables (.env file):
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | Your OpenAI API key |
OPENAI_MODEL |
gpt-4o |
LLM for generation + evaluation |
EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model |
CHROMA_PERSIST_DIR |
./data/chromadb |
ChromaDB storage path |
SQLITE_URL |
sqlite:///./data/docstore.db |
Parent document store |
PARENT_CHUNK_SIZE |
2000 |
Parent chunk size (tokens) |
CHILD_CHUNK_SIZE |
400 |
Child chunk size (tokens) |
TOP_K_PER_COLLECTION |
5 |
Results per collection |
TOP_K_FINAL |
10 |
Final fused results |
# Run all tests
pytest
# Run with coverage
pytest --cov=app --cov-report=term-missing
# Run specific test file
pytest tests/unit/test_pdf_processor.py -v37 tests covering:
- PDF text/table/image extraction
- Hierarchical chunking (parent-child linking)
- Reciprocal Rank Fusion scoring
- Context assembly with source numbering
- Document store serialization
- API endpoint validation
The Streamlit dashboard features a premium dark theme with:
- Hero Section — Animated gradient title, feature cards, quick-start suggestion pills
- Chat Interface — Real-time Q&A with typing indicators and glass-styled source cards
- Evaluation Dashboard — Plotly gauge charts for faithfulness/relevancy, trend lines, history table
- Document Library — Visual document cards with chunk distribution bars, system status metrics
- Glassmorphism Design — Frosted glass cards, gradient borders, smooth CSS animations
MIT License. See LICENSE for details.