Multimodal RAG Agent for Document Intelligence

A production-ready Retrieval-Augmented Generation system that extracts intelligence from unstructured PDFs containing text, tables, and visual charts. Ask natural language questions and get precise, citation-backed answers evaluated for quality in real time.

The Problem

A financial analyst asks: "What was the revenue trend in Q3?" The answer lives across a bar chart on page 5, a revenue table on page 6, and narrative text on page 7. No existing search tool can synthesize across all three modalities.

This project solves that.

What It Does

Upload any PDF document and the system will:

Extract text, tables, and images from every page
Index all content in a unified vector space using ChromaDB (3 separate collections)
Answer natural language questions by searching across all modalities simultaneously
Cite every claim back to its original source (page number, content type)
Evaluate each response with RAGAS scores (Faithfulness & Answer Relevancy)

Who Benefits

User	Use Case
Data Analysts	Query financial reports without scanning hundreds of pages
Compliance Teams	Extract specific facts from regulatory filings with source proof
Researchers	Search across academic papers containing charts, tables, and prose
Engineering Teams	Build searchable knowledge bases from technical documentation
Any Organization	Turn static PDF archives into a queryable, citation-backed knowledge system

End Result

A running system with:

FastAPI backend (port 8000) — handles document ingestion, retrieval, and evaluation
Streamlit dashboard (port 8501) — interactive UI for uploading, querying, and viewing results
RAGAS quality scores — every answer is scored for faithfulness (0-1) and relevancy (0-1)
Source citations — every answer links back to exact text passages, table rows, or chart descriptions

Architecture

                         DOCUMENT INGESTION PIPELINE
 ┌─────────────────────────────────────────────────────────────────────┐
 │  PDF Upload                                                        │
 │    ├── PyMuPDF4LLM ──────► Markdown text (per page)                │
 │    ├── pdfplumber ───────► Tables as DataFrames                    │
 │    └── PyMuPDF ──────────► Images saved to disk                    │
 │                                                                     │
 │  Content Processing:                                                │
 │    ├── Text ──► Hierarchical Chunker (parent 2000 / child 400)     │
 │    ├── Tables ──► TableSerializer (markdown + natural language)     │
 │    └── Images ──► GPT-4o Vision (text descriptions at ingestion)   │
 │                                                                     │
 │  Embedding & Storage:                                               │
 │    └── OpenAI text-embedding-3-small ──► ChromaDB (3 collections)  │
 │        + SQLite DocStore (parent documents)                         │
 └─────────────────────────────────────────────────────────────────────┘

                          QUERY EXECUTION PIPELINE
 ┌─────────────────────────────────────────────────────────────────────┐
 │  User Question                                                      │
 │    ├── Embed query ──► Search all 3 ChromaDB collections           │
 │    ├── Reciprocal Rank Fusion (merge + rerank)                     │
 │    ├── Parent Document Lookup (SQLite ──► full context)            │
 │    ├── Context Assembly (numbered [Source N] references)           │
 │    ├── GPT-4o Generation (citation-aware prompt)                   │
 │    └── RAGAS Evaluation (async: Faithfulness + Relevancy)          │
 │                                                                     │
 │  Response: { answer, sources[], evaluation{} }                      │
 └─────────────────────────────────────────────────────────────────────┘

Key Design Decisions

Decision	Choice	Why
3 ChromaDB collections	text, table, image separate	Independent tuning per modality
Unified embedding model	text-embedding-3-small for all	All modalities in same vector space
GPT-4o Vision at ingestion	Summarize images once	Avoids per-query vision API cost
Reciprocal Rank Fusion	Rank-based, not score-based	Handles score incomparability across collections
Parent-child chunking	Small child chunks retrieve, large parent chunks for LLM	Precise retrieval + sufficient context
RAGAS evaluation	LLM-as-judge (async)	Quality monitoring without ground truth

Tech Stack

Layer	Technology
Orchestration	LangChain 0.3
Vector Store	ChromaDB (persistent, 3 collections)
LLM	OpenAI GPT-4o (generation + vision)
Embeddings	OpenAI text-embedding-3-small
API	FastAPI + Uvicorn
Frontend	Streamlit (glassmorphism dark theme)
Charts	Plotly (gauge charts, trend lines)
Evaluation	RAGAS (Faithfulness, Answer Relevancy)
PDF Processing	PyMuPDF4LLM + pdfplumber
Database	SQLite / SQLAlchemy (docstore + eval logs)
Containerization	Docker + Docker Compose

Quick Start

Option 1: Docker (Recommended)

The fastest way to run the entire project. No Python setup required.

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/Document-Intelligence-Multimodal-rag-agent.git
cd Document-Intelligence-multimodal-rag-agent

# 2. Set your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# 3. Run with Docker Compose
docker compose up --build

# The system is now running:
#   - FastAPI Backend:     http://localhost:8000
#   - Streamlit Dashboard: http://localhost:8501
#   - API Docs (Swagger):  http://localhost:8000/docs

Option 2: Local Development

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/Document-Intelligence-Multimodal-rag-agent.git
cd Document-Intelligence-Multimodal-rag-agent

# 2. Run the setup script
chmod +x setup.sh
./setup.sh

# 3. Set your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# 4. Start the backend (Terminal 1)
uvicorn app.main:app --host 0.0.0.0 --port 8000

# 5. Start the dashboard (Terminal 2)
streamlit run streamlit_app.py

# Open http://localhost:8501 in your browser

Usage Guide

1. Upload a PDF

Use the sidebar in the Streamlit dashboard to upload any PDF document. The system automatically extracts:

Text from every page (via PyMuPDF4LLM)
Tables from every page (via pdfplumber)
Images and converts them to text descriptions (via GPT-4o Vision)

2. Ask Questions

Type natural language questions in the chat interface:

"What was the total revenue in 2024?"
"Summarize the tables in this document"
"What risks are mentioned in the report?"

Each answer includes source citations traced back to original content.

3. Evaluate Quality

Enable the RAGAS Eval checkbox to score every answer:

Faithfulness (0-1): Are all claims supported by the retrieved context?
Answer Relevancy (0-1): Does the answer address the original question?

View aggregate scores and trends on the Evaluation Dashboard tab.

API Reference

Method	Endpoint	Description
`GET`	`/health`	System status + ChromaDB connectivity
`POST`	`/api/v1/ingest`	Upload PDF — extract, chunk, embed, store
`POST`	`/api/v1/query`	Ask question — retrieve, generate, cite
`GET`	`/api/v1/evaluations/summary`	Aggregate RAGAS metrics
`GET`	`/api/v1/evaluations/{query_id}`	Per-query evaluation scores

Example: Query via API

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What was the total revenue in 2024?",
    "top_k": 10,
    "include_evaluation": true
  }'

Example: Ingest via API

curl -X POST http://localhost:8000/api/v1/ingest \
  -F "file=@report.pdf"

Project Structure

multimodal-rag-agent/
├── app/
│   ├── main.py                          # FastAPI app entry point
│   ├── api/
│   │   ├── middleware.py                # Error handling middleware
│   │   └── routes/
│   │       ├── health.py               # GET /health
│   │       ├── ingest.py               # POST /api/v1/ingest
│   │       ├── query.py                # POST /api/v1/query
│   │       └── evaluate.py             # GET /api/v1/evaluations/*
│   ├── core/
│   │   ├── config.py                   # Pydantic Settings
│   │   └── dependencies.py             # Singleton resources
│   ├── models/
│   │   └── schemas.py                  # Request/Response models
│   ├── services/
│   │   ├── ingestion/
│   │   │   ├── pdf_processor.py        # Text, table, image extraction
│   │   │   ├── table_serializer.py     # DataFrame -> markdown/NL
│   │   │   ├── image_summarizer.py     # GPT-4o Vision descriptions
│   │   │   └── ingestion_pipeline.py   # Orchestrator
│   │   ├── embedding/
│   │   │   ├── chunker.py             # Hierarchical parent-child
│   │   │   ├── embedding_service.py   # OpenAI embeddings
│   │   │   └── vector_store.py        # ChromaDB + SQLite docstore
│   │   ├── retrieval/
│   │   │   ├── multi_collection_retriever.py  # 3-collection search + RRF
│   │   │   ├── parent_retriever.py            # Child -> parent lookup
│   │   │   ├── context_assembler.py           # Numbered source assembly
│   │   │   ├── generation_service.py          # GPT-4o with citations
│   │   │   └── query_pipeline.py              # End-to-end orchestrator
│   │   └── evaluation/
│   │       ├── ragas_evaluator.py     # Faithfulness + Relevancy
│   │       └── eval_store.py          # SQLite evaluation storage
│   └── utils/
│       ├── exceptions.py              # Custom exception classes
│       └── logging.py                 # Structured logging
├── streamlit_app.py                    # Premium Streamlit dashboard
├── tests/                              # 37 unit + API tests
│   ├── conftest.py
│   ├── unit/
│   └── api/
├── data/                               # Runtime data (gitignored)
│   └── samples/                        # Sample PDFs for testing
├── notebooks/                          # Jupyter exploration notebooks
├── docker-compose.yml                  # One-command deployment
├── Dockerfile                          # Multi-stage production build
├── pyproject.toml                      # Dependencies & project config
├── setup.sh                            # Automated local setup
├── .env.example                        # Environment template
└── README.md                           # This file

Configuration

All settings are managed via environment variables (.env file):

Variable	Default	Description
`OPENAI_API_KEY`	(required)	Your OpenAI API key
`OPENAI_MODEL`	`gpt-4o`	LLM for generation + evaluation
`EMBEDDING_MODEL`	`text-embedding-3-small`	Embedding model
`CHROMA_PERSIST_DIR`	`./data/chromadb`	ChromaDB storage path
`SQLITE_URL`	`sqlite:///./data/docstore.db`	Parent document store
`PARENT_CHUNK_SIZE`	`2000`	Parent chunk size (tokens)
`CHILD_CHUNK_SIZE`	`400`	Child chunk size (tokens)
`TOP_K_PER_COLLECTION`	`5`	Results per collection
`TOP_K_FINAL`	`10`	Final fused results

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=term-missing

# Run specific test file
pytest tests/unit/test_pdf_processor.py -v

37 tests covering:

PDF text/table/image extraction
Hierarchical chunking (parent-child linking)
Reciprocal Rank Fusion scoring
Context assembly with source numbering
Document store serialization
API endpoint validation

Dashboard Preview

The Streamlit dashboard features a premium dark theme with:

Hero Section — Animated gradient title, feature cards, quick-start suggestion pills
Chat Interface — Real-time Q&A with typing indicators and glass-styled source cards
Evaluation Dashboard — Plotly gauge charts for faithfulness/relevancy, trend lines, history table
Document Library — Visual document cards with chunk distribution bars, system status metrics
Glassmorphism Design — Frosted glass cards, gradient borders, smooth CSS animations

License

MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal RAG Agent for Document Intelligence

The Problem

What It Does

Who Benefits

End Result

Architecture

Key Design Decisions

Tech Stack

Quick Start

Option 1: Docker (Recommended)

Option 2: Local Development

Usage Guide

1. Upload a PDF

2. Ask Questions

3. Evaluate Quality

API Reference

Example: Query via API

Example: Ingest via API

Project Structure

Configuration

Testing

Dashboard Preview

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.streamlit		.streamlit
app		app
data		data
docker		docker
notebooks		notebooks
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.sh		setup.sh
streamlit_app.py		streamlit_app.py

Folders and files

Latest commit

History

Repository files navigation

Multimodal RAG Agent for Document Intelligence

The Problem

What It Does

Who Benefits

End Result

Architecture

Key Design Decisions

Tech Stack

Quick Start

Option 1: Docker (Recommended)

Option 2: Local Development

Usage Guide

1. Upload a PDF

2. Ask Questions

3. Evaluate Quality

API Reference

Example: Query via API

Example: Ingest via API

Project Structure

Configuration

Testing

Dashboard Preview

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages