Skip to content

BabaMalik/Document-Intelligence-Multimodal-rag-agent

Repository files navigation

Multimodal RAG Agent for Document Intelligence

A production-ready Retrieval-Augmented Generation system that extracts intelligence from unstructured PDFs containing text, tables, and visual charts. Ask natural language questions and get precise, citation-backed answers evaluated for quality in real time.

Python FastAPI LangChain OpenAI Docker License


The Problem

A financial analyst asks: "What was the revenue trend in Q3?" The answer lives across a bar chart on page 5, a revenue table on page 6, and narrative text on page 7. No existing search tool can synthesize across all three modalities.

This project solves that.

What It Does

Upload any PDF document and the system will:

  1. Extract text, tables, and images from every page
  2. Index all content in a unified vector space using ChromaDB (3 separate collections)
  3. Answer natural language questions by searching across all modalities simultaneously
  4. Cite every claim back to its original source (page number, content type)
  5. Evaluate each response with RAGAS scores (Faithfulness & Answer Relevancy)

Who Benefits

User Use Case
Data Analysts Query financial reports without scanning hundreds of pages
Compliance Teams Extract specific facts from regulatory filings with source proof
Researchers Search across academic papers containing charts, tables, and prose
Engineering Teams Build searchable knowledge bases from technical documentation
Any Organization Turn static PDF archives into a queryable, citation-backed knowledge system

End Result

A running system with:

  • FastAPI backend (port 8000) — handles document ingestion, retrieval, and evaluation
  • Streamlit dashboard (port 8501) — interactive UI for uploading, querying, and viewing results
  • RAGAS quality scores — every answer is scored for faithfulness (0-1) and relevancy (0-1)
  • Source citations — every answer links back to exact text passages, table rows, or chart descriptions

Architecture

                         DOCUMENT INGESTION PIPELINE
 ┌─────────────────────────────────────────────────────────────────────┐
 │  PDF Upload                                                        │
 │    ├── PyMuPDF4LLM ──────► Markdown text (per page)                │
 │    ├── pdfplumber ───────► Tables as DataFrames                    │
 │    └── PyMuPDF ──────────► Images saved to disk                    │
 │                                                                     │
 │  Content Processing:                                                │
 │    ├── Text ──► Hierarchical Chunker (parent 2000 / child 400)     │
 │    ├── Tables ──► TableSerializer (markdown + natural language)     │
 │    └── Images ──► GPT-4o Vision (text descriptions at ingestion)   │
 │                                                                     │
 │  Embedding & Storage:                                               │
 │    └── OpenAI text-embedding-3-small ──► ChromaDB (3 collections)  │
 │        + SQLite DocStore (parent documents)                         │
 └─────────────────────────────────────────────────────────────────────┘

                          QUERY EXECUTION PIPELINE
 ┌─────────────────────────────────────────────────────────────────────┐
 │  User Question                                                      │
 │    ├── Embed query ──► Search all 3 ChromaDB collections           │
 │    ├── Reciprocal Rank Fusion (merge + rerank)                     │
 │    ├── Parent Document Lookup (SQLite ──► full context)            │
 │    ├── Context Assembly (numbered [Source N] references)           │
 │    ├── GPT-4o Generation (citation-aware prompt)                   │
 │    └── RAGAS Evaluation (async: Faithfulness + Relevancy)          │
 │                                                                     │
 │  Response: { answer, sources[], evaluation{} }                      │
 └─────────────────────────────────────────────────────────────────────┘

Key Design Decisions

Decision Choice Why
3 ChromaDB collections text, table, image separate Independent tuning per modality
Unified embedding model text-embedding-3-small for all All modalities in same vector space
GPT-4o Vision at ingestion Summarize images once Avoids per-query vision API cost
Reciprocal Rank Fusion Rank-based, not score-based Handles score incomparability across collections
Parent-child chunking Small child chunks retrieve, large parent chunks for LLM Precise retrieval + sufficient context
RAGAS evaluation LLM-as-judge (async) Quality monitoring without ground truth

Tech Stack

Layer Technology
Orchestration LangChain 0.3
Vector Store ChromaDB (persistent, 3 collections)
LLM OpenAI GPT-4o (generation + vision)
Embeddings OpenAI text-embedding-3-small
API FastAPI + Uvicorn
Frontend Streamlit (glassmorphism dark theme)
Charts Plotly (gauge charts, trend lines)
Evaluation RAGAS (Faithfulness, Answer Relevancy)
PDF Processing PyMuPDF4LLM + pdfplumber
Database SQLite / SQLAlchemy (docstore + eval logs)
Containerization Docker + Docker Compose

Quick Start

Option 1: Docker (Recommended)

The fastest way to run the entire project. No Python setup required.

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/Document-Intelligence-Multimodal-rag-agent.git
cd Document-Intelligence-multimodal-rag-agent

# 2. Set your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# 3. Run with Docker Compose
docker compose up --build

# The system is now running:
#   - FastAPI Backend:     http://localhost:8000
#   - Streamlit Dashboard: http://localhost:8501
#   - API Docs (Swagger):  http://localhost:8000/docs

Option 2: Local Development

# 1. Clone the repository
git clone https://github.com/YOUR_USERNAME/Document-Intelligence-Multimodal-rag-agent.git
cd Document-Intelligence-Multimodal-rag-agent

# 2. Run the setup script
chmod +x setup.sh
./setup.sh

# 3. Set your OpenAI API key
cp .env.example .env
# Edit .env and add your OPENAI_API_KEY

# 4. Start the backend (Terminal 1)
uvicorn app.main:app --host 0.0.0.0 --port 8000

# 5. Start the dashboard (Terminal 2)
streamlit run streamlit_app.py

# Open http://localhost:8501 in your browser

Usage Guide

1. Upload a PDF

Use the sidebar in the Streamlit dashboard to upload any PDF document. The system automatically extracts:

  • Text from every page (via PyMuPDF4LLM)
  • Tables from every page (via pdfplumber)
  • Images and converts them to text descriptions (via GPT-4o Vision)

2. Ask Questions

Type natural language questions in the chat interface:

  • "What was the total revenue in 2024?"
  • "Summarize the tables in this document"
  • "What risks are mentioned in the report?"

Each answer includes source citations traced back to original content.

3. Evaluate Quality

Enable the RAGAS Eval checkbox to score every answer:

  • Faithfulness (0-1): Are all claims supported by the retrieved context?
  • Answer Relevancy (0-1): Does the answer address the original question?

View aggregate scores and trends on the Evaluation Dashboard tab.


API Reference

Method Endpoint Description
GET /health System status + ChromaDB connectivity
POST /api/v1/ingest Upload PDF — extract, chunk, embed, store
POST /api/v1/query Ask question — retrieve, generate, cite
GET /api/v1/evaluations/summary Aggregate RAGAS metrics
GET /api/v1/evaluations/{query_id} Per-query evaluation scores

Example: Query via API

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What was the total revenue in 2024?",
    "top_k": 10,
    "include_evaluation": true
  }'

Example: Ingest via API

curl -X POST http://localhost:8000/api/v1/ingest \
  -F "file=@report.pdf"

Project Structure

multimodal-rag-agent/
├── app/
│   ├── main.py                          # FastAPI app entry point
│   ├── api/
│   │   ├── middleware.py                # Error handling middleware
│   │   └── routes/
│   │       ├── health.py               # GET /health
│   │       ├── ingest.py               # POST /api/v1/ingest
│   │       ├── query.py                # POST /api/v1/query
│   │       └── evaluate.py             # GET /api/v1/evaluations/*
│   ├── core/
│   │   ├── config.py                   # Pydantic Settings
│   │   └── dependencies.py             # Singleton resources
│   ├── models/
│   │   └── schemas.py                  # Request/Response models
│   ├── services/
│   │   ├── ingestion/
│   │   │   ├── pdf_processor.py        # Text, table, image extraction
│   │   │   ├── table_serializer.py     # DataFrame -> markdown/NL
│   │   │   ├── image_summarizer.py     # GPT-4o Vision descriptions
│   │   │   └── ingestion_pipeline.py   # Orchestrator
│   │   ├── embedding/
│   │   │   ├── chunker.py             # Hierarchical parent-child
│   │   │   ├── embedding_service.py   # OpenAI embeddings
│   │   │   └── vector_store.py        # ChromaDB + SQLite docstore
│   │   ├── retrieval/
│   │   │   ├── multi_collection_retriever.py  # 3-collection search + RRF
│   │   │   ├── parent_retriever.py            # Child -> parent lookup
│   │   │   ├── context_assembler.py           # Numbered source assembly
│   │   │   ├── generation_service.py          # GPT-4o with citations
│   │   │   └── query_pipeline.py              # End-to-end orchestrator
│   │   └── evaluation/
│   │       ├── ragas_evaluator.py     # Faithfulness + Relevancy
│   │       └── eval_store.py          # SQLite evaluation storage
│   └── utils/
│       ├── exceptions.py              # Custom exception classes
│       └── logging.py                 # Structured logging
├── streamlit_app.py                    # Premium Streamlit dashboard
├── tests/                              # 37 unit + API tests
│   ├── conftest.py
│   ├── unit/
│   └── api/
├── data/                               # Runtime data (gitignored)
│   └── samples/                        # Sample PDFs for testing
├── notebooks/                          # Jupyter exploration notebooks
├── docker-compose.yml                  # One-command deployment
├── Dockerfile                          # Multi-stage production build
├── pyproject.toml                      # Dependencies & project config
├── setup.sh                            # Automated local setup
├── .env.example                        # Environment template
└── README.md                           # This file

Configuration

All settings are managed via environment variables (.env file):

Variable Default Description
OPENAI_API_KEY (required) Your OpenAI API key
OPENAI_MODEL gpt-4o LLM for generation + evaluation
EMBEDDING_MODEL text-embedding-3-small Embedding model
CHROMA_PERSIST_DIR ./data/chromadb ChromaDB storage path
SQLITE_URL sqlite:///./data/docstore.db Parent document store
PARENT_CHUNK_SIZE 2000 Parent chunk size (tokens)
CHILD_CHUNK_SIZE 400 Child chunk size (tokens)
TOP_K_PER_COLLECTION 5 Results per collection
TOP_K_FINAL 10 Final fused results

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=term-missing

# Run specific test file
pytest tests/unit/test_pdf_processor.py -v

37 tests covering:

  • PDF text/table/image extraction
  • Hierarchical chunking (parent-child linking)
  • Reciprocal Rank Fusion scoring
  • Context assembly with source numbering
  • Document store serialization
  • API endpoint validation

Dashboard Preview

The Streamlit dashboard features a premium dark theme with:

  • Hero Section — Animated gradient title, feature cards, quick-start suggestion pills
  • Chat Interface — Real-time Q&A with typing indicators and glass-styled source cards
  • Evaluation Dashboard — Plotly gauge charts for faithfulness/relevancy, trend lines, history table
  • Document Library — Visual document cards with chunk distribution bars, system status metrics
  • Glassmorphism Design — Frosted glass cards, gradient borders, smooth CSS animations

License

MIT License. See LICENSE for details.

About

Multimodal RAG Agent for Document Intelligence — Query PDFs with text, tables & charts using LangChain, ChromaDB, GPT-4o, and RAGAS evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors