Skip to content

AbdallahIbrahim27/RAG-Chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 RAG: Production-Ready RAG Chatbot System

Python Version FastAPI PostgreSQL License Status

A modular, production-ready Retrieval-Augmented Generation (RAG) system for intelligent question answering over document collections, powered by PostgreSQL and GPU-accelerated inference.

FeaturesQuick StartDocumentationArchitectureAPI Reference


📖 Table of Contents


🎯 Overview

RAG System is a minimalist yet powerful implementation of a Retrieval-Augmented Generation (RAG) system designed for building intelligent chatbots that answer questions based on your document collections. The system combines semantic search with large language models to provide accurate, context-aware responses, leveraging PostgreSQL with PGVector for efficient vector storage and Google Colab GPU infrastructure for accelerated model inference.

What is RAG?

Retrieval-Augmented Generation enhances LLM responses by:

  1. Retrieving relevant context from a knowledge base using vector similarity search
  2. Augmenting the user query with retrieved information
  3. Generating accurate answers using the enriched context via GPU-accelerated LLMs

Use Cases

  • 📚 Document Q&A: Query large document collections (PDFs, text files)
  • 🏢 Enterprise Knowledge Base: Build internal chatbots over company documentation
  • 📖 Research Assistant: Quickly find and summarize information from research papers
  • 🎓 Educational Tools: Create tutoring systems based on course materials
  • 💼 Customer Support: Automate responses using product documentation

✨ Features

Core Capabilities

  • 🔍 Hybrid Vector Search: Dual vector database support with Qdrant and PGVector (PostgreSQL)
  • 🚀 GPU-Accelerated Inference: Ollama server running on Google Colab T4 GPU for high-performance local LLM inference
  • 🌐 Secure Remote Access: ngrok tunneling for secure public endpoint exposure to Colab-hosted models
  • 🤖 Multi-LLM Support: Compatible with OpenAI GPT, Cohere, and local Ollama models (Gemma, Qwen)
  • 📄 Document Processing: Automatic chunking and processing of PDF and TXT files
  • 🗄️ PostgreSQL Backend: Robust relational database for metadata and document storage
  • 📊 PGVector Integration: Native PostgreSQL vector similarity search capabilities
  • 🌍 Multilingual: Built-in support for English and Arabic prompts
  • 🎯 Project Management: Organize documents into separate projects
  • Batch Processing: Efficient batch embedding and indexing
  • 🔌 FAST API: Clean, well-documented FastAPI endpoints

Technical Highlights

  • Modular Architecture: Clean separation of concerns (routes, controllers, models, stores)
  • Provider Abstraction: Easily swap LLM and vector DB providers via factory pattern
  • Production Ready: PostgreSQL integration, Docker support, environment-based configuration
  • Template System: Customizable prompt templates with locale support
  • Comprehensive Error Handling: Robust validation and error responses
  • Scalable Design: Supports concurrent requests and batch operations
  • Cloud GPU Integration: Leverage free Google Colab T4 GPUs for cost-effective inference

🏗️ Architecture

System Design

┌─────────────────────────────────────────────────────────────────┐
│                         Client Layer                             │
│                   (FAST API Endpoints)                           │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                    Business Logic Layer                          │
│              (Controllers - Orchestration)                       │
└─────┬──────────────────┬──────────────────┬─────────────────────┘
      │                  │                  │
┌─────▼─────┐   ┌────────▼────────┐   ┌────▼────────────────────────────────┐
│   Data    │   │   Persistence   │   │          External Services            │
│   Layer   │   │     Layer       │   │                                        │
│           │   │                 │   │  ┌─────────────────────────────────┐ │
│  Models   │   │   PostgreSQL    │   │  │        LLM Providers             │ │
│  (CRUD)   │   │  (Documents &   │   │  │  • OpenAI (API)                  │ │
│           │   │   Metadata)     │   │  │  • Cohere (API)                  │ │
│           │   │                 │   │  │  • Ollama (GPU-Accelerated)      │ │
│           │   │                 │   │  │    - Gemma                       │ │
└───────────┘   └─────────────────┘   │  │    - Qwen                        │ │
                                      │  └─────────────────────────────────┘ │
                                      │                                        │
                                      │  ┌─────────────────────────────────┐ │
                                      │  │     Vector Database (Dual)       │ │
                                      │  │  • Qdrant (Dedicated Vector DB)  │ │
                                      │  │  • PGVector (PostgreSQL Ext.)    │ │
                                      │  └─────────────────────────────────┘ │
                                      │                                        │
                                      │  ┌─────────────────────────────────┐ │
                                      │  │   Ollama GPU Runtime (Remote)    │ │
                                      │  │  • Google Colab (Free T4 GPU)    │ │
                                      │  │  • ngrok Tunnel (Secure Access)  │ │
                                      │  │  • FAST API Endpoint             │ │
                                      │  └─────────────────────────────────┘ │
                                      └────────────────────────────────────────┘

Component Overview

Component Responsibility Technology
API Layer HTTP request handling, validation FastAPI, Pydantic
Controllers Business logic orchestration Python
Models Database CRUD operations SQLAlchemy, Psycopg3
LLM Store LLM provider integration & routing OpenAI SDK, Cohere SDK, Ollama
Vector Store Semantic search operations (dual provider) Qdrant, PGVector
Relational DB Metadata & document management PostgreSQL 15+
Vector Extension Native PostgreSQL vector operations PGVector
Ollama Server Local LLM inference runtime (remote) Ollama (Gemma 2B/7B, Qwen 2.5)
GPU Runtime Accelerated model inference (cloud) Google Colab (Tesla T4 16GB)
Tunneling Secure public endpoint exposure ngrok (HTTPS)
Helpers Configuration, utilities Pydantic Settings

📋 Prerequisites

System Requirements

  • Python: 3.8 or later
  • Docker: 20.10+ (for database services)
  • Docker Compose: 1.29+
  • Memory: Minimum 4GB RAM recommended (8GB preferred)
  • Storage: 5GB free space (for models and data)

Cloud Infrastructure

  • Google Account: For accessing Google Colab
  • Google Colab: Free tier with T4 GPU runtime (15GB VRAM)
  • ngrok Account: Free tier for secure tunneling (optional but recommended)

API Keys Required

  • OpenAI API Key (for GPT models and embeddings) - OR
  • Cohere API Key (alternative LLM provider)
  • ngrok Auth Token (optional, for persistent tunnels)

🚀 Installation

Step 1: Local Environment Setup

Option 1: Using Conda (Recommended)

# Download and install Miniconda
# Visit: https://docs.anaconda.com/free/miniconda/#quick-command-line-install

# Create virtual environment
conda create -n mini-rag python=3.8 -y

# Activate environment
conda activate mini-rag

# Install dependencies
pip install -r requirements.txt

Option 2: Using venv

# Create virtual environment
python -m venv venv

# Activate environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Step 2: Database Services Setup

# Navigate to docker directory
cd docker

# Copy environment template
cp .env.example .env

# Update .env with your credentials (see Configuration section)
nano .env  # or use your preferred editor

# Start PostgreSQL and Qdrant services
docker compose up -d

# Verify services are running
docker compose ps

# Check PostgreSQL is accessible
docker compose exec postgres psql -U admin -d mini_rag_db -c "\l"

# Verify PGVector extension is loaded
docker compose exec postgres psql -U admin -d mini_rag_db -c "CREATE EXTENSION IF NOT EXISTS vector;"

Step 3: Google Colab Ollama Server Setup

3.1: Create Colab Notebook

  1. Visit Google Colab
  2. Create a new notebook
  3. Enable GPU Runtime:
    • Click RuntimeChange runtime type
    • Select T4 GPU under Hardware accelerator
    • Click Save

3.2: Install and Configure Ollama

Add the following cells to your Colab notebook:

# Cell 1: Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh

# Cell 2: Start Ollama server in background
import subprocess
import time

# Start Ollama server
ollama_process = subprocess.Popen(['ollama', 'serve'], 
                                   stdout=subprocess.PIPE, 
                                   stderr=subprocess.PIPE)
time.sleep(5)  # Wait for server to start
print("Ollama server started")

# Cell 3: Pull your preferred models
!ollama pull gemma:2b       # Lightweight model (1.4GB)
# OR
!ollama pull gemma:7b       # More capable model (4.8GB)
# OR
!ollama pull qwen2.5:3b     # Alternative model (2GB)

# Verify installation
!ollama list

3.3: Setup ngrok Tunnel

# Cell 4: Install ngrok
!pip install pyngrok

# Cell 5: Configure and start ngrok tunnel
from pyngrok import ngrok

# Optional: Set your ngrok auth token for persistent URLs
# Sign up at https://ngrok.com and get your token
ngrok.set_auth_token("YOUR_NGROK_AUTH_TOKEN")  # Replace with your token

# Create tunnel to Ollama server (port 11434)
public_url = ngrok.connect(11434, "http")
print(f"\n🚀 Ollama Server Public URL: {public_url}")
print(f"\n📋 Copy this URL to your .env file as OLLAMA_BASE_URL")

# Keep the tunnel alive
import time
print("\n✅ Tunnel is active. Keep this cell running!")
try:
    while True:
        time.sleep(60)
except KeyboardInterrupt:
    print("\n🛑 Tunnel stopped")

Expected Output:

🚀 Ollama Server Public URL: https://1234-5678-9abc-def0.ngrok-free.app
📋 Copy this URL to your .env file as OLLAMA_BASE_URL
✅ Tunnel is active. Keep this cell running!

3.4: Keep Colab Session Alive

Important: Colab sessions timeout after inactivity. Use one of these methods:

Method 1: Run this cell to simulate activity

# Cell 6: Auto-click to prevent disconnect
from IPython.display import Javascript
display(Javascript('''
function ClickConnect(){
    console.log("Clicking");
    document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect, 60000)
'''))

Method 2: Use browser extension (e.g., Colab Autoclick)

Method 3: Upgrade to Colab Pro for longer sessions


⚙️ Configuration

Main Environment Setup (.env in project root)

# Copy environment template
cp .env.example .env

Configure .env File

# ============================================
# LLM Provider Settings
# ============================================
# API-based providers
OPENAI_API_KEY=sk-your-openai-api-key-here
COHERE_API_KEY=your-cohere-api-key-here  # Optional

# Ollama (GPU-accelerated on Colab)
OLLAMA_BASE_URL=https://1234-5678-9abc-def0.ngrok-free.app  # From ngrok output
OLLAMA_MODEL=gemma:2b  # Options: gemma:2b, gemma:7b, qwen2.5:3b

# Select active LLM provider
LLM_PROVIDER=ollama  # Options: openai, cohere, ollama

# Model Configuration (for API providers)
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4-turbo-preview
COHERE_EMBEDDING_MODEL=embed-english-v3.0
COHERE_CHAT_MODEL=command-r-plus

# ============================================
# Vector Database Settings
# ============================================
# Primary vector database
VECTOR_DB_PROVIDER=qdrant  # Options: qdrant, pgvector

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_GRPC_PORT=6334
QDRANT_API_KEY=  # Optional for local deployment

# PGVector Configuration (uses PostgreSQL)
PGVECTOR_HOST=localhost
PGVECTOR_PORT=5432
PGVECTOR_DATABASE=mini_rag_db
PGVECTOR_USER=admin
PGVECTOR_PASSWORD=your-secure-password

# Vector Search Parameters
VECTOR_SIZE=1536  # Must match embedding model output
DISTANCE_METRIC=cosine  # Options: cosine, euclidean, dot

# ============================================
# PostgreSQL Settings
# ============================================
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=mini_rag_db
POSTGRES_USER=admin
POSTGRES_PASSWORD=your-secure-password
POSTGRES_SCHEMA=public

# Connection Pool Settings
POSTGRES_POOL_SIZE=10
POSTGRES_MAX_OVERFLOW=20

# ============================================
# Application Settings
# ============================================
# Document Processing
CHUNKING_SIZE=500  # Characters per chunk
CHUNKING_OVERLAP=50  # Overlap between chunks
BATCH_SIZE=100  # Chunks per batch for embedding

# Retrieval Settings
TOP_K_RESULTS=5  # Number of chunks to retrieve

# Language Settings
DEFAULT_LOCALE=en  # Options: en, ar

# API Settings
API_HOST=0.0.0.0
API_PORT=5000
API_WORKERS=4

Docker Environment (docker/.env)

# ============================================
# PostgreSQL Configuration
# ============================================
POSTGRES_USER=admin
POSTGRES_PASSWORD=your-secure-password
POSTGRES_DB=mini_rag_db

# PGVector Extension
POSTGRES_EXTENSIONS=vector

# Resource Limits
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SHARED_BUFFERS=256MB

# ============================================
# Qdrant Configuration
# ============================================
QDRANT_API_KEY=  # Optional for local deployment
QDRANT_STORAGE_PATH=/qdrant/storage

# Resource Limits
QDRANT_MAX_CONCURRENT_REQUESTS=100

Docker Compose Configuration

Update docker/docker-compose.yml to include PostgreSQL with PGVector:

version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg15
    container_name: mini-rag-postgres
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    command: postgres -c shared_buffers=${POSTGRES_SHARED_BUFFERS:-256MB}
    FASTart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

  qdrant:
    image: qdrant/qdrant:latest
    container_name: mini-rag-qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
    FASTart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  postgres_data:
  qdrant_data:

PGVector Initialization Script

Create docker/init-scripts/01-init-pgvector.sql:

-- Enable PGVector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create vector index table for embeddings
CREATE TABLE IF NOT EXISTS vector_embeddings (
    id SERIAL PRIMARY KEY,
    chunk_id VARCHAR(255) UNIQUE NOT NULL,
    project_id VARCHAR(255) NOT NULL,
    embedding vector(1536),  -- Adjust size based on your embedding model
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create index for fast similarity search
CREATE INDEX IF NOT EXISTS vector_embeddings_embedding_idx 
ON vector_embeddings 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Create index for project filtering
CREATE INDEX IF NOT EXISTS vector_embeddings_project_idx 
ON vector_embeddings(project_id);

-- Grant permissions
GRANT ALL PRIVILEGES ON TABLE vector_embeddings TO admin;
GRANT USAGE, SELECT ON SEQUENCE vector_embeddings_id_seq TO admin;

📚 Usage

Start the Application

Step 1: Start Docker Services

# Navigate to docker directory
cd docker

# Start PostgreSQL and Qdrant
docker compose up -d

# Verify services are healthy
docker compose ps

# Check logs if needed
docker compose logs -f postgres
docker compose logs -f qdrant

Step 2: Start Colab Ollama Server (if using local models)

  1. Open your Google Colab notebook
  2. Run all cells to start Ollama and ngrok
  3. Copy the ngrok URL to your .env file

Step 3: Start FastAPI Server

# Return to project root
cd ..

# Activate your virtual environment
conda activate mini-rag  # or source venv/bin/activate

# Start the server
uvicorn main:app --reload --host 0.0.0.0 --port 5000

# Or with custom workers for production
uvicorn main:app --host 0.0.0.0 --port 5000 --workers 4

Server will be available at: http://localhost:5000

Interactive API Documentation

  • Swagger UI: http://localhost:5000/docs
  • ReDoc: http://localhost:5000/redoc

Basic Workflow

1. Upload Documents

curl -X POST "http://localhost:5000/api/v1/data/upload/my-project" \
  -F "files=@document.pdf" \
  -F "files=@report.txt" \
  -F "files=@research_paper.pdf"

Response:

{
  "success": true,
  "project_id": "my-project",
  "uploaded_files": [
    {
      "filename": "document.pdf",
      "asset_id": "507f1f77bcf86cd799439011",
      "file_path": "assets/files/my-project/abc123_document.pdf",
      "size_kb": 245.7
    }
  ]
}

2. Process Documents (Chunking)

curl -X POST "http://localhost:5000/api/v1/data/process/my-project"

Response:

{
  "success": true,
  "project_id": "my-project",
  "chunks_created": 150,
  "documents_processed": 3,
  "processing_time": "2.3s"
}

3. Index Documents (Create Embeddings)

curl -X POST "http://localhost:5000/api/v1/nlp/index/push/my-project" \
  -H "Content-Type: application/json" \
  -d '{
    "batch_size": 100
  }'

Response:

{
  "success": true,
  "project_id": "my-project",
  "vectors_indexed": 150,
  "collection_name": "my-project",
  "vector_db": "qdrant",
  "indexing_time": "4.5s"
}

4. Ask Questions

curl -X POST "http://localhost:5000/api/v1/nlp/index/answer/my-project" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the key findings in the document?",
    "locale": "en",
    "temperature": 0.7,
    "max_tokens": 500
  }'

Response:

{
  "answer": "Based on the documents, the key findings include: 1) Implementation of RAG systems significantly improves response accuracy by 45%...",
  "sources": [
    {
      "chunk_id": "507f1f77bcf86cd799439011",
      "text": "RAG systems demonstrate improved performance...",
      "relevance_score": 0.89,
      "source_file": "research_paper.pdf",
      "page": 5
    }
  ],
  "metadata": {
    "model": "gemma:2b",
    "provider": "ollama",
    "tokens_used": 450,
    "processing_time": "1.8s",
    "gpu_accelerated": true
  }
}

📡 API Reference

Base Endpoints

Health Check

GET /api/v1/

Response:

{
  "message": "Welcome to Mini-RAG API",
  "version": "1.0.0",
  "status": "healthy",
  "services": {
    "postgres": "connected",
    "qdrant": "connected",
    "ollama": "connected"
  }
}

Data Management Endpoints

Upload Files

POST /api/v1/data/upload/{project_id}

Parameters:

  • project_id (path): Unique project identifier

Request Body:

  • files: List of files (multipart/form-data)

Supported Formats: .pdf, .txt

Response:

{
  "success": true,
  "project_id": "my-project",
  "uploaded_files": [
    {
      "filename": "document.pdf",
      "asset_id": "507f1f77bcf86cd799439011",
      "file_path": "assets/files/my-project/abc123_document.pdf",
      "size_kb": 245.7,
      "pages": 12
    }
  ],
  "total_files": 1,
  "total_size_mb": 0.24
}

Process Documents

POST /api/v1/data/process/{project_id}

Description: Loads documents and splits them into chunks using LangChain

Query Parameters:

  • chunk_size (optional): Override default chunk size
  • chunk_overlap (optional): Override default overlap

Response:

{
  "success": true,
  "project_id": "my-project",
  "chunks_created": 150,
  "documents_processed": 3,
  "avg_chunk_size": 485,
  "processing_time": "2.3s"
}

Get Project Info

GET /api/v1/data/project/{project_id}

Response:

{
  "project_id": "my-project",
  "total_documents": 3,
  "total_chunks": 150,
  "total_vectors": 150,
  "created_at": "2024-01-15T10:30:00Z",
  "last_updated": "2024-01-15T14:45:00Z",
  "storage": {
    "total_size_mb": 2.45,
    "vector_db": "qdrant"
  }
}

NLP & RAG Endpoints

Index Documents

POST /api/v1/nlp/index/push/{project_id}

Description: Generates embeddings and stores them in vector database

Request Body:

{
  "batch_size": 100,
  "vector_db": "qdrant"  // or "pgvector"
}

Response:

{
  "success": true,
  "project_id": "my-project",
  "vectors_indexed": 150,
  "collection_name": "my-project",
  "vector_db": "qdrant",
  "embedding_model": "text-embedding-3-small",
  "indexing_time": "4.5s",
  "batches_processed": 2
}

Get Collection Info

GET /api/v1/nlp/index/info/{project_id}

Response:

{
  "collection_name": "my-project",
  "vectors_count": 150,
  "vector_db": "qdrant",
  "config": {
    "vector_size": 1536,
    "distance": "Cosine",
    "indexed": true
  },
  "stats": {
    "total_points": 150,
    "indexed_points": 150,
    "segments_count": 1
  }
}

Semantic Search

POST /api/v1/nlp/index/search/{project_id}

Request Body:

{
  "query": "machine learning applications in healthcare",
  "top_k": 5,
  "score_threshold": 0.7,
  "vector_db": "qdrant"
}

Response:

{
  "results": [
    {
      "chunk_id": "507f1f77bcf86cd799439011",
      "text": "Machine learning has transformed healthcare through predictive diagnostics...",
      "score": 0.89,
      "metadata": {
        "source": "healthcare_research.pdf",
        "page": 5,
        "chunk_index": 23
      }
    }
  ],
  "total_results": 5,
  "search_time": "0.12s",
  "vector_db": "qdrant"
}

RAG Question Answering

POST /api/v1/nlp/index/answer/{project_id}

Request Body:

{
  "question": "How does RAG improve LLM accuracy?",
  "locale": "en",
  "temperature": 0.7,
  "max_tokens": 500,
  "top_k": 5,
  "use_gpu": true,
  "stream": false
}

Response:

{
  "answer": "RAG (Retrieval-Augmented Generation) improves LLM accuracy through several mechanisms: 1) It grounds responses in factual, retrieved context rather than relying solely on parametric memory...",
  "sources": [
    {
      "chunk_id": "507f1f77bcf86cd799439011",
      "text": "Retrieved context snippet...",
      "relevance_score": 0.89,
      "source_file": "rag_paper.pdf",
      "page": 3
    }
  ],
  "metadata": {
    "model": "gemma:2b",
    "provider": "ollama",
    "tokens_used": 450,
    "processing_time": "1.8s",
    "gpu_accelerated": true,
    "retrieval_time": "0.15s",
    "generation_time": "1.65s"
  }
}

Switch Vector Database

POST /api/v1/nlp/index/switch-vectordb/{project_id}

Request Body:

{
  "target_db": "pgvector",  // "qdrant" or "pgvector"
  "migrate_data": true
}

Response:

{
  "success": true,
  "project_id": "my-project",
  "previous_db": "qdrant",
  "current_db": "pgvector",
  "vectors_migrated": 150,
  "migration_time": "3.2s"
}

📂 Project Structure

mini-rag/
├── src/
│   ├── routes/                           # 📡 API Endpoints Layer
│   │   ├── base.py                      # Welcome & health check
│   │   ├── data.py                      # File upload & processing
│   │   ├── nlp.py                       # Indexing, search, Q&A
│   │   └── schemes/                     # Request/response schemas
│   │       ├── upload.py
│   │       ├── process.py
│   │       └── query.py
│   │
│   ├── controllers/                      # 🎮 Business Logic Layer
│   │   ├── BaseController.py            # Shared utilities
│   │   ├── ProjectController.py         # Project management
│   │   ├── DataController.py            # File validation
│   │   ├── ProcessController.py         # Document chunking
│   │   └── NLPController.py             # RAG orchestration
│   │
│   ├── models/                           # 💾 Database Layer
│   │   ├── ProjectModel.py              # Project CRUD (PostgreSQL)
│   │   ├── AssetModel.py                # File asset CRUD
│   │   ├── ChunkModel.py                # Chunk CRUD
│   │   ├── VectorModel.py               # Vector embeddings CRUD
│   │   ├── db_schemes/                  # SQLAlchemy schemas
│   │   │   ├── project.py
│   │   │   ├── asset.py
│   │   │   ├── data_chunk.py
│   │   │   └── vector_embedding.py
│   │   ├── enums/                       # Constants
│   │   │   ├── file_types.py
│   │   │   └── status.py
│   │   └── database.py                  # PostgreSQL connection
│   │
│   ├── stores/                           # 🔌 External Service Abstractions
│   │   ├── llm/                         # LLM Provider Integration
│   │   │   ├── LLMInterface.py
│   │   │   ├── LLMProviderFactory.py
│   │   │   ├── LLMEnums.py
│   │   │   ├── providers/
│   │   │   │   ├── OpenAIProvider.py
│   │   │   │   ├── CoHereProvider.py
│   │   │   │   └── OllamaProvider.py    # GPU-accelerated (Colab)
│   │   │   └── templates/               # Prompt templates
│   │   │       ├── template_parser.py
│   │   │       └── locales/
│   │   │           ├── en/
│   │   │           │   └── rag.py
│   │   │           └── ar/
│   │   │               └── rag.py
│   │   │
│   │   └── vectordb/                    # Vector Database Integration
│   │       ├── VectorDBInterface.py
│   │       ├── VectorDBProviderFactory.py
│   │       ├── VectorDBEnums.py
│   │       └── providers/
│   │           ├── QdrantDBProvider.py
│   │           └── PGVectorProvider.py   # PostgreSQL + PGVector
│   │
│   ├── helpers/                          # ⚙️ Utility Functions
│   │   ├── config.py                    # Environment config loader
│   │   ├── logger.py                    # Logging configuration
│   │   └── validators.py                # Input validation
│   │
│   └── assets/                           # 📦 File Storage
│       └── files/                       # Uploaded documents
│           └── {project_id}/
│
├── docker/                               # 🐳 Docker Configuration
│   ├── docker-compose.yml               # PostgreSQL + Qdrant services
│   ├── init-scripts/                    # Database initialization
│   │   └── 01-init-pgvector.sql
│   ├── .env.example
│   └── .env
│
├── notebooks/                            # 📓 Google Colab Notebooks
│   ├── ollama_server_setup.ipynb        # Colab GPU setup guide
│   └── model_testing.ipynb              # Model performance testing
│
├── tests/                                # 🧪 Unit & Integration Tests
│   ├── test_controllers.py
│   ├── test_vectordb.py
│   └── test_ollama_provider.py
│
├── scripts/                              # 🛠️ Utility Scripts
│   ├── migrate_vectordb.py              # Migrate between Qdrant/PGVector
│   ├── benchmark_models.py              # Compare model performance
│   └── backup_database.py               # PostgreSQL backup utility
│
├── .vscode/                              # 💻 Editor Settings
├── main.py                               # 🚀 Application entry point
├── requirements.txt                      # 📦 Python dependencies
├── .env.example                          # ⚙️ Environment template
├── .gitignore
└── README.md                             # 📖 Documentation

Component Responsibilities

Layer Components Purpose
API routes/ HTTP request handling, input validation
Business Logic controllers/ Orchestration, workflow management
Data Access models/ PostgreSQL CRUD operations
External Services stores/llm/, stores/vectordb/ LLM and vector DB integrations
Configuration helpers/ Settings management, logging
Storage assets/ File persistence
Infrastructure docker/ Database containers, init scripts
Cloud GPU notebooks/ Colab setup, model deployment

🔄 RAG Pipeline

Complete Workflow

┌─────────────────┐
│   1. UPLOAD     │  User uploads PDF/TXT files via API
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   2. STORE      │  Files saved to local storage + metadata to PostgreSQL
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. PROCESS     │  Documents split into chunks via LangChain
│                 │  • CharacterTextSplitter (500 chars, 50 overlap)
└────────┬────────┘  • Chunks stored in PostgreSQL
         │
         ▼
┌─────────────────┐
│   4. EMBED      │  Chunks → Vector embeddings
│                 │  • OpenAI: text-embedding-3-small (1536D)
│                 │  • Cohere: embed-english-v3.0 (1024D)
└────────┬────────┘  • Batch processing for efficiency
         │
         ▼
┌─────────────────┐
│   5. INDEX      │  Vectors stored in dual databases:
│                 │  • Qdrant: Dedicated vector search
│                 │  • PGVector: PostgreSQL native extension
└────────┬────────┘  • IVFFlat index for fast retrieval
         │
         ▼
┌─────────────────┐
│   6. QUERY      │  User submits natural language question
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  7. RETRIEVE    │  Semantic search pipeline:
│                 │  • Query → Embedding
│                 │  • Vector similarity search (cosine)
│                 │  • Top-K most relevant chunks (K=5)
└────────┬────────┘  • Score filtering (threshold=0.7)
         │
         ▼
┌─────────────────┐
│  8. AUGMENT     │  Context construction:
│                 │  • Prompt template (locale-aware)
│                 │  • System instructions
│                 │  • Retrieved chunks as context
└────────┬────────┘  • User question
         │
         ▼
┌─────────────────┐
│  9. GENERATE    │  LLM inference (GPU-accelerated):
│                 │  • Ollama on Colab T4 GPU (via ngrok)
│                 │  • Gemma 2B/7B or Qwen 2.5
│                 │  • Or OpenAI/Cohere API
└────────┬────────┘  • Temperature-controlled generation
         │
         ▼
┌─────────────────┐
│ 10. RESPONSE    │  Structured JSON response:
│                 │  • Generated answer
│                 │  • Source citations
│                 │  • Confidence scores
└─────────────────┘  • Performance metadata

Detailed Process Flow

Phase 1: Document Ingestion (Steps 1-3)

# User uploads files
POST /api/v1/data/upload/medical-research# System validates and stores files
- File validation (PDF/TXT, size limits)
- Generate unique asset IDs
- Save to: assets/files/medical-research/
- MetadataPostgreSQL (AssetModel)
↓
# Document processing triggered
POST /api/v1/data/process/medical-research# LangChain pipeline
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n"
)
chunks = splitter.split_documents(documents)
↓
# Chunks stored in PostgreSQL
- chunk_id (UUID)
- project_id
- text content
- metadata (source, page, position)
- created_at timestamp

Phase 2: Embedding & Indexing (Steps 4-5)

# Generate embeddings
POST /api/v1/nlp/index/push/medical-research
{
  "batch_size": 100,
  "vector_db": "qdrant"
}
↓
# Batch processing workflow
chunks = ChunkModel.get_by_project("medical-research")
batches = create_batches(chunks, size=100)

for batch in batches:
    # Generate embeddings (API or local)
    embeddings = llm_provider.create_embeddings([c.text for c in batch])
    
    # Store in vector DB
    if vector_db == "qdrant":
        qdrant.upsert(
            collection_name="medical-research",
            points=[
                PointStruct(
                    id=chunk.id,
                    vector=embedding,
                    payload=chunk.metadata
                )
                for chunk, embedding in zip(batch, embeddings)
            ]
        )
    elif vector_db == "pgvector":
        # PostgreSQL with PGVector extension
        INSERT INTO vector_embeddings (chunk_id, embedding, metadata)
        VALUES (%s, %s::vector, %s)
↓
# Create indexes for fast retrieval
- Qdrant: HNSW index (M=16, ef_construct=100)
- PGVector: IVFFlat index (lists=100)

Phase 3: Query & Generation (Steps 6-10)

# User query received
POST /api/v1/nlp/index/answer/medical-research
{
  "question": "What are the side effects of the treatment?",
  "locale": "en",
  "temperature": 0.7,
  "top_k": 5
}
↓
# Step 1: Query embedding
query_embedding = llm_provider.create_embedding(question)
↓
# Step 2: Vector similarity search
if vector_db == "qdrant":
    results = qdrant.search(
        collection_name="medical-research",
        query_vector=query_embedding,
        limit=5,
        score_threshold=0.7
    )
elif vector_db == "pgvector":
    SELECT chunk_id, text, metadata,
           1 - (embedding <=> %s::vector) as similarity
    FROM vector_embeddings
    WHERE project_id = 'medical-research'
    ORDER BY embedding <=> %s::vector
    LIMIT 5# Step 3: Context preparation
retrieved_chunks = [
    f"[Source {i+1}] {result.text}"
    for i, result in enumerate(results)
]
context = "\n\n".join(retrieved_chunks)
↓
# Step 4: Prompt construction (locale-aware)
from src.stores.llm.templates import get_template

template = get_template("rag", locale="en")
prompt = template.format(
    context=context,
    question=question
)
↓
# Step 5: LLM generation
if llm_provider == "ollama":
    # GPU-accelerated on Colab via ngrok
    response = requests.post(
        f"{OLLAMA_BASE_URL}/api/generate",
        json={
            "model": "gemma:2b",
            "prompt": prompt,
            "temperature": 0.7,
            "max_tokens": 500
        }
    )
    answer = response.json()["response"]
elif llm_provider == "openai":
    response = openai.ChatCompletion.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": template.system},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.7
    )
    answer = response.choices[0].message.content# Step 6: Response formatting
return {
    "answer": answer,
    "sources": [
        {
            "chunk_id": result.id,
            "text": result.text,
            "relevance_score": result.score,
            "metadata": result.metadata
        }
        for result in results
    ],
    "metadata": {
        "model": model_name,
        "provider": provider,
        "tokens_used": token_count,
        "processing_time": elapsed_time,
        "gpu_accelerated": True if ollama else False
    }
}

Performance Characteristics

Component Latency Throughput Scalability
File Upload 100-500ms 10 files/sec Horizontal (API)
Chunking 1-5s/doc 20 docs/min CPU-bound
Embedding (API) 200-800ms 1000 req/min API rate limits
Embedding (GPU) 50-200ms 5000 req/min GPU memory
Vector Search 10-100ms 1000 req/sec Index quality
LLM (Ollama) 500-2000ms 30 req/min GPU compute
LLM (API) 1000-3000ms Rate limited Token limits

🛠️ Development

Setting Up Development Environment

# Clone repository
git clone https://github.com/yourusername/mini-rag.git
cd mini-rag

# Create development environment
conda create -n mini-rag-dev python=3.8 -y
conda activate mini-rag-dev

# Install dependencies with development tools
pip install -r requirements.txt
pip install -r requirements-dev.txt  # pytest, black, flake8, mypy

# Setup pre-commit hooks
pip install pre-commit
pre-commit install

Code Style Guidelines

  • Formatting: Use black with line length 100
  • Linting: Follow flake8 rules
  • Type Hints: Add type annotations for all public functions
  • Docstrings: Use Google-style docstrings
from typing import List, Dict, Optional
import logging

logger = logging.getLogger(__name__)

def process_document(
    file_path: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50
) -> List[Dict[str, any]]:
    """
    Process a document into chunks with embeddings.

    Args:
        file_path: Absolute path to the document file
        chunk_size: Maximum characters per chunk (default: 500)
        chunk_overlap: Character overlap between chunks (default: 50)

    Returns:
        List of dictionaries containing chunk text and metadata

    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If chunk_size < chunk_overlap
        
    Example:
        >>> chunks = process_document("paper.pdf", chunk_size=1000)
        >>> len(chunks)
        45
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")
    
    if chunk_size < chunk_overlap:
        raise ValueError("chunk_size must be >= chunk_overlap")
    
    logger.info(f"Processing document: {file_path}")
    # Implementation...

Running Tests

# Run all tests
pytest

# Run with coverage report
pytest --cov=src --cov-report=html tests/

# Run specific test file
pytest tests/test_nlp_controller.py

# Run with verbose output
pytest -v tests/

# Run only integration tests
pytest -m integration tests/

# Run and generate XML report for CI/CD
pytest --junitxml=test-results.xml

Testing Ollama Connection

# tests/test_ollama_provider.py
import pytest
import requests
from src.helpers.config import settings

def test_ollama_connection():
    """Test connection to Colab Ollama server via ngrok."""
    url = f"{settings.OLLAMA_BASE_URL}/api/version"
    response = requests.get(url, timeout=10)
    assert response.status_code == 200
    assert "version" in response.json()

def test_ollama_embedding():
    """Test embedding generation via Ollama."""
    from src.stores.llm.providers.OllamaProvider import OllamaProvider
    
    provider = OllamaProvider()
    embedding = provider.create_embedding("Test text")
    
    assert isinstance(embedding, list)
    assert len(embedding) > 0
    assert all(isinstance(x, float) for x in embedding)

def test_ollama_chat():
    """Test chat completion via Ollama."""
    from src.stores.llm.providers.OllamaProvider import OllamaProvider
    
    provider = OllamaProvider()
    response = provider.generate_chat_completion([
        {"role": "user", "content": "Hello, how are you?"}
    ])
    
    assert isinstance(response, str)
    assert len(response) > 0

Adding a New LLM Provider

  1. Create provider class:
# src/stores/llm/providers/CustomProvider.py
from typing import List, Dict
from ..LLMInterface import LLMInterface

class CustomProvider(LLMInterface):
    """Custom LLM provider implementation."""
    
    def __init__(self, api_key: str, base_url: str):
        self.api_key = api_key
        self.base_url = base_url
    
    def create_embedding(self, text: str) -> List[float]:
        """Generate embedding vector for text."""
        # Implementation
        pass
    
    def create_embeddings_batch(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for multiple texts."""
        # Implementation
        pass
    
    def generate_chat_completion(
        self,
        messages: List[Dict[str, str]],
        temperature: float = 0.7,
        max_tokens: int = 500
    ) -> str:
        """Generate chat completion."""
        # Implementation
        pass
  1. Register in factory:
# src/stores/llm/LLMProviderFactory.py
from .providers.CustomProvider import CustomProvider

class LLMProviderFactory:
    @staticmethod
    def create_provider(provider_name: str):
        if provider_name == "custom":
            return CustomProvider(
                api_key=settings.CUSTOM_API_KEY,
                base_url=settings.CUSTOM_BASE_URL
            )
        # ... existing providers
  1. Update configuration:
# .env
LLM_PROVIDER=custom
CUSTOM_API_KEY=your-api-key
CUSTOM_BASE_URL=https://api.custom-llm.com

Adding a New Vector Database Provider

  1. Create provider class:
# src/stores/vectordb/providers/WeaviateProvider.py
from typing import List, Dict
from ..VectorDBInterface import VectorDBInterface

class WeaviateProvider(VectorDBInterface):
    """Weaviate vector database provider."""
    
    def __init__(self, url: str, api_key: str):
        import weaviate
        self.client = weaviate.Client(
            url=url,
            auth_client_secret=weaviate.AuthApiKey(api_key)
        )
    
    def create_collection(self, collection_name: str, vector_size: int):
        """Create new collection."""
        pass
    
    def upsert_vectors(
        self,
        collection_name: str,
        vectors: List[List[float]],
        ids: List[str],
        metadata: List[Dict]
    ):
        """Insert or update vectors."""
        pass
    
    def search(
        self,
        collection_name: str,
        query_vector: List[float],
        top_k: int = 5
    ) -> List[Dict]:
        """Semantic similarity search."""
        pass
  1. Update Docker setup (if needed):
# docker/docker-compose.yml
  weaviate:
    image: semitechnologies/weaviate:latest
    ports:
      - "8080:8080"
    environment:
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'your-api-key'
    volumes:
      - weaviate_data:/var/lib/weaviate

Performance Benchmarking

# scripts/benchmark_models.py
import time
from src.stores.llm.LLMProviderFactory import LLMProviderFactory

def benchmark_embedding_speed():
    """Compare embedding generation speed across providers."""
    
    test_texts = ["Sample text"] * 100
    providers = ["openai", "cohere", "ollama"]
    
    results = {}
    for provider_name in providers:
        provider = LLMProviderFactory.create_provider(provider_name)
        
        start = time.time()
        embeddings = provider.create_embeddings_batch(test_texts)
        elapsed = time.time() - start
        
        results[provider_name] = {
            "total_time": elapsed,
            "avg_time": elapsed / len(test_texts),
            "throughput": len(test_texts) / elapsed
        }
    
    return results

def benchmark_generation_quality():
    """Compare answer quality across models."""
    # Implementation
    pass

if __name__ == "__main__":
    results = benchmark_embedding_speed()
    print(results)

🐛 Troubleshooting

Common Issues

1. PostgreSQL Connection Errors

Error: psycopg.OperationalError: connection to server failed

Solutions:

# Check if PostgreSQL container is running
docker compose ps postgres

# View PostgreSQL logs
docker compose logs postgres

# Verify connection settings
docker compose exec postgres psql -U admin -d mini_rag_db -c "\conninfo"

# Test connection from host
psql -h localhost -p 5432 -U admin -d mini_rag_db

# FASTart PostgreSQL
docker compose FASTart postgres

2. PGVector Extension Issues

Error: relation "vector_embeddings" does not exist

Solutions:

# Verify PGVector extension is installed
docker compose exec postgres psql -U admin -d mini_rag_db -c "\dx"

# Manually create extension
docker compose exec postgres psql -U admin -d mini_rag_db -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Run initialization script
docker compose exec postgres psql -U admin -d mini_rag_db -f /docker-entrypoint-initdb.d/01-init-pgvector.sql

# Check table exists
docker compose exec postgres psql -U admin -d mini_rag_db -c "\dt"

Error: vector dimension mismatch

Solutions:

-- Check current vector size
SELECT column_name, data_type 
FROM information_schema.columns 
WHERE table_name = 'vector_embeddings';

-- Drop and recreate table with correct size
DROP TABLE IF EXISTS vector_embeddings;

CREATE TABLE vector_embeddings (
    id SERIAL PRIMARY KEY,
    chunk_id VARCHAR(255) UNIQUE NOT NULL,
    project_id VARCHAR(255) NOT NULL,
    embedding vector(1536),  -- Match your embedding model
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

3. Qdrant Connection Issues

Error: QdrantException: Connection refused

Solutions:

# Check Qdrant status
docker compose ps qdrant
docker compose logs qdrant

# Verify Qdrant is accessible
curl http://localhost:6333/collections

# FASTart Qdrant
docker compose FASTart qdrant

# Check Qdrant dashboard
open http://localhost:6333/dashboard

4. Ollama/Colab Connection Issues

Error: requests.exceptions.ConnectionError: Failed to establish connection

Solutions:

Check ngrok tunnel:

# In your Colab notebook
from pyngrok import ngrok

# List active tunnels
tunnels = ngrok.get_tunnels()
print(tunnels)

# FASTart tunnel if needed
ngrok.kill()
public_url = ngrok.connect(11434, "http")
print(f"New URL: {public_url}")

Verify Ollama server:

# In Colab cell
!curl http://localhost:11434/api/version

Update .env file:

# Update with new ngrok URL
OLLAMA_BASE_URL=https://new-url-from-ngrok.ngrok-free.app

Test connection from local machine:

import requests

url = "https://your-ngrok-url.ngrok-free.app/api/version"
response = requests.get(url)
print(response.json())

Error: Colab session disconnected

Solutions:

  • Run the auto-click JavaScript code (see Installation section)
  • Use Colab Pro for longer sessions (24 hours)
  • Set up automatic session FASTarter
  • Consider self-hosting Ollama on a dedicated server

5. GPU Memory Issues

Error: CUDA out of memory

Solutions:

Switch to smaller model:

# In Colab
!ollama pull gemma:2b  # Instead of gemma:7b

Clear GPU cache:

# In Colab
import torch
torch.cuda.empty_cache()

Reduce batch size:

# In .env
BATCH_SIZE=50  # Instead of 100

6. Vector Size Mismatch

Error: Vector dimension mismatch: expected 1536, got 1024

Solution: Ensure consistency between embedding model and vector DB configuration

# For OpenAI text-embedding-3-small (1536 dimensions)
VECTOR_SIZE=1536
OPENAI_EMBEDDING_MODEL=text-embedding-3-small

# For OpenAI text-embedding-3-large (3072 dimensions)
VECTOR_SIZE=3072
OPENAI_EMBEDDING_MODEL=text-embedding-3-large

# For Cohere embed-english-v3.0 (1024 dimensions)
VECTOR_SIZE=1024
COHERE_EMBEDDING_MODEL=embed-english-v3.0

Update PGVector table:

ALTER TABLE vector_embeddings 
ALTER COLUMN embedding TYPE vector(1536);  -- Match your size

Recreate Qdrant collection:

from qdrant_client import QdrantClient

client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
    collection_name="my-project",
    vectors_config={"size": 1536, "distance": "Cosine"}
)

7. API Rate Limiting

Error: RateLimitError: Rate limit exceeded

Solutions:

For OpenAI:

# Implement exponential backoff
import time
from openai import RateLimitError

def create_embeddings_with_retry(texts, max_retries=3):
    for attempt in range(max_retries):
        try:
            return openai.Embedding.create(input=texts)
        except RateLimitError:
            wait_time = 2 ** attempt
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

For batch processing:

# Reduce batch size
BATCH_SIZE=20  # Instead of 100

# Add delay between batches
BATCH_DELAY_SECONDS=2

Performance Optimization

Slow Embedding Generation

Diagnosis:

import time

start = time.time()
embeddings = provider.create_embeddings_batch(chunks)
print(f"Time: {time.time() - start:.2f}s for {len(chunks)} chunks")

Solutions:

  1. Use batch processing:
# Instead of sequential
for chunk in chunks:
    embedding = provider.create_embedding(chunk.text)

# Use batching
batch_texts = [chunk.text for chunk in chunks]
embeddings = provider.create_embeddings_batch(batch_texts)
  1. Switch to local Ollama (if using API):
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=https://your-colab-ngrok-url
  1. Parallel processing:
from concurrent.futures import ThreadPoolExecutor

def process_batch(batch):
    return provider.create_embeddings_batch(batch)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = executor.map(process_batch, batches)

Slow Vector Search

Diagnosis:

start = time.time()
results = vectordb.search(query_vector, top_k=5)
print(f"Search time: {time.time() - start:.2f}s")

Solutions:

  1. Optimize PGVector index:
-- Use IVFFlat for better speed/accuracy tradeoff
CREATE INDEX vector_embeddings_embedding_idx 
ON vector_embeddings 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- Increase for larger datasets

-- Vacuum and analyze
VACUUM ANALYZE vector_embeddings;
  1. Optimize Qdrant index:
client.recreate_collection(
    collection_name="project",
    vectors_config={
        "size": 1536,
        "distance": "Cosine"
    },
    hnsw_config={
        "m": 16,                # Increase for better recall
        "ef_construct": 100,    # Increase for better quality
    }
)
  1. Reduce top_k:
TOP_K_RESULTS=3  # Instead of 10
  1. Add score threshold:
results = vectordb.search(
    query_vector,
    top_k=10,
    score_threshold=0.7  # Only return highly relevant results
)

High Memory Usage

Solutions:

  1. Reduce chunk size:
CHUNKING_SIZE=300  # Instead of 500
  1. Process files individually:
for file in files:
    process_single_file(file)
    # Clear cache between files
  1. Increase Docker memory limits:
# docker-compose.yml
services:
  postgres:
    deploy:
      resources:
        limits:
          memory: 2G

📊 Performance Benchmarks

Embedding Generation (1000 chunks)

Provider Model Time (s) Throughput (chunks/s) Cost
OpenAI API text-embedding-3-small 12.5 80 $0.0001/k
Cohere API embed-english-v3.0 15.3 65 $0.0001/k
Ollama (Colab) nomic-embed-text 8.2 122 Free

Vector Search (1M vectors)

Database Index Type Search Time (ms) Memory (GB) Accuracy
Qdrant HNSW 15 2.1 0.98
PGVector IVFFlat 45 1.8 0.95

End-to-End Query (RAG)

Component Latency (ms) Notes
Query embedding 120 OpenAI API
Vector search 25 Qdrant HNSW
LLM generation 1500 Ollama Gemma 2B on T4
Total 1645 ~1.6s end-to-end

Benchmarks conducted on: Intel i7-10700K, 32GB RAM, Tesla T4 GPU (Colab)


🔒 Security Best Practices

API Keys

  • Never commit .env files to version control
  • Use environment variables for all secrets
  • Rotate API keys regularly
  • Use different keys for dev/staging/production

Database

-- Use strong passwords
POSTGRES_PASSWORD=$(openssl rand -base64 32)

-- FASTrict network access
# docker-compose.yml
services:
  postgres:
    ports:
      - "127.0.0.1:5432:5432"  # Only localhost

-- Enable SSL for production
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'

API Security

# main.py
from fastapi import FastAPI, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi_limiter import FastAPILimiter
import redis.asyncio as redis

app = FastAPI()

# CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://yourdomain.com"],  # Specific origins
    allow_methods=["GET", "POST"],
    allow_headers=["*"],
)

# Rate limiting
@app.on_event("startup")
async def startup():
    redis_client = await redis.from_url("redis://localhost")
    await FastAPILimiter.init(redis_client)


⬆ Back to Top


⭐ If you find this project useful, please consider giving it a star!

Made with ❤️ by Boudy Ibrahim

About

RAG Chatbot — A production-ready RAG system built with Python. Upload documents, convert them to vector embeddings, and chat with an AI that retrieves relevant context to generate accurate, source-grounded responses. Features document ingestion, semantic search, and a modular architecture for real-world deployment

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors