🤖 RAG: Production-Ready RAG Chatbot System

A modular, production-ready Retrieval-Augmented Generation (RAG) system for intelligent question answering over document collections, powered by PostgreSQL and GPU-accelerated inference.

Features • Quick Start • Documentation • Architecture • API Reference

📖 Table of Contents

Overview
Features
Architecture
Prerequisites
Installation
Configuration
Usage
API Reference
Project Structure
RAG Pipeline
Development
Troubleshooting

🎯 Overview

RAG System is a minimalist yet powerful implementation of a Retrieval-Augmented Generation (RAG) system designed for building intelligent chatbots that answer questions based on your document collections. The system combines semantic search with large language models to provide accurate, context-aware responses, leveraging PostgreSQL with PGVector for efficient vector storage and Google Colab GPU infrastructure for accelerated model inference.

What is RAG?

Retrieval-Augmented Generation enhances LLM responses by:

Retrieving relevant context from a knowledge base using vector similarity search
Augmenting the user query with retrieved information
Generating accurate answers using the enriched context via GPU-accelerated LLMs

Use Cases

📚 Document Q&A: Query large document collections (PDFs, text files)
🏢 Enterprise Knowledge Base: Build internal chatbots over company documentation
📖 Research Assistant: Quickly find and summarize information from research papers
🎓 Educational Tools: Create tutoring systems based on course materials
💼 Customer Support: Automate responses using product documentation

✨ Features

Core Capabilities

🔍 Hybrid Vector Search: Dual vector database support with Qdrant and PGVector (PostgreSQL)
🚀 GPU-Accelerated Inference: Ollama server running on Google Colab T4 GPU for high-performance local LLM inference
🌐 Secure Remote Access: ngrok tunneling for secure public endpoint exposure to Colab-hosted models
🤖 Multi-LLM Support: Compatible with OpenAI GPT, Cohere, and local Ollama models (Gemma, Qwen)
📄 Document Processing: Automatic chunking and processing of PDF and TXT files
🗄️ PostgreSQL Backend: Robust relational database for metadata and document storage
📊 PGVector Integration: Native PostgreSQL vector similarity search capabilities
🌍 Multilingual: Built-in support for English and Arabic prompts
🎯 Project Management: Organize documents into separate projects
⚡ Batch Processing: Efficient batch embedding and indexing
🔌 FAST API: Clean, well-documented FastAPI endpoints

Technical Highlights

Modular Architecture: Clean separation of concerns (routes, controllers, models, stores)
Provider Abstraction: Easily swap LLM and vector DB providers via factory pattern
Production Ready: PostgreSQL integration, Docker support, environment-based configuration
Template System: Customizable prompt templates with locale support
Comprehensive Error Handling: Robust validation and error responses
Scalable Design: Supports concurrent requests and batch operations
Cloud GPU Integration: Leverage free Google Colab T4 GPUs for cost-effective inference

🏗️ Architecture

System Design

┌─────────────────────────────────────────────────────────────────┐
│                         Client Layer                             │
│                   (FAST API Endpoints)                           │
└────────────────────────┬────────────────────────────────────────┘
                         │
┌────────────────────────▼────────────────────────────────────────┐
│                    Business Logic Layer                          │
│              (Controllers - Orchestration)                       │
└─────┬──────────────────┬──────────────────┬─────────────────────┘
      │                  │                  │
┌─────▼─────┐   ┌────────▼────────┐   ┌────▼────────────────────────────────┐
│   Data    │   │   Persistence   │   │          External Services            │
│   Layer   │   │     Layer       │   │                                        │
│           │   │                 │   │  ┌─────────────────────────────────┐ │
│  Models   │   │   PostgreSQL    │   │  │        LLM Providers             │ │
│  (CRUD)   │   │  (Documents &   │   │  │  • OpenAI (API)                  │ │
│           │   │   Metadata)     │   │  │  • Cohere (API)                  │ │
│           │   │                 │   │  │  • Ollama (GPU-Accelerated)      │ │
│           │   │                 │   │  │    - Gemma                       │ │
└───────────┘   └─────────────────┘   │  │    - Qwen                        │ │
                                      │  └─────────────────────────────────┘ │
                                      │                                        │
                                      │  ┌─────────────────────────────────┐ │
                                      │  │     Vector Database (Dual)       │ │
                                      │  │  • Qdrant (Dedicated Vector DB)  │ │
                                      │  │  • PGVector (PostgreSQL Ext.)    │ │
                                      │  └─────────────────────────────────┘ │
                                      │                                        │
                                      │  ┌─────────────────────────────────┐ │
                                      │  │   Ollama GPU Runtime (Remote)    │ │
                                      │  │  • Google Colab (Free T4 GPU)    │ │
                                      │  │  • ngrok Tunnel (Secure Access)  │ │
                                      │  │  • FAST API Endpoint             │ │
                                      │  └─────────────────────────────────┘ │
                                      └────────────────────────────────────────┘

Component Overview

Component	Responsibility	Technology
API Layer	HTTP request handling, validation	FastAPI, Pydantic
Controllers	Business logic orchestration	Python
Models	Database CRUD operations	SQLAlchemy, Psycopg3
LLM Store	LLM provider integration & routing	OpenAI SDK, Cohere SDK, Ollama
Vector Store	Semantic search operations (dual provider)	Qdrant, PGVector
Relational DB	Metadata & document management	PostgreSQL 15+
Vector Extension	Native PostgreSQL vector operations	PGVector
Ollama Server	Local LLM inference runtime (remote)	Ollama (Gemma 2B/7B, Qwen 2.5)
GPU Runtime	Accelerated model inference (cloud)	Google Colab (Tesla T4 16GB)
Tunneling	Secure public endpoint exposure	ngrok (HTTPS)
Helpers	Configuration, utilities	Pydantic Settings

📋 Prerequisites

System Requirements

Python: 3.8 or later
Docker: 20.10+ (for database services)
Docker Compose: 1.29+
Memory: Minimum 4GB RAM recommended (8GB preferred)
Storage: 5GB free space (for models and data)

Cloud Infrastructure

Google Account: For accessing Google Colab
Google Colab: Free tier with T4 GPU runtime (15GB VRAM)
ngrok Account: Free tier for secure tunneling (optional but recommended)

API Keys Required

OpenAI API Key (for GPT models and embeddings) - OR
Cohere API Key (alternative LLM provider)
ngrok Auth Token (optional, for persistent tunnels)

🚀 Installation

Step 1: Local Environment Setup

Option 1: Using Conda (Recommended)

# Download and install Miniconda
# Visit: https://docs.anaconda.com/free/miniconda/#quick-command-line-install

# Create virtual environment
conda create -n mini-rag python=3.8 -y

# Activate environment
conda activate mini-rag

# Install dependencies
pip install -r requirements.txt

Option 2: Using venv

# Create virtual environment
python -m venv venv

# Activate environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Step 2: Database Services Setup

# Navigate to docker directory
cd docker

# Copy environment template
cp .env.example .env

# Update .env with your credentials (see Configuration section)
nano .env  # or use your preferred editor

# Start PostgreSQL and Qdrant services
docker compose up -d

# Verify services are running
docker compose ps

# Check PostgreSQL is accessible
docker compose exec postgres psql -U admin -d mini_rag_db -c "\l"

# Verify PGVector extension is loaded
docker compose exec postgres psql -U admin -d mini_rag_db -c "CREATE EXTENSION IF NOT EXISTS vector;"

Step 3: Google Colab Ollama Server Setup

3.1: Create Colab Notebook

Visit Google Colab
Create a new notebook
Enable GPU Runtime:
- Click Runtime → Change runtime type
- Select T4 GPU under Hardware accelerator
- Click Save

3.2: Install and Configure Ollama

Add the following cells to your Colab notebook:

# Cell 1: Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh

# Cell 2: Start Ollama server in background
import subprocess
import time

# Start Ollama server
ollama_process = subprocess.Popen(['ollama', 'serve'], 
                                   stdout=subprocess.PIPE, 
                                   stderr=subprocess.PIPE)
time.sleep(5)  # Wait for server to start
print("Ollama server started")

# Cell 3: Pull your preferred models
!ollama pull gemma:2b       # Lightweight model (1.4GB)
# OR
!ollama pull gemma:7b       # More capable model (4.8GB)
# OR
!ollama pull qwen2.5:3b     # Alternative model (2GB)

# Verify installation
!ollama list

3.3: Setup ngrok Tunnel

# Cell 4: Install ngrok
!pip install pyngrok

# Cell 5: Configure and start ngrok tunnel
from pyngrok import ngrok

# Optional: Set your ngrok auth token for persistent URLs
# Sign up at https://ngrok.com and get your token
ngrok.set_auth_token("YOUR_NGROK_AUTH_TOKEN")  # Replace with your token

# Create tunnel to Ollama server (port 11434)
public_url = ngrok.connect(11434, "http")
print(f"\n🚀 Ollama Server Public URL: {public_url}")
print(f"\n📋 Copy this URL to your .env file as OLLAMA_BASE_URL")

# Keep the tunnel alive
import time
print("\n✅ Tunnel is active. Keep this cell running!")
try:
    while True:
        time.sleep(60)
except KeyboardInterrupt:
    print("\n🛑 Tunnel stopped")

Expected Output:

🚀 Ollama Server Public URL: https://1234-5678-9abc-def0.ngrok-free.app
📋 Copy this URL to your .env file as OLLAMA_BASE_URL
✅ Tunnel is active. Keep this cell running!

3.4: Keep Colab Session Alive

Important: Colab sessions timeout after inactivity. Use one of these methods:

Method 1: Run this cell to simulate activity

# Cell 6: Auto-click to prevent disconnect
from IPython.display import Javascript
display(Javascript('''
function ClickConnect(){
    console.log("Clicking");
    document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect, 60000)
'''))

Method 2: Use browser extension (e.g., Colab Autoclick)

Method 3: Upgrade to Colab Pro for longer sessions

⚙️ Configuration

Main Environment Setup (.env in project root)

# Copy environment template
cp .env.example .env

Configure `.env` File

# ============================================
# LLM Provider Settings
# ============================================
# API-based providers
OPENAI_API_KEY=sk-your-openai-api-key-here
COHERE_API_KEY=your-cohere-api-key-here  # Optional

# Ollama (GPU-accelerated on Colab)
OLLAMA_BASE_URL=https://1234-5678-9abc-def0.ngrok-free.app  # From ngrok output
OLLAMA_MODEL=gemma:2b  # Options: gemma:2b, gemma:7b, qwen2.5:3b

# Select active LLM provider
LLM_PROVIDER=ollama  # Options: openai, cohere, ollama

# Model Configuration (for API providers)
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4-turbo-preview
COHERE_EMBEDDING_MODEL=embed-english-v3.0
COHERE_CHAT_MODEL=command-r-plus

# ============================================
# Vector Database Settings
# ============================================
# Primary vector database
VECTOR_DB_PROVIDER=qdrant  # Options: qdrant, pgvector

# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_GRPC_PORT=6334
QDRANT_API_KEY=  # Optional for local deployment

# PGVector Configuration (uses PostgreSQL)
PGVECTOR_HOST=localhost
PGVECTOR_PORT=5432
PGVECTOR_DATABASE=mini_rag_db
PGVECTOR_USER=admin
PGVECTOR_PASSWORD=your-secure-password

# Vector Search Parameters
VECTOR_SIZE=1536  # Must match embedding model output
DISTANCE_METRIC=cosine  # Options: cosine, euclidean, dot

# ============================================
# PostgreSQL Settings
# ============================================
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=mini_rag_db
POSTGRES_USER=admin
POSTGRES_PASSWORD=your-secure-password
POSTGRES_SCHEMA=public

# Connection Pool Settings
POSTGRES_POOL_SIZE=10
POSTGRES_MAX_OVERFLOW=20

# ============================================
# Application Settings
# ============================================
# Document Processing
CHUNKING_SIZE=500  # Characters per chunk
CHUNKING_OVERLAP=50  # Overlap between chunks
BATCH_SIZE=100  # Chunks per batch for embedding

# Retrieval Settings
TOP_K_RESULTS=5  # Number of chunks to retrieve

# Language Settings
DEFAULT_LOCALE=en  # Options: en, ar

# API Settings
API_HOST=0.0.0.0
API_PORT=5000
API_WORKERS=4

Docker Environment (docker/.env)

# ============================================
# PostgreSQL Configuration
# ============================================
POSTGRES_USER=admin
POSTGRES_PASSWORD=your-secure-password
POSTGRES_DB=mini_rag_db

# PGVector Extension
POSTGRES_EXTENSIONS=vector

# Resource Limits
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SHARED_BUFFERS=256MB

# ============================================
# Qdrant Configuration
# ============================================
QDRANT_API_KEY=  # Optional for local deployment
QDRANT_STORAGE_PATH=/qdrant/storage

# Resource Limits
QDRANT_MAX_CONCURRENT_REQUESTS=100

Docker Compose Configuration

Update docker/docker-compose.yml to include PostgreSQL with PGVector:

version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg15
    container_name: mini-rag-postgres
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    command: postgres -c shared_buffers=${POSTGRES_SHARED_BUFFERS:-256MB}
    FASTart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
      interval: 10s
      timeout: 5s
      retries: 5

  qdrant:
    image: qdrant/qdrant:latest
    container_name: mini-rag-qdrant
    ports:
      - "6333:6333"
      - "6334:6334"
    volumes:
      - qdrant_data:/qdrant/storage
    environment:
      QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
    FASTart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  postgres_data:
  qdrant_data:

PGVector Initialization Script

Create docker/init-scripts/01-init-pgvector.sql:

-- Enable PGVector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create vector index table for embeddings
CREATE TABLE IF NOT EXISTS vector_embeddings (
    id SERIAL PRIMARY KEY,
    chunk_id VARCHAR(255) UNIQUE NOT NULL,
    project_id VARCHAR(255) NOT NULL,
    embedding vector(1536),  -- Adjust size based on your embedding model
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create index for fast similarity search
CREATE INDEX IF NOT EXISTS vector_embeddings_embedding_idx 
ON vector_embeddings 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Create index for project filtering
CREATE INDEX IF NOT EXISTS vector_embeddings_project_idx 
ON vector_embeddings(project_id);

-- Grant permissions
GRANT ALL PRIVILEGES ON TABLE vector_embeddings TO admin;
GRANT USAGE, SELECT ON SEQUENCE vector_embeddings_id_seq TO admin;

📚 Usage

Start the Application

Step 1: Start Docker Services

# Navigate to docker directory
cd docker

# Start PostgreSQL and Qdrant
docker compose up -d

# Verify services are healthy
docker compose ps

# Check logs if needed
docker compose logs -f postgres
docker compose logs -f qdrant

Step 2: Start Colab Ollama Server (if using local models)

Open your Google Colab notebook
Run all cells to start Ollama and ngrok
Copy the ngrok URL to your .env file

Step 3: Start FastAPI Server

# Return to project root
cd ..

# Activate your virtual environment
conda activate mini-rag  # or source venv/bin/activate

# Start the server
uvicorn main:app --reload --host 0.0.0.0 --port 5000

# Or with custom workers for production
uvicorn main:app --host 0.0.0.0 --port 5000 --workers 4

Server will be available at: http://localhost:5000

Interactive API Documentation

Swagger UI: http://localhost:5000/docs
ReDoc: http://localhost:5000/redoc

Basic Workflow

1. Upload Documents

curl -X POST "http://localhost:5000/api/v1/data/upload/my-project" \
  -F "files=@document.pdf" \
  -F "files=@report.txt" \
  -F "files=@research_paper.pdf"

Response:

{
  "success": true,
  "project_id": "my-project",
  "uploaded_files": [
    {
      "filename": "document.pdf",
      "asset_id": "507f1f77bcf86cd799439011",
      "file_path": "assets/files/my-project/abc123_document.pdf",
      "size_kb": 245.7
    }
  ]
}

2. Process Documents (Chunking)

curl -X POST "http://localhost:5000/api/v1/data/process/my-project"

Response:

{
  "success": true,
  "project_id": "my-project",
  "chunks_created": 150,
  "documents_processed": 3,
  "processing_time": "2.3s"
}

3. Index Documents (Create Embeddings)

curl -X POST "http://localhost:5000/api/v1/nlp/index/push/my-project" \
  -H "Content-Type: application/json" \
  -d '{
    "batch_size": 100
  }'

Response:

{
  "success": true,
  "project_id": "my-project",
  "vectors_indexed": 150,
  "collection_name": "my-project",
  "vector_db": "qdrant",
  "indexing_time": "4.5s"
}

4. Ask Questions

curl -X POST "http://localhost:5000/api/v1/nlp/index/answer/my-project" \
  -H "Content-Type: application/json" \
  -d '{
    "question": "What are the key findings in the document?",
    "locale": "en",
    "temperature": 0.7,
    "max_tokens": 500
  }'

Response:

{
  "answer": "Based on the documents, the key findings include: 1) Implementation of RAG systems significantly improves response accuracy by 45%...",
  "sources": [
    {
      "chunk_id": "507f1f77bcf86cd799439011",
      "text": "RAG systems demonstrate improved performance...",
      "relevance_score": 0.89,
      "source_file": "research_paper.pdf",
      "page": 5
    }
  ],
  "metadata": {
    "model": "gemma:2b",
    "provider": "ollama",
    "tokens_used": 450,
    "processing_time": "1.8s",
    "gpu_accelerated": true
  }
}

📡 API Reference

Base Endpoints

Health Check

GET /api/v1/

Response:

{
  "message": "Welcome to Mini-RAG API",
  "version": "1.0.0",
  "status": "healthy",
  "services": {
    "postgres": "connected",
    "qdrant": "connected",
    "ollama": "connected"
  }
}

Data Management Endpoints

Upload Files

POST /api/v1/data/upload/{project_id}

Parameters:

project_id (path): Unique project identifier

Request Body:

files: List of files (multipart/form-data)

Supported Formats: .pdf, .txt

Response:

{
  "success": true,
  "project_id": "my-project",
  "uploaded_files": [
    {
      "filename": "document.pdf",
      "asset_id": "507f1f77bcf86cd799439011",
      "file_path": "assets/files/my-project/abc123_document.pdf",
      "size_kb": 245.7,
      "pages": 12
    }
  ],
  "total_files": 1,
  "total_size_mb": 0.24
}

Process Documents

POST /api/v1/data/process/{project_id}

Description: Loads documents and splits them into chunks using LangChain

Query Parameters:

chunk_size (optional): Override default chunk size
chunk_overlap (optional): Override default overlap

Response:

{
  "success": true,
  "project_id": "my-project",
  "chunks_created": 150,
  "documents_processed": 3,
  "avg_chunk_size": 485,
  "processing_time": "2.3s"
}

Get Project Info

GET /api/v1/data/project/{project_id}

Response:

{
  "project_id": "my-project",
  "total_documents": 3,
  "total_chunks": 150,
  "total_vectors": 150,
  "created_at": "2024-01-15T10:30:00Z",
  "last_updated": "2024-01-15T14:45:00Z",
  "storage": {
    "total_size_mb": 2.45,
    "vector_db": "qdrant"
  }
}

NLP & RAG Endpoints

Index Documents

POST /api/v1/nlp/index/push/{project_id}

Description: Generates embeddings and stores them in vector database

Request Body:

{
  "batch_size": 100,
  "vector_db": "qdrant"  // or "pgvector"
}

Response:

{
  "success": true,
  "project_id": "my-project",
  "vectors_indexed": 150,
  "collection_name": "my-project",
  "vector_db": "qdrant",
  "embedding_model": "text-embedding-3-small",
  "indexing_time": "4.5s",
  "batches_processed": 2
}

Get Collection Info

GET /api/v1/nlp/index/info/{project_id}

Response:

{
  "collection_name": "my-project",
  "vectors_count": 150,
  "vector_db": "qdrant",
  "config": {
    "vector_size": 1536,
    "distance": "Cosine",
    "indexed": true
  },
  "stats": {
    "total_points": 150,
    "indexed_points": 150,
    "segments_count": 1
  }
}

Semantic Search

POST /api/v1/nlp/index/search/{project_id}

Request Body:

{
  "query": "machine learning applications in healthcare",
  "top_k": 5,
  "score_threshold": 0.7,
  "vector_db": "qdrant"
}

Response:

{
  "results": [
    {
      "chunk_id": "507f1f77bcf86cd799439011",
      "text": "Machine learning has transformed healthcare through predictive diagnostics...",
      "score": 0.89,
      "metadata": {
        "source": "healthcare_research.pdf",
        "page": 5,
        "chunk_index": 23
      }
    }
  ],
  "total_results": 5,
  "search_time": "0.12s",
  "vector_db": "qdrant"
}

RAG Question Answering

POST /api/v1/nlp/index/answer/{project_id}

Request Body:

{
  "question": "How does RAG improve LLM accuracy?",
  "locale": "en",
  "temperature": 0.7,
  "max_tokens": 500,
  "top_k": 5,
  "use_gpu": true,
  "stream": false
}

Response:

{
  "answer": "RAG (Retrieval-Augmented Generation) improves LLM accuracy through several mechanisms: 1) It grounds responses in factual, retrieved context rather than relying solely on parametric memory...",
  "sources": [
    {
      "chunk_id": "507f1f77bcf86cd799439011",
      "text": "Retrieved context snippet...",
      "relevance_score": 0.89,
      "source_file": "rag_paper.pdf",
      "page": 3
    }
  ],
  "metadata": {
    "model": "gemma:2b",
    "provider": "ollama",
    "tokens_used": 450,
    "processing_time": "1.8s",
    "gpu_accelerated": true,
    "retrieval_time": "0.15s",
    "generation_time": "1.65s"
  }
}

Switch Vector Database

POST /api/v1/nlp/index/switch-vectordb/{project_id}

Request Body:

{
  "target_db": "pgvector",  // "qdrant" or "pgvector"
  "migrate_data": true
}

Response:

{
  "success": true,
  "project_id": "my-project",
  "previous_db": "qdrant",
  "current_db": "pgvector",
  "vectors_migrated": 150,
  "migration_time": "3.2s"
}

📂 Project Structure

mini-rag/
├── src/
│   ├── routes/                           # 📡 API Endpoints Layer
│   │   ├── base.py                      # Welcome & health check
│   │   ├── data.py                      # File upload & processing
│   │   ├── nlp.py                       # Indexing, search, Q&A
│   │   └── schemes/                     # Request/response schemas
│   │       ├── upload.py
│   │       ├── process.py
│   │       └── query.py
│   │
│   ├── controllers/                      # 🎮 Business Logic Layer
│   │   ├── BaseController.py            # Shared utilities
│   │   ├── ProjectController.py         # Project management
│   │   ├── DataController.py            # File validation
│   │   ├── ProcessController.py         # Document chunking
│   │   └── NLPController.py             # RAG orchestration
│   │
│   ├── models/                           # 💾 Database Layer
│   │   ├── ProjectModel.py              # Project CRUD (PostgreSQL)
│   │   ├── AssetModel.py                # File asset CRUD
│   │   ├── ChunkModel.py                # Chunk CRUD
│   │   ├── VectorModel.py               # Vector embeddings CRUD
│   │   ├── db_schemes/                  # SQLAlchemy schemas
│   │   │   ├── project.py
│   │   │   ├── asset.py
│   │   │   ├── data_chunk.py
│   │   │   └── vector_embedding.py
│   │   ├── enums/                       # Constants
│   │   │   ├── file_types.py
│   │   │   └── status.py
│   │   └── database.py                  # PostgreSQL connection
│   │
│   ├── stores/                           # 🔌 External Service Abstractions
│   │   ├── llm/                         # LLM Provider Integration
│   │   │   ├── LLMInterface.py
│   │   │   ├── LLMProviderFactory.py
│   │   │   ├── LLMEnums.py
│   │   │   ├── providers/
│   │   │   │   ├── OpenAIProvider.py
│   │   │   │   ├── CoHereProvider.py
│   │   │   │   └── OllamaProvider.py    # GPU-accelerated (Colab)
│   │   │   └── templates/               # Prompt templates
│   │   │       ├── template_parser.py
│   │   │       └── locales/
│   │   │           ├── en/
│   │   │           │   └── rag.py
│   │   │           └── ar/
│   │   │               └── rag.py
│   │   │
│   │   └── vectordb/                    # Vector Database Integration
│   │       ├── VectorDBInterface.py
│   │       ├── VectorDBProviderFactory.py
│   │       ├── VectorDBEnums.py
│   │       └── providers/
│   │           ├── QdrantDBProvider.py
│   │           └── PGVectorProvider.py   # PostgreSQL + PGVector
│   │
│   ├── helpers/                          # ⚙️ Utility Functions
│   │   ├── config.py                    # Environment config loader
│   │   ├── logger.py                    # Logging configuration
│   │   └── validators.py                # Input validation
│   │
│   └── assets/                           # 📦 File Storage
│       └── files/                       # Uploaded documents
│           └── {project_id}/
│
├── docker/                               # 🐳 Docker Configuration
│   ├── docker-compose.yml               # PostgreSQL + Qdrant services
│   ├── init-scripts/                    # Database initialization
│   │   └── 01-init-pgvector.sql
│   ├── .env.example
│   └── .env
│
├── notebooks/                            # 📓 Google Colab Notebooks
│   ├── ollama_server_setup.ipynb        # Colab GPU setup guide
│   └── model_testing.ipynb              # Model performance testing
│
├── tests/                                # 🧪 Unit & Integration Tests
│   ├── test_controllers.py
│   ├── test_vectordb.py
│   └── test_ollama_provider.py
│
├── scripts/                              # 🛠️ Utility Scripts
│   ├── migrate_vectordb.py              # Migrate between Qdrant/PGVector
│   ├── benchmark_models.py              # Compare model performance
│   └── backup_database.py               # PostgreSQL backup utility
│
├── .vscode/                              # 💻 Editor Settings
├── main.py                               # 🚀 Application entry point
├── requirements.txt                      # 📦 Python dependencies
├── .env.example                          # ⚙️ Environment template
├── .gitignore
└── README.md                             # 📖 Documentation

Component Responsibilities

Layer	Components	Purpose
API	`routes/`	HTTP request handling, input validation
Business Logic	`controllers/`	Orchestration, workflow management
Data Access	`models/`	PostgreSQL CRUD operations
External Services	`stores/llm/`, `stores/vectordb/`	LLM and vector DB integrations
Configuration	`helpers/`	Settings management, logging
Storage	`assets/`	File persistence
Infrastructure	`docker/`	Database containers, init scripts
Cloud GPU	`notebooks/`	Colab setup, model deployment

🔄 RAG Pipeline

Complete Workflow

┌─────────────────┐
│   1. UPLOAD     │  User uploads PDF/TXT files via API
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   2. STORE      │  Files saved to local storage + metadata to PostgreSQL
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  3. PROCESS     │  Documents split into chunks via LangChain
│                 │  • CharacterTextSplitter (500 chars, 50 overlap)
└────────┬────────┘  • Chunks stored in PostgreSQL
         │
         ▼
┌─────────────────┐
│   4. EMBED      │  Chunks → Vector embeddings
│                 │  • OpenAI: text-embedding-3-small (1536D)
│                 │  • Cohere: embed-english-v3.0 (1024D)
└────────┬────────┘  • Batch processing for efficiency
         │
         ▼
┌─────────────────┐
│   5. INDEX      │  Vectors stored in dual databases:
│                 │  • Qdrant: Dedicated vector search
│                 │  • PGVector: PostgreSQL native extension
└────────┬────────┘  • IVFFlat index for fast retrieval
         │
         ▼
┌─────────────────┐
│   6. QUERY      │  User submits natural language question
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  7. RETRIEVE    │  Semantic search pipeline:
│                 │  • Query → Embedding
│                 │  • Vector similarity search (cosine)
│                 │  • Top-K most relevant chunks (K=5)
└────────┬────────┘  • Score filtering (threshold=0.7)
         │
         ▼
┌─────────────────┐
│  8. AUGMENT     │  Context construction:
│                 │  • Prompt template (locale-aware)
│                 │  • System instructions
│                 │  • Retrieved chunks as context
└────────┬────────┘  • User question
         │
         ▼
┌─────────────────┐
│  9. GENERATE    │  LLM inference (GPU-accelerated):
│                 │  • Ollama on Colab T4 GPU (via ngrok)
│                 │  • Gemma 2B/7B or Qwen 2.5
│                 │  • Or OpenAI/Cohere API
└────────┬────────┘  • Temperature-controlled generation
         │
         ▼
┌─────────────────┐
│ 10. RESPONSE    │  Structured JSON response:
│                 │  • Generated answer
│                 │  • Source citations
│                 │  • Confidence scores
└─────────────────┘  • Performance metadata

Detailed Process Flow

Phase 1: Document Ingestion (Steps 1-3)

# User uploads files
POST /api/v1/data/upload/medical-research
↓
# System validates and stores files
- File validation (PDF/TXT, size limits)
- Generate unique asset IDs
- Save to: assets/files/medical-research/
- Metadata → PostgreSQL (AssetModel)
↓
# Document processing triggered
POST /api/v1/data/process/medical-research
↓
# LangChain pipeline
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separator="\n"
)
chunks = splitter.split_documents(documents)
↓
# Chunks stored in PostgreSQL
- chunk_id (UUID)
- project_id
- text content
- metadata (source, page, position)
- created_at timestamp

Phase 2: Embedding & Indexing (Steps 4-5)

# Generate embeddings
POST /api/v1/nlp/index/push/medical-research
{
  "batch_size": 100,
  "vector_db": "qdrant"
}
↓
# Batch processing workflow
chunks = ChunkModel.get_by_project("medical-research")
batches = create_batches(chunks, size=100)

for batch in batches:
    # Generate embeddings (API or local)
    embeddings = llm_provider.create_embeddings([c.text for c in batch])
    
    # Store in vector DB
    if vector_db == "qdrant":
        qdrant.upsert(
            collection_name="medical-research",
            points=[
                PointStruct(
                    id=chunk.id,
                    vector=embedding,
                    payload=chunk.metadata
                )
                for chunk, embedding in zip(batch, embeddings)
            ]
        )
    elif vector_db == "pgvector":
        # PostgreSQL with PGVector extension
        INSERT INTO vector_embeddings (chunk_id, embedding, metadata)
        VALUES (%s, %s::vector, %s)
↓
# Create indexes for fast retrieval
- Qdrant: HNSW index (M=16, ef_construct=100)
- PGVector: IVFFlat index (lists=100)

Phase 3: Query & Generation (Steps 6-10)

# User query received
POST /api/v1/nlp/index/answer/medical-research
{
  "question": "What are the side effects of the treatment?",
  "locale": "en",
  "temperature": 0.7,
  "top_k": 5
}
↓
# Step 1: Query embedding
query_embedding = llm_provider.create_embedding(question)
↓
# Step 2: Vector similarity search
if vector_db == "qdrant":
    results = qdrant.search(
        collection_name="medical-research",
        query_vector=query_embedding,
        limit=5,
        score_threshold=0.7
    )
elif vector_db == "pgvector":
    SELECT chunk_id, text, metadata,
           1 - (embedding <=> %s::vector) as similarity
    FROM vector_embeddings
    WHERE project_id = 'medical-research'
    ORDER BY embedding <=> %s::vector
    LIMIT 5
↓
# Step 3: Context preparation
retrieved_chunks = [
    f"[Source {i+1}] {result.text}"
    for i, result in enumerate(results)
]
context = "\n\n".join(retrieved_chunks)
↓
# Step 4: Prompt construction (locale-aware)
from src.stores.llm.templates import get_template

template = get_template("rag", locale="en")
prompt = template.format(
    context=context,
    question=question
)
↓
# Step 5: LLM generation
if llm_provider == "ollama":
    # GPU-accelerated on Colab via ngrok
    response = requests.post(
        f"{OLLAMA_BASE_URL}/api/generate",
        json={
            "model": "gemma:2b",
            "prompt": prompt,
            "temperature": 0.7,
            "max_tokens": 500
        }
    )
    answer = response.json()["response"]
elif llm_provider == "openai":
    response = openai.ChatCompletion.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": template.system},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0.7
    )
    answer = response.choices[0].message.content
↓
# Step 6: Response formatting
return {
    "answer": answer,
    "sources": [
        {
            "chunk_id": result.id,
            "text": result.text,
            "relevance_score": result.score,
            "metadata": result.metadata
        }
        for result in results
    ],
    "metadata": {
        "model": model_name,
        "provider": provider,
        "tokens_used": token_count,
        "processing_time": elapsed_time,
        "gpu_accelerated": True if ollama else False
    }
}

Performance Characteristics

Component	Latency	Throughput	Scalability
File Upload	100-500ms	10 files/sec	Horizontal (API)
Chunking	1-5s/doc	20 docs/min	CPU-bound
Embedding (API)	200-800ms	1000 req/min	API rate limits
Embedding (GPU)	50-200ms	5000 req/min	GPU memory
Vector Search	10-100ms	1000 req/sec	Index quality
LLM (Ollama)	500-2000ms	30 req/min	GPU compute
LLM (API)	1000-3000ms	Rate limited	Token limits

🛠️ Development

Setting Up Development Environment

# Clone repository
git clone https://github.com/yourusername/mini-rag.git
cd mini-rag

# Create development environment
conda create -n mini-rag-dev python=3.8 -y
conda activate mini-rag-dev

# Install dependencies with development tools
pip install -r requirements.txt
pip install -r requirements-dev.txt  # pytest, black, flake8, mypy

# Setup pre-commit hooks
pip install pre-commit
pre-commit install

Code Style Guidelines

Formatting: Use black with line length 100
Linting: Follow flake8 rules
Type Hints: Add type annotations for all public functions
Docstrings: Use Google-style docstrings

from typing import List, Dict, Optional
import logging

logger = logging.getLogger(__name__)

def process_document(
    file_path: str,
    chunk_size: int = 500,
    chunk_overlap: int = 50
) -> List[Dict[str, any]]:
    """
    Process a document into chunks with embeddings.

    Args:
        file_path: Absolute path to the document file
        chunk_size: Maximum characters per chunk (default: 500)
        chunk_overlap: Character overlap between chunks (default: 50)

    Returns:
        List of dictionaries containing chunk text and metadata

    Raises:
        FileNotFoundError: If file doesn't exist
        ValueError: If chunk_size < chunk_overlap
        
    Example:
        >>> chunks = process_document("paper.pdf", chunk_size=1000)
        >>> len(chunks)
        45
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")
    
    if chunk_size < chunk_overlap:
        raise ValueError("chunk_size must be >= chunk_overlap")
    
    logger.info(f"Processing document: {file_path}")
    # Implementation...

Running Tests

# Run all tests
pytest

# Run with coverage report
pytest --cov=src --cov-report=html tests/

# Run specific test file
pytest tests/test_nlp_controller.py

# Run with verbose output
pytest -v tests/

# Run only integration tests
pytest -m integration tests/

# Run and generate XML report for CI/CD
pytest --junitxml=test-results.xml

Testing Ollama Connection

# tests/test_ollama_provider.py
import pytest
import requests
from src.helpers.config import settings

def test_ollama_connection():
    """Test connection to Colab Ollama server via ngrok."""
    url = f"{settings.OLLAMA_BASE_URL}/api/version"
    response = requests.get(url, timeout=10)
    assert response.status_code == 200
    assert "version" in response.json()

def test_ollama_embedding():
    """Test embedding generation via Ollama."""
    from src.stores.llm.providers.OllamaProvider import OllamaProvider
    
    provider = OllamaProvider()
    embedding = provider.create_embedding("Test text")
    
    assert isinstance(embedding, list)
    assert len(embedding) > 0
    assert all(isinstance(x, float) for x in embedding)

def test_ollama_chat():
    """Test chat completion via Ollama."""
    from src.stores.llm.providers.OllamaProvider import OllamaProvider
    
    provider = OllamaProvider()
    response = provider.generate_chat_completion([
        {"role": "user", "content": "Hello, how are you?"}
    ])
    
    assert isinstance(response, str)
    assert len(response) > 0

Adding a New LLM Provider

Create provider class:

# src/stores/llm/providers/CustomProvider.py
from typing import List, Dict
from ..LLMInterface import LLMInterface

class CustomProvider(LLMInterface):
    """Custom LLM provider implementation."""
    
    def __init__(self, api_key: str, base_url: str):
        self.api_key = api_key
        self.base_url = base_url
    
    def create_embedding(self, text: str) -> List[float]:
        """Generate embedding vector for text."""
        # Implementation
        pass
    
    def create_embeddings_batch(self, texts: List[str]) -> List[List[float]]:
        """Generate embeddings for multiple texts."""
        # Implementation
        pass
    
    def generate_chat_completion(
        self,
        messages: List[Dict[str, str]],
        temperature: float = 0.7,
        max_tokens: int = 500
    ) -> str:
        """Generate chat completion."""
        # Implementation
        pass

Register in factory:

# src/stores/llm/LLMProviderFactory.py
from .providers.CustomProvider import CustomProvider

class LLMProviderFactory:
    @staticmethod
    def create_provider(provider_name: str):
        if provider_name == "custom":
            return CustomProvider(
                api_key=settings.CUSTOM_API_KEY,
                base_url=settings.CUSTOM_BASE_URL
            )
        # ... existing providers

Update configuration:

# .env
LLM_PROVIDER=custom
CUSTOM_API_KEY=your-api-key
CUSTOM_BASE_URL=https://api.custom-llm.com

Adding a New Vector Database Provider

Create provider class:

# src/stores/vectordb/providers/WeaviateProvider.py
from typing import List, Dict
from ..VectorDBInterface import VectorDBInterface

class WeaviateProvider(VectorDBInterface):
    """Weaviate vector database provider."""
    
    def __init__(self, url: str, api_key: str):
        import weaviate
        self.client = weaviate.Client(
            url=url,
            auth_client_secret=weaviate.AuthApiKey(api_key)
        )
    
    def create_collection(self, collection_name: str, vector_size: int):
        """Create new collection."""
        pass
    
    def upsert_vectors(
        self,
        collection_name: str,
        vectors: List[List[float]],
        ids: List[str],
        metadata: List[Dict]
    ):
        """Insert or update vectors."""
        pass
    
    def search(
        self,
        collection_name: str,
        query_vector: List[float],
        top_k: int = 5
    ) -> List[Dict]:
        """Semantic similarity search."""
        pass

Update Docker setup (if needed):

# docker/docker-compose.yml
  weaviate:
    image: semitechnologies/weaviate:latest
    ports:
      - "8080:8080"
    environment:
      AUTHENTICATION_APIKEY_ENABLED: 'true'
      AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'your-api-key'
    volumes:
      - weaviate_data:/var/lib/weaviate

Performance Benchmarking

# scripts/benchmark_models.py
import time
from src.stores.llm.LLMProviderFactory import LLMProviderFactory

def benchmark_embedding_speed():
    """Compare embedding generation speed across providers."""
    
    test_texts = ["Sample text"] * 100
    providers = ["openai", "cohere", "ollama"]
    
    results = {}
    for provider_name in providers:
        provider = LLMProviderFactory.create_provider(provider_name)
        
        start = time.time()
        embeddings = provider.create_embeddings_batch(test_texts)
        elapsed = time.time() - start
        
        results[provider_name] = {
            "total_time": elapsed,
            "avg_time": elapsed / len(test_texts),
            "throughput": len(test_texts) / elapsed
        }
    
    return results

def benchmark_generation_quality():
    """Compare answer quality across models."""
    # Implementation
    pass

if __name__ == "__main__":
    results = benchmark_embedding_speed()
    print(results)

🐛 Troubleshooting

Common Issues

1. PostgreSQL Connection Errors

Error: psycopg.OperationalError: connection to server failed

Solutions:

# Check if PostgreSQL container is running
docker compose ps postgres

# View PostgreSQL logs
docker compose logs postgres

# Verify connection settings
docker compose exec postgres psql -U admin -d mini_rag_db -c "\conninfo"

# Test connection from host
psql -h localhost -p 5432 -U admin -d mini_rag_db

# FASTart PostgreSQL
docker compose FASTart postgres

2. PGVector Extension Issues

Error: relation "vector_embeddings" does not exist

Solutions:

# Verify PGVector extension is installed
docker compose exec postgres psql -U admin -d mini_rag_db -c "\dx"

# Manually create extension
docker compose exec postgres psql -U admin -d mini_rag_db -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Run initialization script
docker compose exec postgres psql -U admin -d mini_rag_db -f /docker-entrypoint-initdb.d/01-init-pgvector.sql

# Check table exists
docker compose exec postgres psql -U admin -d mini_rag_db -c "\dt"

Error: vector dimension mismatch

Solutions:

-- Check current vector size
SELECT column_name, data_type 
FROM information_schema.columns 
WHERE table_name = 'vector_embeddings';

-- Drop and recreate table with correct size
DROP TABLE IF EXISTS vector_embeddings;

CREATE TABLE vector_embeddings (
    id SERIAL PRIMARY KEY,
    chunk_id VARCHAR(255) UNIQUE NOT NULL,
    project_id VARCHAR(255) NOT NULL,
    embedding vector(1536),  -- Match your embedding model
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

3. Qdrant Connection Issues

Error: QdrantException: Connection refused

Solutions:

# Check Qdrant status
docker compose ps qdrant
docker compose logs qdrant

# Verify Qdrant is accessible
curl http://localhost:6333/collections

# FASTart Qdrant
docker compose FASTart qdrant

# Check Qdrant dashboard
open http://localhost:6333/dashboard

4. Ollama/Colab Connection Issues

Error: requests.exceptions.ConnectionError: Failed to establish connection

Solutions:

Check ngrok tunnel:

# In your Colab notebook
from pyngrok import ngrok

# List active tunnels
tunnels = ngrok.get_tunnels()
print(tunnels)

# FASTart tunnel if needed
ngrok.kill()
public_url = ngrok.connect(11434, "http")
print(f"New URL: {public_url}")

Verify Ollama server:

# In Colab cell
!curl http://localhost:11434/api/version

Update .env file:

# Update with new ngrok URL
OLLAMA_BASE_URL=https://new-url-from-ngrok.ngrok-free.app

Test connection from local machine:

import requests

url = "https://your-ngrok-url.ngrok-free.app/api/version"
response = requests.get(url)
print(response.json())

Error: Colab session disconnected

Solutions:

Run the auto-click JavaScript code (see Installation section)
Use Colab Pro for longer sessions (24 hours)
Set up automatic session FASTarter
Consider self-hosting Ollama on a dedicated server

5. GPU Memory Issues

Error: CUDA out of memory

Solutions:

Switch to smaller model:

# In Colab
!ollama pull gemma:2b  # Instead of gemma:7b

Clear GPU cache:

# In Colab
import torch
torch.cuda.empty_cache()

Reduce batch size:

# In .env
BATCH_SIZE=50  # Instead of 100

6. Vector Size Mismatch

Error: Vector dimension mismatch: expected 1536, got 1024

Solution: Ensure consistency between embedding model and vector DB configuration

# For OpenAI text-embedding-3-small (1536 dimensions)
VECTOR_SIZE=1536
OPENAI_EMBEDDING_MODEL=text-embedding-3-small

# For OpenAI text-embedding-3-large (3072 dimensions)
VECTOR_SIZE=3072
OPENAI_EMBEDDING_MODEL=text-embedding-3-large

# For Cohere embed-english-v3.0 (1024 dimensions)
VECTOR_SIZE=1024
COHERE_EMBEDDING_MODEL=embed-english-v3.0

Update PGVector table:

ALTER TABLE vector_embeddings 
ALTER COLUMN embedding TYPE vector(1536);  -- Match your size

Recreate Qdrant collection:

from qdrant_client import QdrantClient

client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
    collection_name="my-project",
    vectors_config={"size": 1536, "distance": "Cosine"}
)

7. API Rate Limiting

Error: RateLimitError: Rate limit exceeded

Solutions:

For OpenAI:

# Implement exponential backoff
import time
from openai import RateLimitError

def create_embeddings_with_retry(texts, max_retries=3):
    for attempt in range(max_retries):
        try:
            return openai.Embedding.create(input=texts)
        except RateLimitError:
            wait_time = 2 ** attempt
            time.sleep(wait_time)
    raise Exception("Max retries exceeded")

For batch processing:

# Reduce batch size
BATCH_SIZE=20  # Instead of 100

# Add delay between batches
BATCH_DELAY_SECONDS=2

Performance Optimization

Slow Embedding Generation

Diagnosis:

import time

start = time.time()
embeddings = provider.create_embeddings_batch(chunks)
print(f"Time: {time.time() - start:.2f}s for {len(chunks)} chunks")

Solutions:

Use batch processing:

# Instead of sequential
for chunk in chunks:
    embedding = provider.create_embedding(chunk.text)

# Use batching
batch_texts = [chunk.text for chunk in chunks]
embeddings = provider.create_embeddings_batch(batch_texts)

Switch to local Ollama (if using API):

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=https://your-colab-ngrok-url

Parallel processing:

from concurrent.futures import ThreadPoolExecutor

def process_batch(batch):
    return provider.create_embeddings_batch(batch)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = executor.map(process_batch, batches)

Slow Vector Search

Diagnosis:

start = time.time()
results = vectordb.search(query_vector, top_k=5)
print(f"Search time: {time.time() - start:.2f}s")

Solutions:

Optimize PGVector index:

-- Use IVFFlat for better speed/accuracy tradeoff
CREATE INDEX vector_embeddings_embedding_idx 
ON vector_embeddings 
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);  -- Increase for larger datasets

-- Vacuum and analyze
VACUUM ANALYZE vector_embeddings;

Optimize Qdrant index:

client.recreate_collection(
    collection_name="project",
    vectors_config={
        "size": 1536,
        "distance": "Cosine"
    },
    hnsw_config={
        "m": 16,                # Increase for better recall
        "ef_construct": 100,    # Increase for better quality
    }
)

Reduce top_k:

TOP_K_RESULTS=3  # Instead of 10

Add score threshold:

results = vectordb.search(
    query_vector,
    top_k=10,
    score_threshold=0.7  # Only return highly relevant results
)

High Memory Usage

Solutions:

Reduce chunk size:

CHUNKING_SIZE=300  # Instead of 500

Process files individually:

for file in files:
    process_single_file(file)
    # Clear cache between files

Increase Docker memory limits:

# docker-compose.yml
services:
  postgres:
    deploy:
      resources:
        limits:
          memory: 2G

📊 Performance Benchmarks

Embedding Generation (1000 chunks)

Provider	Model	Time (s)	Throughput (chunks/s)	Cost
OpenAI API	text-embedding-3-small	12.5	80	$0.0001/k
Cohere API	embed-english-v3.0	15.3	65	$0.0001/k
Ollama (Colab)	nomic-embed-text	8.2	122	Free

Vector Search (1M vectors)

Database	Index Type	Search Time (ms)	Memory (GB)	Accuracy
Qdrant	HNSW	15	2.1	0.98
PGVector	IVFFlat	45	1.8	0.95

End-to-End Query (RAG)

Component	Latency (ms)	Notes
Query embedding	120	OpenAI API
Vector search	25	Qdrant HNSW
LLM generation	1500	Ollama Gemma 2B on T4
Total	1645	~1.6s end-to-end

Benchmarks conducted on: Intel i7-10700K, 32GB RAM, Tesla T4 GPU (Colab)

🔒 Security Best Practices

API Keys

Never commit .env files to version control
Use environment variables for all secrets
Rotate API keys regularly
Use different keys for dev/staging/production

Database

-- Use strong passwords
POSTGRES_PASSWORD=$(openssl rand -base64 32)

-- FASTrict network access
# docker-compose.yml
services:
  postgres:
    ports:
      - "127.0.0.1:5432:5432"  # Only localhost

-- Enable SSL for production
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'

API Security

# main.py
from fastapi import FastAPI, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi_limiter import FastAPILimiter
import redis.asyncio as redis

app = FastAPI()

# CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://yourdomain.com"],  # Specific origins
    allow_methods=["GET", "POST"],
    allow_headers=["*"],
)

# Rate limiting
@app.on_event("startup")
async def startup():
    redis_client = await redis.from_url("redis://localhost")
    await FastAPILimiter.init(redis_client)

⬆ Back to Top

⭐ If you find this project useful, please consider giving it a star!

Made with ❤️ by Boudy Ibrahim

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.vscode		.vscode
docker		docker
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_ollama.ipynb		run_ollama.ipynb

Folders and files

Latest commit

History

Repository files navigation

🤖 RAG: Production-Ready RAG Chatbot System

📖 Table of Contents

🎯 Overview

What is RAG?

Use Cases

✨ Features

Core Capabilities

Technical Highlights

🏗️ Architecture

System Design

Component Overview

📋 Prerequisites

System Requirements

Cloud Infrastructure

API Keys Required

🚀 Installation

Step 1: Local Environment Setup

Option 1: Using Conda (Recommended)

Option 2: Using venv

Step 2: Database Services Setup

Step 3: Google Colab Ollama Server Setup

3.1: Create Colab Notebook

3.2: Install and Configure Ollama

3.3: Setup ngrok Tunnel

3.4: Keep Colab Session Alive

⚙️ Configuration

Main Environment Setup (.env in project root)

Configure .env File

Docker Environment (docker/.env)

Docker Compose Configuration

PGVector Initialization Script

📚 Usage

Start the Application

Step 1: Start Docker Services

Step 2: Start Colab Ollama Server (if using local models)

Step 3: Start FastAPI Server

Interactive API Documentation

Basic Workflow

1. Upload Documents

2. Process Documents (Chunking)

3. Index Documents (Create Embeddings)

4. Ask Questions

📡 API Reference

Base Endpoints

Health Check

Data Management Endpoints

Upload Files

Process Documents

Get Project Info

NLP & RAG Endpoints

Index Documents

Get Collection Info

Semantic Search

RAG Question Answering

Switch Vector Database

📂 Project Structure

Component Responsibilities

🔄 RAG Pipeline

Complete Workflow

Detailed Process Flow

Phase 1: Document Ingestion (Steps 1-3)

Phase 2: Embedding & Indexing (Steps 4-5)

Phase 3: Query & Generation (Steps 6-10)

Performance Characteristics

🛠️ Development

Setting Up Development Environment

Code Style Guidelines

Running Tests

Testing Ollama Connection

Adding a New LLM Provider

Adding a New Vector Database Provider

Performance Benchmarking

🐛 Troubleshooting

Common Issues

1. PostgreSQL Connection Errors

2. PGVector Extension Issues

Configure `.env` File

Packages