A modular, production-ready Retrieval-Augmented Generation (RAG) system for intelligent question answering over document collections, powered by PostgreSQL and GPU-accelerated inference.
Features • Quick Start • Documentation • Architecture • API Reference
- Overview
- Features
- Architecture
- Prerequisites
- Installation
- Configuration
- Usage
- API Reference
- Project Structure
- RAG Pipeline
- Development
- Troubleshooting
RAG System is a minimalist yet powerful implementation of a Retrieval-Augmented Generation (RAG) system designed for building intelligent chatbots that answer questions based on your document collections. The system combines semantic search with large language models to provide accurate, context-aware responses, leveraging PostgreSQL with PGVector for efficient vector storage and Google Colab GPU infrastructure for accelerated model inference.
Retrieval-Augmented Generation enhances LLM responses by:
- Retrieving relevant context from a knowledge base using vector similarity search
- Augmenting the user query with retrieved information
- Generating accurate answers using the enriched context via GPU-accelerated LLMs
- 📚 Document Q&A: Query large document collections (PDFs, text files)
- 🏢 Enterprise Knowledge Base: Build internal chatbots over company documentation
- 📖 Research Assistant: Quickly find and summarize information from research papers
- 🎓 Educational Tools: Create tutoring systems based on course materials
- 💼 Customer Support: Automate responses using product documentation
- 🔍 Hybrid Vector Search: Dual vector database support with Qdrant and PGVector (PostgreSQL)
- 🚀 GPU-Accelerated Inference: Ollama server running on Google Colab T4 GPU for high-performance local LLM inference
- 🌐 Secure Remote Access: ngrok tunneling for secure public endpoint exposure to Colab-hosted models
- 🤖 Multi-LLM Support: Compatible with OpenAI GPT, Cohere, and local Ollama models (Gemma, Qwen)
- 📄 Document Processing: Automatic chunking and processing of PDF and TXT files
- 🗄️ PostgreSQL Backend: Robust relational database for metadata and document storage
- 📊 PGVector Integration: Native PostgreSQL vector similarity search capabilities
- 🌍 Multilingual: Built-in support for English and Arabic prompts
- 🎯 Project Management: Organize documents into separate projects
- ⚡ Batch Processing: Efficient batch embedding and indexing
- 🔌 FAST API: Clean, well-documented FastAPI endpoints
- Modular Architecture: Clean separation of concerns (routes, controllers, models, stores)
- Provider Abstraction: Easily swap LLM and vector DB providers via factory pattern
- Production Ready: PostgreSQL integration, Docker support, environment-based configuration
- Template System: Customizable prompt templates with locale support
- Comprehensive Error Handling: Robust validation and error responses
- Scalable Design: Supports concurrent requests and batch operations
- Cloud GPU Integration: Leverage free Google Colab T4 GPUs for cost-effective inference
┌─────────────────────────────────────────────────────────────────┐
│ Client Layer │
│ (FAST API Endpoints) │
└────────────────────────┬────────────────────────────────────────┘
│
┌────────────────────────▼────────────────────────────────────────┐
│ Business Logic Layer │
│ (Controllers - Orchestration) │
└─────┬──────────────────┬──────────────────┬─────────────────────┘
│ │ │
┌─────▼─────┐ ┌────────▼────────┐ ┌────▼────────────────────────────────┐
│ Data │ │ Persistence │ │ External Services │
│ Layer │ │ Layer │ │ │
│ │ │ │ │ ┌─────────────────────────────────┐ │
│ Models │ │ PostgreSQL │ │ │ LLM Providers │ │
│ (CRUD) │ │ (Documents & │ │ │ • OpenAI (API) │ │
│ │ │ Metadata) │ │ │ • Cohere (API) │ │
│ │ │ │ │ │ • Ollama (GPU-Accelerated) │ │
│ │ │ │ │ │ - Gemma │ │
└───────────┘ └─────────────────┘ │ │ - Qwen │ │
│ └─────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Vector Database (Dual) │ │
│ │ • Qdrant (Dedicated Vector DB) │ │
│ │ • PGVector (PostgreSQL Ext.) │ │
│ └─────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Ollama GPU Runtime (Remote) │ │
│ │ • Google Colab (Free T4 GPU) │ │
│ │ • ngrok Tunnel (Secure Access) │ │
│ │ • FAST API Endpoint │ │
│ └─────────────────────────────────┘ │
└────────────────────────────────────────┘
| Component | Responsibility | Technology |
|---|---|---|
| API Layer | HTTP request handling, validation | FastAPI, Pydantic |
| Controllers | Business logic orchestration | Python |
| Models | Database CRUD operations | SQLAlchemy, Psycopg3 |
| LLM Store | LLM provider integration & routing | OpenAI SDK, Cohere SDK, Ollama |
| Vector Store | Semantic search operations (dual provider) | Qdrant, PGVector |
| Relational DB | Metadata & document management | PostgreSQL 15+ |
| Vector Extension | Native PostgreSQL vector operations | PGVector |
| Ollama Server | Local LLM inference runtime (remote) | Ollama (Gemma 2B/7B, Qwen 2.5) |
| GPU Runtime | Accelerated model inference (cloud) | Google Colab (Tesla T4 16GB) |
| Tunneling | Secure public endpoint exposure | ngrok (HTTPS) |
| Helpers | Configuration, utilities | Pydantic Settings |
- Python: 3.8 or later
- Docker: 20.10+ (for database services)
- Docker Compose: 1.29+
- Memory: Minimum 4GB RAM recommended (8GB preferred)
- Storage: 5GB free space (for models and data)
- Google Account: For accessing Google Colab
- Google Colab: Free tier with T4 GPU runtime (15GB VRAM)
- ngrok Account: Free tier for secure tunneling (optional but recommended)
- OpenAI API Key (for GPT models and embeddings) - OR
- Cohere API Key (alternative LLM provider)
- ngrok Auth Token (optional, for persistent tunnels)
# Download and install Miniconda
# Visit: https://docs.anaconda.com/free/miniconda/#quick-command-line-install
# Create virtual environment
conda create -n mini-rag python=3.8 -y
# Activate environment
conda activate mini-rag
# Install dependencies
pip install -r requirements.txt# Create virtual environment
python -m venv venv
# Activate environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# Navigate to docker directory
cd docker
# Copy environment template
cp .env.example .env
# Update .env with your credentials (see Configuration section)
nano .env # or use your preferred editor
# Start PostgreSQL and Qdrant services
docker compose up -d
# Verify services are running
docker compose ps
# Check PostgreSQL is accessible
docker compose exec postgres psql -U admin -d mini_rag_db -c "\l"
# Verify PGVector extension is loaded
docker compose exec postgres psql -U admin -d mini_rag_db -c "CREATE EXTENSION IF NOT EXISTS vector;"- Visit Google Colab
- Create a new notebook
- Enable GPU Runtime:
- Click
Runtime→Change runtime type - Select
T4 GPUunder Hardware accelerator - Click
Save
- Click
Add the following cells to your Colab notebook:
# Cell 1: Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh
# Cell 2: Start Ollama server in background
import subprocess
import time
# Start Ollama server
ollama_process = subprocess.Popen(['ollama', 'serve'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
time.sleep(5) # Wait for server to start
print("Ollama server started")
# Cell 3: Pull your preferred models
!ollama pull gemma:2b # Lightweight model (1.4GB)
# OR
!ollama pull gemma:7b # More capable model (4.8GB)
# OR
!ollama pull qwen2.5:3b # Alternative model (2GB)
# Verify installation
!ollama list# Cell 4: Install ngrok
!pip install pyngrok
# Cell 5: Configure and start ngrok tunnel
from pyngrok import ngrok
# Optional: Set your ngrok auth token for persistent URLs
# Sign up at https://ngrok.com and get your token
ngrok.set_auth_token("YOUR_NGROK_AUTH_TOKEN") # Replace with your token
# Create tunnel to Ollama server (port 11434)
public_url = ngrok.connect(11434, "http")
print(f"\n🚀 Ollama Server Public URL: {public_url}")
print(f"\n📋 Copy this URL to your .env file as OLLAMA_BASE_URL")
# Keep the tunnel alive
import time
print("\n✅ Tunnel is active. Keep this cell running!")
try:
while True:
time.sleep(60)
except KeyboardInterrupt:
print("\n🛑 Tunnel stopped")Expected Output:
🚀 Ollama Server Public URL: https://1234-5678-9abc-def0.ngrok-free.app
📋 Copy this URL to your .env file as OLLAMA_BASE_URL
✅ Tunnel is active. Keep this cell running!
Important: Colab sessions timeout after inactivity. Use one of these methods:
Method 1: Run this cell to simulate activity
# Cell 6: Auto-click to prevent disconnect
from IPython.display import Javascript
display(Javascript('''
function ClickConnect(){
console.log("Clicking");
document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect, 60000)
'''))Method 2: Use browser extension (e.g., Colab Autoclick)
Method 3: Upgrade to Colab Pro for longer sessions
# Copy environment template
cp .env.example .env# ============================================
# LLM Provider Settings
# ============================================
# API-based providers
OPENAI_API_KEY=sk-your-openai-api-key-here
COHERE_API_KEY=your-cohere-api-key-here # Optional
# Ollama (GPU-accelerated on Colab)
OLLAMA_BASE_URL=https://1234-5678-9abc-def0.ngrok-free.app # From ngrok output
OLLAMA_MODEL=gemma:2b # Options: gemma:2b, gemma:7b, qwen2.5:3b
# Select active LLM provider
LLM_PROVIDER=ollama # Options: openai, cohere, ollama
# Model Configuration (for API providers)
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
OPENAI_CHAT_MODEL=gpt-4-turbo-preview
COHERE_EMBEDDING_MODEL=embed-english-v3.0
COHERE_CHAT_MODEL=command-r-plus
# ============================================
# Vector Database Settings
# ============================================
# Primary vector database
VECTOR_DB_PROVIDER=qdrant # Options: qdrant, pgvector
# Qdrant Configuration
QDRANT_HOST=localhost
QDRANT_PORT=6333
QDRANT_GRPC_PORT=6334
QDRANT_API_KEY= # Optional for local deployment
# PGVector Configuration (uses PostgreSQL)
PGVECTOR_HOST=localhost
PGVECTOR_PORT=5432
PGVECTOR_DATABASE=mini_rag_db
PGVECTOR_USER=admin
PGVECTOR_PASSWORD=your-secure-password
# Vector Search Parameters
VECTOR_SIZE=1536 # Must match embedding model output
DISTANCE_METRIC=cosine # Options: cosine, euclidean, dot
# ============================================
# PostgreSQL Settings
# ============================================
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_DATABASE=mini_rag_db
POSTGRES_USER=admin
POSTGRES_PASSWORD=your-secure-password
POSTGRES_SCHEMA=public
# Connection Pool Settings
POSTGRES_POOL_SIZE=10
POSTGRES_MAX_OVERFLOW=20
# ============================================
# Application Settings
# ============================================
# Document Processing
CHUNKING_SIZE=500 # Characters per chunk
CHUNKING_OVERLAP=50 # Overlap between chunks
BATCH_SIZE=100 # Chunks per batch for embedding
# Retrieval Settings
TOP_K_RESULTS=5 # Number of chunks to retrieve
# Language Settings
DEFAULT_LOCALE=en # Options: en, ar
# API Settings
API_HOST=0.0.0.0
API_PORT=5000
API_WORKERS=4# ============================================
# PostgreSQL Configuration
# ============================================
POSTGRES_USER=admin
POSTGRES_PASSWORD=your-secure-password
POSTGRES_DB=mini_rag_db
# PGVector Extension
POSTGRES_EXTENSIONS=vector
# Resource Limits
POSTGRES_MAX_CONNECTIONS=100
POSTGRES_SHARED_BUFFERS=256MB
# ============================================
# Qdrant Configuration
# ============================================
QDRANT_API_KEY= # Optional for local deployment
QDRANT_STORAGE_PATH=/qdrant/storage
# Resource Limits
QDRANT_MAX_CONCURRENT_REQUESTS=100Update docker/docker-compose.yml to include PostgreSQL with PGVector:
version: '3.8'
services:
postgres:
image: pgvector/pgvector:pg15
container_name: mini-rag-postgres
environment:
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: ${POSTGRES_DB}
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init-scripts:/docker-entrypoint-initdb.d
command: postgres -c shared_buffers=${POSTGRES_SHARED_BUFFERS:-256MB}
FASTart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER}"]
interval: 10s
timeout: 5s
retries: 5
qdrant:
image: qdrant/qdrant:latest
container_name: mini-rag-qdrant
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_data:/qdrant/storage
environment:
QDRANT__SERVICE__API_KEY: ${QDRANT_API_KEY}
FASTart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 5
volumes:
postgres_data:
qdrant_data:Create docker/init-scripts/01-init-pgvector.sql:
-- Enable PGVector extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create vector index table for embeddings
CREATE TABLE IF NOT EXISTS vector_embeddings (
id SERIAL PRIMARY KEY,
chunk_id VARCHAR(255) UNIQUE NOT NULL,
project_id VARCHAR(255) NOT NULL,
embedding vector(1536), -- Adjust size based on your embedding model
metadata JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Create index for fast similarity search
CREATE INDEX IF NOT EXISTS vector_embeddings_embedding_idx
ON vector_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- Create index for project filtering
CREATE INDEX IF NOT EXISTS vector_embeddings_project_idx
ON vector_embeddings(project_id);
-- Grant permissions
GRANT ALL PRIVILEGES ON TABLE vector_embeddings TO admin;
GRANT USAGE, SELECT ON SEQUENCE vector_embeddings_id_seq TO admin;# Navigate to docker directory
cd docker
# Start PostgreSQL and Qdrant
docker compose up -d
# Verify services are healthy
docker compose ps
# Check logs if needed
docker compose logs -f postgres
docker compose logs -f qdrant- Open your Google Colab notebook
- Run all cells to start Ollama and ngrok
- Copy the ngrok URL to your
.envfile
# Return to project root
cd ..
# Activate your virtual environment
conda activate mini-rag # or source venv/bin/activate
# Start the server
uvicorn main:app --reload --host 0.0.0.0 --port 5000
# Or with custom workers for production
uvicorn main:app --host 0.0.0.0 --port 5000 --workers 4Server will be available at: http://localhost:5000
- Swagger UI:
http://localhost:5000/docs - ReDoc:
http://localhost:5000/redoc
curl -X POST "http://localhost:5000/api/v1/data/upload/my-project" \
-F "files=@document.pdf" \
-F "files=@report.txt" \
-F "files=@research_paper.pdf"Response:
{
"success": true,
"project_id": "my-project",
"uploaded_files": [
{
"filename": "document.pdf",
"asset_id": "507f1f77bcf86cd799439011",
"file_path": "assets/files/my-project/abc123_document.pdf",
"size_kb": 245.7
}
]
}curl -X POST "http://localhost:5000/api/v1/data/process/my-project"Response:
{
"success": true,
"project_id": "my-project",
"chunks_created": 150,
"documents_processed": 3,
"processing_time": "2.3s"
}curl -X POST "http://localhost:5000/api/v1/nlp/index/push/my-project" \
-H "Content-Type: application/json" \
-d '{
"batch_size": 100
}'Response:
{
"success": true,
"project_id": "my-project",
"vectors_indexed": 150,
"collection_name": "my-project",
"vector_db": "qdrant",
"indexing_time": "4.5s"
}curl -X POST "http://localhost:5000/api/v1/nlp/index/answer/my-project" \
-H "Content-Type: application/json" \
-d '{
"question": "What are the key findings in the document?",
"locale": "en",
"temperature": 0.7,
"max_tokens": 500
}'Response:
{
"answer": "Based on the documents, the key findings include: 1) Implementation of RAG systems significantly improves response accuracy by 45%...",
"sources": [
{
"chunk_id": "507f1f77bcf86cd799439011",
"text": "RAG systems demonstrate improved performance...",
"relevance_score": 0.89,
"source_file": "research_paper.pdf",
"page": 5
}
],
"metadata": {
"model": "gemma:2b",
"provider": "ollama",
"tokens_used": 450,
"processing_time": "1.8s",
"gpu_accelerated": true
}
}GET /api/v1/Response:
{
"message": "Welcome to Mini-RAG API",
"version": "1.0.0",
"status": "healthy",
"services": {
"postgres": "connected",
"qdrant": "connected",
"ollama": "connected"
}
}POST /api/v1/data/upload/{project_id}Parameters:
project_id(path): Unique project identifier
Request Body:
files: List of files (multipart/form-data)
Supported Formats: .pdf, .txt
Response:
{
"success": true,
"project_id": "my-project",
"uploaded_files": [
{
"filename": "document.pdf",
"asset_id": "507f1f77bcf86cd799439011",
"file_path": "assets/files/my-project/abc123_document.pdf",
"size_kb": 245.7,
"pages": 12
}
],
"total_files": 1,
"total_size_mb": 0.24
}POST /api/v1/data/process/{project_id}Description: Loads documents and splits them into chunks using LangChain
Query Parameters:
chunk_size(optional): Override default chunk sizechunk_overlap(optional): Override default overlap
Response:
{
"success": true,
"project_id": "my-project",
"chunks_created": 150,
"documents_processed": 3,
"avg_chunk_size": 485,
"processing_time": "2.3s"
}GET /api/v1/data/project/{project_id}Response:
{
"project_id": "my-project",
"total_documents": 3,
"total_chunks": 150,
"total_vectors": 150,
"created_at": "2024-01-15T10:30:00Z",
"last_updated": "2024-01-15T14:45:00Z",
"storage": {
"total_size_mb": 2.45,
"vector_db": "qdrant"
}
}POST /api/v1/nlp/index/push/{project_id}Description: Generates embeddings and stores them in vector database
Request Body:
{
"batch_size": 100,
"vector_db": "qdrant" // or "pgvector"
}Response:
{
"success": true,
"project_id": "my-project",
"vectors_indexed": 150,
"collection_name": "my-project",
"vector_db": "qdrant",
"embedding_model": "text-embedding-3-small",
"indexing_time": "4.5s",
"batches_processed": 2
}GET /api/v1/nlp/index/info/{project_id}Response:
{
"collection_name": "my-project",
"vectors_count": 150,
"vector_db": "qdrant",
"config": {
"vector_size": 1536,
"distance": "Cosine",
"indexed": true
},
"stats": {
"total_points": 150,
"indexed_points": 150,
"segments_count": 1
}
}POST /api/v1/nlp/index/search/{project_id}Request Body:
{
"query": "machine learning applications in healthcare",
"top_k": 5,
"score_threshold": 0.7,
"vector_db": "qdrant"
}Response:
{
"results": [
{
"chunk_id": "507f1f77bcf86cd799439011",
"text": "Machine learning has transformed healthcare through predictive diagnostics...",
"score": 0.89,
"metadata": {
"source": "healthcare_research.pdf",
"page": 5,
"chunk_index": 23
}
}
],
"total_results": 5,
"search_time": "0.12s",
"vector_db": "qdrant"
}POST /api/v1/nlp/index/answer/{project_id}Request Body:
{
"question": "How does RAG improve LLM accuracy?",
"locale": "en",
"temperature": 0.7,
"max_tokens": 500,
"top_k": 5,
"use_gpu": true,
"stream": false
}Response:
{
"answer": "RAG (Retrieval-Augmented Generation) improves LLM accuracy through several mechanisms: 1) It grounds responses in factual, retrieved context rather than relying solely on parametric memory...",
"sources": [
{
"chunk_id": "507f1f77bcf86cd799439011",
"text": "Retrieved context snippet...",
"relevance_score": 0.89,
"source_file": "rag_paper.pdf",
"page": 3
}
],
"metadata": {
"model": "gemma:2b",
"provider": "ollama",
"tokens_used": 450,
"processing_time": "1.8s",
"gpu_accelerated": true,
"retrieval_time": "0.15s",
"generation_time": "1.65s"
}
}POST /api/v1/nlp/index/switch-vectordb/{project_id}Request Body:
{
"target_db": "pgvector", // "qdrant" or "pgvector"
"migrate_data": true
}Response:
{
"success": true,
"project_id": "my-project",
"previous_db": "qdrant",
"current_db": "pgvector",
"vectors_migrated": 150,
"migration_time": "3.2s"
}mini-rag/
├── src/
│ ├── routes/ # 📡 API Endpoints Layer
│ │ ├── base.py # Welcome & health check
│ │ ├── data.py # File upload & processing
│ │ ├── nlp.py # Indexing, search, Q&A
│ │ └── schemes/ # Request/response schemas
│ │ ├── upload.py
│ │ ├── process.py
│ │ └── query.py
│ │
│ ├── controllers/ # 🎮 Business Logic Layer
│ │ ├── BaseController.py # Shared utilities
│ │ ├── ProjectController.py # Project management
│ │ ├── DataController.py # File validation
│ │ ├── ProcessController.py # Document chunking
│ │ └── NLPController.py # RAG orchestration
│ │
│ ├── models/ # 💾 Database Layer
│ │ ├── ProjectModel.py # Project CRUD (PostgreSQL)
│ │ ├── AssetModel.py # File asset CRUD
│ │ ├── ChunkModel.py # Chunk CRUD
│ │ ├── VectorModel.py # Vector embeddings CRUD
│ │ ├── db_schemes/ # SQLAlchemy schemas
│ │ │ ├── project.py
│ │ │ ├── asset.py
│ │ │ ├── data_chunk.py
│ │ │ └── vector_embedding.py
│ │ ├── enums/ # Constants
│ │ │ ├── file_types.py
│ │ │ └── status.py
│ │ └── database.py # PostgreSQL connection
│ │
│ ├── stores/ # 🔌 External Service Abstractions
│ │ ├── llm/ # LLM Provider Integration
│ │ │ ├── LLMInterface.py
│ │ │ ├── LLMProviderFactory.py
│ │ │ ├── LLMEnums.py
│ │ │ ├── providers/
│ │ │ │ ├── OpenAIProvider.py
│ │ │ │ ├── CoHereProvider.py
│ │ │ │ └── OllamaProvider.py # GPU-accelerated (Colab)
│ │ │ └── templates/ # Prompt templates
│ │ │ ├── template_parser.py
│ │ │ └── locales/
│ │ │ ├── en/
│ │ │ │ └── rag.py
│ │ │ └── ar/
│ │ │ └── rag.py
│ │ │
│ │ └── vectordb/ # Vector Database Integration
│ │ ├── VectorDBInterface.py
│ │ ├── VectorDBProviderFactory.py
│ │ ├── VectorDBEnums.py
│ │ └── providers/
│ │ ├── QdrantDBProvider.py
│ │ └── PGVectorProvider.py # PostgreSQL + PGVector
│ │
│ ├── helpers/ # ⚙️ Utility Functions
│ │ ├── config.py # Environment config loader
│ │ ├── logger.py # Logging configuration
│ │ └── validators.py # Input validation
│ │
│ └── assets/ # 📦 File Storage
│ └── files/ # Uploaded documents
│ └── {project_id}/
│
├── docker/ # 🐳 Docker Configuration
│ ├── docker-compose.yml # PostgreSQL + Qdrant services
│ ├── init-scripts/ # Database initialization
│ │ └── 01-init-pgvector.sql
│ ├── .env.example
│ └── .env
│
├── notebooks/ # 📓 Google Colab Notebooks
│ ├── ollama_server_setup.ipynb # Colab GPU setup guide
│ └── model_testing.ipynb # Model performance testing
│
├── tests/ # 🧪 Unit & Integration Tests
│ ├── test_controllers.py
│ ├── test_vectordb.py
│ └── test_ollama_provider.py
│
├── scripts/ # 🛠️ Utility Scripts
│ ├── migrate_vectordb.py # Migrate between Qdrant/PGVector
│ ├── benchmark_models.py # Compare model performance
│ └── backup_database.py # PostgreSQL backup utility
│
├── .vscode/ # 💻 Editor Settings
├── main.py # 🚀 Application entry point
├── requirements.txt # 📦 Python dependencies
├── .env.example # ⚙️ Environment template
├── .gitignore
└── README.md # 📖 Documentation
| Layer | Components | Purpose |
|---|---|---|
| API | routes/ |
HTTP request handling, input validation |
| Business Logic | controllers/ |
Orchestration, workflow management |
| Data Access | models/ |
PostgreSQL CRUD operations |
| External Services | stores/llm/, stores/vectordb/ |
LLM and vector DB integrations |
| Configuration | helpers/ |
Settings management, logging |
| Storage | assets/ |
File persistence |
| Infrastructure | docker/ |
Database containers, init scripts |
| Cloud GPU | notebooks/ |
Colab setup, model deployment |
┌─────────────────┐
│ 1. UPLOAD │ User uploads PDF/TXT files via API
└────────┬────────┘
│
▼
┌─────────────────┐
│ 2. STORE │ Files saved to local storage + metadata to PostgreSQL
└────────┬────────┘
│
▼
┌─────────────────┐
│ 3. PROCESS │ Documents split into chunks via LangChain
│ │ • CharacterTextSplitter (500 chars, 50 overlap)
└────────┬────────┘ • Chunks stored in PostgreSQL
│
▼
┌─────────────────┐
│ 4. EMBED │ Chunks → Vector embeddings
│ │ • OpenAI: text-embedding-3-small (1536D)
│ │ • Cohere: embed-english-v3.0 (1024D)
└────────┬────────┘ • Batch processing for efficiency
│
▼
┌─────────────────┐
│ 5. INDEX │ Vectors stored in dual databases:
│ │ • Qdrant: Dedicated vector search
│ │ • PGVector: PostgreSQL native extension
└────────┬────────┘ • IVFFlat index for fast retrieval
│
▼
┌─────────────────┐
│ 6. QUERY │ User submits natural language question
└────────┬────────┘
│
▼
┌─────────────────┐
│ 7. RETRIEVE │ Semantic search pipeline:
│ │ • Query → Embedding
│ │ • Vector similarity search (cosine)
│ │ • Top-K most relevant chunks (K=5)
└────────┬────────┘ • Score filtering (threshold=0.7)
│
▼
┌─────────────────┐
│ 8. AUGMENT │ Context construction:
│ │ • Prompt template (locale-aware)
│ │ • System instructions
│ │ • Retrieved chunks as context
└────────┬────────┘ • User question
│
▼
┌─────────────────┐
│ 9. GENERATE │ LLM inference (GPU-accelerated):
│ │ • Ollama on Colab T4 GPU (via ngrok)
│ │ • Gemma 2B/7B or Qwen 2.5
│ │ • Or OpenAI/Cohere API
└────────┬────────┘ • Temperature-controlled generation
│
▼
┌─────────────────┐
│ 10. RESPONSE │ Structured JSON response:
│ │ • Generated answer
│ │ • Source citations
│ │ • Confidence scores
└─────────────────┘ • Performance metadata
# User uploads files
POST /api/v1/data/upload/medical-research
↓
# System validates and stores files
- File validation (PDF/TXT, size limits)
- Generate unique asset IDs
- Save to: assets/files/medical-research/
- Metadata → PostgreSQL (AssetModel)
↓
# Document processing triggered
POST /api/v1/data/process/medical-research
↓
# LangChain pipeline
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separator="\n"
)
chunks = splitter.split_documents(documents)
↓
# Chunks stored in PostgreSQL
- chunk_id (UUID)
- project_id
- text content
- metadata (source, page, position)
- created_at timestamp# Generate embeddings
POST /api/v1/nlp/index/push/medical-research
{
"batch_size": 100,
"vector_db": "qdrant"
}
↓
# Batch processing workflow
chunks = ChunkModel.get_by_project("medical-research")
batches = create_batches(chunks, size=100)
for batch in batches:
# Generate embeddings (API or local)
embeddings = llm_provider.create_embeddings([c.text for c in batch])
# Store in vector DB
if vector_db == "qdrant":
qdrant.upsert(
collection_name="medical-research",
points=[
PointStruct(
id=chunk.id,
vector=embedding,
payload=chunk.metadata
)
for chunk, embedding in zip(batch, embeddings)
]
)
elif vector_db == "pgvector":
# PostgreSQL with PGVector extension
INSERT INTO vector_embeddings (chunk_id, embedding, metadata)
VALUES (%s, %s::vector, %s)
↓
# Create indexes for fast retrieval
- Qdrant: HNSW index (M=16, ef_construct=100)
- PGVector: IVFFlat index (lists=100)# User query received
POST /api/v1/nlp/index/answer/medical-research
{
"question": "What are the side effects of the treatment?",
"locale": "en",
"temperature": 0.7,
"top_k": 5
}
↓
# Step 1: Query embedding
query_embedding = llm_provider.create_embedding(question)
↓
# Step 2: Vector similarity search
if vector_db == "qdrant":
results = qdrant.search(
collection_name="medical-research",
query_vector=query_embedding,
limit=5,
score_threshold=0.7
)
elif vector_db == "pgvector":
SELECT chunk_id, text, metadata,
1 - (embedding <=> %s::vector) as similarity
FROM vector_embeddings
WHERE project_id = 'medical-research'
ORDER BY embedding <=> %s::vector
LIMIT 5
↓
# Step 3: Context preparation
retrieved_chunks = [
f"[Source {i+1}] {result.text}"
for i, result in enumerate(results)
]
context = "\n\n".join(retrieved_chunks)
↓
# Step 4: Prompt construction (locale-aware)
from src.stores.llm.templates import get_template
template = get_template("rag", locale="en")
prompt = template.format(
context=context,
question=question
)
↓
# Step 5: LLM generation
if llm_provider == "ollama":
# GPU-accelerated on Colab via ngrok
response = requests.post(
f"{OLLAMA_BASE_URL}/api/generate",
json={
"model": "gemma:2b",
"prompt": prompt,
"temperature": 0.7,
"max_tokens": 500
}
)
answer = response.json()["response"]
elif llm_provider == "openai":
response = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": template.system},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
temperature=0.7
)
answer = response.choices[0].message.content
↓
# Step 6: Response formatting
return {
"answer": answer,
"sources": [
{
"chunk_id": result.id,
"text": result.text,
"relevance_score": result.score,
"metadata": result.metadata
}
for result in results
],
"metadata": {
"model": model_name,
"provider": provider,
"tokens_used": token_count,
"processing_time": elapsed_time,
"gpu_accelerated": True if ollama else False
}
}| Component | Latency | Throughput | Scalability |
|---|---|---|---|
| File Upload | 100-500ms | 10 files/sec | Horizontal (API) |
| Chunking | 1-5s/doc | 20 docs/min | CPU-bound |
| Embedding (API) | 200-800ms | 1000 req/min | API rate limits |
| Embedding (GPU) | 50-200ms | 5000 req/min | GPU memory |
| Vector Search | 10-100ms | 1000 req/sec | Index quality |
| LLM (Ollama) | 500-2000ms | 30 req/min | GPU compute |
| LLM (API) | 1000-3000ms | Rate limited | Token limits |
# Clone repository
git clone https://github.com/yourusername/mini-rag.git
cd mini-rag
# Create development environment
conda create -n mini-rag-dev python=3.8 -y
conda activate mini-rag-dev
# Install dependencies with development tools
pip install -r requirements.txt
pip install -r requirements-dev.txt # pytest, black, flake8, mypy
# Setup pre-commit hooks
pip install pre-commit
pre-commit install- Formatting: Use
blackwith line length 100 - Linting: Follow
flake8rules - Type Hints: Add type annotations for all public functions
- Docstrings: Use Google-style docstrings
from typing import List, Dict, Optional
import logging
logger = logging.getLogger(__name__)
def process_document(
file_path: str,
chunk_size: int = 500,
chunk_overlap: int = 50
) -> List[Dict[str, any]]:
"""
Process a document into chunks with embeddings.
Args:
file_path: Absolute path to the document file
chunk_size: Maximum characters per chunk (default: 500)
chunk_overlap: Character overlap between chunks (default: 50)
Returns:
List of dictionaries containing chunk text and metadata
Raises:
FileNotFoundError: If file doesn't exist
ValueError: If chunk_size < chunk_overlap
Example:
>>> chunks = process_document("paper.pdf", chunk_size=1000)
>>> len(chunks)
45
"""
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
if chunk_size < chunk_overlap:
raise ValueError("chunk_size must be >= chunk_overlap")
logger.info(f"Processing document: {file_path}")
# Implementation...# Run all tests
pytest
# Run with coverage report
pytest --cov=src --cov-report=html tests/
# Run specific test file
pytest tests/test_nlp_controller.py
# Run with verbose output
pytest -v tests/
# Run only integration tests
pytest -m integration tests/
# Run and generate XML report for CI/CD
pytest --junitxml=test-results.xml# tests/test_ollama_provider.py
import pytest
import requests
from src.helpers.config import settings
def test_ollama_connection():
"""Test connection to Colab Ollama server via ngrok."""
url = f"{settings.OLLAMA_BASE_URL}/api/version"
response = requests.get(url, timeout=10)
assert response.status_code == 200
assert "version" in response.json()
def test_ollama_embedding():
"""Test embedding generation via Ollama."""
from src.stores.llm.providers.OllamaProvider import OllamaProvider
provider = OllamaProvider()
embedding = provider.create_embedding("Test text")
assert isinstance(embedding, list)
assert len(embedding) > 0
assert all(isinstance(x, float) for x in embedding)
def test_ollama_chat():
"""Test chat completion via Ollama."""
from src.stores.llm.providers.OllamaProvider import OllamaProvider
provider = OllamaProvider()
response = provider.generate_chat_completion([
{"role": "user", "content": "Hello, how are you?"}
])
assert isinstance(response, str)
assert len(response) > 0- Create provider class:
# src/stores/llm/providers/CustomProvider.py
from typing import List, Dict
from ..LLMInterface import LLMInterface
class CustomProvider(LLMInterface):
"""Custom LLM provider implementation."""
def __init__(self, api_key: str, base_url: str):
self.api_key = api_key
self.base_url = base_url
def create_embedding(self, text: str) -> List[float]:
"""Generate embedding vector for text."""
# Implementation
pass
def create_embeddings_batch(self, texts: List[str]) -> List[List[float]]:
"""Generate embeddings for multiple texts."""
# Implementation
pass
def generate_chat_completion(
self,
messages: List[Dict[str, str]],
temperature: float = 0.7,
max_tokens: int = 500
) -> str:
"""Generate chat completion."""
# Implementation
pass- Register in factory:
# src/stores/llm/LLMProviderFactory.py
from .providers.CustomProvider import CustomProvider
class LLMProviderFactory:
@staticmethod
def create_provider(provider_name: str):
if provider_name == "custom":
return CustomProvider(
api_key=settings.CUSTOM_API_KEY,
base_url=settings.CUSTOM_BASE_URL
)
# ... existing providers- Update configuration:
# .env
LLM_PROVIDER=custom
CUSTOM_API_KEY=your-api-key
CUSTOM_BASE_URL=https://api.custom-llm.com- Create provider class:
# src/stores/vectordb/providers/WeaviateProvider.py
from typing import List, Dict
from ..VectorDBInterface import VectorDBInterface
class WeaviateProvider(VectorDBInterface):
"""Weaviate vector database provider."""
def __init__(self, url: str, api_key: str):
import weaviate
self.client = weaviate.Client(
url=url,
auth_client_secret=weaviate.AuthApiKey(api_key)
)
def create_collection(self, collection_name: str, vector_size: int):
"""Create new collection."""
pass
def upsert_vectors(
self,
collection_name: str,
vectors: List[List[float]],
ids: List[str],
metadata: List[Dict]
):
"""Insert or update vectors."""
pass
def search(
self,
collection_name: str,
query_vector: List[float],
top_k: int = 5
) -> List[Dict]:
"""Semantic similarity search."""
pass- Update Docker setup (if needed):
# docker/docker-compose.yml
weaviate:
image: semitechnologies/weaviate:latest
ports:
- "8080:8080"
environment:
AUTHENTICATION_APIKEY_ENABLED: 'true'
AUTHENTICATION_APIKEY_ALLOWED_KEYS: 'your-api-key'
volumes:
- weaviate_data:/var/lib/weaviate# scripts/benchmark_models.py
import time
from src.stores.llm.LLMProviderFactory import LLMProviderFactory
def benchmark_embedding_speed():
"""Compare embedding generation speed across providers."""
test_texts = ["Sample text"] * 100
providers = ["openai", "cohere", "ollama"]
results = {}
for provider_name in providers:
provider = LLMProviderFactory.create_provider(provider_name)
start = time.time()
embeddings = provider.create_embeddings_batch(test_texts)
elapsed = time.time() - start
results[provider_name] = {
"total_time": elapsed,
"avg_time": elapsed / len(test_texts),
"throughput": len(test_texts) / elapsed
}
return results
def benchmark_generation_quality():
"""Compare answer quality across models."""
# Implementation
pass
if __name__ == "__main__":
results = benchmark_embedding_speed()
print(results)Error: psycopg.OperationalError: connection to server failed
Solutions:
# Check if PostgreSQL container is running
docker compose ps postgres
# View PostgreSQL logs
docker compose logs postgres
# Verify connection settings
docker compose exec postgres psql -U admin -d mini_rag_db -c "\conninfo"
# Test connection from host
psql -h localhost -p 5432 -U admin -d mini_rag_db
# FASTart PostgreSQL
docker compose FASTart postgresError: relation "vector_embeddings" does not exist
Solutions:
# Verify PGVector extension is installed
docker compose exec postgres psql -U admin -d mini_rag_db -c "\dx"
# Manually create extension
docker compose exec postgres psql -U admin -d mini_rag_db -c "CREATE EXTENSION IF NOT EXISTS vector;"
# Run initialization script
docker compose exec postgres psql -U admin -d mini_rag_db -f /docker-entrypoint-initdb.d/01-init-pgvector.sql
# Check table exists
docker compose exec postgres psql -U admin -d mini_rag_db -c "\dt"Error: vector dimension mismatch
Solutions:
-- Check current vector size
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'vector_embeddings';
-- Drop and recreate table with correct size
DROP TABLE IF EXISTS vector_embeddings;
CREATE TABLE vector_embeddings (
id SERIAL PRIMARY KEY,
chunk_id VARCHAR(255) UNIQUE NOT NULL,
project_id VARCHAR(255) NOT NULL,
embedding vector(1536), -- Match your embedding model
metadata JSONB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);Error: QdrantException: Connection refused
Solutions:
# Check Qdrant status
docker compose ps qdrant
docker compose logs qdrant
# Verify Qdrant is accessible
curl http://localhost:6333/collections
# FASTart Qdrant
docker compose FASTart qdrant
# Check Qdrant dashboard
open http://localhost:6333/dashboardError: requests.exceptions.ConnectionError: Failed to establish connection
Solutions:
Check ngrok tunnel:
# In your Colab notebook
from pyngrok import ngrok
# List active tunnels
tunnels = ngrok.get_tunnels()
print(tunnels)
# FASTart tunnel if needed
ngrok.kill()
public_url = ngrok.connect(11434, "http")
print(f"New URL: {public_url}")Verify Ollama server:
# In Colab cell
!curl http://localhost:11434/api/versionUpdate .env file:
# Update with new ngrok URL
OLLAMA_BASE_URL=https://new-url-from-ngrok.ngrok-free.appTest connection from local machine:
import requests
url = "https://your-ngrok-url.ngrok-free.app/api/version"
response = requests.get(url)
print(response.json())Error: Colab session disconnected
Solutions:
- Run the auto-click JavaScript code (see Installation section)
- Use Colab Pro for longer sessions (24 hours)
- Set up automatic session FASTarter
- Consider self-hosting Ollama on a dedicated server
Error: CUDA out of memory
Solutions:
Switch to smaller model:
# In Colab
!ollama pull gemma:2b # Instead of gemma:7bClear GPU cache:
# In Colab
import torch
torch.cuda.empty_cache()Reduce batch size:
# In .env
BATCH_SIZE=50 # Instead of 100Error: Vector dimension mismatch: expected 1536, got 1024
Solution: Ensure consistency between embedding model and vector DB configuration
# For OpenAI text-embedding-3-small (1536 dimensions)
VECTOR_SIZE=1536
OPENAI_EMBEDDING_MODEL=text-embedding-3-small
# For OpenAI text-embedding-3-large (3072 dimensions)
VECTOR_SIZE=3072
OPENAI_EMBEDDING_MODEL=text-embedding-3-large
# For Cohere embed-english-v3.0 (1024 dimensions)
VECTOR_SIZE=1024
COHERE_EMBEDDING_MODEL=embed-english-v3.0Update PGVector table:
ALTER TABLE vector_embeddings
ALTER COLUMN embedding TYPE vector(1536); -- Match your sizeRecreate Qdrant collection:
from qdrant_client import QdrantClient
client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
collection_name="my-project",
vectors_config={"size": 1536, "distance": "Cosine"}
)Error: RateLimitError: Rate limit exceeded
Solutions:
For OpenAI:
# Implement exponential backoff
import time
from openai import RateLimitError
def create_embeddings_with_retry(texts, max_retries=3):
for attempt in range(max_retries):
try:
return openai.Embedding.create(input=texts)
except RateLimitError:
wait_time = 2 ** attempt
time.sleep(wait_time)
raise Exception("Max retries exceeded")For batch processing:
# Reduce batch size
BATCH_SIZE=20 # Instead of 100
# Add delay between batches
BATCH_DELAY_SECONDS=2Diagnosis:
import time
start = time.time()
embeddings = provider.create_embeddings_batch(chunks)
print(f"Time: {time.time() - start:.2f}s for {len(chunks)} chunks")Solutions:
- Use batch processing:
# Instead of sequential
for chunk in chunks:
embedding = provider.create_embedding(chunk.text)
# Use batching
batch_texts = [chunk.text for chunk in chunks]
embeddings = provider.create_embeddings_batch(batch_texts)- Switch to local Ollama (if using API):
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=https://your-colab-ngrok-url- Parallel processing:
from concurrent.futures import ThreadPoolExecutor
def process_batch(batch):
return provider.create_embeddings_batch(batch)
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(process_batch, batches)Diagnosis:
start = time.time()
results = vectordb.search(query_vector, top_k=5)
print(f"Search time: {time.time() - start:.2f}s")Solutions:
- Optimize PGVector index:
-- Use IVFFlat for better speed/accuracy tradeoff
CREATE INDEX vector_embeddings_embedding_idx
ON vector_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- Increase for larger datasets
-- Vacuum and analyze
VACUUM ANALYZE vector_embeddings;- Optimize Qdrant index:
client.recreate_collection(
collection_name="project",
vectors_config={
"size": 1536,
"distance": "Cosine"
},
hnsw_config={
"m": 16, # Increase for better recall
"ef_construct": 100, # Increase for better quality
}
)- Reduce top_k:
TOP_K_RESULTS=3 # Instead of 10- Add score threshold:
results = vectordb.search(
query_vector,
top_k=10,
score_threshold=0.7 # Only return highly relevant results
)Solutions:
- Reduce chunk size:
CHUNKING_SIZE=300 # Instead of 500- Process files individually:
for file in files:
process_single_file(file)
# Clear cache between files- Increase Docker memory limits:
# docker-compose.yml
services:
postgres:
deploy:
resources:
limits:
memory: 2G| Provider | Model | Time (s) | Throughput (chunks/s) | Cost |
|---|---|---|---|---|
| OpenAI API | text-embedding-3-small | 12.5 | 80 | $0.0001/k |
| Cohere API | embed-english-v3.0 | 15.3 | 65 | $0.0001/k |
| Ollama (Colab) | nomic-embed-text | 8.2 | 122 | Free |
| Database | Index Type | Search Time (ms) | Memory (GB) | Accuracy |
|---|---|---|---|---|
| Qdrant | HNSW | 15 | 2.1 | 0.98 |
| PGVector | IVFFlat | 45 | 1.8 | 0.95 |
| Component | Latency (ms) | Notes |
|---|---|---|
| Query embedding | 120 | OpenAI API |
| Vector search | 25 | Qdrant HNSW |
| LLM generation | 1500 | Ollama Gemma 2B on T4 |
| Total | 1645 | ~1.6s end-to-end |
Benchmarks conducted on: Intel i7-10700K, 32GB RAM, Tesla T4 GPU (Colab)
- Never commit
.envfiles to version control - Use environment variables for all secrets
- Rotate API keys regularly
- Use different keys for dev/staging/production
-- Use strong passwords
POSTGRES_PASSWORD=$(openssl rand -base64 32)
-- FASTrict network access
# docker-compose.yml
services:
postgres:
ports:
- "127.0.0.1:5432:5432" # Only localhost
-- Enable SSL for production
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'# main.py
from fastapi import FastAPI, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi_limiter import FastAPILimiter
import redis.asyncio as redis
app = FastAPI()
# CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["https://yourdomain.com"], # Specific origins
allow_methods=["GET", "POST"],
allow_headers=["*"],
)
# Rate limiting
@app.on_event("startup")
async def startup():
redis_client = await redis.from_url("redis://localhost")
await FastAPILimiter.init(redis_client)Made with ❤️ by Boudy Ibrahim