A simple Docker-based environment for exploring data analytics and AI tools. Includes basic data processing, storage, and LLM capabilities - all containerized for easy experimentation.
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Jupyter β β Phoenix β β Ollama β β Trino β
β Notebook β βAI Observ. β β LLM Server β β Engine β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β β β
ββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββ
β β
βββββββββββββββ βββββββββββββββ
β Spark β β Hive β
β Cluster β β Metastore β
βββββββββββββββ βββββββββββββββ
β β
ββββββββββββββββββ
β
βββββββββββββββ βββββββββββββββ
β MinIO β β Qdrant β
β (S3 API) β β Vector DB β
βββββββββββββββ βββββββββββββββ
- MinIO: S3-compatible storage for files
- Apache Spark: Basic data processing capabilities
- Hive Metastore: Simple metadata management
- Trino: SQL query interface
- Ollama: Local LLM server (gemma3:4b model)
- Phoenix: Basic AI monitoring
- Qdrant: Vector database for AI experiments
- Jupyter: Notebook environment with common libraries
- PostgreSQL: Database backend
- NVIDIA Docker: GPU support for AI tools
- Docker with Docker Compose
- NVIDIA GPU (Required):
- NVIDIA GPU with 4GB+ VRAM for gemma3:4b (current default)
- Lower VRAM options available: gemma3:1b
- NVIDIA Container Runtime pre-configured for Docker GPU access
- Platform is optimized for GPU acceleration and requires NVIDIA hardware
- Minimum System Requirements:
- 8GB+ RAM recommended (12GB+ for optimal performance)
- 50GB+ free disk space for models and data
# This script will:
# 1. Build all custom Docker images (if not already built)
# 2. Start all services
# 3. Setup MinIO storage
# 4. Pull the gemma3:4b LLM model
./start-platform.sh# 1. Build custom Docker images
docker build -t datalab-playground/jupyter ./jupyter
docker build -t datalab-playground/spark ./spark
docker build -t datalab-playground/trino ./trino
docker build -t datalab-playground/hive-metastore ./hive-metastore
# 2. Start all services
docker-compose up -d
# 3. Pull LLM model
docker exec ollama ollama pull gemma3:4bSimple steps to explore the tools:
- Start: Run
./start-platform.sh - Open Jupyter: Go to http://localhost:8888 (password: 123456)
- Try the demos: Open
data_lab_playground.ipynbfor basic examples - Experiment with RAG: Try
rag_vector_demo.ipynbfor vector database examples - Explore UIs: Check out Qdrant dashboard, Phoenix monitoring, etc.
| Service | URL | Credentials | Description |
|---|---|---|---|
| Jupyter Notebook | http://localhost:8888 | password: 123456 | Interactive AI/ML environment |
| Phoenix AI Observability | http://localhost:6006 | None | AI model monitoring & traces |
| Ollama LLM API | http://localhost:11434 | None | Local LLM inference endpoint |
| Qdrant Vector Database | http://localhost:6333 | None | Vector storage & similarity search |
| Qdrant Web Dashboard | http://localhost:6333/dashboard | None | Vector database management UI |
| Trino Web UI | http://localhost:8080 | None | SQL query interface |
| Spark Master UI | http://localhost:8081 | None | Spark cluster monitoring |
| MinIO Console | http://localhost:9001 | minioadmin/minioadmin123 | S3 storage management |
- Default Model: gemma3:4b (~4GB VRAM)
- Any Ollama Model: Different sizes available for various GPU capabilities from Ollama Library
- Requirements: NVIDIA GPU with Docker runtime
- API:
http://localhost:11434
- Purpose: Store embeddings for RAG experiments
- Web UI:
http://localhost:6333/dashboard - API:
http://localhost:6333
- Purpose: Basic AI operation tracing
- Web UI:
http://localhost:6006
- GenAI Environment: Pre-installed packages
- Core Libraries: pandas, numpy, matplotlib, seaborn, plotly, scikit-learn
- AI/ML Stack: LangChain ecosystem, transformers, torch, sentence-transformers
- Vector & Database: qdrant-client, trino, sqlalchemy, boto3, s3fs
- Document Processing: pypdf2, python-docx, beautifulsoup4, tiktoken
- Observability: arize-phoenix, opentelemetry, openinference instrumentation
- Development Tools: ipywidgets, tqdm, rich, typer
- Default Kernel: GenAI Analytics (Python 3.12)
- Access:
http://localhost:8888(password: 123456)
import requests
import json
# Chat with the local LLM
def chat_with_ollama(prompt, model="gemma3:4b"):
response = requests.post('http://ollama:11434/api/generate',
json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()['response']
# Example usage
result = chat_with_ollama("Explain data analytics in simple terms")
print(result)from qdrant_client import QdrantClient
import ollama
# Connect to vector database
qdrant = QdrantClient(host="qdrant", port=6333)
# Generate embeddings using Ollama
def get_embedding(text):
response = ollama.embeddings(
model="nomic-embed-text",
prompt=text
)
return response["embedding"]
# Store document chunks in vector database
def store_document(text_chunks, collection_name="knowledge_base"):
for i, chunk in enumerate(text_chunks):
embedding = get_embedding(chunk)
qdrant.upsert(
collection_name=collection_name,
points=[{
"id": i,
"vector": embedding,
"payload": {"text": chunk}
}]
)
# Retrieve relevant context for questions
def rag_query(question, collection_name="knowledge_base"):
question_embedding = get_embedding(question)
results = qdrant.search(
collection_name=collection_name,
query_vector=question_embedding,
limit=3
)
context = "\n".join([hit.payload["text"] for hit in results])
# Use context with LLM
prompt = f"Context: {context}\n\nQuestion: {question}\nAnswer:"
return chat_with_ollama(prompt)import phoenix as px
from openinference.instrumentation.langchain import LangChainInstrumentor
from phoenix.otel import register
# Configure Phoenix tracing (in GenAI DEV kernel)
tracer_provider = register(
project_name="data-analytics",
endpoint="http://phoenix:4317",
auto_instrument=True,
)from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DataLab-Playground") \
.master("spark://spark-master:7077") \
.config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
.config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
.config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
.getOrCreate()
# Process data and prepare for AI workloads
df = spark.read.parquet("s3a://warehouse/data/")
df.write.mode("overwrite").parquet("s3a://warehouse/processed/ai_training_data")- MinIO: minioadmin/minioadmin123
- Spark: spark://spark-master:7077
- Ollama Models: Stored in persistent volume
# Check all services status
docker-compose ps
# View specific service logs
docker logs phoenix
docker logs ollama
# Restart services
docker-compose restart ollama phoenix# List available Ollama models
docker exec ollama ollama list
# Pull additional LLM models
docker exec ollama ollama pull llama3.2
# Pull additional embedding models
docker exec ollama ollama pull all-MiniLM-L6-v2Services start automatically in the right order:
- Storage & databases (PostgreSQL, MinIO)
- Data processing (Spark, Trino, Hive)
- AI services (Ollama, Phoenix)
- Jupyter notebooks
# Check if GPU is detected
docker exec ollama nvidia-smi
# Verify Docker GPU support
docker info | grep nvidia# Check if services are running
docker-compose ps
# View service logs
docker logs ollama
docker logs phoenixFound a bug or have an idea? Feel free to:
- Open GitHub issues for problems or suggestions
- Submit pull requests for improvements
- Add example notebooks or documentation
Built with AI Assistance
This project was developed with GitHub Copilot (powered by Claude Sonnet 4), demonstrating the power of human-AI partnership in creating comprehensive data platforms.