A hands-on, project-based guide to Machine Learning Operations built specifically for DevOps, Platform, and SRE engineers.
No ML background required. Every concept is explained through DevOps analogies you already understand.
If you are completely new to MLOps, read our DevOps to MLOps guide first.
- Who This Is For
- What We Build
- Prerequisites
- Phase 1: Local Dev & Pipelines
- Phase 2: Enterprise Orchestration for ML
- Learning Path
- Tech Stack
- Recommended Reading
- License
Most MLOps resources are written for data scientists learning infrastructure. This repo flips that.
You do not need to become a data scientist. But just like understanding how a Java application is built makes you a better DevOps engineer, understanding how an ML model is built, trained, and served makes you effective at operating ML workloads in production.
| Track | What You Learn |
|---|---|
| 🤖 Traditional ML | Train, serve, automate, and monitor a real ML model on Kubernetes |
| 🧠 Foundational Models | Serve LLMs in production using vLLM, TGI, and Ollama |
| ⚙️ LLM-Powered DevOps | Monitor K8s clusters, build RAG pipelines and agents with LLMs |
Everything runs on Kubernetes, Docker, and tools you already use.
| Skill | Level |
|---|---|
| Linux CLI | Intermediate |
| Docker | Intermediate |
| Kubernetes | Intermediate |
| AWS | Basic to Intermediate |
| Python | Basic- read and run scripts |
| Git | Intermediate |
No ML experience needed. That is what this repo teaches.
Goal: Build the full ML foundation on your local machine — from raw data to a trained, tested model.
Use case throughout: Employee attrition prediction for a large organisation (~500,000 employees). One problem, end to end. Keeps the focus on infrastructure and operations, not data science theory.
| Step | Title | Guide |
|---|---|---|
| 1 | Project Dataset Pipeline | Read the Guide |
| 2 | Data Preparation Stages | Read the Guide |
| 3 | Training & Building the Prediction Model | Read the Guide |
| 4 | From Model to Live API with KServe | Read the Guide |
Code: phase-1-local-dev/
Goal: Replace local, manual ML workflows with production-grade orchestration. Versioned data, automated pipelines, experiment tracking, and scalable training.
| Step | Title | Guide |
|---|---|---|
| 1 | Data Versioning Fundamentals | Read the Guide |
| 2 | Data Version Control (DVC) with AWS S3 | Read the Guide |
| 3 | Data Versioning using Airflow on Kubernetes | Read The Guide |
| 4 | A Detailed Look in to Feature Store | 🔜 Coming Next |
Code: phase-2-enterprise-setup/
| Phase | Track | Title | Status |
|---|---|---|---|
| 1 | 🤖 Traditional ML | Local Dev & Pipelines | ✅ Done |
| 1 | 🤖 Traditional ML | K8s Deploy & Model Serving | ✅ Done |
| 3 | 🤖 Traditional ML | Enterprise Orchestration | 🔄 In Progress |
| 4 | 🤖 Traditional ML | Monitor & Observe | 🔜 Planned |
| 5 | 🧠 Foundational Models | Foundational Models | 🔜 Planned |
| 6 | 🧠 Foundational Models | LLM Serving & Scaling | 🔜 Planned |
| 7 | ⚙️ LLM-Powered DevOps | LLM-Powered DevOps | 🔜 Planned |
| 8 | ⚙️ LLM-Powered DevOps | Emerging AI Ops | 🔜 Planned |
Here is the tech stack you will be using in this setup.
| Category | Tools |
|---|---|
| Data Pipeline | Python, Airflow |
| Model Training | scikit-learn |
| API / Serving | FastAPI, Flask, Docker, KServe |
| ML Orchestration | Kubeflow, MLflow Pipelines |
| Monitoring | Prometheus, Grafana, Evidently AI |
| Infrastructure | Kubernetes, Helm, GitHub Actions |
- Ray: Open-source distributed computing framework For Python & AI Workloads
- rtk: High-performance CLI proxy that reduces LLM token consumption.
- CML: CI/CD for Machine Learning Projects
Dual licensed:
- Code (scripts, configs, manifests) — Apache 2.0
- Content (README, guides, docs) — All Rights Reserved
For commercial licensing: contact@devopscube.com