Comprehensive, concept-to-code walkthrough of the KNN algorithm for both classification and regression: theory, intuition, math, helper utilities, notebook experimentation, and a roadmap for extending to a full reusable implementation.
- Motivation & Goals
- What Is KNN? (High-Level Intuition)
- Key Concepts (Supervised / Instance-Based / Non-Parametric)
- When to Use KNN / When Not To
- Mathematical Foundations
- Distance Metrics (Focus: L2 / Euclidean)
- Choosing K (BiasβVariance & Practical Heuristics)
- Classification vs Regression Behavior
- Handling Ties, Weights & Edge Cases
- Data Preparation (Scaling, Cleaning, Dimensionality)
- Complexity & Performance Considerations
- Repository Structure
- Installation & Environment Setup
- Quick Start Examples (From-Scratch Logic & Helper Utilities)
- Typical Workflow (EndβtoβEnd)
- Evaluation Metrics (Classification & Regression)
- Visualizations & Exploratory Analysis
- Extending the Project (Roadmap)
- Common Pitfalls & Debugging Checklist
- Reference Reading & Further Study
- License
This tutorial aims to give you BOTH the theory (why KNN works, its tradeβoffs) and the practice (how to implement core pieces, experiment in a notebook, evaluate, and optimize). It is intentionally incremental so you can:
- Start with conceptual foundations.
- Use helper utilities (distance calculation, majority/mean aggregation).
- Prototype classification & regression in a notebook.
- Later add a reusable KNN class (see roadmap).
KNN predicts the label (class or numeric value) of a query point by looking at the K most similar (closest) points in the training set. No parameters are βfitβ in advanceβprediction is deferred until you ask a question of the model. It is sometimes called a memory-based, lazy, or instance-based method.
Linked docs in docs/ expand each topic:
- Supervised Learning (
supervised-learning.md): labeled examples drive learning. - Classification / Regression (
classification-problem.md,regression-problem.md). - Instance-Based (
instance-base.md): store all examples; compare at query time. - Non-Parametric (
non-parametric.md): flexibility grows with data volume.
Use KNN when:
- Decision boundary is irregular / hard to parametrize.
- Dataset is small-to-medium (memory fits; latency acceptable).
- You want a strong baseline quickly.
Avoid or be cautious when:
- Very large datasets β prediction cost O(N * d) per query.
- High-dimensional feature spaces (curse of dimensionality).
- Features on different scales (must scale/normalize first).
- Many irrelevant/noisy features (distance metric degrades).
Given training set T = {(xα΅’, yα΅’)} for i = 1..N and a query point x*:
- Compute distance d(x*, xα΅’) for all i.
- Select K indices of smallest distances: N_K.
- Classification: predict mode({ yα΅’ | i β N_K }).
- Regression: predict average or weighted average of { yα΅’ | i β N_K }.
function predict(x*, X_train, y_train, K, distance_fn, task, weight=False):
distances = []
for i in range(len(X_train)):
d = distance_fn(x*, X_train[i])
distances.append( (d, y_train[i]) )
distances.sort by d ascending
neighbors = first K elements
if task == 'classification':
return majority_vote([label for (_, label) in neighbors])
else: # regression
if weight:
return weighted_mean(neighbors, w = 1 / (d + Ξ΅))
else:
return simple_mean(neighbors)
The Euclidean (L2) distance is most common (see L2-distance.md). Other options:
- Manhattan (L1)
- Minkowski (generalized p-norm)
- Cosine distance (if direction matters more than magnitude)
- Hamming distance (categorical / binary vectors) Metric choice affects neighbor relationshipsβexperiment.
Tradeβoff:
- Small K β low bias, high variance (sensitive to noise).
- Large K β smoother, higher bias (can underfit). Heuristics:
- Try odd K for binary classification (avoid ties).
- Start with sqrt(N) as a coarse upper bound to explore.
- Use crossβvalidation grid (e.g., K in {1,3,5,7,11,15,β¦}).
- Classification: Majority vote; can introduce distance weighting (e.g., weight = 1 / (d + Ξ΅)).
- Regression: Mean vs weighted mean; handle zero distance (exact duplicate) by immediately returning that label/value or assigning infinite weight (see helper function approach).
Edge considerations:
- Ties in classification: break by (a) smallest cumulative distance, (b) deterministic ordering, or (c) reducing K by 1 recursively.
- Distance zero: identical point encounteredβshortcut return.
- Missing values: impute before KNN (mean / median / KNN-imputer) else distance invalid.
- Feature scaling: always scale numerical features (e.g., StandardScaler / MinMaxScaler) to prevent dominance.
Recommended pipeline:
- Clean anomalies / duplicates.
- Encode categorical variables (one-hot, ordinal if appropriate).
- Scale numeric features.
- Optionally reduce dimensionality (PCA) if d is large.
- Split into train / validation / test.
- Training: O(1) (just store data).
- Prediction (naΓ―ve): O(N * d) per query.
- Memory: O(N * d). Acceleration options (future roadmap):
- KD-Tree / Ball Tree (works best for moderate dimensions < ~30).
- Locality Sensitive Hashing (approximate for high-d sparse data).
- Annoy / FAISS for large-scale approximate neighbor search.
βββ docs/ # Theory & concept explanations
β βββ KNN.md
β βββ L2-distance.md
β βββ instance-base.md
β βββ non-parametric.md
β βββ classification-problem.md
β βββ regression-problem.md
β βββ supervised-learning.md
βββ notebook/
β βββ practice.ipynb # Interactive experimentation space
β βββ utils/
β βββ HelperFunctions.py # Plotting, L2 distance, frequency & prediction helpers
β βββ __init__.py
βββ resource/
β βββ tutorial.pdf # Supplemental reading (overview)
βββ LICENSE
βββ README.md
Minimal dependencies (implied by helper code & typical usage):
- Python >= 3.10
- numpy
- pandas
- matplotlib
Create & activate environment (examples):
# (Windows PowerShell) create venv
python -m venv .venv
./.venv/Scripts/Activate.ps1
pip install --upgrade pip
pip install numpy pandas matplotlib scikit-learn
Optional (for experiments):
- seaborn (nicer visuals)
- jupyter / ipykernel
pip install seaborn jupyter ipykernel
python -m ipykernel install --user --name knn-tutorial
from notebook.utils.HelperFunctions import HelperFunctions
p1 = [1, 2, 3]
p2 = [4, 6, 8]
dist = HelperFunctions.calculateL2Distance(p1, p2)
print(dist)import pandas as pd
from notebook.utils.HelperFunctions import HelperFunctions
df = pd.DataFrame({
'label': ['A','B','A','C','A','B']
})
pred = HelperFunctions.predictionClassification(df, 'label')
print('Predicted label:', pred)Assume a DataFrame of neighbor targets & distances:
import pandas as pd
from notebook.utils.HelperFunctions import HelperFunctions
neighbors = pd.DataFrame({
'value': [10.0, 12.0, 15.0],
'distance': [0.5, 1.2, 2.0]
})
estimate = HelperFunctions.predictionRegression(neighbors, target='value', distance='distance', weight=True)
print('Weighted estimate:', estimate)import math
import numpy as np
from collections import Counter
def l2(a, b):
return math.dist(a, b)
def knn_predict(x_query, X_train, y_train, K=5):
distances = [(l2(x_query, X_train[i]), y_train[i]) for i in range(len(X_train))]
distances.sort(key=lambda t: t[0])
topk = [label for (_, label) in distances[:K]]
return Counter(topk).most_common(1)[0][0]
X_train = np.array([[1,2],[2,3],[3,3],[6,5],[7,8]])
y_train = np.array(['A','A','B','B','B'])
print(knn_predict([2.5,2.5], X_train, y_train, K=3))- Load dataset (CSV or synthetic generation).
- Split train/validation/test.
- Scale numeric features (StandardScaler / MinMax).
- Loop over candidate K values; evaluate.
- Select best K; retrain (store training set + chosen hyperparams).
- Evaluate on test set; record metrics.
- Visualize decision boundaries (2D) or performance curves.
- Package as a reusable class (future step).
Classification:
- Accuracy (baseline)
- Precision / Recall / F1 (esp. for imbalance)
- Confusion Matrix
- ROC-AUC (probability-style adaptation via distance weighting / inverse aggregated scores)
Regression:
- MAE, MSE / RMSE
- RΒ²
- Median Absolute Error (robust to outliers)
Use matplotlib (and optionally seaborn) to:
- Scatter plot colored by class.
- Plot effect of K vs validation accuracy (elbow view).
- Distance distribution histogram for query points.
- Decision boundary grid (for 2D synthetic datasets).
The HelperFunctions.drawScatterPlot provides a light starting point.
Planned / Suggested Enhancements:
- Add a
KNNClassifierandKNNRegressorclass (API similar to scikit-learn:fit,predict). - Support multiple distance metrics with strategy pattern.
- Implement KD-Tree / Ball Tree fallback based on dimension threshold.
- Add unit tests (pytest) for helper utilities & future classes.
- Add benchmarking script (timing naive vs tree-based search).
- Integrate synthetic dataset generators (circles, moons, blobs).
- Provide decision boundary visualization utility.
- Add cross-validation helper.
- Forgetting to scale features β distorted neighbor ranking.
- Using too large K β overly smooth / majority class dominance.
- Imbalanced classes β evaluate with F1, not only accuracy.
- High dimensional sparse vectors β consider approximate methods.
- Duplicated rows with contradictory labels β clean data.
- Performance bottleneck β pre-compute structures or batch queries.
Concept docs inside docs/ are fully self-contained. For deeper dives:
- Pattern Recognition & Machine Learning β C. M. Bishop (Ch. on non-parametric methods)
- Elements of Statistical Learning β Hastie, Tibshirani, Friedman (KNN & kernel methods)
- scikit-learn user guide (Nearest Neighbors section)
- Research approximate nearest neighbor libraries: FAISS, Annoy, HNSW.
This project is distributed under the terms of the MIT License. See LICENSE for details.
Feel free to fork & open a pull request adding:
- A formal KNN implementation
- More distance metrics
- Performance benchmarks
- Additional instructional notebooks
This repository provides conceptual foundations, helper utilities, and a sandbox (practice.ipynb) to learn and extend the K-Nearest Neighbors algorithm. Build from here toward a robust, tested mini-library and experiment with optimizations as datasets grow.
K-Nearest Neighbors remains one of the most intuitive algorithms in machine learningβpowerful as a baseline, instructive for learning core concepts like distance, feature scaling, and biasβvariance trade-offs. This tutorial equips you with:
- A solid theoretical grounding (supervised / nonβparametric / instanceβbased context).
- Practical utility functions (distance, voting, weighting) to bootstrap experiments.
- A structured workflow for evaluation, tuning, and visualization.
- A forward-looking roadmap to transform learning artifacts into a reusable mini-framework.
Suggested next steps for you:
- Implement a
KNNClassifierclass withfit,predict, and optionalpredict_proba(distance-weighted vote proportions). - Add a
KNNRegressorcounterpart with optional kernel weighting (e.g., Gaussian kernel over distances). - Introduce hyperparameter search (grid / randomized) with cross-validation wrappers.
- Benchmark naive vs tree-based neighbor search on synthetic datasets of increasing size.
- Package your implementation (e.g.,
pip install -e .) and publish as a learning artifact.
If you build upon this, consider documenting performance trade-offs and adding reproducible experiment scripts. Your future self (and contributors) will thank you.
Happy learning & exploringβmay your nearest neighbors always be informative! π