Transforming Raw Data into Machine Learning Ready Data π
Data preprocessing is one of the most critical steps in Machine Learning. Real-world data is often incomplete, inconsistent, noisy, or unstructured. Before training any machine learning model, it is essential to clean, transform, and organize the data into a suitable format.
This repository provides a beginner-friendly guide to Numerical & Categorical Data Preprocessing, covering practical techniques and real-world examples to better understand how raw data is transformed into machine-learning-ready data.
βοΈ Introduction to Data Preprocessing
βοΈ Importance of Clean Data in Machine Learning
βοΈ Handling Missing Values
βοΈ Feature Scaling Techniques
βοΈ Normalization vs Standardization
βοΈ Outlier Detection & Treatment
βοΈ Label Encoding
βοΈ One-Hot Encoding
βοΈ Ordinal Encoding
βοΈ Handling Rare/Unknown Categories
βοΈ Data Cleaning Techniques
βοΈ Feature Transformation Methods
βοΈ Real-world Use Cases
βοΈ Common Mistakes to Avoid
- Introduction to Data Preprocessing
- Importance of Clean Data in Machine Learning
- Numerical Data Preprocessing
- Handling Missing Values
- Feature Scaling
- Normalization vs Standardization
- Outlier Detection & Treatment
- Categorical Data Preprocessing
- Label Encoding
- One-Hot Encoding
- Ordinal Encoding
- Handling Rare/Unknown Categories
- Data Cleaning Techniques
- Feature Transformation Techniques
- Real-world Examples / Use Cases
- Advantages of Proper Preprocessing
- Common Mistakes to Avoid
- Conclusion & Key Learnings
Numerical data includes values such as age, salary, marks, temperature, and price. Although machine learning algorithms understand numbers naturally, these features often require preprocessing.
Missing values are one of the most common problems in datasets. They can occur because of human error, incomplete forms, or technical failures.
Common techniques:
- Removing missing rows
- Replacing with Mean / Median / Mode
- Predicting missing values
- Interpolation methods
Features may have different ranges. For example:
- Age β
18 to 60 - Salary β
20,000 to 10,00,000
Feature scaling ensures that no feature dominates simply because of larger numerical values.
Normalization
- Rescales values between
0 and 1 - Best for bounded data ranges
Standardization
- Mean becomes
0 - Standard deviation becomes
1 - Useful for many machine learning algorithms
Outliers are extreme values that differ significantly from normal observations.
Common methods:
- IQR Method
- Z-Score Method
- Capping / Clipping
- Removal (if incorrect)
- Log Transformations
Categorical data contains labels or groups such as city, gender, product type, or education level.
Since machine learning algorithms cannot directly understand text, categories must be converted into numbers.
Assigns numerical values to categories.
Example:
Red = 0
Blue = 1
Green = 2Creates separate binary columns for each category.
Example:
| Color | Red | Blue | Green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
Used when categories have a meaningful order.
Example:
Low = 1
Medium = 2
High = 3Rare categories may negatively affect model performance.
Solutions:
- Group into "Other" category
- Merge infrequent labels
- Handle unknown categories properly
Before model training, datasets often require cleaning.
Common techniques include:
- Removing duplicate records
- Fixing inconsistent formatting
- Standardizing text values
- Handling missing values
- Removing irrelevant or noisy data
Feature transformation improves data quality and distribution.
Popular methods:
- Log Transformation
- Square Root Transformation
- Box-Cox Transformation
- Binning
- Polynomial Features
These techniques help improve model performance and reduce skewness.
- Missing values handling
- Location encoding
- Feature scaling
- Contract type encoding
- Monthly charge normalization
- Outlier detection in medical values
- Missing patient data treatment
Proper preprocessing helps:
βοΈ Improve model accuracy
βοΈ Reduce noise and inconsistencies
βοΈ Improve generalization on unseen data
βοΈ Reduce bias in predictions
βοΈ Enhance overall model performance
β Ignoring missing values
β Removing outliers blindly
β Applying Label Encoding on unordered categories
β Data Leakage (preprocessing before train-test split)
β Overcomplicating features unnecessarily
I have also published a detailed Medium article explaining every concept in a beginner-friendly and practical way.
π Read Here:
From Raw to Refined: The Ultimate Guide to Data Preprocessing in Machine Learning
Feel free to connect with me and explore more Machine Learning & Data Science projects.
π LinkedIn: https://www.linkedin.com/posts/saisourav-panigrahi_machinelearning-datascience-datapreprocessing-share-7466568282523865089-QjBZ/?utm_source=share&utm_medium=member_desktop&rcm=ACoAAGNwn4QBGwfbhY2KqFQgojIO099iwSyR5OQ
π GitHub: https://github.com/SaiSourav2004/data-preprocessing-in-machine-learning
π Medium: https://medium.com/@panigrahisaisourav
β If you found this repository helpful, consider giving it a Star!