This project is about building a machine learning model to detect intrusions (suspicious or malicious activity) using a Kaggle dataset. The dataset combines network-level features with user behavior, making it possible to train a model that can separate normal activity from attacks.
- Source: Kaggle – Cybersecurity Intrusion Detection Dataset
- Network features: packet size, protocol type (TCP, UDP, ICMP), encryption used.
- User behavior features: number of login attempts, failed logins, session duration, unusual access times, IP reputation score, browser type.
- Target variable:
attack_detected(binary flag: 1 = attack, 0 = safe).
- Dropped irrelevant or incomplete columns (
session_id,encryption_used). - Handled missing values by removing columns with null data.
- Converted categorical features (e.g., protocol type, browser) into numerical form using one-hot encoding.
- Split dataset into training (75%) and testing (25%).
-
Random Forest Classifier was used as the main model.
-
GridSearchCV was applied to try different hyperparameter combinations and find the best fit.
-
Best parameters found:
n_estimators = 200max_depth = 20min_samples_split = 5
- Cross-validation accuracy: ~89%
- Test set accuracy: ~89%
This shows that the tuned Random Forest model can effectively detect intrusions with high accuracy.
-
Clone this repository.
-
Open the Jupyter Notebook file (
ml-intrusion-detection.ipynb). -
Run the notebook step by step to:
- Load and preprocess the dataset.
- Train the Random Forest model with GridSearchCV.
- Evaluate accuracy on the test set.
- Network and user behavior features together provide strong signals for intrusion detection.
- Good preprocessing (handling missing data, encoding categorical values) is just as important as the choice of model.
- Hyperparameter tuning makes a big difference—accuracy improved significantly after optimization.