This project is about using machine learning to classify malware into families. The idea is simple: take a huge set of malware samples, extract useful features with YARA rules (a tool that scans files for patterns), and train models to see if they can tell which family a piece of malware belongs to.
-
Data
- Around 160,000 malware samples were used.
- Features came from YARA rules, describing things like file sections, data directory sizes, and debug paths.
- A smaller set of about 2,400 samples had labels with family names.
-
Features
- Started with thousands of raw features after processing.
- Cut this down to 271 important ones that carried most of the signal.
-
Models
- Random Forest: after tuning, reached about ~75% accuracy on the test set.
- XGBoost: gave similar accuracy, but worked a bit differently on some families.
- K-Means: worked without labels and still managed to group CoinMiner samples into their own cluster.
- Supervised models (Random Forest, XGBoost) do a good job on known threats.
- Unsupervised clustering (K-Means) can uncover hidden structures and even point to new malware families.
- The quality of the features really matters here. Feature engineering and selection made a huge difference.
- Clone the repo.
- Open
Malware_analysis.ipynbin Jupyter. - Run the cells to go through preprocessing, training, and evaluation.
- Random Forest hit ~80% accuracy after tuning.
- XGBoost had very similar performance.
- K-Means found natural clusters, with one lining up closely to the CoinMiner family.
This project shows how machine learning can help in malware research. Known families can be classified with good accuracy, and at the same time, unsupervised methods reveal patterns that might not be obvious. Together, they give a stronger view of what’s hiding in large malware datasets.