Malware Classification with Machine Learning

This project is about using machine learning to classify malware into families. The idea is simple: take a huge set of malware samples, extract useful features with YARA rules (a tool that scans files for patterns), and train models to see if they can tell which family a piece of malware belongs to.

What’s Inside

Data
- Around 160,000 malware samples were used.
- Features came from YARA rules, describing things like file sections, data directory sizes, and debug paths.
- A smaller set of about 2,400 samples had labels with family names.
Features
- Started with thousands of raw features after processing.
- Cut this down to 271 important ones that carried most of the signal.
Models
- Random Forest: after tuning, reached about ~75% accuracy on the test set.
- XGBoost: gave similar accuracy, but worked a bit differently on some families.
- K-Means: worked without labels and still managed to group CoinMiner samples into their own cluster.

Why It’s Interesting

Supervised models (Random Forest, XGBoost) do a good job on known threats.
Unsupervised clustering (K-Means) can uncover hidden structures and even point to new malware families.
The quality of the features really matters here. Feature engineering and selection made a huge difference.

How to Run

Clone the repo.
Open Malware_analysis.ipynb in Jupyter.
Run the cells to go through preprocessing, training, and evaluation.

⚠️ Note: The raw malware samples and extracted CSVs are not included here for safety reasons. The notebook assumes you already have them.

Results

Random Forest hit ~80% accuracy after tuning.
XGBoost had very similar performance.
K-Means found natural clusters, with one lining up closely to the CoinMiner family.

Conclusion

This project shows how machine learning can help in malware research. Known families can be classified with good accuracy, and at the same time, unsupervised methods reveal patterns that might not be obvious. Together, they give a stronger view of what’s hiding in large malware datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Malware_analysis.ipynb		Malware_analysis.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware Classification with Machine Learning

What’s Inside

Why It’s Interesting

How to Run

Results

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Malware Classification with Machine Learning

What’s Inside

Why It’s Interesting

How to Run

Results

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages