Skip to content

heyitsazar/ml-malware-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Malware Classification with Machine Learning

This project is about using machine learning to classify malware into families. The idea is simple: take a huge set of malware samples, extract useful features with YARA rules (a tool that scans files for patterns), and train models to see if they can tell which family a piece of malware belongs to.

What’s Inside

  • Data

    • Around 160,000 malware samples were used.
    • Features came from YARA rules, describing things like file sections, data directory sizes, and debug paths.
    • A smaller set of about 2,400 samples had labels with family names.
  • Features

    • Started with thousands of raw features after processing.
    • Cut this down to 271 important ones that carried most of the signal.
  • Models

    • Random Forest: after tuning, reached about ~75% accuracy on the test set.
    • XGBoost: gave similar accuracy, but worked a bit differently on some families.
    • K-Means: worked without labels and still managed to group CoinMiner samples into their own cluster.

Why It’s Interesting

  • Supervised models (Random Forest, XGBoost) do a good job on known threats.
  • Unsupervised clustering (K-Means) can uncover hidden structures and even point to new malware families.
  • The quality of the features really matters here. Feature engineering and selection made a huge difference.

How to Run

  1. Clone the repo.
  2. Open Malware_analysis.ipynb in Jupyter.
  3. Run the cells to go through preprocessing, training, and evaluation.

⚠️ Note: The raw malware samples and extracted CSVs are not included here for safety reasons. The notebook assumes you already have them.

Results

  • Random Forest hit ~80% accuracy after tuning.
  • XGBoost had very similar performance.
  • K-Means found natural clusters, with one lining up closely to the CoinMiner family.

Conclusion

This project shows how machine learning can help in malware research. Known families can be classified with good accuracy, and at the same time, unsupervised methods reveal patterns that might not be obvious. Together, they give a stronger view of what’s hiding in large malware datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors