Skip to content

hacktivist211/vae-anomaly-detector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Variational Autoencoder for Anomaly Detection

This repository contains implementations of a Variational Autoencoder (VAE) designed for anomaly detection in tabular datasets. The provided code includes scripts for both training the VAE model and running inference to detect anomalies.

Table of Contents

Overview

The project implements a Variational Autoencoder in PyTorch with the following capabilities:

  • Training: Implements both initial training and resuming training from checkpoints. It includes a custom progress bar with detailed epoch and batch-level feedback.
  • Inference: Provides functionality to load a pretrained model, preprocess input CSV files, compute anomaly scores, and output latent representations and reconstruction errors.
  • Evaluation and Visualization: Evaluates model performance by calculating reconstruction and anomaly scores, and visualizes training progress and anomaly distributions using Matplotlib.
  • Anomaly Explanation: Generates explanations for detected anomalies by identifying the top contributing features.

Features

  • Deep Learning with PyTorch: Utilizes PyTorch modules and data loaders.
  • Preprocessing: Uses StandardScaler from scikit-learn for data normalization.
  • Checkpointing: Saves model checkpoints and tracks training progress.
  • Visualization: Generates plots for loss curves and anomaly score distributions.
  • Anomaly Detection: Combines reconstruction error and KL divergence for anomaly scoring.

Project Structure

├── load_model.py         # Script to load a pretrained VAE, run inference on a test CSV file, and output results
├── VA_E.py               # Script containing training routines, advanced progress bar, evaluation, and visualization functions
├── training_set.csv      # (Example) CSV file for training data (not included; user provided)
├── validation_set.csv    # (Example) CSV file for validation data (not included; user provided)
├── test_set.csv          # (Example) CSV file for testing data (not included; user provided)
└── vae_hif_detection_model.pt  # Saved model checkpoint output after training

Note: Update or add CSV files and additional resources as required to run the project.

Installation

Using Conda Environment

It is recommended to use Conda to manage the project dependencies. Follow these steps to create the environment:

  1. Install Conda: If you haven't already, download and install Anaconda or Miniconda.

  2. Create the environment: Open your terminal, navigate to your project directory, and run:

    conda create -n vae_env python=3.8
  3. Activate the environment:

    conda activate vae_env
  4. Install dependencies: Install the required packages using conda and pip:

    conda install pytorch torchvision -c pytorch
    conda install pandas numpy scikit-learn matplotlib tqdm

    If any package is not available via conda, use pip:

    pip install package_name

Alternatively, you can create an environment.yml file with the following contents and run conda env create -f environment.yml:

name: vae_env
channels:
  - pytorch
  - defaults
dependencies:
  - python=3.8
  - pytorch
  - torchvision
  - pandas
  - numpy
  - scikit-learn
  - matplotlib
  - tqdm

Usage

Training the VAE

The VA_E.py script contains the full training pipeline along with advanced progress tracking. It supports both initial training and resuming training from a checkpoint.

To start training, run:

python VA_E.py

During training, the script will display:

  • Device information (CPU or GPU)
  • Real-time training and validation loss updates per batch and per epoch
  • Checkpoint saving (e.g., vae_hif_detection_model.pt)

Running Inference

The load_model.py script allows you to load a pretrained VAE model, process a CSV dataset, generate latent vectors, reconstruct the input, and compute anomaly scores.

To run inference:

  1. Place your pretrained model (e.g., vae_hif_detection_model.pt) and the CSV file (e.g., test_set.csv) in the repository folder.

  2. Adjust the model and CSV paths in load_model.py if needed.

  3. Run the script:

    python load_model.py

The script prints:

  • The number of detected anomalies (if a threshold is set)
  • Shapes of the reconstructed data and latent vectors
  • Mean anomaly score across the dataset

If an output path is provided, the results (including anomaly scores and latent dimensions) are saved to a CSV file (e.g., vae_results.csv).

How It Works

Code Overview

The repository includes two primary scripts:

1. VA_E.py

  • Data Loading and Preprocessing:
    This script reads training, validation, and test CSV files. It then normalizes the data using StandardScaler and converts it into PyTorch tensors for creating datasets and loaders.

  • Model Definition:
    The VAE model is defined in the VariationalAutoencoder class, where the encoder creates a latent representation (with mean and log variance), and the decoder attempts to reconstruct the input. The reparameterization trick is implemented to enable backpropagation through the latent space.

  • Training Pipeline:
    The training process includes:

    • Forward propagation through the encoder and decoder.
    • Calculation of the reconstruction loss (MSE) and the Kullback–Leibler divergence (to regularize the latent space).
    • Backpropagation and parameter updates using an Adam optimizer.
    • Checkpointing to save the best performing model based on validation loss.
  • Advanced Progress Bar:
    A custom progress bar is implemented to display detailed information about training progress for each epoch and batch. This includes ETA calculation, batch loss, and overall training progress.

  • Resume Training:
    The code also supports resuming training from a saved checkpoint. When the resume option is enabled, the training continues from the saved state.

  • Evaluation and Visualization:
    After training, the script computes anomaly scores based on reconstruction errors and KL divergence. It also visualizes:

    • Training and validation loss curves.
    • Distribution of anomaly scores and detection thresholds.
    • Detailed scatter plots to show the anomalies detected in the test set.
  • Anomaly Explanation:
    The script can provide insights by highlighting the top contributing features to the anomaly score for selected samples.

2. load_model.py

  • Model Loading:
    This script loads a pretrained VAE model checkpoint and adjusts the model architecture based on the saved state.

  • Inference Pipeline:
    It reads a CSV file, applies the same normalization as in training, and uses the model to reconstruct the input data. It computes the reconstruction error and KL divergence to produce an overall anomaly score.

  • Result Saving:
    If an output path is provided, the script saves the results (including anomaly scores and latent vector dimensions) into a CSV file. It also prints key statistics like the percentage of anomalies detected.

Working Flow

  1. Initialization:
    Load data, create data loaders, and initialize the VAE model.

  2. Training:
    The training loop updates model weights, evaluates on validation data, and uses advanced progress reporting.

  3. Checkpointing:
    The best performing model (with the lowest validation loss) is saved and used for inference.

  4. Inference and Evaluation:
    The pretrained model is loaded to calculate anomaly scores. Anomalies are determined based on a predefined threshold computed from the training scores.

  5. Visualization and Explanation:
    Results are visualized using plots and further analyzed to explain anomalies by determining the feature-wise contribution to reconstruction errors.

Customization

  • Hyperparameters:
    Adjust parameters such as hidden_dim, latent_dim, dropout_rate, learning rate, and beta factor (scaling the KL divergence) directly in the scripts.

  • Data Preprocessing:
    Modify the CSV loading or normalization in load_model.py or VA_E.py to suit your dataset.

  • Model Architecture:
    Change the layers or activation functions in the VariationalAutoencoder class to experiment with different architectures.

Results

The project produces several outputs:

  • Model Checkpoints:
    Saved as vae_hif_detection_model.pt during training.

  • CSV Results:
    Includes anomaly scores, latent representations, and reconstructed outputs.

  • Visualizations:
    Generated plots (e.g., vae_hif_detection_results.png) illustrate loss curves and the anomaly score distribution.

Contributing

Contributions are welcome! Please create issues or submit pull requests for improvements, bug fixes, or additional features.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages