Log Classification System

A multi-method log classification system that categorizes log messages from different sources using regex patterns, BERT embeddings, and LLM-based classification.

Overview

This system classifies log messages from various sources (ModernCRM, BillingSystem, AnalyticsEngine, ModernHR, LegacyCRM, etc.) into categories such as:

User Action
System Notification
Error
Workflow Error
Deprecation Warning
Unclassified

The classification strategy varies based on the log source:

LegacyCRM logs are classified using an LLM (Groq's deepseek-r1-distill-llama-70b)
Other logs are first attempted with regex patterns
If regex fails, a BERT-based classifier is used as a fallback

System Architecture

The system consists of:

A FastAPI server for handling classification requests
Multiple classification processors:
- Regex-based classifier for pattern matching
- BERT-based classifier using sentence transformers and a pre-trained logistic regression model
- LLM-based classifier using Groq's API
A main classifier that orchestrates the different classification methods

Installation

Clone the repository:

git clone https://github.com/hussin-sobhy/log_classification_system
cd log_classification_system

Install dependencies:

pip install -r requirements.txt

Environment Setup

Create a .env file in the project root with your Groq API key:

GROQ_API_KEY=your-groq-api-key

You can obtain a Groq API key by signing up at https://console.groq.com/.

Usage

Running the API Server

Start the FastAPI server from the project root directory:

uvicorn app.server:app --reload

The server will be available at http://localhost:8000.

API Endpoints

POST /classify/

Accepts a CSV file with 'source' and 'log_message' columns and returns the same data with an additional 'target_label' column containing the classification results.

Example using curl:

curl -X POST -F "file=@data/test.csv" http://localhost:8000/classify/ -o classified_logs.csv

Command Line Classification

You can also classify logs directly from the command line:

python -c "from classifications.classifier import classify_csv; classify_csv('data/test.csv')"

This will process the logs in data/test.csv and save the results to data/output.csv.

Testing

Use the provided test data:

python -c "from classifications.classifier import classify_csv; classify_csv('data/test.csv')"

Check the output file at data/output.csv to see the classification results.

You can also test individual processors:

Test regex processor:

python -m classifications.processor_regex

Test BERT processor:

python -m classifications.processor_bert

Test LLM processor:

python -m classifications.processor_llm

Project Structure

/log_classification_system/
├── .env                  # Environment variables (contains Groq API key)
├── .gitignore
├── requirements.txt      # Dependencies
├── app/
│   └── server.py         # FastAPI server for log classification
├── classifications/
│   ├── classifier.py     # Main classifier that orchestrates different methods
│   ├── processor_bert.py # BERT-based classification
│   ├── processor_llm.py  # LLM-based classification using Groq
│   └── processor_regex.py # Regex-based classification
├── data/
│   ├── output.csv        # Output file with classification results
│   ├── synthetic_logs.csv
│   └── test.csv          # Test data
├── saved_models/
│   └── log_classifier_logistic.joblib # Pre-trained model
└── training/
    └── training.ipynb    # Jupyter notebook for model training

Classification Methods

Regex Classification

Uses regular expression patterns to match common log formats and categorize them. This is the fastest method but limited to predefined patterns.

BERT Classification

Uses the all-MiniLM-L6-v2 sentence transformer to convert log messages into embeddings, which are then classified using a pre-trained logistic regression model. This method is more flexible than regex but requires more computational resources.

LLM Classification

Uses Groq's deepseek-r1-distill-llama-70b model to classify logs from LegacyCRM. This method is specifically designed for complex logs that require deeper understanding and context.

Training

The BERT classifier uses a pre-trained logistic regression model. To retrain this model:

Open and run the Jupyter notebook:

jupyter notebook training/training.ipynb

Follow the instructions in the notebook to train and save a new model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Log Classification System

Overview

System Architecture

Installation

Environment Setup

Usage

Running the API Server

API Endpoints

POST /classify/

Command Line Classification

Testing

Project Structure

Classification Methods

Regex Classification

BERT Classification

LLM Classification

Training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
app		app
classifications		classifications
data		data
notebooks		notebooks
saved_models		saved_models
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Log Classification System

Overview

System Architecture

Installation

Environment Setup

Usage

Running the API Server

API Endpoints

POST /classify/

Command Line Classification

Testing

Project Structure

Classification Methods

Regex Classification

BERT Classification

LLM Classification

Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages