This repository contains a machine learning system that classifies job titles into predefined Persona Segments. The system combines keyword-based rules with a machine learning model built using Python and scikit-learn.
When classifying a job title, the system prioritizes assignments based on this order:
- GenAI
- Engineering
- Product
- Cyber Security
- Trust & Safety
- Legal & Compliance
- Executive
If a job title could fall into multiple categories, the highest-priority category is selected.
git clone <repo-url>
cd <repo-name>Ensure you have Python installed (version 3.7 or later recommended; tested with 3.8-3.11). Then, install the required dependencies:
pip install -r requirements.txtCheck that all required files and directories are in place:
make check# Run the complete pipeline (train + predict)
make all
# Or run steps individually:
make train # Train the model
make predict # Run predictionsTo classify job titles, follow these steps:
Your input file must contain two columns (case-insensitive):
| Column Name | Description |
|---|---|
Record ID |
A unique identifier (e.g., from HubSpot) |
Job Title |
The job title to be classified |
Important: Save the file with UTF-8 encoding to support international characters.
Example data/input.csv:
Record ID,Job Title
37462838462827,AI Engineer
82736482736473,Senior Software Developer
ProductLead123,Product LeadExecute the following command:
make predictThe output file will be tagged_personas.csv, containing:
| Column Name | Description |
|---|---|
Record ID |
Same as input |
Job Title |
Standardized job title (if applicable) |
Persona Segment |
Assigned category ("Not Classified" if confidence < 50%) |
Confidence Score |
Model confidence (0-100, all scores rounded to nearest 5) |
Example tagged_personas.csv output:
Record ID,Job Title,Persona Segment,Confidence Score
37462838462827,AI Engineer,GenAI,90
82736482736473,Senior Software Developer,Engineering,85
ProductLead123,Product Lead,Product,80Your training data must contain two columns (case-insensitive):
| Column Name | Description |
|---|---|
Job Title |
The job title text |
Persona Segment |
Correct category label |
Requirements:
- Minimum 10 total samples
- At least 2 different persona segments
- Save with UTF-8 encoding
- For best results, include 10+ samples per persona
Example data/training_data.csv:
Job Title,Persona Segment
Sr. Product Manager,Product
Lead AI Researcher,GenAI
VP of Legal,Legal & Compliance
Senior Software Engineer,Engineering
Trust & Safety Specialist,Trust & Safety
Security Architect,Cyber Security
CTO,ExecutiveExecute the following command:
make trainThe training process now provides detailed metrics:
=== Data Quality Report ===
Persona Segment Distribution:
Engineering: 523 samples (32.1%)
Product: 387 samples (23.8%)
GenAI: 201 samples (12.3%)
...
=== Model Evaluation ===
Classification Report:
precision recall f1-score support
Engineering 0.92 0.89 0.90 105
Product 0.88 0.91 0.89 78
...
Cross-validation scores (5-fold):
Mean accuracy: 0.876 (+/- 0.042)
Test set accuracy: 0.883
β
Model training completed successfully!
The system also saves metadata to model/model_metadata.txt with training details.
For precise control, you can define keyword rules that take priority over ML predictions.
| Column Name | Description |
|---|---|
Keyword |
Text to match (case-insensitive) |
Rule |
Either contains or equals |
Persona Segment |
Segment to assign |
Exclude Keyword |
Optional: exclude if this text is present |
Example data/keyword_matching.csv:
Keyword,Rule,Persona Segment,Exclude Keyword
chief executive,contains,Executive,
ai,contains,GenAI,
engineer,contains,Engineering,sales
product manager,equals,Product,Notes:
- Keyword matches receive 100% confidence and override ML predictions
- Keywords are matched against standardized job titles (after title standardization is applied)
- Invalid rule types are skipped with a warning
Standardize job title variations before classification.
| Column Name | Description |
|---|---|
Reference |
Original/variant job title |
Standardization |
Standardized form |
Example data/title_reference.csv:
Reference,Standardization
Sr. PM,Senior Product Manager
ML Eng,Machine Learning Engineer
VP T&S,Vice President of Trust & Safety
CEO,Chief Executive Officer
eng,EngineerNote: Standardization is case-insensitive and applied during both training and prediction.
The system applies a multi-layered classification approach in the following order:
-
Title Standardization (if
title_reference.csvexists)- Applied first to normalize job title variations
- Case-insensitive lookup
- Consistent with training pipeline to prevent data leakage
-
Keyword Matching (if
keyword_matching.csvexists)- Applied to standardized job titles
- Receives 100% confidence and overrides ML predictions
- Case-insensitive matching
-
ML Classification
- Uses TF-IDF features with n-grams (1-3), max 5000 features, min_df=2
- English stop words removed automatically
- Logistic Regression with max_iter=1000, class_weight='balanced'
- Generates probability scores for each persona segment
- Priority enforcement applied when model confidence < 70%
- Confidence Threshold: 50% (predictions below this are marked as "Not Classified")
- Priority Enforcement: Applied when model confidence < 70%
- Fuzzy Matching: Available but disabled by default
- Max Title Length: 500 characters (longer titles are truncated)
- Duplicate Record IDs: By default, keeps first occurrence
- Character Encoding: UTF-8 for all CSV files
You can override default settings using environment variables:
# Set confidence threshold to 60%
export PC_CONFIDENCE_THRESHOLD=60
# Change duplicate handling to keep last occurrence
export PC_DUPLICATE_HANDLING=keep_last
# Adjust priority threshold
export PC_PRIORITY_THRESHOLD=0.8
# Change maximum title length
export PC_MAX_TITLE_LENGTH=300
# Run prediction with custom settings
make predictAvailable environment variables:
PC_CONFIDENCE_THRESHOLD: Minimum confidence score for assignment, 0-100 (default: 50)PC_DUPLICATE_HANDLING: How to handle duplicate Record IDs (default: keep_first)keep_first: Keeps first occurrence, removes duplicateskeep_last: Keeps last occurrence, removes duplicateskeep_all: Keeps all duplicates (may result in multiple rows with same ID)
PC_PRIORITY_THRESHOLD: Confidence threshold for priority enforcement, 0.0-1.0 (default: 0.7)PC_SIMILARITY_RANGE: Range for considering similar probabilities, 0.0-1.0 (default: 0.1)PC_MAX_TITLE_LENGTH: Maximum job title length in characters, 10-10000 (default: 500)PC_TEST_SIZE: Train/test split ratio, 0.1-0.5 (default: 0.2)PC_MAX_FEATURES: TF-IDF maximum features, minimum 100 (default: 5000)
make help # Show all available commands
make train # Train the model
make predict # Run predictions
make all # Run full pipeline (train + predict)
make check # Verify system setup
make validate # Validate format of input files
make clean # Remove generated files
make retrain # Force model retraining
make test # Quick test using training data/project-root
βββ model/
β βββ persona_classifier.pkl # Trained model (generated)
β βββ model_metadata.txt # Training metadata (generated)
βββ data/
β βββ input.csv # Input file for predictions
β βββ training_data.csv # Training data
β βββ keyword_matching.csv # Optional: keyword rules
β βββ title_reference.csv # Optional: title standardization
βββ scripts/
β βββ predict.py # Prediction script
β βββ train_model.py # Training script
β βββ title_standardizer.py # Standardization module
βββ tagged_personas.csv # Output file (generated)
βββ Makefile # Build automation
βββ requirements.txt # Python dependencies
βββ README.md # This file
βββ LICENSE # License file
The system provides comprehensive performance metrics during training:
- Classification Report: Precision, recall, and F1-score per persona
- Cross-validation: 5-fold CV with mean accuracy and standard deviation
- Test Set Accuracy: Hold-out test performance
- Data Quality Checks: Class distribution, imbalance warnings, duplicate detection
| Issue | Cause | Solution |
|---|---|---|
β Model file not found |
Model hasnβt been trained | Run make train first |
β Training file not found |
Missing training data | Ensure data/training_data.csv exists |
β Input file not found |
Missing input data | Create data/input.csv with required columns |
β Failed to load model file |
Corrupted model file | Delete model file and retrain with make train |
β Insufficient training data |
Too few training samples | Need at least 10 samples total |
β Training data contains only one persona |
Single class in training | Add samples from at least one other persona |
Invalid persona segments in training data |
Typo in persona names | Check spelling matches valid personas exactly |
Found X rows with duplicate Record IDs |
Non-unique identifiers | Review input file; by default keeps first occurrence (configurable via PC_DUPLICATE_HANDLING) |
Cannot use stratified split |
Too few samples in some classes | Add more training examples (need 10+ per persona) |
Skipping cross-validation |
Very small training set | Add more training data for reliable validation |
| Low confidence scores | Insufficient training data | Add more diverse examples to training data |
| Wrong classifications | Model needs retraining | Update training data and run make retrain |
High class imbalance detected |
Uneven persona distribution | Add more samples for underrepresented personas |
| Unicode/encoding errors | Non-UTF-8 characters in CSV | Ensure all CSV files are saved with UTF-8 encoding |
Run make validate to check your input files for common issues:
- File existence and readability
- Column names and counts
- Number of rows in each file
- Basic data format validation
The system provides detailed logging during execution:
- INFO: Normal operations and statistics
- WARNING: Potential issues (e.g., missing files, imbalanced data)
- ERROR: Fatal errors that stop execution
Enable approximate title matching in title_standardizer.py:
standardized_title = standardize_title(title, fuzzy=True, similarity_threshold=0.8)After training, check model/model_metadata.txt for:
- Training date and time
- Dataset statistics
- Model parameters
- Performance metrics
This project is licensed under the MIT License.