A sophisticated web application that uses advanced algorithms to detect whether text content was written by a human or generated by an AI model. The system employs multiple detection techniques including stylometric analysis, perplexity scoring, and machine learning classification for high-accuracy results.
- Combined Analysis: Integrates all methods for highest accuracy
- Stylometric Analysis: Analyzes writing style patterns and linguistic features
- Perplexity Analysis: Measures text complexity and predictability
- ML Classification: Machine learning-based detection using trained models
- Real-time Processing: Instant analysis with detailed results
- Confidence Scoring: Probability-based predictions with confidence levels
- Feature Breakdown: Detailed analysis of text characteristics
- Visual Analytics: Interactive charts and graphs for result visualization
- Text Files (.txt): Direct text analysis
- Word Documents (.docx): Extract and analyze content
- PDF Files (.pdf): Extract text from PDF documents
- Direct Input: Paste text directly for analysis
- Responsive Design: Works on desktop and mobile devices
- Interactive Charts: Plotly-powered visualizations
- Real-time Updates: Live analysis with progress indicators
- Professional UI: Clean, modern interface with intuitive navigation
- Clone the repository:
git clone https://github.com/yourusername/ai-content-detector-pro.git
cd ai-content-detector-pro- Create a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Start the application:
streamlit run app.py-
Build image
docker build -t ai-content-detector . -
Run container
docker run --rm -p 8501:8501 ai-content-detector
-
Open app
http://localhost:8501
The container already includes all Python/NLTK dependencies, so your friends only need Docker installed.
-
Open your browser and navigate to the provided URL (typically
http://localhost:8501) -
Choose your analysis method:
- Combined Analysis (Recommended): Uses all methods for best accuracy
- Stylometric Analysis: Focus on writing style patterns
- Perplexity Analysis: Analyze text complexity
- ML Classification: Machine learning-based detection
-
Input your content:
- Upload a document (.txt, .docx, .pdf)
- Or paste text directly into the input area
-
Analyze and view results:
- Get probability scores for human vs AI origin
- View detailed feature breakdowns
- Explore interactive visualizations
Analyzes writing style characteristics including:
- Vocabulary Richness: Measures diversity of word usage
- Sentence Length Distribution: Analyzes sentence structure patterns
- Word Frequency Analysis: Identifies repetitive patterns
- Punctuation Usage: Examines punctuation patterns
- Capitalization Patterns: Analyzes capitalization frequency
- Word Length Variance: Measures variation in word lengths
Measures how "surprised" a language model is by the text:
- Higher Perplexity: Suggests human-written content (more unpredictable)
- Lower Perplexity: Suggests AI-generated content (more predictable)
- Statistical Modeling: Uses probability distributions to assess text complexity
Uses trained models to classify content:
- TF-IDF Vectorization: Converts text to numerical features
- Random Forest Classifier: Ensemble learning for robust predictions
- Synthetic Training Data: Generated human and AI text samples
- Probability Scoring: Provides confidence levels for predictions
Integrates all three methods for maximum accuracy:
- Weighted Combination: Balances different analysis methods
- Cross-Validation: Reduces false positives and negatives
- Robust Detection: Handles sophisticated AI-generated text
- High Confidence (>70%): Strong indication of content origin
- Medium Confidence (50-70%): Mixed signals, consider additional context
- Low Confidence (<50%): Uncertain results, manual review recommended
- Pie Charts: Show probability distribution
- Gauge Charts: Display perplexity scores
- Bar Charts: Feature breakdown analysis
- Color Coding: Green for human, red for AI indicators
- Human-Written Probability: Percentage indicating human authorship
- AI-Generated Probability: Percentage indicating AI generation
- Confidence Level: Overall reliability of the analysis
- Feature Scores: Individual characteristic measurements
- Frontend: Streamlit web application
- Backend: Python-based analysis engine
- ML Pipeline: Scikit-learn for classification
- Text Processing: NLTK for natural language processing
- Visualization: Plotly for interactive charts
streamlit: Web application frameworknumpy: Numerical computingpandas: Data manipulationscikit-learn: Machine learning algorithmsplotly: Interactive visualizationsnltk: Natural language processingpython-docx: Word document processingPyPDF2: PDF text extraction
The system automatically trains a machine learning model using:
- Synthetic Human Text: Generated samples with natural variations
- Synthetic AI Text: Generated samples with AI-like patterns
- Feature Engineering: TF-IDF vectorization of text
- Model Persistence: Saves trained models for reuse
- Detection Method: Choose analysis approach
- Detailed Explanation: Toggle detailed feature breakdown
- Feature Breakdown: Show individual feature scores
- Automatic Training: Models train on first run
- Model Persistence: Trained models are saved locally
- Model Status: Real-time model availability indicators
- Combined Analysis: Highest accuracy across all methods
- Cross-Validation: Robust against different text types
- False Positive Reduction: Minimizes incorrect AI detections
- Real-time Analysis: Instant results for most text lengths
- Optimized Processing: Efficient algorithms for large documents
- Caching: Model persistence for faster subsequent runs
We welcome contributions! Here's how you can help:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Enhanced ML Models: Better training data and algorithms
- Additional Features: More detection methods
- UI Improvements: Better user experience
- Performance Optimization: Faster processing
- Documentation: More detailed guides
This project is licensed under the MIT License - see the LICENSE file for details.
- Streamlit: For the excellent web application framework
- Scikit-learn: For machine learning capabilities
- NLTK: For natural language processing tools
- Plotly: For interactive visualizations
- Open Source Community: For the libraries that make this possible
If you encounter any issues or have questions:
- GitHub Issues: Report bugs and feature requests
- Documentation: Check this README for usage instructions
- Community: Join discussions in the repository
๐ AI Content Detector Pro - Advanced AI-generated content detection using multiple analysis techniques