A comprehensive educational framework for analyzing PDF files for malicious content
Features • Installation • Usage • API Integration • Documentation
This is an EDUCATIONAL PROJECT ONLY
This framework is designed for:
- ✅ Security researchers learning about PDF malware analysis
- ✅ Educational purposes in academic environments
- ✅ Training and skill development
- ✅ Understanding PDF structure and malware techniques
NOT INTENDED FOR:
- ❌ Production use without proper security review
- ❌ Analyzing malware in production environments
- ❌ Bypassing security controls
- ❌ Any illegal or malicious activities
<YOUR_API_KEY_HERE>, <API_KEY_PLACEHOLDER>, etc.). These are NOT valid API keys. You must:
- Register for your own API keys from respective services
- Replace all placeholder text with your actual API keys
- Never commit real API keys to version control
- Use environment variables for production deployments
- Features
- Project Structure
- Installation
- Quick Start
- Usage Examples
- API Integration
- Configuration
- Testing
- Contributing
- License
- Disclaimer
- Static Analysis: Extract metadata, hashes, and basic file properties
- JavaScript Detection: Identify and deobfuscate malicious JavaScript
- Stream Analysis: Analyze encoded and compressed streams
- Embedded File Detection: Find and analyze embedded files
- Structure Analysis: Detect structural anomalies and suspicious objects
- Risk Scoring: Intelligent risk assessment based on multiple factors
- Multiple Output Formats: JSON, HTML, and visual reports
- Batch Processing: Analyze multiple files simultaneously
- Directory Monitoring: Watch folders for new PDFs
- Web Interface: User-friendly web UI for analysis
- Extensible Architecture: Easy to add new analyzers
- VirusTotal - File hash lookup (requires API key)
- URLScan.io - URL analysis (requires API key)
- Hybrid Analysis - Sandbox analysis (requires API key)
pdf-malware-analyzer/
├── src/
│ ├── core/ # Core components
│ │ ├── base_analyzer.py
│ │ ├── pdf_parser.py
│ │ └── risk_scorer.py
│ ├── analyzers/ # Analysis modules
│ │ ├── basic_analyzer.py
│ │ ├── metadata_analyzer.py
│ │ ├── javascript_analyzer.py
│ │ ├── stream_analyzer.py
│ │ ├── structure_analyzer.py
│ │ └── embedded_file_analyzer.py
│ ├── deobfuscators/ # Code deobfuscation
│ │ └── js_deobfuscator.py
│ ├── threat_intel/ # API integrations (placeholders)
│ │ ├── virustotal.py
│ │ ├── urlscan.py
│ │ └── hybrid_analysis.py
│ ├── reporters/ # Output generation
│ │ ├── json_reporter.py
│ │ └── html_reporter.py
│ └── utils/ # Utilities
│ ├── file_utils.py
│ ├── logger.py
│ └── config_loader.py
├── tests/ # Test suite
│ └── test_samples/ # Test PDF generators
├── scripts/ # Utility scripts
│ ├── batch_analyze.py
│ └── monitor_directory.py
├── web_interface/ # Flask web application
│ ├── app.py
│ └── templates/
├── config.yaml # Configuration file
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.8 or higher
- pip package manager
- Git (optional)
git clone https://github.com/yourusername/pdf-malware-analyzer.git
cd pdf-malware-analyzerOn Windows:
python -m venv venv
venv\Scripts\activateOn Linux/Mac:
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtpip install -e .Edit config.yaml and replace all placeholder API keys with your actual keys:
# In config.yaml - REPLACE ALL PLACEHOLDERS
threat_intel:
virus_total:
enabled: false # Set to true to enable
api_key: "YOUR_ACTUAL_VIRUSTOTAL_API_KEY" # Replace placeholder
urlscan:
enabled: false # Set to true to enable
api_key: "YOUR_ACTUAL_URLSCAN_API_KEY" # Replace placeholder
hybrid_analysis:
enabled: false # Set to true to enable
api_key: "YOUR_ACTUAL_HYBRID_ANALYSIS_API_KEY" # Replace placeholder
secret: "YOUR_ACTUAL_HYBRID_ANALYSIS_SECRET" # Replace placeholder