This is a complete, production-ready Apify Actor that scrapes websites using HTTrack.
.
├── Dockerfile # Docker configuration with HTTrack
├── requirements.txt # Python dependencies
├── src/
│ ├── __init__.py # Package initialization
│ ├── __main__.py # Entry point
│ └── main.py # Main Actor logic
├── .actor/
│ ├── actor.json # Actor metadata
│ ├── input_schema.json # Input configuration schema
│ ├── output_schema.json # Output schema
│ ├── dataset_schema.json # Dataset display schema
│ └── INPUT_EXAMPLE.json # Example input
├── .dockerignore # Files to exclude from Docker
├── README.md # Actor documentation
├── website_scraper.py # Standalone Python script
└── README_SCRAPER.md # Standalone script docs
-
Login to Apify:
apify login
-
Deploy Actor:
apify push
-
Run on Apify Console:
- Go to https://console.apify.com/
- Find your Actor
- Configure input and run
-
Install Apify CLI:
npm install -g apify-cli
-
Run Locally:
apify run
-
Check Output:
cd apify_storage/key_value_stores/default/ ls *.zip
The website_scraper.py can also be used independently:
cd ~
python3 website_scraper.py https://example.com --non-interactive{
"url": "https://example.com"
}Uses all default settings (depth=2, stay on domain, download all content).
{
"url": "https://example.com",
"depth": 3,
"stayOnDomain": true,
"connections": 8,
"maxRate": 1000,
"maxSize": 500,
"maxTime": 600,
"getImages": true,
"getVideos": false,
"followRobots": true,
"outputName": "my_backup"
}Statistics for each scrape:
{
"url": "https://example.com",
"outputName": "example.com_20241205_130000",
"zipFile": "example.com_20241205_130000.zip",
"fileCount": 156,
"totalSize": 5242880,
"zipSize": 2621440,
"compressionRatio": 50.0,
"timestamp": "2024-12-05T13:00:00.000Z",
"status": "success"
}Complete ZIP archive of the website.
Download via API:
curl "https://api.apify.com/v2/key-value-stores/{storeId}/keys/{filename}.zip" > website.zipThe Dockerfile:
- ✅ Installs HTTrack (system package)
- ✅ Installs Python dependencies
- ✅ Copies Actor source code
- ✅ Sets up proper permissions
- ✅ Configures environment
Key sections:
- Root user section: Installs HTTrack
- myuser section: Installs Python packages and copies code
- Environment: Sets PATH and HTTrack variables
Flow:
- Input Validation - Checks URL is provided
- Configuration - Loads config with defaults
- HTTrack Check - Verifies HTTrack is installed
- Scraping - Runs HTTrack with configured parameters
- ZIP Creation - Compresses all downloaded files
- Storage - Saves ZIP to Key-Value Store
- Dataset - Pushes statistics to Dataset
- Cleanup - Removes temporary files (if enabled)
Defines the Actor's input form in Apify Console:
- Required fields (url)
- Optional fields with defaults
- Field types and validation
- Descriptions and examples
Defines output quick links:
- ZIP file in Key-Value Store
- Dataset with statistics
Defines how data is displayed in Apify Console:
- Overview view: Key metrics
- Details view: Full configuration
{
"url": "https://mywebsite.com",
"depth": 5,
"stayOnDomain": true,
"followRobots": true
}{
"url": "https://competitor.com",
"depth": 2,
"getVideos": false,
"maxSize": 100
}{
"url": "https://old-site.com",
"depth": 10,
"externalDepth": 1,
"maxTime": 3600
}{
"url": "https://docs.example.com",
"depth": 3,
"stayOnDomain": true,
"getImages": true,
"getVideos": false
}In Apify Console:
- Go to Run detail
- Click "Log" tab
- See real-time progress
Locally:
apify run
# Logs appear in terminalIssue: "HTTrack is not installed"
- Solution: Docker image should have HTTrack pre-installed
- Check:
docker run <image> httrack --version
Issue: "Failed to scrape website"
- Solution: Check logs for HTTrack errors
- Try: Reduce depth, enable followRobots, increase timeout
Issue: "ZIP file too large"
- Solution: Disable videos, reduce depth, set maxSize
✅ Always have permission to scrape websites
✅ Respect robots.txt (enabled by default)
✅ Use rate limiting to avoid overloading servers
✅ Check Terms of Service before scraping
✅ Don't scrape personal data without consent
Recommended settings:
connections: 2-8 (4 is safe default)maxRate: 500-1000 KB/s for polite scrapingfollowRobots: true (always)
{
"connections": 8,
"getVideos": false,
"depth": 2
}{
"connections": 4,
"maxRate": 1000,
"depth": 3
}{
"connections": 2,
"maxRate": 500,
"depth": 2,
"timeout": 60
}Before deploying:
- Test locally with
apify run - Verify HTTrack is in Dockerfile
- Check input_schema.json has all fields
- Test with sample URLs
- Review logs for errors
- Check ZIP files are created
- Verify dataset output
- Update actor.json metadata
- Set appropriate timeout (default: 3600s)
- Add README.md with examples
Deploy command:
apify pushIn Dockerfile:
RUN apt-get update && apt-get install -y httrack=<version>In requirements.txt:
apify ~= 2.0.0
Check Actor runs for:
- Average duration
- Memory usage
- Success rate
- Common errors
Apify Documentation: https://docs.apify.com/
HTTrack Documentation: https://www.httrack.com/
Actor Console: https://console.apify.com/
This Actor provides:
- ✅ Complete website downloads with HTTrack
- ✅ ZIP archive output
- ✅ Configurable parameters (15+ options)
- ✅ Progress tracking and statistics
- ✅ Apify platform integration
- ✅ Production-ready Docker container
Perfect for website backups, archiving, and offline browsing!