A comprehensive Python script to scrape websites using HTTrack and create ZIP archives of the downloaded content.
- ✅ Interactive configuration with sensible defaults
- ✅ Non-interactive mode for automation
- ✅ Automatic ZIP archive creation
- ✅ Configuration saving/loading
- ✅ Detailed progress tracking
- ✅ Error handling and validation
- ✅ Customizable download parameters
- ✅ Support for depth control, rate limiting, and content filtering
-
HTTrack must be installed:
# Ubuntu/Debian sudo apt-get install httrack # Fedora/RHEL sudo dnf install httrack # macOS brew install httrack
-
Python 3.6+ (usually pre-installed on Linux)
-
Make the script executable:
chmod +x website_scraper.py
-
(Optional) Create a virtual environment:
python3 -m venv venv source venv/bin/activate
python website_scraper.py https://example.comThis will:
- Prompt you for configuration options
- Scrape the website
- Create a ZIP archive
- Keep both the scraped directory and ZIP file
python website_scraper.py https://example.com --non-interactivepython website_scraper.py https://example.com --cleanupThis removes the scraped directory after creating the ZIP file.
python website_scraper.py https://example.com --output my_websitepython website_scraper.py https://example.com -n -c -o my_siteWhen running in interactive mode, you'll be prompted for:
- Mirror depth: How many links deep to follow (default: 2)
- Stay on domain: Whether to only download from the same domain (default: yes)
- Max download rate: Maximum KB/s (0 = unlimited)
- Max total size: Maximum total size in MB (0 = unlimited)
- Max time: Maximum scraping time in seconds (0 = unlimited)
- Simultaneous connections: Number of parallel downloads (default: 4)
- Retries: Number of retry attempts on error (default: 2)
- Timeout: Connection timeout in seconds (default: 30)
- Download images: Whether to download images (default: yes)
- Download videos: Whether to download videos (default: yes)
- Download audio: Whether to download audio files (default: yes)
- Follow robots.txt: Respect website's robots.txt (default: yes)
- Accept cookies: Allow cookies during scraping (default: yes)
- Parse JavaScript: Parse JavaScript for links (default: yes)
The script saves your configuration to scraper_config.json for reuse. You can edit this file directly:
{
"depth": 2,
"max_rate": 0,
"connections": 4,
"stay_on_domain": true,
"get_images": true,
"follow_robots": true
}scraped_websites/
├── example.com_20241205_120000/
│ ├── index.html
│ ├── assets/
│ ├── images/
│ ├── scrape_config.json
│ └── hts-log.txt
└── example.com_20241205_120000.zip
Each scraped site includes:
- All downloaded files with original structure
scrape_config.json: Configuration used for this scrapehts-log.txt: HTTrack's detailed log file
# Scrape all posts, stay on domain
python website_scraper.py https://myblog.com# Deep scrape for documentation
python website_scraper.py https://docs.example.com --non-interactiveThen edit scraper_config.json:
{
"depth": 5,
"stay_on_domain": true,
"get_images": true,
"get_videos": false
}python website_scraper.py https://createathon.co -n -c#!/bin/bash
# scrape_multiple.sh
sites=(
"https://site1.com"
"https://site2.com"
"https://site3.com"
)
for site in "${sites[@]}"; do
python website_scraper.py "$site" --non-interactive --cleanup
done✗ HTTrack is not installed!
Solution: Install HTTrack using your package manager (see Prerequisites).
Permission denied: 'scraped_websites'
Solution: Run with appropriate permissions or change output directory:
mkdir -p ~/scraped_websites
python website_scraper.py https://example.comSolution: Adjust configuration:
- Reduce
depth(e.g., from 5 to 2) - Set
max_timelimit - Disable video/image download if not needed
- Increase
connectionsfor faster parallel downloads
Solution:
- Enable
follow_robotsto respect robots.txt - Reduce
connectionsto be more polite - Set
max_rateto limit bandwidth usage - Increase
timeoutif connections are slow
Solution:
- Disable video/image downloads
- Reduce
depth - Set
max_sizelimit - Filter specific file types
Edit the script to add custom filters:
# In build_httrack_command method, add:
cmd.extend(["+*.html", "+*.css", "+*.js"]) # Only these types
cmd.extend(["-*/admin/*", "-*/wp-admin/*"]) # Exclude admin pathsCreate a urls.txt file:
https://site1.com
https://site2.com
https://site3.com
Then run:
while read url; do
python website_scraper.py "$url" -n -c
sleep 60 # Wait 60 seconds between scrapes
done < urls.txtfrom website_scraper import WebsiteScraper
scraper = WebsiteScraper()
config = scraper.load_config()
output_dir = scraper.scrape_website("https://example.com", config)
zip_path = scraper.create_zip(output_dir)
# Do something with the ZIP file
upload_to_s3(zip_path)- Respect
robots.txtfiles - Don't overload servers with too many connections
- Check website Terms of Service
- Don't scrape copyrighted content without permission
- Be a good internet citizen
usage: website_scraper.py [-h] [-n] [-c] [-o OUTPUT] url
positional arguments:
url URL of the website to scrape
optional arguments:
-h, --help Show help message and exit
-n, --non-interactive Use default configuration without prompts
-c, --cleanup Remove source directory after creating ZIP
-o OUTPUT, --output OUTPUT
Custom output name for the scraped content
This script is provided as-is for educational and legitimate use cases.
For issues or questions:
- Check HTTrack documentation: https://www.httrack.com/
- Verify HTTrack is properly installed
- Check file permissions and disk space
- Review the generated
hts-log.txtfor detailed errors