Website Scraper Using HTTrack

A comprehensive Python script to scrape websites using HTTrack and create ZIP archives of the downloaded content.

Features

✅ Interactive configuration with sensible defaults
✅ Non-interactive mode for automation
✅ Automatic ZIP archive creation
✅ Configuration saving/loading
✅ Detailed progress tracking
✅ Error handling and validation
✅ Customizable download parameters
✅ Support for depth control, rate limiting, and content filtering

Prerequisites

HTTrack must be installed:

# Ubuntu/Debian
sudo apt-get install httrack

# Fedora/RHEL
sudo dnf install httrack

# macOS
brew install httrack

Python 3.6+ (usually pre-installed on Linux)

Installation

Make the script executable:
```
chmod +x website_scraper.py
```

(Optional) Create a virtual environment:

python3 -m venv venv
source venv/bin/activate

Usage

Basic Usage (Interactive Mode)

python website_scraper.py https://example.com

This will:

Prompt you for configuration options
Scrape the website
Create a ZIP archive
Keep both the scraped directory and ZIP file

Non-Interactive Mode (Use Defaults)

python website_scraper.py https://example.com --non-interactive

Cleanup After Scraping

python website_scraper.py https://example.com --cleanup

This removes the scraped directory after creating the ZIP file.

Custom Output Name

python website_scraper.py https://example.com --output my_website

Combined Options

python website_scraper.py https://example.com -n -c -o my_site

Configuration Options

When running in interactive mode, you'll be prompted for:

Basic Settings

Mirror depth: How many links deep to follow (default: 2)
Stay on domain: Whether to only download from the same domain (default: yes)

Download Limits

Max download rate: Maximum KB/s (0 = unlimited)
Max total size: Maximum total size in MB (0 = unlimited)
Max time: Maximum scraping time in seconds (0 = unlimited)

Connection Settings

Simultaneous connections: Number of parallel downloads (default: 4)
Retries: Number of retry attempts on error (default: 2)
Timeout: Connection timeout in seconds (default: 30)

Content Settings

Download images: Whether to download images (default: yes)
Download videos: Whether to download videos (default: yes)
Download audio: Whether to download audio files (default: yes)

Advanced Settings

Follow robots.txt: Respect website's robots.txt (default: yes)
Accept cookies: Allow cookies during scraping (default: yes)
Parse JavaScript: Parse JavaScript for links (default: yes)

Configuration File

The script saves your configuration to scraper_config.json for reuse. You can edit this file directly:

{
  "depth": 2,
  "max_rate": 0,
  "connections": 4,
  "stay_on_domain": true,
  "get_images": true,
  "follow_robots": true
}

Output Structure

scraped_websites/
├── example.com_20241205_120000/
│   ├── index.html
│   ├── assets/
│   ├── images/
│   ├── scrape_config.json
│   └── hts-log.txt
└── example.com_20241205_120000.zip

Each scraped site includes:

All downloaded files with original structure
scrape_config.json: Configuration used for this scrape
hts-log.txt: HTTrack's detailed log file

Examples

Scrape a Blog

# Scrape all posts, stay on domain
python website_scraper.py https://myblog.com

Scrape Documentation Site

# Deep scrape for documentation
python website_scraper.py https://docs.example.com --non-interactive

Then edit scraper_config.json:

{
  "depth": 5,
  "stay_on_domain": true,
  "get_images": true,
  "get_videos": false
}

Quick Scrape (No Prompts)

python website_scraper.py https://createathon.co -n -c

Scrape Multiple Sites

#!/bin/bash
# scrape_multiple.sh

sites=(
  "https://site1.com"
  "https://site2.com"
  "https://site3.com"
)

for site in "${sites[@]}"; do
  python website_scraper.py "$site" --non-interactive --cleanup
done

Troubleshooting

HTTrack Not Found

✗ HTTrack is not installed!

Solution: Install HTTrack using your package manager (see Prerequisites).

Permission Denied

Permission denied: 'scraped_websites'

Solution: Run with appropriate permissions or change output directory:

mkdir -p ~/scraped_websites
python website_scraper.py https://example.com

Scraping Takes Too Long

Solution: Adjust configuration:

Reduce depth (e.g., from 5 to 2)
Set max_time limit
Disable video/image download if not needed
Increase connections for faster parallel downloads

Website Blocks Scraping

Solution:

Enable follow_robots to respect robots.txt
Reduce connections to be more polite
Set max_rate to limit bandwidth usage
Increase timeout if connections are slow

Large ZIP Files

Solution:

Disable video/image downloads
Reduce depth
Set max_size limit
Filter specific file types

Advanced Usage

Custom HTTrack Filters

Edit the script to add custom filters:

# In build_httrack_command method, add:
cmd.extend(["+*.html", "+*.css", "+*.js"])  # Only these types
cmd.extend(["-*/admin/*", "-*/wp-admin/*"]) # Exclude admin paths

Batch Processing

Create a urls.txt file:

https://site1.com
https://site2.com
https://site3.com

Then run:

while read url; do
  python website_scraper.py "$url" -n -c
  sleep 60  # Wait 60 seconds between scrapes
done < urls.txt

Integration with Other Tools

from website_scraper import WebsiteScraper

scraper = WebsiteScraper()
config = scraper.load_config()
output_dir = scraper.scrape_website("https://example.com", config)
zip_path = scraper.create_zip(output_dir)

# Do something with the ZIP file
upload_to_s3(zip_path)

Legal and Ethical Considerations

⚠️ Important: Always ensure you have permission to scrape websites.

Respect robots.txt files
Don't overload servers with too many connections
Check website Terms of Service
Don't scrape copyrighted content without permission
Be a good internet citizen

Command-Line Options Reference

usage: website_scraper.py [-h] [-n] [-c] [-o OUTPUT] url

positional arguments:
  url                   URL of the website to scrape

optional arguments:
  -h, --help            Show help message and exit
  -n, --non-interactive Use default configuration without prompts
  -c, --cleanup         Remove source directory after creating ZIP
  -o OUTPUT, --output OUTPUT
                        Custom output name for the scraped content

License

This script is provided as-is for educational and legitimate use cases.

Support

For issues or questions:

Check HTTrack documentation: https://www.httrack.com/
Verify HTTrack is properly installed
Check file permissions and disk space
Review the generated hts-log.txt for detailed errors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Website Scraper Using HTTrack

Features

Prerequisites

Installation

Usage

Basic Usage (Interactive Mode)

Non-Interactive Mode (Use Defaults)

Cleanup After Scraping

Custom Output Name

Combined Options

Configuration Options

Basic Settings

Download Limits

Connection Settings

Content Settings

Advanced Settings

Configuration File

Output Structure

Examples

Scrape a Blog

Scrape Documentation Site

Quick Scrape (No Prompts)

Scrape Multiple Sites

Troubleshooting

HTTrack Not Found

Permission Denied

Scraping Takes Too Long

Website Blocks Scraping

Large ZIP Files

Advanced Usage

Custom HTTrack Filters

Batch Processing

Integration with Other Tools

Legal and Ethical Considerations

Command-Line Options Reference

License

Support

FilesExpand file tree

README_SCRAPER.md

Latest commit

History

README_SCRAPER.md

File metadata and controls

Website Scraper Using HTTrack

Features

Prerequisites

Installation

Usage

Basic Usage (Interactive Mode)

Non-Interactive Mode (Use Defaults)

Cleanup After Scraping

Custom Output Name

Combined Options

Configuration Options

Basic Settings

Download Limits

Connection Settings

Content Settings

Advanced Settings

Configuration File

Output Structure

Examples

Scrape a Blog

Scrape Documentation Site

Quick Scrape (No Prompts)

Scrape Multiple Sites

Troubleshooting

HTTrack Not Found

Permission Denied

Scraping Takes Too Long

Website Blocks Scraping

Large ZIP Files

Advanced Usage

Custom HTTrack Filters

Batch Processing

Integration with Other Tools

Legal and Ethical Considerations

Command-Line Options Reference

License

Support