High-performance, distributed web crawler and search indexer built in Rust.
Scrapix aims to be an internet-scale web crawler capable of:
- Global Internet Indexing - Crawl billions of pages
- Targeted Site Crawling - Index specific websites/documentation
- Real-time Information Retrieval - On-demand crawling
┌─────────────────────────────────────────────────────────────────┐
│ API Layer │
│ (scrapix-api) │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Frontier │ │ Crawler │ │ Content │
│ Service │ │ Workers │ │ Workers │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ Data Layer │
│ Redpanda │ RocksDB │ Meilisearch │ DragonflyDB │ S3 │
└─────────────────────────────────────────────────────────────────┘
| Component | Technology |
|---|---|
| Language | Rust |
| Message Queue | Redpanda (Kafka-compatible) |
| Search | Meilisearch |
| Local State | RocksDB |
| Cache | DragonflyDB (Redis-compatible) |
| Object Storage | S3/MinIO/RustFS |
- Rust 1.75+
- Docker & Docker Compose
# Start all infrastructure (Redpanda, Meilisearch, DragonflyDB)
docker compose -f docker-compose.yml -f docker-compose.dev.yml up -dcargo build --releaseIn separate terminals:
# Terminal 1: API Server
KAFKA_BROKERS=localhost:19092 \
MEILISEARCH_URL=http://localhost:7700 \
MEILISEARCH_API_KEY=masterKey \
cargo run --release --bin scrapix-api
# Terminal 2: Frontier Service
KAFKA_BROKERS=localhost:19092 \
cargo run --release --bin scrapix-frontier-service
# Terminal 3: Crawler Worker
KAFKA_BROKERS=localhost:19092 \
cargo run --release --bin scrapix-worker-crawler
# Terminal 4: Content Worker
KAFKA_BROKERS=localhost:19092 \
MEILISEARCH_URL=http://localhost:7700 \
MEILISEARCH_API_KEY=masterKey \
cargo run --release --bin scrapix-worker-content# Using the CLI
cargo run --bin scrapix -- crawl -f examples/simple-crawl.json
# Or using curl
curl -X POST http://localhost:8080/crawl \
-H "Content-Type: application/json" \
-d @examples/simple-crawl.jsonRun everything in Docker:
# Build and start all services
docker compose up -d --build
# View logs
docker compose logs -f
# Stop all services
docker compose downServices will be available at:
- API: http://localhost:8080
- Meilisearch: http://localhost:7700
- Redpanda Console: http://localhost:8090
POST /crawl
Content-Type: application/json
{
"start_urls": ["https://example.com"],
"index_uid": "my-index",
"max_depth": 5,
"features": {
"markdown": { "enabled": true }
}
}
# Response
{ "job_id": "550e8400-e29b-41d4-a716-446655440000" }POST /crawl/sync
Content-Type: application/json
# Same body as above, waits for completionGET /job/{job_id}/status
# Response
{
"job_id": "...",
"status": "running",
"pages_crawled": 150,
"pages_indexed": 145
}GET /job/{job_id}/events
Accept: text/event-stream
# Returns server-sent events with crawl progressConnect to /ws for multi-job subscriptions or /ws/job/{job_id} for a single job.
Client Messages:
{"type": "subscribe", "job_id": "..."} // Subscribe to job
{"type": "unsubscribe", "job_id": "..."} // Unsubscribe
{"type": "get_status", "job_id": "..."} // Request status
{"type": "ping"} // KeepaliveServer Messages:
{"type": "event", "job_id": "...", "event": {...}} // Job event
{"type": "status", "job_id": "...", "status": {...}} // Status response
{"type": "subscribed", "job_id": "..."} // Confirmed
{"type": "pong", "timestamp": 1234567890} // Keepalive responseExample (JavaScript):
const ws = new WebSocket('ws://localhost:8080/ws/job/' + jobId);
ws.onmessage = (e) => console.log(JSON.parse(e.data));DELETE /job/{job_id}GET /jobs?limit=10&offset=0GET /health# Start a crawl job
scrapix crawl -f config.json
scrapix crawl --start-url https://example.com --index-uid my-index
# Start a sync crawl (wait for completion)
scrapix crawl -f config.json --sync
# Check job status
scrapix status <job_id>
scrapix status <job_id> --watch # Poll continuously
# Stream job events
scrapix events <job_id>
# List recent jobs
scrapix jobs --limit 20
# Cancel a job
scrapix cancel <job_id>See examples/ for configuration examples:
simple-crawl.json- Basic HTTP crawldocumentation-site.json- Documentation with custom selectorsecommerce-products.json- Product catalog with schema extractionai-enrichment.json- AI-powered content enrichmentwith-proxy.json- Crawling with proxy rotation
{
"start_urls": ["https://example.com"],
"index_uid": "my-index",
"crawler_type": "http",
"max_depth": 10,
"max_pages": 1000,
"url_patterns": {
"include": ["https://example.com/**"],
"exclude": ["**/private/**"],
"index_only": ["**/docs/**"]
},
"sitemap": {
"enabled": true,
"urls": ["https://example.com/sitemap.xml"]
},
"concurrency": {
"max_concurrent_requests": 20,
"browser_pool_size": 5
},
"rate_limit": {
"requests_per_second": 10,
"per_domain_delay_ms": 100,
"respect_robots_txt": true
},
"proxy": {
"urls": ["http://proxy:8080"],
"rotation": "round_robin"
},
"features": {
"metadata": { "enabled": true },
"markdown": { "enabled": true },
"block_split": { "enabled": true },
"schema": {
"enabled": true,
"only_types": ["Product", "Article"]
},
"custom_selectors": {
"enabled": true,
"selectors": {
"title": "h1",
"price": ".product-price"
}
},
"ai_extraction": {
"enabled": true,
"prompt": "Extract key information...",
"model": "gpt-4"
},
"ai_summary": { "enabled": true },
"embeddings": {
"enabled": true,
"model": "text-embedding-3-small"
}
},
"meilisearch": {
"url": "http://localhost:7700",
"api_key": "masterKey",
"batch_size": 100
},
"webhooks": [{
"url": "https://hooks.example.com/crawl",
"events": ["crawl_completed"],
"enabled": true
}]
}# Apply local overlay
kubectl apply -k deploy/kubernetes/overlays/local
# Access services
kubectl port-forward -n scrapix svc/scrapix-api 8080:8080
kubectl port-forward -n scrapix svc/meilisearch 7700:7700# Apply production overlay
kubectl apply -k deploy/kubernetes/overlays/prod| Variable | Description | Default |
|---|---|---|
HOST |
API server host | 0.0.0.0 |
PORT |
API server port | 8080 |
KAFKA_BROKERS |
Kafka/Redpanda brokers | localhost:9092 |
KAFKA_GROUP_ID |
Consumer group ID | Service-specific |
MEILISEARCH_URL |
Meilisearch URL | http://localhost:7700 |
MEILISEARCH_API_KEY |
Meilisearch API key | - |
REDIS_URL |
Redis/DragonflyDB URL | redis://localhost:6379 |
RUST_LOG |
Log level | info |
CONCURRENCY |
Crawler concurrency | 10 |
USER_AGENT |
HTTP User-Agent | Scrapix default |
REQUEST_TIMEOUT |
Request timeout (seconds) | 30 |
MAX_RETRIES |
Max retry attempts | 3 |
MAX_DEPTH |
Max crawl depth | 100 |
RESPECT_ROBOTS |
Respect robots.txt | true |
OPENAI_API_KEY |
OpenAI API key (for AI features) | - |
Scrapix includes advanced near-duplicate detection using locality-sensitive hashing:
Fast content fingerprinting using weighted token hashing:
use scrapix_frontier::SimHash;
let simhash = SimHash::new();
let hash1 = simhash.hash(content1);
let hash2 = simhash.hash(content2);
// Hamming distance < 10 indicates near-duplicate
let distance = SimHash::hamming_distance(hash1, hash2);Accurate similarity estimation using multiple hash functions:
use scrapix_frontier::MinHash;
let minhash = MinHash::new(128); // 128 hash functions
let sig1 = minhash.signature(content1);
let sig2 = minhash.signature(content2);
// Returns similarity estimate 0.0-1.0
let similarity = MinHash::jaccard_similarity(&sig1, &sig2);Combines both methods with LSH buckets for efficient detection:
use scrapix_frontier::{NearDuplicateDetector, NearDuplicateConfig};
let detector = NearDuplicateDetector::new(NearDuplicateConfig {
use_simhash: true,
simhash_threshold: 10, // Max Hamming distance
use_minhash: true,
minhash_threshold: 0.8, // Min Jaccard similarity
..Default::default()
});
// Returns Some(canonical_url) if near-duplicate found
if let Some(original) = detector.check_and_add(url, content) {
println!("Duplicate of: {}", original);
}Scrapix includes a complete monitoring stack with Prometheus and Grafana.
# Start monitoring services
cd deploy/monitoring
docker compose up -d
# Access dashboards
# Grafana: http://localhost:3000 (admin/admin)
# Prometheus: http://localhost:9090
# Alertmanager: http://localhost:9093The scrapix-telemetry crate exports metrics:
| Metric | Type | Description |
|---|---|---|
scrapix_pages_crawled_total |
Counter | Total pages crawled |
scrapix_pages_indexed_total |
Counter | Total pages indexed |
scrapix_crawl_errors_total |
Counter | Crawl errors by type |
scrapix_crawl_latency_seconds |
Histogram | Page fetch latency |
scrapix_index_latency_seconds |
Histogram | Indexing latency |
scrapix_queue_depth |
Gauge | URLs pending in queue |
scrapix_active_crawls |
Gauge | Currently active crawls |
Pre-configured dashboards in deploy/monitoring/grafana/dashboards/:
- Scrapix Overview - Crawl rates, error rates, latency percentiles
- Job Performance - Per-job metrics and progress tracking
- System Health - Resource usage, queue depths, worker status
Configured alerts in deploy/monitoring/prometheus/alerts.yml:
- ScrapixHighErrorRate - Error rate > 10% for 5 minutes
- ScrapixSlowCrawling - p99 latency > 30s for 10 minutes
- ScrapixQueueBacklog - Queue depth > 100k for 15 minutes
- ScrapixWorkerDown - Worker not responding
For large-scale analytics, Scrapix supports ClickHouse:
use scrapix_storage::{ClickHouseClient, ClickHouseConfig};
let client = ClickHouseClient::new(ClickHouseConfig {
url: "http://localhost:8123".to_string(),
database: "scrapix".to_string(),
..Default::default()
}).await?;
// Query domain statistics
let stats = client.get_domain_stats("example.com", Some(30)).await?;
// Query hourly aggregates
let hourly = client.get_hourly_stats(24).await?;crawl_events- Every page fetch with status, latency, sizecontent_events- Processed content with word counts, language
scrapix/
├── crates/ # Library crates
│ ├── scrapix-core/ # Core types and traits
│ ├── scrapix-frontier/ # URL frontier with dedup
│ ├── scrapix-crawler/ # HTTP/browser fetching
│ ├── scrapix-parser/ # HTML parsing
│ ├── scrapix-extractor/ # Feature extraction
│ ├── scrapix-ai/ # AI enrichment
│ ├── scrapix-storage/ # Storage backends
│ ├── scrapix-queue/ # Message queue
│ └── scrapix-telemetry/ # Observability
│
├── bins/ # Binary crates
│ ├── scrapix-api/ # REST API server
│ ├── scrapix-worker-crawler/# Crawler worker
│ ├── scrapix-worker-content/# Content processor
│ ├── scrapix-frontier-service/# Frontier service
│ └── scrapix-cli/ # CLI tool
│
├── deploy/ # Deployment configs
│ ├── kubernetes/ # K8s manifests
│ │ ├── base/ # Base resources
│ │ └── overlays/ # Environment overrides
│ │ ├── local/ # Local development
│ │ └── prod/ # Production
│ └── monitoring/ # Prometheus/Grafana stack
│ ├── docker-compose.yml
│ ├── prometheus/ # Prometheus config + alerts
│ └── grafana/ # Dashboards + datasources
│
├── tests/ # Integration tests
├── examples/ # Example configurations
├── ARCHITECTURE.md # Detailed architecture docs
└── docker-compose.yml # Docker Compose stack
- Architecture - System design and tech decisions
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT