Name	Name	Last commit message	Last commit date
parent directory ..
data	data
src	src
tests/fixtures	tests/fixtures
Cargo.toml	Cargo.toml
README.md	README.md
stats.md	stats.md

v3 — YC Scraper & Parser

Pure Rust pipeline. Scrapes every Y Combinator company page via spider.cloud, parses markdown into structured data, stores it in SQLite.

5,723 companies. 11 minutes to scrape. 8 seconds to parse. Zero errors.

How It Works

YC Sitemap ──→ URL Queue ──→ spider.cloud ──→ Markdown ──→ 3-Pass Parser ──→ SQLite
  (XML)         5,723 urls    10 concurrent    streaming     Lexer            9 tables
                              retry+backoff    to DB         Clusterer        68 MB
                              11m 23s                        Extractors
                                                             8.2s / 18 cores

The 3-Pass Parser

The parser turns raw markdown into structured database rows in three stages:

Pass 1 — Lexer (blocks.rs). Every line becomes a typed block:

"### Stripe"                    → Heading { level: 3, text: "Stripe" }
"[Fintech](/companies/industry/Fintech)" → TagLink { tag: "Fintech" }
"Patrick Collison"              → Person { name, title, bio, links }
"Founded:2009"                  → MetaField { key: "Founded", value: "2009" }
"Active"                        → StatusLine("Active")

Person detection uses word-count heuristics (<=6 words), bare social link patterns ([](url)), and title keywords ("Founder", "CEO", "CTO"). Compiles 8 regex patterns once via LazyLock and shares them across threads.

Pass 2 — Clusterer (sections.rs). Groups blocks into named sections by structural transitions:

header → description → founders → news → jobs → footer_meta → launches

Transition detection: ### heading starts description, 3+ consecutive MetaFields start footer_meta, first Person block starts founders, external link followed by a date starts news, /jobs/ URL starts jobs.

Pass 3 — Extractors (extract/*.rs). One extractor per section type. Each pulls structured fields into DB rows — company info, founders with LinkedIn/Twitter, news with dates, job listings, meeting/scheduling links across 18 platforms.

The Scraper

Async tokio runtime with semaphore-bounded concurrency (10 concurrent). Each result streams to SQLite via mpsc channel the moment it arrives — no batch buffering. Retry with exponential backoff (2s → 4s → 8s) on 429/5xx errors.

Processing

Rayon par_iter in chunks of 500. On 18 cores, processes 5,583 pages in 8.2 seconds (~681 pages/sec). Wall time 8.2s vs CPU time 1m52s = ~14x parallelism efficiency.

Commands

export SPIDER_API_KEY="..."

cargo run -- init              # Fetch YC sitemap → URL queue
cargo run -- scrape            # Scrape all unvisited (streams to DB)
cargo run -- scrape -n 50      # Scrape 50 pages
cargo run -- process           # Parse all unprocessed markdown
cargo run -- run               # Scrape + process in one pipeline
cargo run -- run -n 100        # Pipeline 100 pages
cargo run -- overview          # Company table
cargo run -- overview --status Active --batch "Winter 2024" -n 20
cargo run -- stats             # Pipeline progress counters

Schema

Table	What	Rows
`pages`	URL queue	5,723
`page_data`	Raw markdown + HTTP status + latency	5,723
`company_sections`	Parsed sections per company	5,723
`companies`	Structured company data	5,723
`founders`	Name, title, bio, LinkedIn, Twitter	11,286
`news`	Articles with publication dates	4,241
`company_jobs`	Job listings with location/salary	3,470
`company_links`	External links with domain classification	42,779
`meeting_links`	Calendly, Cal.com, Motion, HubSpot links	441

Dependencies

Crate	Purpose
`spider-client`	spider.cloud API client
`tokio`	Async runtime for concurrent scraping
`rayon`	Data-parallel parsing across cores
`clap`	CLI argument parsing with derive macros
`rusqlite`	SQLite with bundled `libsqlite3`
`reqwest`	HTTP client (sitemap fetch)
`quick-xml`	XML parsing for YC sitemap
`regex`	Pattern matching in lexer + extractors
`serde` / `serde_json`	JSON deserialization of spider.cloud responses
`tracing`	Structured logging
`indicatif`	Progress bars for scrape/process
`chrono`	Date parsing
`anyhow`	Error handling

Project Structure

src/
├── main.rs                 CLI + pipeline orchestration + Rayon processing
├── db.rs                   Schema (9 tables), all queries, transactional writes
├── sitemap.rs              Sitemap fetch + XML parse + URL filtering
├── scraper.rs              spider.cloud client, mpsc streaming, retry/backoff
└── parser/
    ├── blocks.rs           Pass 1: line lexer (8 LazyLock regex, person detection)
    ├── sections.rs         Pass 2: structural section clustering
    └── extract/
        ├── mod.rs          Pass 3: orchestrator
        ├── company.rs      Name, batch, status, homepage, social links
        ├── founders.rs     Name, title, bio, LinkedIn, Twitter
        ├── news.rs         Articles with dates
        ├── jobs.rs         Listings with location/salary
        ├── links.rs        External links + domain classification
        └── meetings.rs     18 scheduling platform domains

Performance

Phase	Wall Time	CPU Time	Throughput
Scrape	11m 23s	—	8.2 pages/sec (10 concurrent)
Process	8.2s	1m 52s	681 pages/sec (18 cores, 14x parallel)

See stats.md for a full breakdown of the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

v3 — YC Scraper & Parser

How It Works

The 3-Pass Parser

The Scraper

Processing

Commands

Schema

Dependencies

Project Structure

Performance

FilesExpand file tree

v3

Directory actions

More options

Directory actions

More options

Latest commit

History

v3

Folders and files

parent directory

README.md

v3 — YC Scraper & Parser

How It Works

The 3-Pass Parser

The Scraper

Processing

Commands

Schema

Dependencies

Project Structure

Performance