This repository contains a Python-based web scraper designed to automate the extraction of business and professional data from Apollo.io. The scraper utilizes SeleniumBase (with Undetected Chrome) for browser automation and intercepts API responses to capture data efficiently. Instead of scraping HTML, it captures data directly from Apollo's API endpoints, making it more reliable and faster.
- Login Automation: Automatically logs in to Apollo.io with your credentials
- API Response Interception: Captures data directly from
https://app.apollo.io/api/v1/mixed_people/searchAPI instead of scraping HTML - Data Extraction: Extracts comprehensive contact information including:
- Full Name, First Name, Last Name
- Email Address
- Company Name
- Job Title
- Location (City, State)
- LinkedIn, Twitter, GitHub URLs
- Phone Number
- Multi-Page Scraping: Configurable maximum pages to scrape via
config.json - Incremental Data Saving: Saves data after each page to prevent data loss
- Email & Phone Unlocking: Automatically unlocks email addresses and phone numbers for contacts
- Flexible Output Formats: Save data as JSON, CSV, or both (configurable)
- Anti-Detection: Uses SeleniumBase's Undetected Chrome mode to avoid bot detection
- Captcha Handling: Automatic captcha solving with 2Captcha integration and manual fallback
- Configuration-Based: All settings stored in
config.jsonfor easy management
-
Clone the repository:
git clone https://github.com/your-username/apollo-scraper.git cd apollo-scraper -
Create a virtual environment (recommended):
python -m venv venv # On Windows venv\Scripts\activate # On Linux/Mac source venv/bin/activate
-
Install required dependencies:
pip install -r requirements.txt pip install seleniumbase # SeleniumBase is required but not in requirements.txt -
Configure the scraper:
- Copy
config.json.exampletoconfig.json:cp config.json.example config.json
- Update
config.jsonwith your credentials:{ "credentials": { "email": "your-email@example.com", "password": "your-password", "two_captcha_api_key": "your-2captcha-api-key" }, "urls": { "login_url": "https://app.apollo.io/#/login?locale=en", "saved_link_list": "your-apollo-saved-list-url" }, "scraping": { "max_pages": 10, "output_format": "both" } }
Note: The
two_captcha_api_keyis optional. If not provided, the scraper will prompt for manual captcha solving when needed. - Copy
-
Ensure
config.jsonis properly configured with your Apollo.io credentials and target URL. -
Run the scraper:
python main.py
-
The script will:
- Log in to Apollo.io
- Navigate to your saved list
- Wait for each page to load and intercept the API response
- Extract contact data from the API response
- Automatically unlock email addresses and phone numbers for each contact
- Save data incrementally to output file(s) after each page
- Continue to the next page until
max_pagesis reached or no more pages are available
The scraper can save data in multiple formats based on the output_format setting in config.json:
"json": Saves data toapollo_data.jsononly"csv": Saves data toapollo_data.csvonly (JSON used temporarily, then deleted)"both": Saves data to bothapollo_data.jsonandapollo_data.csv(default)
Each contact entry contains:
{
"name": "John Doe",
"first_name": "John",
"last_name": "Doe",
"email": "john.doe@example.com",
"company": "Example Corp",
"job_title": "Software Engineer",
"location": "San Francisco, CA",
"linkedin_url": "https://linkedin.com/in/johndoe",
"twitter_url": "NA",
"github_url": "NA",
"phone_number": "+1234567890",
"page": 1
}All configuration is managed through config.json:
- Credentials: Email, password, and 2Captcha API key
- URLs: Login URL and target saved list URL
- Selectors: CSS selectors and XPaths for page elements
- Timeouts: Page load and default timeouts
- Scraping:
max_pages: Maximum pages to scrape (default: 10)output_format: Output file format -"json","csv", or"both"(default:"both")
- Python 3.7+
- Chrome browser installed
- SeleniumBase (with Undetected Chrome support)
- BeautifulSoup4
- Valid Apollo.io account credentials
- The scraper uses Undetected Chrome mode to avoid detection
- Data is saved incrementally to prevent loss if the script is interrupted
- If automatic captcha solving fails, the script will prompt for manual solving
- The script waits for API responses rather than scraping HTML, making it more reliable