Skip to content

Modular FastAPI-based job scraper for multiple job boards. Easily extensible, saves results to CSV and Supabase. Includes registry system for adding new scrapers.

Notifications You must be signed in to change notification settings

jlord31/simple-scraper

Repository files navigation

Simple Scraper API

Overview

The Simple Scraper API is a modular, extensible FastAPI application for scraping job listings (currently from Indeed.com) and saving results to CSV and Supabase. The architecture supports easy addition of new job board scrapers via a registry system.

Features

  • Scrapes job listings from Indeed.com (extensible to other boards)
  • Collects job title, company name, location, salary, benefits, description, employment type
  • Saves data as CSV and (optionally) uploads to Supabase
  • Extensible: add new scrapers by subclassing and registering
  • Modular codebase with clear separation of scraping, data handling, and database logic

Project Structure

SimpleScraper-API/
│
├── main.py                  # FastAPI entry point, uses scraper registry
├── scraper_registry.py      # Registry for all scraper classes
├── requirements.txt         # Dependencies
├── README.md                # Project documentation
├── Dockerfile               # Docker build file
├── docker-compose.yml       # Docker Compose config
├── .env                     # Environment variables
├── scrapers/                # Scraper classes for each job board
│   ├── __init__.py
│   ├── base_scraper.py      # Abstract base class for all scrapers
│   ├── indeed_scraper.py    # Indeed scraper implementation
├── utils/                   # Utility modules
│   ├── __init__.py
│   ├── csv_handler.py       # Utility for saving CSV files
│   ├── driver_utils.py      # Selenium driver setup utility
│   ├── scrape_utils.py      # Reusable scraping helpers
│   └── supabase_utils.py    # Supabase upload utility

Technologies Used

  • Backend Framework: FastAPI
  • Web Scraping: Selenium, BeautifulSoup
  • Data Processing: Pandas
  • Database: Supabase
  • Other Tools: WebDriver Manager, Uvicorn

Prerequisites

  • Python 3.10+
  • Google Chrome and ChromeDriver (managed automatically by WebDriver Manager)
  • A Supabase account with a database table set up:
    • Table Name: Job_listing
    • Columns:
      • POSITION
      • COMPANY NAME
      • LOCATION
      • SALARY
      • JOB LINK
      • BENEFITS
      • DESCRIPTION
      • EMPLOYMENT TYPE

Setup and Installation

  1. Clone the Repository:
git clone https://github.com/your-repo/job_scraper.git
cd job_scraper
  1. Install Dependencies:
pip install -r requirements.txt
  1. Configure Supabase (optional):
  • Edit .env or set environment variables for SUPER_BASE_URL and SUPER_BASE_KEY.
  • You can also pass credentials directly to upload_to_supabase in utils/supabase_utils.py.
  1. Run the API:
uvicorn main:app --reload
  1. Access the API docs:

API Endpoints

/search_jobs (POST)

  • Parameters:
    • job_title (string, required): The job title to search for
    • location (string, required): The job location
  • Description: Scrapes jobs using all registered scrapers, saves results to CSV, and uploads to Supabase
  • Response:
    {
      "results": [
        {
          "scraper": "IndeedJobScraper",
          "uploaded_data": [ ... ]
        },
        {
          "scraper": "OtherScraper",
          "uploaded_data": [ ... ]
        }
      ]
    }

Extending: Adding a New Scraper

To add support for another job board:

  1. Create a new class that inherits from BaseJobScraper and implements get_job_links and extract_job_details.
  2. Add an instance of your new scraper to the SCRAPER_REGISTRY list in scraper_registry.py.
  3. Your scraper will automatically be used by the API.

Example:

# my_scraper.py
from base_scraper import BaseJobScraper
class MyJobBoardScraper(BaseJobScraper):
  def get_job_links(self, job_title, location):
    # ...implementation...
  def extract_job_details(self, job_links, file_path):
    # ...implementation...

# scraper_registry.py
from job_details import IndeedJobScraper
from my_scraper import MyJobBoardScraper
SCRAPER_REGISTRY = [IndeedJobScraper(), MyJobBoardScraper()]

CSV Output

The job data is saved locally as output.csv with the following structure:

POSITION        COMPANY NAME    LOCATION        SALARY      JOB LINK        BENEFITS        DESCRIPTION     EMPLOYMENT TYPE
Data Analyst    TechCorp        Remote, USA     $80,000     https://job-link.com/1  Health, 401k    Job details     Full-time
Junior Analyst  BizSolutions    New York, USA   $50,000     https://job-link.com/2  None            Job details     Part-time

Error Handling

  • If no job links are found, the API returns:
{
  "detail": "No job links found."
}
  • If scraping fails or data extraction is incomplete:
{
  "detail": "No job data could be extracted."
}

Deployment

Local Deployment:

Run the app locally with Uvicorn:

uvicorn main:app --host 0.0.0.0 --port 8000

Docker Deployment:

Create a Dockerfile with the following content:

FROM python:3.10-slim
WORKDIR /app
COPY . /app
RUN pip install --no-cache-dir -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run the Docker image:

docker build -t job_scraper .
docker run -p 8000:8000 job_scraper

Future Improvements

  • Add support for other job boards (e.g., LinkedIn, Glassdoor) by creating new scraper classes
  • Implement user authentication for secure access
  • Schedule automated scraping tasks using a job scheduler like Celery
  • Optimize scraping logic to handle large-scale data efficiently

Author

Joe - JoeHardey@proton.me

License

This project is licensed under the MIT License.

About

Modular FastAPI-based job scraper for multiple job boards. Easily extensible, saves results to CSV and Supabase. Includes registry system for adding new scrapers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors