Simple Scraper API

Overview

The Simple Scraper API is a modular, extensible FastAPI application for scraping job listings (currently from Indeed.com) and saving results to CSV and Supabase. The architecture supports easy addition of new job board scrapers via a registry system.

Features

Scrapes job listings from Indeed.com (extensible to other boards)
Collects job title, company name, location, salary, benefits, description, employment type
Saves data as CSV and (optionally) uploads to Supabase
Extensible: add new scrapers by subclassing and registering
Modular codebase with clear separation of scraping, data handling, and database logic

Project Structure

SimpleScraper-API/
│
├── main.py                  # FastAPI entry point, uses scraper registry
├── scraper_registry.py      # Registry for all scraper classes
├── requirements.txt         # Dependencies
├── README.md                # Project documentation
├── Dockerfile               # Docker build file
├── docker-compose.yml       # Docker Compose config
├── .env                     # Environment variables
├── scrapers/                # Scraper classes for each job board
│   ├── __init__.py
│   ├── base_scraper.py      # Abstract base class for all scrapers
│   ├── indeed_scraper.py    # Indeed scraper implementation
├── utils/                   # Utility modules
│   ├── __init__.py
│   ├── csv_handler.py       # Utility for saving CSV files
│   ├── driver_utils.py      # Selenium driver setup utility
│   ├── scrape_utils.py      # Reusable scraping helpers
│   └── supabase_utils.py    # Supabase upload utility

Technologies Used

Backend Framework: FastAPI
Web Scraping: Selenium, BeautifulSoup
Data Processing: Pandas
Database: Supabase
Other Tools: WebDriver Manager, Uvicorn

Prerequisites

Python 3.10+
Google Chrome and ChromeDriver (managed automatically by WebDriver Manager)
A Supabase account with a database table set up:
- Table Name: Job_listing
- Columns:
  - POSITION
  - COMPANY NAME
  - LOCATION
  - SALARY
  - JOB LINK
  - BENEFITS
  - DESCRIPTION
  - EMPLOYMENT TYPE

Setup and Installation

Clone the Repository:

git clone https://github.com/your-repo/job_scraper.git
cd job_scraper

Install Dependencies:

pip install -r requirements.txt

Configure Supabase (optional):

Edit .env or set environment variables for SUPER_BASE_URL and SUPER_BASE_KEY.
You can also pass credentials directly to upload_to_supabase in utils/supabase_utils.py.

Run the API:

uvicorn main:app --reload

Access the API docs:

http://127.0.0.1:8000/docs

API Endpoints

`/search_jobs` (POST)

Parameters:
- job_title (string, required): The job title to search for
- location (string, required): The job location
Description: Scrapes jobs using all registered scrapers, saves results to CSV, and uploads to Supabase

Response:

{
  "results": [
    {
      "scraper": "IndeedJobScraper",
      "uploaded_data": [ ... ]
    },
    {
      "scraper": "OtherScraper",
      "uploaded_data": [ ... ]
    }
  ]
}

Extending: Adding a New Scraper

To add support for another job board:

Create a new class that inherits from BaseJobScraper and implements get_job_links and extract_job_details.
Add an instance of your new scraper to the SCRAPER_REGISTRY list in scraper_registry.py.
Your scraper will automatically be used by the API.

Example:

# my_scraper.py
from base_scraper import BaseJobScraper
class MyJobBoardScraper(BaseJobScraper):
  def get_job_links(self, job_title, location):
    # ...implementation...
  def extract_job_details(self, job_links, file_path):
    # ...implementation...

# scraper_registry.py
from job_details import IndeedJobScraper
from my_scraper import MyJobBoardScraper
SCRAPER_REGISTRY = [IndeedJobScraper(), MyJobBoardScraper()]

CSV Output

The job data is saved locally as output.csv with the following structure:

POSITION        COMPANY NAME    LOCATION        SALARY      JOB LINK        BENEFITS        DESCRIPTION     EMPLOYMENT TYPE
Data Analyst    TechCorp        Remote, USA     $80,000     https://job-link.com/1  Health, 401k    Job details     Full-time
Junior Analyst  BizSolutions    New York, USA   $50,000     https://job-link.com/2  None            Job details     Part-time

Error Handling

If no job links are found, the API returns:

{
  "detail": "No job links found."
}

If scraping fails or data extraction is incomplete:

{
  "detail": "No job data could be extracted."
}

Deployment

Local Deployment:

Run the app locally with Uvicorn:

uvicorn main:app --host 0.0.0.0 --port 8000

Docker Deployment:

Create a Dockerfile with the following content:

FROM python:3.10-slim
WORKDIR /app
COPY . /app
RUN pip install --no-cache-dir -r requirements.txt
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Build and run the Docker image:

docker build -t job_scraper .
docker run -p 8000:8000 job_scraper

Future Improvements

Add support for other job boards (e.g., LinkedIn, Glassdoor) by creating new scraper classes
Implement user authentication for secure access
Schedule automated scraping tasks using a job scheduler like Celery
Optimize scraping logic to handle large-scale data efficiently

Author

Joe - JoeHardey@proton.me

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Scraper API

Overview

Features

Project Structure

Technologies Used

Prerequisites

Setup and Installation

API Endpoints

`/search_jobs` (POST)

Extending: Adding a New Scraper

CSV Output

Error Handling

Deployment

Local Deployment:

Docker Deployment:

Future Improvements

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scrapers		scrapers
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
scraper_registry.py		scraper_registry.py

jlord31/simple-scraper

Folders and files

Latest commit

History

Repository files navigation

Simple Scraper API

Overview

Features

Project Structure

Technologies Used

Prerequisites

Setup and Installation

API Endpoints

/search_jobs (POST)

Extending: Adding a New Scraper

CSV Output

Error Handling

Deployment

Local Deployment:

Docker Deployment:

Future Improvements

Author

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`/search_jobs` (POST)

Packages