Distributed Scraping

Welcome to the Distributed Scrapy Project! This project is designed to demonstrate a highly scalable, distributed web scraping solution using Scrapy, a powerful Python library for extracting the data from websites. Our solution leverages Docker for containerization, PostgreSQL for task queue management, and AWS ECS for container orchestration.

Overview

At Stackadoc, we specialize in implementing machine learning solutions across various domains. The accuracy of our predictions heavily relies on the quality and quantity of data, which we extract from the internet using Scrapy spiders. As our data gathering needs grew, we faced limitations with the single-machine, vertical scaling approach. To overcome this, we adopted a distributed architecture to horizontally scale our scraping tasks, significantly reducing data collection timeframes and enhancing system robustness.

Key Components

Python Application: Core business logic written in Python, emphasizing clean code and modularity.

Poetry: Dependency management and packaging made easy with Poetry, ensuring consistent environments and straightforward dependency resolution.

Docker: Containerization support with Docker, facilitating development, testing, and deployment across different environments without any surprises.

Terraform: Infrastructure as Code (IaC) to provision and manage any cloud, infrastructure, or service.

SQLAlchemy: Database access and manipulation using SQLAlchemy, providing a high-level ORM and direct SQL access for efficient data handling.

PostgreSQL: Utilizing PostgreSQL as the relational database system of choice, known for its reliability, feature robustness, and performance.

Getting Started

Prerequisites

Docker
Python 3.10 or newer
Poetry
Terraform
Access to a PostgreSQL server (either locally or a hosted instance)

Setup

Clone the repository

git clone https://github.com/yourusername/ProjectName.git
cd ProjectName

Install Dependencies with Poetry:

poetry install

Environment Variables

Duplicate .env.example to .env and fill it with your PostgreSQL credentials and any other environment variables needed.

Running Locally

Using Docker Compose, you can spin up the application and the required databases for local development:

docker-compose up --build

Database Migration

To create or migrate your database schema, run:

poetry run alembic upgrade head

Deploy with Terraform

To provision your infrastructure on the cloud, navigate to the infrastructure directory:

cd terraform

Initialize Terraform:

terraform init

Apply configuration (Note: You might need to configure your cloud provider credentials):

terraform apply

Contributing

Your contributions are welcome! Whether it's improving the code, fixing bugs, or enhancing documentation, we value your help. Please feel free to fork the repository, make your changes, and submit a pull request.

License

ProjectName is released under the GNU License.

Acknowledgments

This work is a result of collaborative efforts from the Stackadoc team, aimed at pushing the boundaries of data collection for machine learning. We hope that sharing our journey and solution will benefit others facing similar scaling challenges.

Happy Scraping!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
alembic		alembic
config		config
infrastructure		infrastructure
libs		libs
scraper		scraper
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapyd.conf		scrapyd.conf
scrapyd.sh		scrapyd.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Scraping

Overview

Key Components

Getting Started

Prerequisites

Setup

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Scraping

Overview

Key Components

Getting Started

Prerequisites

Setup

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages