Modular Reddit data collection framework
Scrape subreddits, posts, and users into clean structured JSON.
A modular Reddit scraping pipeline designed for data collection, analytics, and research workflows.
The project gathers structured data about:
- 📚 Subreddits
- 📝 Posts
- 👤 Users
and exports everything as clean JSON datasets ready for:
- databases
- machine learning pipelines
- analytics
- data exploration
No manual scraping steps required.
- Modular scraper architecture
- Structured JSON output
- Automated scraping workflow
- MongoDB import helpers
- Large dataset handling utilities
- Environment-based configuration
| Entity | Data |
|---|---|
| Subreddits | metadata & statistics |
| Posts | content, scores, engagement |
| Users | profile & activity info |
run.py
│
├── subreddits.py
├── posts.py
└── users.py
↓
JSON datasets
↓
(optional) MongoDB import
Each scraper is independent and reusable.
git clone https://github.com/glowfi/reddit-scraper
cd reddit-scraper
python -m venv env
source env/bin/activate # Linux / macOS
# env\Scripts\activate # Windows
pip install -r requirements.txtEdit env-sample and rename it:
.env
username=<RedditUsername>
password=<RedditPassword>
client_id=<ClientID>
client_secret=<ClientSecret>
TOTAL_SUBREDDITS_PER_TOPICS=6
SUBREDDIT_SORT_FILTER="hot"
POSTS_PER_SUBREDDIT=10
POSTS_SORT_FILTER="new"Create Reddit API credentials here:
👉 https://www.reddit.com/prefs/apps
python run.pyPipeline execution:
- Scrape subreddits
- Scrape posts
- Scrape users
- Export JSON datasets
- Optional dataset splitting
JSON files are large (16–25MB). Download instead of viewing in browser.
Sample: https://files.catbox.moe/r7a7um.json
Sample: https://files.catbox.moe/5cf2xw.json
Sample: https://files.catbox.moe/yp506n.json
reddit-scraper/
├── subreddits.py
├── posts.py
├── users.py
├── run.py
├── utils/
│ ├── split.py
│ └── import_data_to_mongodb.sh
└── output/
| Tool | Purpose |
|---|---|
run.py |
Executes full scraping pipeline |
utils/split.py |
Splits large JSON datasets |
import_data_to_mongodb.sh |
Bulk imports into MongoDB |
After scraping:
./utils/import_data_to_mongodb.shEnsure MongoDB is running beforehand.
- Reddit API rate limits apply
- Scraping speed depends on network/API limits
- Designed for research & data workflows
- Respect Reddit API terms of service
Contributions, improvements, and issue reports are welcome.
Small focused PRs are preferred.
GPL-3.0


