Reddit API Policy Change (November 2025): Reddit now requires pre-approval to create new API apps. If you already have API credentials, they still work normally. New users must apply at Reddit's API Request Form (expect 2-4 weeks for approval). Alternatively, you can use the GDPR Export Mode to create an index of your saved content without any API credentials. See Getting API Credentials for full details.
Reddit Stash is a Python script designed to help you effortlessly back up your Reddit saved/ posted/ upvoted posts and comments to Dropbox, AWS S3, or your local machine. Utilizing GitHub Actions, this script runs every 3 hours during peak hours and twice during off-peak hours, automating the process of archiving your Reddit data after a simple setup.
Looking for search & AI chat? Check out reddit-stash-insights — a companion tool that adds semantic search and RAG chat to your Reddit archive.
When Reddit Stash runs successfully, your saved content is organized by subreddit in a clean folder structure and stored as markdown files:
reddit/
├── r_AskReddit/
│ ├── POST_abcd123.md # Your posted content
│ ├── COMMENT_efgh456.md # Your comments
│ ├── SAVED_POST_xyz789.md # Posts you saved
│ └── SAVED_COMMENT_def012.md # Comments you saved
├── r_ProgrammerHumor/
│ ├── UPVOTE_POST_ghi345.md # Posts you upvoted
│ ├── UPVOTE_COMMENT_mno901.md # Comments you upvoted
│ └── GDPR_POST_jkl678.md # From GDPR export (if enabled)
├── gdpr_data/ # GDPR CSV files (if processing enabled)
│ ├── saved_posts.csv
│ └── saved_comments.csv
└── file_log.json # Tracks processed items
Each post and comment is formatted with:
- Original title and content
- Author information
- Post/comment URL
- Timestamp
- Subreddit details
- Any images or links from the original post
When Reddit Stash completes processing, you'll receive detailed storage information:
Processing completed. 150 items processed, 25 items skipped.
Markdown file storage: 12.45 MB
Media file storage: 89.32 MB
Total combined storage: 101.77 MB
This gives you clear visibility into:
- Text content storage: How much space your saved posts and comments use
- Media storage: How much space downloaded images and videos use
- Total storage: Combined space used for your complete Reddit archive
- What You Get
- How It Works
- Quick Start
- Key Features
- Why Use Reddit Stash
- Setup
- Configuration
- Docker Environment Variables
- Alternative Scheduling: External Cron Setup
- Important Notes
- File Organization and Utilities
- Frequently Asked Questions
- Troubleshooting
- Security Considerations
- Contributing
- Acknowledgement
- Project Status
- License
graph LR
A[Reddit API] -->|Fetch Content| B[Reddit Stash Script]
B -->|Save as Markdown| C[Local Storage]
B -->|Check Settings| D{Save Type}
D -->|SAVED| E[Saved Posts/Comments]
D -->|ACTIVITY| F[User Posts/Comments]
D -->|UPVOTED| G[Upvoted Content]
D -->|ALL| H[All Content Types]
C -->|Optional| I[Cloud Upload]
I -->|Dropbox or S3| K[Cloud Storage]
J[GDPR Export] -->|Optional| B
-
Data Collection:
- The script connects to Reddit's API to fetch your saved, posted, or upvoted content
- Optionally, it can process your GDPR export data for a complete history
-
Processing & Organization:
- Content is processed based on your settings (SAVED, ACTIVITY, UPVOTED, or ALL)
- Files are organized by subreddit in a clean folder structure
- A log file tracks all processed items to avoid duplicates
-
Storage Options:
- Local storage: Content is saved as markdown files on your machine
- Cloud storage: Optional integration with Dropbox or AWS S3 for backup
-
Deployment Methods:
- GitHub Actions: Fully automated with scheduled runs and cloud storage integration (Dropbox or S3)
- Local Installation: Run manually or schedule with cron jobs on your machine
- Docker: Run in a containerized environment with optional volume mounts
The script is designed to be flexible, allowing you to choose how you collect, process, and store your Reddit content.
For those who want to get up and running quickly, here's a streamlined process:
- Fork this repository.
- Set up the required secrets in your GitHub repository:
- From Reddit:
REDDIT_CLIENT_ID,REDDIT_CLIENT_SECRET,REDDIT_USERNAME,REDDIT_PASSWORD - For Dropbox storage:
DROPBOX_APP_KEY,DROPBOX_APP_SECRET,DROPBOX_REFRESH_TOKEN - For S3 storage:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_S3_BUCKET,STORAGE_PROVIDER(set tos3)
- From Reddit:
- Manually trigger the workflow from the Actions tab.
- Clone the repository:
Replace
git clone https://github.com/YOUR_USERNAME/reddit-stash.git cd reddit-stashYOUR_USERNAMEwith your GitHub username (or userhnfzlif using the original repository). - Install dependencies:
pip install -r requirements.txt
- Set up your environment variables and run:
python reddit_stash.py
- Build the Docker image:
docker build -t reddit-stash . - Run with your environment variables:
Add cloud storage env vars as needed: Dropbox (
docker run -it \ -e REDDIT_CLIENT_ID=your_client_id \ -e REDDIT_CLIENT_SECRET=your_client_secret \ -e REDDIT_USERNAME=your_username \ -e REDDIT_PASSWORD=your_password \ -v $(pwd)/reddit:/app/reddit \ reddit-stashDROPBOX_APP_KEY,DROPBOX_APP_SECRET,DROPBOX_REFRESH_TOKEN) or S3 (STORAGE_PROVIDER=s3,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_S3_BUCKET). See Setting Up Dropbox App or Setting Up AWS S3 for details.
For detailed setup instructions, continue reading the Setup section.
| Feature | GitHub Actions | Local Installation | Docker |
|---|---|---|---|
| Ease of Setup | ⭐⭐⭐ (Easiest) | ⭐⭐ | ⭐⭐ |
| Automation | ✅ Runs on schedule | ✅ Manual control or cron | ✅ Built-in scheduling support |
| Requirements | GitHub account | Python 3.10-3.12 | Docker |
| Data Storage | Dropbox or S3 | Local, Dropbox, or S3 | Local, Dropbox, or S3 |
| Maintenance | Minimal | More hands-on | Low to Medium |
| Privacy | Credentials in GitHub secrets | Credentials on local machine | Credentials in container |
| Best For | Set & forget users | Power users with customization needs | Containerized environments & flexible scheduling |
- 🤖 Automated Reddit Backup: Automatically retrieves saved posts and comments from Reddit, even your posts and comments if you set it up.
- 🔄 Flexible Storage Options: Allows for flexible saving options (all activity or only saved items) via
settings.ini. - 📦 Cloud Storage Integration: Sync your archive to Dropbox or AWS S3 (with Glacier support for low-cost archival).
- 📝 Markdown Support: Saves the content as markdown files.
- 🔍 File Deduplication: Uses intelligent file existence checking to avoid re-downloading content.
- ⏱️ Rate Limit Management: Implements dynamic sleep timers to respect Reddit's API rate limits.
- 🔒 GDPR Data Processing: Optional processing of Reddit's GDPR export data.
- 🖼️ Enhanced Media Downloads: Download images, videos, and other media with dramatically improved success rates (~80% vs previous ~10%), featuring intelligent fallback systems and modern web compatibility.
- 🔄 Content Recovery System: 4-provider cascade for failed downloads (Wayback Machine, PullPush.io, Reddit Previews, Reveddit) with SQLite caching and automatic retry across runs.
Reddit Stash was designed with specific use cases in mind:
Reddit only shows your most recent 1000 saved posts. With Reddit Stash, you can save everything and go beyond this limitation.
Many users save technical posts, tutorials, or valuable discussions on Reddit. Reddit Stash helps you build a searchable archive of this knowledge.
Reddit posts and comments can be deleted by users or moderation. Reddit Stash preserves this content in your personal archive.
All of your saved posts are available locally in markdown format, making them easily accessible even without an internet connection.
Since content is saved in markdown, you can easily import it into note-taking systems like Obsidian, Notion, or any markdown-compatible tool.
Beyond text, Reddit Stash can download and preserve images, videos, and other media from posts, ensuring you have complete archives even if external hosting services go offline.
- ✅ Python 3.10-3.12 (Python 3.12 recommended for best performance)
- 🔑 Reddit API credentials
- 📊 A cloud storage account (optional): Dropbox with API token, or an AWS account with S3 access
Before proceeding with any installation method, ensure that you have set the Reddit environment variables. Follow Reddit API guide to create a Reddit app and obtain the necessary credentials.
Note: Cloud storage is optional. To use Dropbox, see the Dropbox App setup. To use AWS S3 instead, see Setting Up AWS S3. The GitHub Actions workflow runs the script every 3 hours during peak hours (6:00-21:00 UTC) and twice during off-peak hours (23:00 and 3:00 UTC), syncing files to your configured cloud storage. The workflow is defined in .github/workflows/reddit_scraper.yml.
-
Fork this repository.
-
Set Up Secrets:
- Go to your forked repository's Settings > Secrets and variables > Actions > Click on New repository secret.
- Add the following secrets individually:
REDDIT_CLIENT_IDREDDIT_CLIENT_SECRETREDDIT_USERNAMEREDDIT_PASSWORDFor Dropbox cloud storage (optional — see Setting Up Dropbox App)DROPBOX_APP_KEYDROPBOX_APP_SECRETDROPBOX_REFRESH_TOKENFor S3 cloud storage (optional — see Setting Up AWS S3)AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_S3_BUCKETSTORAGE_PROVIDER=s3For Enhanced Media Downloads (Optional - Imgur API registration is permanently closed)IMGUR_CLIENT_ID(only if you already have an existing Imgur application)IMGUR_CLIENT_SECRET(only if you already have an existing Imgur application)
- Enter the respective secret values without any quotes.
- Manually Trigger the Workflow:
- Go to the Actions tab > Select the Reddit Stash Scraper from the list on the left > Click Run workflow > Select the branch
main> Click the green Run workflow button. The workflow will then be triggered, and you can monitor its progress in the Actions tab. Upon successful completion, you should see the Reddit folder in your configured cloud storage (Dropbox or S3) or in the Actions artifacts.
-
The workflow runs automatically on a schedule:
- Every 3 hours during peak hours (6:00-21:00 UTC)
- Twice during off-peak hours (23:00 and 3:00 UTC)
- You can adjust these times in the workflow file to match your timezone if needed.
-
Additional Workflows: The repository includes automated workflows for maintenance and testing:
python-compatibility.yml: Tests compatibility across Python versions 3.10-3.12
-
Clone this repository:
git clone https://github.com/YOUR_USERNAME/reddit-stash.git cd reddit-stashReplace
YOUR_USERNAMEwith your GitHub username (or userhnfzlif using the original repository). -
Install the required Python packages:
pip install -r requirements.txt -
Set up cloud storage (optional). Choose one:
- Dropbox: Follow Setting Up Dropbox App
- AWS S3: Follow Setting Up AWS S3
- Local only: Skip this step if you only want to save files locally on your system.
-
Edit the settings.ini file, here is how to
-
Set Environment Variables (Optional but preferred):
For macOS and Linux:
export REDDIT_CLIENT_ID='your_client_id' export REDDIT_CLIENT_SECRET='your_client_secret' export REDDIT_USERNAME='your_username' export REDDIT_PASSWORD='your_password' # Optional: Dropbox cloud storage export DROPBOX_APP_KEY='dropbox-app-key' export DROPBOX_APP_SECRET='dropbox-secret-key' export DROPBOX_REFRESH_TOKEN='dropbox-secret-key' # Optional: AWS S3 cloud storage (instead of Dropbox) export AWS_ACCESS_KEY_ID='your_access_key' export AWS_SECRET_ACCESS_KEY='your_secret_key' export AWS_S3_BUCKET='your-bucket-name' export STORAGE_PROVIDER='s3' # Optional, for enhanced Imgur downloading (if you have existing API access) export IMGUR_CLIENT_ID='your_imgur_client_id' export IMGUR_CLIENT_SECRET='your_imgur_client_secret'For Windows:
set REDDIT_CLIENT_ID='your_client_id' set REDDIT_CLIENT_SECRET='your_client_secret' set REDDIT_USERNAME='your_username' set REDDIT_PASSWORD='your_password' # Optional: Dropbox cloud storage set DROPBOX_APP_KEY='dropbox-app-key' set DROPBOX_APP_SECRET='dropbox-secret-key' set DROPBOX_REFRESH_TOKEN='dropbox-secret-key' # Optional: AWS S3 cloud storage (instead of Dropbox) set AWS_ACCESS_KEY_ID='your_access_key' set AWS_SECRET_ACCESS_KEY='your_secret_key' set AWS_S3_BUCKET='your-bucket-name' set STORAGE_PROVIDER='s3' # Optional, for enhanced Imgur downloading (if you have existing API access) set IMGUR_CLIENT_ID='your_imgur_client_id' set IMGUR_CLIENT_SECRET='your_imgur_client_secret'You can verify the setup with:
echo $REDDIT_CLIENT_ID echo $REDDIT_CLIENT_SECRET echo $REDDIT_USERNAME echo $REDDIT_PASSWORD # If using Dropbox: echo $DROPBOX_APP_KEY echo $DROPBOX_APP_SECRET echo $DROPBOX_REFRESH_TOKEN # If using S3: echo $AWS_ACCESS_KEY_ID echo $AWS_SECRET_ACCESS_KEY echo $AWS_S3_BUCKET echo $STORAGE_PROVIDER # If using Imgur: echo $IMGUR_CLIENT_ID echo $IMGUR_CLIENT_SECRET -
Usage:
- First-time setup:
python reddit_stash.pyTo upload to cloud storage (optional):
python storage_utils.py --upload # Works with Dropbox or S3 python dropbox_utils.py --upload # Dropbox-only (legacy)- Subsequent runs, as per your convenience:
- Download from cloud storage (optional):
python storage_utils.py --download # Works with Dropbox or S3- Process Reddit saved items:
python reddit_stash.py- Upload to cloud storage (optional):
python storage_utils.py --upload
🐳 Pre-built Images Available! No build required - pull and run directly from GitHub Container Registry.
📁 Important: File Storage Location
When running via Docker, files are downloaded to /app/reddit/ inside the container. Without a volume mount, these files are lost when the container stops.
To persist your downloaded Reddit content, you must mount a volume:
- Named Volume (recommended):
-v reddit-data:/app/reddit- Docker manages storage, data persists across container restarts - Bind Mount (direct access):
-v $(pwd)/reddit:/app/reddit- Files stored directly on your host machine at./reddit/ - NAS Path:
-v /volume1/reddit-stash:/app/reddit- Store on your NAS at a specific location
Pull pre-built multi-platform images (AMD64/ARM64) from GitHub Container Registry:
# Pull the latest stable image
docker pull ghcr.io/rhnfzl/reddit-stash:latest
# Run with your credentials
docker run -d \
--name reddit-stash \
-v reddit-data:/app/reddit \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
ghcr.io/rhnfzl/reddit-stash:latestAvailable Image Tags:
| Tag | Description | Use Case |
|---|---|---|
latest |
Latest stable from main branch | Production deployments |
develop |
Development version | Testing new features |
py3.10-latest, py3.11-latest, py3.12-latest |
Python-specific versions | Specific Python requirements |
v1.0.0 |
Semantic version tags | Version pinning |
sha-abc123 |
Commit-specific builds | Reproducible deployments |
Platform Support:
- ✅ AMD64 (x86_64) - Standard x86 systems
- ✅ ARM64 - Raspberry Pi, ARM-based NAS devices
NAS/HomeLab Compatibility:
- Synology DSM (Container Manager)
- QNAP (Container Station)
- TrueNAS SCALE
- unRAID (Community Applications)
- OpenMediaVault (Docker plugin)
- Proxmox (LXC/Docker)
- Portainer, Yacht, CasaOS, Dockge
Periodic execution with pre-built image:
docker run -d \
--name reddit-stash \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e SCHEDULE_MODE='periodic' \
-e SCHEDULE_INTERVAL='7200' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestDocker Compose is the recommended deployment method for NAS and HomeLab environments. It provides easier management, configuration persistence, and works seamlessly with Portainer, Yacht, CasaOS, and other GUI tools.
1. Create docker-compose.yml
Basic Setup (Local-only with Periodic Execution):
version: '3.8'
services:
reddit-stash:
image: ghcr.io/rhnfzl/reddit-stash:latest
container_name: reddit-stash
restart: unless-stopped
environment:
- REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID}
- REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET}
- REDDIT_USERNAME=${REDDIT_USERNAME}
- REDDIT_PASSWORD=${REDDIT_PASSWORD}
- SCHEDULE_MODE=periodic
- SCHEDULE_INTERVAL=7200
volumes:
- reddit-data:/app/reddit
volumes:
reddit-data:2. Create .env file
Create a .env file in the same directory as your docker-compose.yml:
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USERNAME=your_reddit_username
REDDIT_PASSWORD=your_reddit_password3. Run with Docker Compose
# Start the service
docker-compose up -d
# View logs
docker-compose logs -f
# Stop the service
docker-compose down
# Update to latest image
docker-compose pull && docker-compose up -dWith Dropbox Sync:
version: '3.8'
services:
reddit-stash:
image: ghcr.io/rhnfzl/reddit-stash:latest
container_name: reddit-stash
restart: unless-stopped
environment:
- REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID}
- REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET}
- REDDIT_USERNAME=${REDDIT_USERNAME}
- REDDIT_PASSWORD=${REDDIT_PASSWORD}
- DROPBOX_APP_KEY=${DROPBOX_APP_KEY}
- DROPBOX_APP_SECRET=${DROPBOX_APP_SECRET}
- DROPBOX_REFRESH_TOKEN=${DROPBOX_REFRESH_TOKEN}
- SCHEDULE_MODE=periodic
- SCHEDULE_INTERVAL=7200
volumes:
- reddit-data:/app/reddit
volumes:
reddit-data:Add to your .env:
DROPBOX_APP_KEY=your_dropbox_app_key
DROPBOX_APP_SECRET=your_dropbox_app_secret
DROPBOX_REFRESH_TOKEN=your_dropbox_refresh_tokenWith AWS S3 Sync:
version: '3.8'
services:
reddit-stash:
image: ghcr.io/rhnfzl/reddit-stash:latest
container_name: reddit-stash
restart: unless-stopped
environment:
- REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID}
- REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET}
- REDDIT_USERNAME=${REDDIT_USERNAME}
- REDDIT_PASSWORD=${REDDIT_PASSWORD}
- STORAGE_PROVIDER=s3
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_S3_BUCKET=${AWS_S3_BUCKET}
- AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION:-us-east-1}
- SCHEDULE_MODE=periodic
- SCHEDULE_INTERVAL=7200
volumes:
- reddit-data:/app/reddit
volumes:
reddit-data:Add to your .env:
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_S3_BUCKET=your-bucket-name
AWS_DEFAULT_REGION=us-east-1With Imgur API (for Better Rate Limits):
version: '3.8'
services:
reddit-stash:
image: ghcr.io/rhnfzl/reddit-stash:latest
container_name: reddit-stash
restart: unless-stopped
environment:
- REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID}
- REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET}
- REDDIT_USERNAME=${REDDIT_USERNAME}
- REDDIT_PASSWORD=${REDDIT_PASSWORD}
- IMGUR_CLIENT_ID=${IMGUR_CLIENT_ID}
- SCHEDULE_MODE=periodic
- SCHEDULE_INTERVAL=7200
volumes:
- reddit-data:/app/reddit
volumes:
reddit-data:Add to your .env:
IMGUR_CLIENT_ID=your_imgur_client_idFull-Featured Example:
version: '3.8'
services:
reddit-stash:
image: ghcr.io/rhnfzl/reddit-stash:latest
container_name: reddit-stash
restart: unless-stopped
environment:
# Reddit credentials
- REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID}
- REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET}
- REDDIT_USERNAME=${REDDIT_USERNAME}
- REDDIT_PASSWORD=${REDDIT_PASSWORD}
# Cloud storage — choose Dropbox OR S3 (both optional)
- DROPBOX_APP_KEY=${DROPBOX_APP_KEY}
- DROPBOX_APP_SECRET=${DROPBOX_APP_SECRET}
- DROPBOX_REFRESH_TOKEN=${DROPBOX_REFRESH_TOKEN}
# - STORAGE_PROVIDER=s3 # Uncomment for S3
# - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
# - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
# - AWS_S3_BUCKET=${AWS_S3_BUCKET}
# Imgur API (optional - for better rate limits)
- IMGUR_CLIENT_ID=${IMGUR_CLIENT_ID}
# Scheduling
- SCHEDULE_MODE=periodic
- SCHEDULE_INTERVAL=7200
volumes:
- reddit-data:/app/reddit
# Optional: mount custom settings.ini
# - ./settings.ini:/app/settings.ini:ro
deploy:
resources:
limits:
cpus: '1.0'
memory: 512M
volumes:
reddit-data:Using with Portainer:
- Go to Stacks → Add stack
- Name your stack (e.g.,
reddit-stash) - Paste one of the compose examples above
- Scroll down to Environment variables and add your credentials:
REDDIT_CLIENT_ID= your_client_idREDDIT_CLIENT_SECRET= your_client_secretREDDIT_USERNAME= your_usernameREDDIT_PASSWORD= your_password- (Add others as needed)
- Click Deploy the stack
Bind Mounts for Specific Paths:
If you prefer to use a specific directory instead of a named volume, use bind mounts:
volumes:
# Named volume (recommended)
- reddit-data:/app/reddit
# OR bind mount to specific path
# Synology:
# - /volume1/docker/reddit-stash:/app/reddit
# unRAID:
# - /mnt/user/appdata/reddit-stash:/app/reddit
# Generic:
# - ./reddit:/app/redditWith Dropbox Sync (Single Run):
docker run --rm \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e DROPBOX_APP_KEY='your_dropbox_key' \
-e DROPBOX_APP_SECRET='your_dropbox_secret' \
-e DROPBOX_REFRESH_TOKEN='your_dropbox_token' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestWith Dropbox Sync (Periodic):
docker run -d \
--name reddit-stash \
--restart unless-stopped \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e DROPBOX_APP_KEY='your_dropbox_key' \
-e DROPBOX_APP_SECRET='your_dropbox_secret' \
-e DROPBOX_REFRESH_TOKEN='your_dropbox_token' \
-e SCHEDULE_MODE='periodic' \
-e SCHEDULE_INTERVAL='7200' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestWith AWS S3 Sync (Periodic):
docker run -d \
--name reddit-stash \
--restart unless-stopped \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e STORAGE_PROVIDER='s3' \
-e AWS_ACCESS_KEY_ID='your_access_key' \
-e AWS_SECRET_ACCESS_KEY='your_secret_key' \
-e AWS_S3_BUCKET='your-bucket-name' \
-e SCHEDULE_MODE='periodic' \
-e SCHEDULE_INTERVAL='7200' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestWith Imgur API for Better Rate Limits:
docker run -d \
--name reddit-stash \
--restart unless-stopped \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e IMGUR_CLIENT_ID='your_imgur_client_id' \
-e SCHEDULE_MODE='periodic' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestWith 2FA Authentication:
If you use Reddit's Two-Factor Authentication, append your 6-digit code to your password:
docker run -d \
--name reddit-stash \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password:123456' \
-e SCHEDULE_MODE='periodic' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestWith Custom settings.ini:
docker run -d \
--name reddit-stash \
--restart unless-stopped \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e SCHEDULE_MODE='periodic' \
-v reddit-data:/app/reddit \
-v /path/to/your/settings.ini:/app/settings.ini:ro \
ghcr.io/rhnfzl/reddit-stash:latestCustom Interval (e.g., every hour):
docker run -d \
--name reddit-stash \
--restart unless-stopped \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e SCHEDULE_MODE='periodic' \
-e SCHEDULE_INTERVAL='3600' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestCloud Storage Upload Only (Unified CLI):
Upload local content to your configured cloud provider without running the main Reddit scraper:
# Dropbox
docker run --rm \
-e DROPBOX_APP_KEY='your_dropbox_key' \
-e DROPBOX_APP_SECRET='your_dropbox_secret' \
-e DROPBOX_REFRESH_TOKEN='your_dropbox_token' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latest \
storage_utils.py --upload
# S3
docker run --rm \
-e STORAGE_PROVIDER='s3' \
-e AWS_ACCESS_KEY_ID='your_access_key' \
-e AWS_SECRET_ACCESS_KEY='your_secret_key' \
-e AWS_S3_BUCKET='your-bucket-name' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latest \
storage_utils.py --uploadCloud Storage Download Only:
Download content from your cloud provider to local storage:
# Dropbox
docker run --rm \
-e DROPBOX_APP_KEY='your_dropbox_key' \
-e DROPBOX_APP_SECRET='your_dropbox_secret' \
-e DROPBOX_REFRESH_TOKEN='your_dropbox_token' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latest \
storage_utils.py --download
# S3
docker run --rm \
-e STORAGE_PROVIDER='s3' \
-e AWS_ACCESS_KEY_ID='your_access_key' \
-e AWS_SECRET_ACCESS_KEY='your_secret_key' \
-e AWS_S3_BUCKET='your-bucket-name' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latest \
storage_utils.py --downloadLegacy note:
dropbox_utils.py --uploadanddropbox_utils.py --downloadstill work for Dropbox-only use.
Named Volumes (Recommended for NAS):
Named volumes are portable across different NAS systems and don't require specific path configurations:
-v reddit-data:/app/redditView volume location:
docker volume inspect reddit-dataBind Mounts (Specific Paths):
Use bind mounts when you need data in a specific directory:
# Synology DSM
-v /volume1/docker/reddit-stash:/app/reddit
# QNAP
-v /share/Container/reddit-stash:/app/reddit
# TrueNAS SCALE
-v /mnt/tank/apps/reddit-stash:/app/reddit
# unRAID
-v /mnt/user/appdata/reddit-stash:/app/reddit
# Generic Linux/macOS
-v /path/to/reddit:/app/reddit
-v $(pwd)/reddit:/app/redditVolume Permissions:
The container runs as UID 1000. If you encounter permission errors with bind mounts:
sudo chown -R 1000:1000 /path/to/redditResource Limits for NAS Devices:
Limit CPU and memory usage to prevent overloading your NAS:
Docker CLI:
docker run -d \
--name reddit-stash \
--restart unless-stopped \
--memory="512m" \
--memory-swap="1g" \
--cpus="1.0" \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e SCHEDULE_MODE='periodic' \
-v reddit-data:/app/reddit \
ghcr.io/rhnfzl/reddit-stash:latestDocker Compose:
services:
reddit-stash:
# ... other config ...
deploy:
resources:
limits:
cpus: '1.0'
memory: 512M
reservations:
cpus: '0.5'
memory: 256MMonitoring & Logs:
# View real-time logs
docker logs -f reddit-stash
# View last 100 lines
docker logs --tail 100 reddit-stash
# View logs since specific time
docker logs --since 1h reddit-stash
# Monitor resource usage
docker stats reddit-stashUpdating the Container:
# Stop and remove old container
docker stop reddit-stash
docker rm reddit-stash
# Pull latest image
docker pull ghcr.io/rhnfzl/reddit-stash:latest
# Recreate with same command or docker-compose up -dOr with Docker Compose:
docker-compose pull
docker-compose up -dFor detailed step-by-step guides specific to your NAS platform, see docs/DOCKER_DEPLOYMENT.md:
- Synology DSM - Container Manager setup
- QNAP - Container Station configuration
- TrueNAS SCALE - Apps deployment
- unRAID - Docker tab and Community Applications
- OpenMediaVault - Docker plugin setup
- Proxmox - LXC container with Docker
- Portainer - Stack deployment
If you prefer to build the Docker image yourself:
1. Clone and Build:
# Clone the repository
git clone https://github.com/rhnfzl/reddit-stash.git
cd reddit-stash
# Build with default Python 3.12
docker build -t reddit-stash:local .
# Or build with specific Python version
docker build --build-arg PYTHON_VERSION=3.11 -t reddit-stash:py3.11 .
docker build --build-arg PYTHON_VERSION=3.10 -t reddit-stash:py3.10 .2. Run the Locally Built Image:
After building, use your local image tag (reddit-stash:local) instead of the GitHub Container Registry image (ghcr.io/rhnfzl/reddit-stash:latest) in any of the examples from Option 1 above.
Example:
docker run -d \
--name reddit-stash \
--restart unless-stopped \
-e REDDIT_CLIENT_ID='your_client_id' \
-e REDDIT_CLIENT_SECRET='your_client_secret' \
-e REDDIT_USERNAME='your_username' \
-e REDDIT_PASSWORD='your_password' \
-e SCHEDULE_MODE='periodic' \
-v reddit-data:/app/reddit \
reddit-stash:localFor all other usage scenarios (Dropbox/S3 sync, Imgur API, docker-compose, special operations, etc.), refer to the examples in Option 1 above, simply replacing the image name.
- Python Support: Build supports Python 3.10, 3.11, and 3.12 (3.12 is default)
- Security: The container runs as a non-root user for security
- Data Persistence: Data is persisted through a volume mount (
-v $(pwd)/reddit:/app/reddit) to your local machine - Runtime Configuration: Environment variables must be provided at runtime
- Flexibility: The container supports running different scripts (main script, storage operations)
- Interactive Mode: Use
-itflags for interactive operation with output visible in your terminal - Shell Special Characters: Always use single quotes around environment variable values to prevent shell interpretation of special characters (!, &, $, etc.)
- Execution Modes:
- Single execution (
SCHEDULE_MODE=once, default): Runs once and exits - Periodic execution (
SCHEDULE_MODE=periodic): Runs continuously on schedule
- Single execution (
- Scheduling Options:
- Default interval: 2 hours (7200 seconds)
- Custom interval: Set
SCHEDULE_INTERVALto any value ≥ 60 seconds - Graceful shutdown: Responds to SIGTERM/SIGINT for clean container stops
- Two Main Storage Modes:
- Local-only: Just Reddit credentials, saves to mounted volume
- Cloud sync: Dropbox or S3 credentials for automatic cloud backup
- Detached Mode: You can also run in detached mode with
-dif you prefer:docker run -d \ -e REDDIT_CLIENT_ID='your_client_id' \ [other environment variables] \ -v $(pwd)/reddit:/app/reddit \ reddit-stash
- Logging: Logs are available through Docker's logging system when running in detached mode:
docker logs <container_id>
If you've forked this repository to make custom modifications, follow these setup steps:
Docker works the same way with your fork - just build locally:
# Clone your fork
git clone https://github.com/YOUR_USERNAME/reddit-stash.git
cd reddit-stash
# Build your customized version
docker build -t reddit-stash .
# Run your custom build (with Dropbox)
docker run -it \
-e REDDIT_CLIENT_ID=your_client_id \
-e REDDIT_CLIENT_SECRET=your_client_secret \
-e REDDIT_USERNAME=your_username \
-e REDDIT_PASSWORD=your_password \
-e DROPBOX_APP_KEY=your_dropbox_key \
-e DROPBOX_APP_SECRET=your_dropbox_secret \
-e DROPBOX_REFRESH_TOKEN=your_dropbox_token \
-v $(pwd)/reddit:/app/reddit \
reddit-stash
# Or with S3
docker run -it \
-e REDDIT_CLIENT_ID=your_client_id \
-e REDDIT_CLIENT_SECRET=your_client_secret \
-e REDDIT_USERNAME=your_username \
-e REDDIT_PASSWORD=your_password \
-e STORAGE_PROVIDER=s3 \
-e AWS_ACCESS_KEY_ID=your_access_key \
-e AWS_SECRET_ACCESS_KEY=your_secret_key \
-e AWS_S3_BUCKET=your-bucket-name \
-v $(pwd)/reddit:/app/reddit \
reddit-stashReplace YOUR_USERNAME with your GitHub username.
After completing your chosen installation method, verify that everything is working correctly:
- Repository forked successfully
- All required secrets added to repository settings
- Workflow manually triggered at least once
- Workflow completes without errors (check Actions tab)
- Reddit folder appears in your cloud storage (Dropbox account or S3 bucket)
- Content files are present and readable
- Python 3.10-3.12 installed and working (3.12 recommended)
- Repository cloned successfully
- Dependencies installed via
pip install -r requirements.txt - Environment variables set correctly
- Script runs without errors
- Content saved to specified directory
- (Optional) Content uploaded to cloud storage (Dropbox or S3) if configured
- Docker installed and daemon running
- Image built successfully
- Container runs without errors
- Content appears in mounted volume
- (Optional) Content uploaded to cloud storage (Dropbox or S3) if configured
The settings.ini file in the root directory of the project allows you to configure how Reddit Stash operates. This comprehensive configuration system controls everything from basic saving behavior to advanced media processing and content recovery.
Jump to any section or browse the complete settings index:
| Section | Settings Count | Purpose | Jump To |
|---|---|---|---|
| [Settings] | 8 | Core behavior (save paths, types, file checking) | ↓ View |
| [Configuration] | 4 | Reddit API credentials (use env vars!) | ↓ View |
| [Media] | 12 | Media download controls (images, videos, albums) | ↓ View |
| [Imgur] | 3 | Imgur API configuration (optional) | ↓ View |
| [Recovery] | 9 | Content recovery system (4-provider cascade) | ↓ View |
| [Retry] | 7 | Retry queue management (exponential backoff) | ↓ View |
| [Storage] | 5 | Cloud storage backend (Dropbox, S3) | ↓ View |
| Total | 48 settings | Complete system configuration | ↓ Settings Index |
Quick Tips:
- 🔒 Security First: Use environment variables for credentials, not settings.ini
- ⚡ Performance:
check_type=LOG,max_concurrent_downloads=3-5 - 💾 Storage: Configure
max_image_size,max_video_size,max_daily_storage_mb - 🔄 Recovery: Enable all 4 providers for best deleted content recovery
- 📖 Full Docs: Each setting includes type, defaults, examples, and trade-offs
[Settings]
save_directory = reddit/ # Local directory for saved content
dropbox_directory = /reddit # Dropbox folder path
save_type = ALL # Content types to save
check_type = LOG # File existence checking method
unsave_after_download = false # Auto-unsave after downloading
process_gdpr = false # Process GDPR export data
process_api = true # Use Reddit API for content
ignore_tls_errors = false # Bypass SSL certificate validation (use with caution)-
save_directory- Where to save downloaded content locally- Type: String (directory path)
- Default:
reddit/ - Valid Values:
- Relative paths:
reddit/,./backup/,../archives/reddit/ - Absolute paths:
/home/user/reddit-archive/,C:\Users\Name\Documents\reddit\
- Relative paths:
- Behavior:
- Creates directory if it doesn't exist
- Relative paths are relative to script location
- Subdirectories created automatically for each subreddit
- Examples:
save_directory = reddit/ # Default, creates ./reddit/ save_directory = /var/backups/reddit/ # Absolute Unix path save_directory = D:\Reddit\ # Absolute Windows path
- Considerations:
- Ensure sufficient disk space (typical usage: 100MB-10GB+ depending on media)
- Write permissions required for the directory
- Avoid network drives for better performance
-
dropbox_directory- Dropbox cloud storage path- Type: String (Dropbox path)
- Default:
/reddit - Valid Values: Any valid Dropbox path starting with
/ - Format Rules:
- Must start with
/(Dropbox root) - Forward slashes only (even on Windows)
- Case-sensitive on Dropbox side
- Must start with
- Examples:
dropbox_directory = /reddit # Root level folder dropbox_directory = /Backups/reddit # Nested folder dropbox_directory = /Archives/2024/reddit # Deep nesting
- Notes:
- Folder created automatically if it doesn't exist
- Used as the Dropbox folder path or the S3 key prefix (with leading
/stripped for S3) - Syncs file_log.json and all content
-
save_type- What content to download from Reddit-
Type: String (enum)
-
Default:
ALL -
Valid Values:
ALL,SAVED,ACTIVITY,UPVOTED(case-insensitive) -
Detailed Options:
Value What It Downloads File Prefixes Use Case ALLEverything: your posts, comments, saved items, upvoted items POST_,COMMENT_,SAVED_POST_,SAVED_COMMENT_,UPVOTE_POST_,UPVOTE_COMMENT_Complete Reddit archive SAVEDOnly items you saved SAVED_POST_,SAVED_COMMENT_Personal knowledge base ACTIVITYOnly your posts & comments POST_,COMMENT_Your Reddit contributions UPVOTEDOnly upvoted items UPVOTE_POST_,UPVOTE_COMMENT_Track interests -
Performance Comparison (approximate, 1000 items each):
ALL: ~15-30 minutes (processes 4 categories) SAVED: ~5-10 minutes (1 category) ACTIVITY: ~5-10 minutes (1 category) UPVOTED: ~5-10 minutes (1 category) -
API Rate Limit Impact: Reddit limits to ~100 requests/minute, so
ALLuses 4x more requests -
Examples:
save_type = ALL # Everything save_type = SAVED # Just saved items save_type = all # Case doesn't matter
-
-
check_type- How to detect already-downloaded files-
Type: String (enum)
-
Default:
LOG -
Valid Values:
LOG,DIR(case-insensitive) -
Detailed Comparison:
Feature LOGDIRSpeed ⚡ Very fast (in-memory check) 🐢 Slower (filesystem scan) Accuracy Requires intact file_log.jsonAlways accurate Recovery Fails if log corrupted/deleted Self-healing Scalability Excellent (10,000+ files) Degrades with size Best For GitHub Actions, automated runs Local use, manual management -
LOGMode Details:- Uses
file_log.jsonin save_directory - Tracks: filename, timestamp, subreddit, content type
- JSON structure:
{"file_id-subreddit-type-category": {...}} - Loads entire log into memory (typically <10MB for 10,000 items)
- Risk: If log file deleted, treats all files as new (duplicates)
- Fix: If log corrupted, switch to
DIRtemporarily to rebuild
- Uses
-
DIRMode Details:- Scans filesystem for matching files
- Checks filename patterns:
POST_*.md,SAVED_POST_*.md, etc. - Builds in-memory index during startup
- Performance: ~1 second per 1,000 files
- Benefits: No dependency on log file, always correct
- Drawback: Slow for large archives (10,000+ files = ~10s startup)
-
Examples:
check_type = LOG # Fast, recommended for automation check_type = DIR # Thorough, recommended for local
-
When to Switch:
LOG→DIR: Log file corrupted, seeing duplicate downloadsDIR→LOG: Archive stable, want faster processing
-
-
unsave_after_download- Automatically unsave items after downloading- Type: Boolean
- Default:
false - Valid Values:
true,false,yes,no,on,off,1,0(case-insensitive) ⚠️ CRITICAL WARNING: THIS IS PERMANENT AND IRREVERSIBLE!- How It Works:
- Downloads post/comment to local file
- Verifies download successful
- Unsaves item from Reddit (removes from your saved list)
- Waits 0.5 seconds (rate limit protection)
- Continues to next item
- Use Case - Breaking the 1000-Item Limit:
- Reddit shows max 1000 saved items in your list
- Older items beyond 1000 are hidden (but still saved on Reddit)
- By unsaving downloaded items, older items "bubble up" into view
- Run script multiple times to access progressively older saves
- Workflow Example:
Run 1: Download 1000 most recent, unsave them → Access items 1001-2000 Run 2: Download next 1000, unsave them → Access items 2001-3000 Run 3: Download next 1000, unsave them → Access items 3001-4000 Continue until no more items found - Safety Recommendations:
- First Run: Keep
false, verify downloads work correctly - Backup: Ensure
file_log.jsonis backed up (Dropbox/S3/Git) - Test: Try with
save_type = UPVOTEDfirst (less critical) - Enable: Set to
trueonly when confident - Monitor: Check logs for "unsave failed" messages
- First Run: Keep
- Error Handling:
- If unsave fails (API error), script continues (doesn't stop)
- Failed unsaves logged but don't retry automatically
- Item remains saved on Reddit, can be re-downloaded (will skip if in log)
- Examples:
unsave_after_download = false # Safe default unsave_after_download = true # Enable with caution unsave_after_download = yes # Also valid
-
process_gdpr- Process Reddit GDPR export files- Type: Boolean
- Default:
false - Valid Values:
true,false,yes,no,on,off,1,0 - Purpose: Access complete Reddit history including items beyond 1000-item API limits
- Requirements:
- GDPR export CSV files in
{save_directory}/gdpr_data/ - Expected files:
saved_posts.csv,saved_comments.csv
- GDPR export CSV files in
- How It Works:
- Reads post/comment IDs from CSV files
- Fetches full content via Reddit API (one API call per item)
- Saves with
GDPR_POST_orGDPR_COMMENT_prefix - Respects rate limits (same 100 req/min as normal processing)
- GDPR Export Process:
- Visit https://www.reddit.com/settings/data-request
- Request data (takes 2-30 days to process)
- Download ZIP file when ready
- Extract
saved_posts.csvandsaved_comments.csv - Place in
{save_directory}/gdpr_data/
- Performance Impact:
- Processes AFTER regular API content
- Each item requires separate API call
- 1000 GDPR items ≈ 10-15 additional minutes
- Deduplication: Items already in log are skipped (no duplicates)
- Examples:
process_gdpr = false # Default, skip GDPR processing process_gdpr = true # Process GDPR files if present
- Common Issues:
- "GDPR directory not found" → Create
{save_directory}/gdpr_data/ - "No CSV files found" → Verify files named exactly
saved_posts.csv - Deleted content → GDPR has IDs but content may be deleted (404 errors normal)
- "GDPR directory not found" → Create
-
process_api- Fetch content from Reddit API-
Type: Boolean
-
Default:
true -
Valid Values:
true,false,yes,no,on,off,1,0 -
Purpose: Control whether to fetch current content from Reddit API
-
Use Cases:
process_apiprocess_gdprBehavior truefalseNormal mode: Fetch current saved/posted/upvoted content truetrueComplete mode: Current content + GDPR history falsetrueGDPR-only: Only process export files (no API calls for current) falsefalse⚠️ Does nothing (no content fetched) -
When to Set
false:- You only want to process GDPR export files
- Testing GDPR processing without affecting API rate limits
- Already ran API processing, want to add GDPR data only
-
Examples:
process_api = true # Normal operation process_api = false # Skip API, only process GDPR
-
-
ignore_tls_errors- Bypass SSL certificate verification- Type: Boolean
- Default:
false - Valid Values:
true,false,yes,no,on,off,1,0 ⚠️ SECURITY WARNING: Setting totruereduces security significantly!- What It Does:
false(default): Validates SSL certificates, rejects expired/invalid/self-signed certstrue: Accepts ANY certificate, including expired, self-signed, or invalid
- Security Implications:
- Vulnerable to man-in-the-middle attacks
- No guarantee you're downloading from legitimate source
- Should NEVER be used with sensitive content
- Legitimate Use Cases (rare):
- Archiving content from sites with expired certificates
- Corporate networks with self-signed proxy certificates
- Historical content preservation where security less critical
- What It Affects:
- Only affects media downloads (images, videos)
- Does NOT affect Reddit API (always uses valid TLS)
- Third-party hosting: Imgur, Gfycat, etc.
- Warnings Generated:
- Script prints "
⚠️ TLS verification disabled" on startup - Configuration validator shows security warning
- Script prints "
- Examples:
ignore_tls_errors = false # Secure default (recommended) ignore_tls_errors = true # Insecure, use only if absolutely necessary
- Better Alternatives:
- Use content recovery system (often has archived copies)
- Manually download problematic images
- Report invalid certificates to site owners
[Configuration]
client_id = None # Reddit app client ID
client_secret = None # Reddit app client secret
username = None # Reddit username
password = None # Reddit password (or password:2FA_code)🔒 USE ENVIRONMENT VARIABLES, NOT settings.ini FOR CREDENTIALS!
Unless you absolutely know what you're doing and have a specific reason to store credentials in settings.ini (like an isolated, encrypted system), always use environment variables instead.
Why Environment Variables Are Safer:
| Risk | settings.ini | Environment Variables |
|---|---|---|
| Accidental Git Commit | ❌ VERY HIGH - Credentials exposed publicly | ✅ Safe - Not in version control |
| File Sharing | ❌ HIGH - Shared accidentally with settings | ✅ Safe - Not in shared files |
| Backup Exposure | ❌ MEDIUM - In all backups/archives | ✅ Safe - Not in file backups |
| Log Exposure | ❌ MEDIUM - May appear in logs | ✅ Safer - Separate from logs |
| Access Control | ❌ File permissions only | ✅ Better - OS-level protection |
Real-World Danger Example:
You: Edit settings.ini with credentials
You: Test locally, everything works
You: git add . && git commit -m "Update settings"
You: git push
Result: 🚨 YOUR CREDENTIALS ARE NOW PUBLIC ON GITHUB 🚨
Anyone can: Access your Reddit account, download your data, post as you
-
client_id- Reddit application client ID- Type: String or
None - Default:
None ⚠️ SECURITY: Use environment variableREDDIT_CLIENT_IDinstead- When to Use settings.ini:
- Isolated test environment (not connected to internet)
- Encrypted filesystem only you access
- Single-use throwaway credentials
- Never Use settings.ini If:
- Repository is public or will be shared
- File will be backed up to cloud
- Multiple people access the system
- You're not 100% sure about security implications
- Environment Variable Method (RECOMMENDED):
# Linux/macOS export REDDIT_CLIENT_ID='your_client_id_here' # Windows set REDDIT_CLIENT_ID=your_client_id_here
- settings.ini Method (NOT RECOMMENDED):
client_id = your_client_id_here # ⚠️ DANGEROUS
- Type: String or
-
client_secret- Reddit application client secret- Type: String or
None - Default:
None ⚠️ CRITICAL SECURITY: This is your API password! UseREDDIT_CLIENT_SECRETenv var- Exposure Risk: CRITICAL - Full API access
- If Compromised: Attacker can use your Reddit app credentials
- NEVER commit this to version control
- Environment Variable (REQUIRED for any shared/public code):
export REDDIT_CLIENT_SECRET='your_secret_here'
- Type: String or
-
username- Your Reddit username- Type: String or
None - Default:
None - Security Level: Medium (username is public anyway, but indicates which account)
- Environment Variable (RECOMMENDED):
export REDDIT_USERNAME='your_username'
- Note: Less critical than password, but still recommended to use env var
- Type: String or
-
password- Your Reddit account password- Type: String or
None - Default:
None ⚠️ MAXIMUM SECURITY RISK: YOUR REDDIT PASSWORD!- Exposure = Account Takeover: Someone with this can log into your Reddit account
- 2FA Format: If you have two-factor authentication:
your_password:123456 - NEVER EVER put this in settings.ini unless isolated system
- Environment Variable (ABSOLUTELY REQUIRED):
export REDDIT_PASSWORD='your_password' # With 2FA: export REDDIT_PASSWORD='your_password:123456'
- Type: String or
- Environment Variables (checked first) ← USE THIS
- settings.ini values (fallback) ← AVOID THIS
If environment variable exists, settings.ini value is completely ignored.
✅ SAFE - Keep credentials as None:
[Configuration]
client_id = None
client_secret = None
username = None
password = None❌ UNSAFE - Credentials in file:
[Configuration]
client_id = abc123def456 # ⚠️ WILL BE EXPOSED
client_secret = secret_key_here # ⚠️ CRITICAL RISK
username = your_username # ⚠️ ACCOUNT IDENTIFIER
password = your_password # ⚠️ ACCOUNT TAKEOVER RISKOnly use settings.ini for credentials if ALL of these are true:
- ✅ System is completely isolated (no internet, air-gapped)
- ✅ Filesystem is encrypted
- ✅ Only you have access (no shared users)
- ✅ File will never be committed to version control
- ✅ File will never be backed up to cloud
- ✅ File will never be shared with anyone
- ✅ You fully understand the security implications
- ✅ You're using test/throwaway credentials only
If you're unsure about even ONE of these, use environment variables!
# Check if credentials are in settings.ini
grep -E "client_id|client_secret|username|password" settings.ini
# Safe output (all None):
client_id = None
client_secret = None
username = None
password = None
# Unsafe output (has values):
client_id = abc123 # ⚠️ FIX THIS- Add settings.ini to .gitignore (even with None values, for safety)
- Use .env files for local development (also in .gitignore)
- Rotate credentials if you suspect exposure
- Use GitHub Secrets for GitHub Actions (already encrypted)
- Never screenshot settings with credentials
- Check git history if you accidentally committed credentials
For detailed instructions on setting up environment variables properly, see Setting Up Reddit Environment Variables
[Media]
# Global media download controls
download_enabled = true # Master switch for all media downloads
download_images = true # Enable image downloads
download_videos = true # Enable video downloads
download_audio = true # Enable audio downloads
# Image processing settings
thumbnail_size = 800 # Thumbnail dimensions in pixels
max_image_size = 5242880 # Max image file size (5MB in bytes)
create_thumbnails = true # Generate thumbnails for large images
# Video settings
video_quality = high # Video quality preference
max_video_size = 209715200 # Max video file size (200MB in bytes)
# Album settings
download_albums = true # Process image albums/galleries
max_album_images = 50 # Max images per album (0 = unlimited)
# Performance and resource limits
max_concurrent_downloads = 3 # Parallel download streams
download_timeout = 30 # Download timeout in seconds
max_daily_storage_mb = 1024 # Daily storage limit in MBGlobal Controls:
-
download_enabled- Master switch for all media downloads- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does:
true: Enables media download system, attempts to download images/videos/audiofalse: Completely disables media downloads, only saves text content (markdown)
- Performance Impact:
- Enabled: Adds 5-30 minutes per 1000 posts (depends on media count)
- Disabled: ~80% faster processing, minimal storage usage
- Storage Impact:
- Enabled: 50MB-5GB+ depending on content (images are heavy!)
- Disabled: <50MB for 10,000 text-only posts
- Examples:
download_enabled = true # Enable media downloads download_enabled = false # Text-only archiving
-
download_images- Control image downloads- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - Requires:
download_enabled = true - What It Downloads:
- Direct image links (i.redd.it, i.imgur.com)
- Reddit galleries
- External images (direct URLs)
- Success Rate: ~80% with modern web compatibility
- Examples:
download_images = true # Download all images download_images = false # Skip images
-
download_videos- Control video downloads- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - Requires:
download_enabled = true - What It Downloads:
- v.redd.it videos (with audio merging via ffmpeg if available)
- Direct video links (.mp4, .webm, .mov)
- Embedded videos from supported hosts
- Dependencies:
- ffmpeg recommended for v.redd.it audio merging
- Without ffmpeg: video-only (no audio track)
- Storage Warning: Videos are large! Single video can be 50-500MB
- Examples:
download_videos = true # Download videos download_videos = false # Skip videos (save storage)
-
download_audio- Control audio downloads- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - Requires:
download_enabled = true - What It Downloads:
- Audio-only posts (podcasts, music)
- Audio tracks from videos (when separated)
- Note: Rarely used by Reddit, most audio is embedded in video
- Examples:
download_audio = true # Download audio files download_audio = false # Skip audio
Image Processing:
-
thumbnail_size- Maximum thumbnail dimension in pixels- Type: Integer (positive)
- Default:
800 - Valid Range: 100-4096 (recommended: 300-1200)
- What It Means:
- Thumbnails are resized to fit within this width × height box
- Aspect ratio is preserved
- If image is 1920×1080 and thumbnail_size=800: → 800×450
- Storage Impact Per Image:
- 300px: ~50KB (very small)
- 800px: ~150KB (balanced)
- 1200px: ~300KB (high quality)
- Use Cases:
- 300-500: Quick previews, storage-constrained
- 800-1000: Balanced quality/size (recommended)
- 1200-2000: High quality, display on large screens
- Examples:
thumbnail_size = 400 # Small previews thumbnail_size = 800 # Default, good balance thumbnail_size = 1200 # High quality
-
max_image_size- Maximum image file size in bytes- Type: Integer (positive)
- Default:
5242880(5MB) - Valid Range: 100,000 (100KB) to 52,428,800 (50MB)
- What It Does:
- Images larger than this are SKIPPED (not downloaded)
- Not resized - full skip to save bandwidth/storage
- Applies BEFORE download (uses Content-Length header if available)
- Size Conversion Guide:
1 MB = 1048576 bytes 2 MB = 2097152 bytes 5 MB = 5242880 bytes (default) 10 MB = 10485760 bytes 20 MB = 20971520 bytes 50 MB = 52428800 bytes - Trade-offs:
- Small limit (1-2MB): Saves storage, may miss high-res images
- Medium limit (5-10MB): Balanced, gets most images
- Large limit (20-50MB): Complete archive, heavy storage
- Examples:
max_image_size = 1048576 # 1MB limit (minimal) max_image_size = 5242880 # 5MB limit (default) max_image_size = 20971520 # 20MB limit (comprehensive)
-
create_thumbnails- Generate thumbnail versions of images- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does:
true: Creates smaller preview version alongside originalfalse: Only saves original full-size image
- File Naming:
- Original:
abc123_media.jpg - Thumbnail:
abc123_media_thumb.jpg
- Original:
- Storage Cost: +10-20% additional space
- Benefits:
- Fast loading in file browsers
- Preview without opening large files
- Useful for browsing archives offline
- Processing Cost: +2-5 seconds per image
- Examples:
create_thumbnails = true # Generate previews create_thumbnails = false # Original only
Video Processing:
-
video_quality- Download quality preference for videos-
Type: String (enum)
-
Default:
high -
Valid Values:
high,low(case-insensitive) -
What It Means:
Quality Resolution Bitrate File Size (per min) Use Case high720p-1080p+ High 10-50MB/min Best quality, storage available low360p-480p Lower 3-10MB/min Save bandwidth/storage -
How It Works:
- v.redd.it: Selects from available quality levels
- YouTube: Requests specific quality (if available)
- Gfycat: Downloads HD vs SD version
-
Important: Not all sources provide multiple qualities
-
Examples:
video_quality = high # Best quality (default) video_quality = low # Save storage video_quality = HIGH # Case doesn't matter
-
-
max_video_size- Maximum video file size in bytes- Type: Integer (positive)
- Default:
209715200(200MB) - Valid Range: 1,048,576 (1MB) to 1,073,741,824 (1GB)
- What It Does: Videos larger than this limit are SKIPPED
- Size Conversion Guide:
10 MB = 10485760 bytes 50 MB = 52428800 bytes 100 MB = 104857600 bytes 200 MB = 209715200 bytes (default) 500 MB = 524288000 bytes 1 GB = 1073741824 bytes - Typical Video Sizes:
- Short clip (30s): 5-20MB
- Medium video (2-5 min): 20-100MB
- Long video (10+ min): 100-500MB
- HD long video: 500MB-2GB
- Recommendations By Use Case:
- Storage-constrained: 50-100MB
- Balanced: 200-300MB (default range)
- Complete archive: 500MB-1GB
- Examples:
max_video_size = 52428800 # 50MB (minimal) max_video_size = 209715200 # 200MB (default) max_video_size = 524288000 # 500MB (comprehensive)
Album Handling:
-
download_albums- Process multi-image posts (albums/galleries)- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Handles:
- Imgur albums (imgur.com/a/XXXXX)
- Reddit galleries (multiple images in one post)
- Other multi-image services
- File Naming:
{post_id}_media_001.jpg,{post_id}_media_002.jpg, etc. - Storage Impact: Albums can be 2-100+ images each
- Examples:
download_albums = true # Download all album images download_albums = false # Skip albums entirely
-
max_album_images- Limit images per album- Type: Integer (non-negative)
- Default:
50 - Valid Values:
0(unlimited) or positive integer (1-1000+) - What It Does:
0: Download ALL images in album (no limit)Positive number: Download only first N images, skip rest
- Why Limit?:
- Some albums have 100+ images (massive storage)
- Prevent single post from consuming too much space
- Faster processing
- Behavior Example:
- Album has 80 images
max_album_images = 50: Download first 50, skip remaining 30max_album_images = 0: Download all 80
- Recommendations:
- Conservative: 20-30 images
- Balanced: 50 images (default)
- Complete: 0 (unlimited, use with caution)
- Examples:
max_album_images = 10 # First 10 only max_album_images = 50 # First 50 (default) max_album_images = 0 # No limit
Performance Controls:
-
max_concurrent_downloads- Parallel download streams-
Type: Integer (positive)
-
Default:
3 -
Valid Range: 1-20 (recommended: 1-10)
-
What It Does: Number of media files downloaded simultaneously
-
Trade-offs:
Value Speed CPU Usage Memory Usage Network Load Best For 1 Slow Low Low Light Slow connections, low-power devices 3 Moderate Medium Medium Moderate Default, balanced 5-7 Fast High High Heavy Fast connections, powerful machines 10+ Fastest Very High Very High Very Heavy Server environments, very fast connections -
Limiting Factors:
- GitHub Actions: 3-5 recommended (shared resources)
- Home internet: Based on bandwidth (3-5 typical)
- Fast connection: 5-10
-
Warning: Too high can trigger rate limits!
-
Examples:
max_concurrent_downloads = 1 # One at a time (safest) max_concurrent_downloads = 3 # Default (balanced) max_concurrent_downloads = 5 # Fast (requires good connection)
-
-
download_timeout- Per-file download timeout in seconds- Type: Integer (positive)
- Default:
30 - Valid Range: 5-600 (recommended: 15-120)
- What It Does: Max time to wait for single file download
- What Happens on Timeout:
- Download cancelled
- Item added to retry queue
- Script continues to next file
- Recommendations By File Type:
- Images only: 15-30 seconds
- Mixed (images + small videos): 30-60 seconds (default range)
- Large videos: 60-300 seconds
- Network Speed Considerations:
- Slow connection (<1 Mbps): 60-120 seconds
- Medium (1-10 Mbps): 30-60 seconds
- Fast (10+ Mbps): 15-30 seconds
- Examples:
download_timeout = 15 # Fast timeout (risk more failures) download_timeout = 30 # Default (balanced) download_timeout = 120 # Patient (for large files/slow networks)
-
max_daily_storage_mb- Daily storage consumption limit in megabytes- Type: Integer (positive)
- Default:
1024(1GB) - Valid Range: 10-100,000+ (10MB to 100GB+)
- What It Does:
- Tracks total storage used for media per run
- Stops downloading media when limit reached
- Text content (markdown) still saved
- Resets on next script run (not calendar day)
- Why Use It:
- Prevent unexpected storage exhaustion
- Control cloud storage costs (Dropbox/S3)
- GitHub Actions storage limits
- Predictable resource usage
- Size Planning Guide:
100 MB: ~500-1000 images OR ~2-5 short videos 500 MB: ~2500-5000 images OR ~10-25 videos 1 GB: ~5000-10000 images OR ~20-50 videos (default) 5 GB: ~25000+ images OR 100+ videos 10 GB: Complete large archive - Recommendations By Use Case:
- Testing: 100-200MB
- GitHub Actions (free): 500-1000MB
- Home backup: 2000-5000MB (2-5GB)
- Complete archive: 10000+ MB (10GB+)
- Examples:
max_daily_storage_mb = 100 # Testing/minimal max_daily_storage_mb = 1024 # 1GB default max_daily_storage_mb = 5120 # 5GB comprehensive max_daily_storage_mb = 0 # No limit (use with caution!)
[Imgur]
# Optional: Comma-separated client IDs for rate limit rotation
client_ids = None # Multiple Imgur client IDs
client_secrets = None # Corresponding client secrets
recover_deleted = true # Attempt recovery of deleted content-
client_ids- Imgur application client IDs for API access-
Type: String (comma-separated list in settings.ini, single value in env var)
-
Default:
None -
Valid Values:
None: No Imgur API access (uses fallback methods)- Single ID:
abc123def456 - Multiple IDs (settings.ini only):
id1,id2,id3(for rate limit rotation)
-
Format Rules:
- No spaces around commas:
id1,id2,id3✅ - With spaces (invalid):
id1, id2, id3❌ - Each ID is alphanumeric, typically 15 characters
- No spaces around commas:
-
What It Does:
- Enables official Imgur API access
- Higher rate limits (12,500 requests/day per app)
- Album support and metadata
- Multiple IDs (settings.ini only) rotate to avoid single-app rate limits
-
Without API Access:
- Falls back to direct HTTP downloads
- Lower success rate (~30-40% vs ~70-80% with API)
- Frequent 429 rate limit errors (expected and normal)
- No album support
-
Configuration Methods:
Method Single ID Multiple IDs Best For Environment Variable ✅ Yes ❌ No Most users, single app settings.ini ✅ Yes ✅ Yes Advanced: rate limit rotation -
How to Get (only if you registered before May 2024):
- Go to https://imgur.com/account/settings/apps
- Select your application
- Copy the "Client ID" value
-
Examples:
# settings.ini - Multiple IDs supported client_ids = None # No API (most users) client_ids = abc123def456 # Single app client_ids = abc123,def456,ghi789 # Multiple apps (rotation) ⚠️ settings.ini only
# Environment variable - Single ID only export IMGUR_CLIENT_ID='abc123def456' # Single app only
-
⚠️ Important: Multiple client ID rotation is only supported in settings.ini, not via environment variables. If you need rotation across multiple Imgur apps, you must use settings.ini configuration.
-
-
client_secrets- Imgur application client secrets-
Type: String (comma-separated list in settings.ini, single value in env var)
-
Default:
None -
Valid Values:
None: No Imgur API access- Comma-separated secrets matching
client_idsorder (settings.ini only)
-
MUST MATCH
client_ids:- If
client_idshas 3 IDs,client_secretsmust have 3 secrets - Order matters:
client_ids[0]pairs withclient_secrets[0]
- If
-
Format Rules:
- Same as client_ids: no spaces
- Each secret is alphanumeric, typically 40 characters
- Keep these SECRET (never commit to version control!)
-
Security Warning:
- These are sensitive credentials
- Environment variables recommended for single app
- Never share or expose publicly
-
Configuration Methods:
Method Single Secret Multiple Secrets Recommended Environment Variable ✅ Yes ❌ No ✅ Most secure settings.ini ✅ Yes ✅ Yes ⚠️ Only if multiple apps -
Examples:
# settings.ini - Multiple secrets supported client_secrets = None # No API client_secrets = abcdef1234567890abcdef1234567890abcdef12 # Single client_secrets = secret1_40chars,secret2_40chars,secret3_40chars # Multiple ⚠️ settings.ini only
# Environment variable - Single secret only (RECOMMENDED) export IMGUR_CLIENT_SECRET='abcdef1234567890abcdef1234567890abcdef12'
-
Validation: Script checks that count matches
client_ids -
⚠️ Note: Multiple client secrets are only supported in settings.ini. For single app (most users), use environment variable for better security.
-
-
recover_deleted- Attempt recovery of deleted/unavailable Imgur content- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does:
true: When Imgur returns 404, triggers content recovery cascadefalse: Skip recovery, treat as permanent failure
- Recovery Process:
- Imgur returns 404 (not found) or 429 (rate limited)
- System tries Wayback Machine for archived copy
- Falls back to Reddit preview URLs
- Checks other recovery providers
- Caches result (success or failure) to avoid re-trying
- Success Rates (for deleted Imgur content):
- Recent deletions (<1 month): ~40-60% recovery
- Older deletions (1-6 months): ~20-40% recovery
- Very old (>6 months): ~10-20% recovery
- Popular images: Higher success (more likely archived)
- Performance Impact:
- Adds 5-15 seconds per failed Imgur image
- Only activates on failures (no cost for successful downloads)
- Results cached (subsequent failures instant)
- When to Disable:
- You don't care about deleted content
- Want faster processing (skip recovery attempts)
- Already know most Imgur links are dead
- Examples:
recover_deleted = true # Try to recover (default) recover_deleted = false # Skip recovery, faster
[Recovery]
# Recovery providers (4-provider cascade)
use_wayback_machine = true # Internet Archive Wayback Machine
use_pushshift_api = true # PullPush.io (Pushshift successor)
use_reddit_previews = true # Reddit's preview/thumbnail system
use_reveddit_api = true # Reveddit deleted content recovery
# Performance settings
timeout_seconds = 10 # Per-provider timeout
cache_duration_hours = 24 # Cache recovery results
# Cache management
max_cache_entries = 10000 # Maximum cached recovery results
max_cache_size_mb = 100 # Cache size limit in MB
cleanup_interval_minutes = 60 # Cache cleanup frequency
enable_background_cleanup = true # Automatic cache maintenanceReddit Stash includes a sophisticated 4-provider cascade system that attempts to recover deleted, removed, or unavailable content.
-
use_wayback_machine- Use Internet Archive Wayback Machine- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does: Archives web snapshots going back to 1996
- Best For: Popular content, older deletions, historical preservation
- Success Rate: 60-80% for popular content, 20-40% for obscure content
- Coverage: Billions of web pages, extensive image archives
- Rate Limit: 60 requests/minute (respectful, non-blocking)
- Response Time: 2-10 seconds average
- Examples:
use_wayback_machine = true # Enable (recommended) use_wayback_machine = false # Disable to save time
-
use_pushshift_api- Use PullPush.io (Pushshift successor)- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does: Reddit-specific archive of posts and comments
- Best For: Deleted/removed Reddit text content, metadata
- Success Rate: 40-70% for Reddit content (higher for older content)
- Coverage: Reddit posts/comments from 2005-present
- Rate Limit: 12 requests/minute (conservative, respects soft limit of 15)
- Response Time: 1-3 seconds average
- Note: Sometimes slower or down (community-run service)
- Examples:
use_pushshift_api = true # Enable (recommended for Reddit content) use_pushshift_api = false # Disable if service unavailable
-
use_reddit_previews- Use Reddit's preview/thumbnail system- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does: Reddit's own cached preview images
- Best For: Recent posts with images, when original host is down
- Success Rate: 20-50% (only works if Reddit generated preview)
- Coverage: Images from posts where Reddit created thumbnails
- Quality: Usually lower resolution (preview quality, not original)
- Rate Limit: 30 requests/minute
- Response Time: <1 second (very fast)
- Limitations:
- Only works for posts Reddit previewed
- Lower quality than originals
- May not work for very old posts
- Examples:
use_reddit_previews = true # Enable (fast fallback) use_reddit_previews = false # Disable if quality matters
-
use_reveddit_api- Use Reveddit deleted content recovery- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does: Specialized service for recovering deleted Reddit content
- Best For: Recently deleted posts/comments (within days/weeks)
- Success Rate: 30-60% for recent deletions, lower for older
- Coverage: Reddit posts and comments deleted by users or moderators
- Rate Limit: 20 requests/minute
- Response Time: 2-5 seconds average
- Note: Most effective for recent deletions (<30 days)
- Examples:
use_reveddit_api = true # Enable (good for recent deletions) use_reveddit_api = false # Disable to save time
-
timeout_seconds- Maximum wait time per provider attempt-
Type: Integer (positive)
-
Default:
10 -
Valid Range: 3-120 seconds (recommended: 5-30)
-
What It Does: Max time to wait for each recovery provider to respond
-
Behavior on Timeout:
- Provider attempt cancelled
- Moves to next provider in cascade
- Failure logged but doesn't stop processing
-
Trade-offs:
Timeout Success Rate Speed Best For 5s Lower (~50%) Fast Quick pass, fast networks 10s Good (~70%) Moderate Default, balanced 20-30s Higher (~85%) Slow Thorough recovery, slow networks -
Cascade Example (timeout=10s, 4 providers):
- Wayback: Try for 10s → Success/Fail → Next
- PullPush: Try for 10s → Success/Fail → Next
- Reddit Preview: Try for 10s → Success/Fail → Next
- Reveddit: Try for 10s → Success/Fail → Give up
- Total: 0-40 seconds per item (stops at first success)
-
Examples:
timeout_seconds = 5 # Fast, may miss some content timeout_seconds = 10 # Default, balanced timeout_seconds = 30 # Thorough, slow
-
-
cache_duration_hours- How long to cache recovery results-
Type: Integer (positive)
-
Default:
24 -
Valid Range: 1-720 hours (1 hour to 30 days)
-
What It Does: Stores recovery results (success/failure) to avoid re-trying
-
What Gets Cached:
- Successful recoveries: URL → recovered content location
- Failed attempts: URL → "not found" (avoid retrying same failure)
-
Cache Database:
.recovery_cache.dbin save directory (SQLite) -
Why Cache?:
- Avoid re-querying same URL across runs
- Respect provider rate limits
- Speed up subsequent runs (instant cache hits)
- Reduce network usage
-
Duration Recommendations:
Duration Use Case Behavior 1-6 hours Testing, rapidly changing content Short-term cache 24-48 hours Normal use, daily runs Default, balanced 168 hours (1 week) Weekly runs, stable content Longer persistence 720 hours (30 days) Monthly runs, archival Maximum persistence -
Auto-Cleanup: Expired entries automatically removed based on
cleanup_interval_minutes -
Examples:
cache_duration_hours = 6 # 6 hours (short-term) cache_duration_hours = 24 # 24 hours (default) cache_duration_hours = 168 # 1 week (long-term)
-
-
max_cache_entries- Maximum number of cached results- Type: Integer (positive)
- Default:
10000 - Valid Range: 100-1,000,000+
- What It Does: Limits total number of entries in cache database
- Cleanup Behavior:
- When limit reached: Oldest entries removed first (FIFO)
- Expired entries removed first, then oldest
- Cleanup triggered automatically
- Storage Per Entry: ~200-500 bytes average
- Total Storage Examples:
1,000 entries = ~0.5 MB 10,000 entries = ~5 MB (default) 100,000 entries = ~50 MB 1,000,000 entries = ~500 MB - Recommendations:
- Small archive (<1000 posts): 1,000-5,000 entries
- Medium archive (1000-10000): 10,000-50,000 entries (default range)
- Large archive (10000+): 50,000-500,000 entries
- Examples:
max_cache_entries = 1000 # Minimal cache max_cache_entries = 10000 # Default max_cache_entries = 100000 # Large cache
-
max_cache_size_mb- Cache size limit in megabytes- Type: Integer (positive)
- Default:
100 - Valid Range: 1-10,000 MB (1MB to 10GB)
- What It Does: Limits total disk space used by cache database
- Cleanup Trigger: When cache file exceeds this size
- Relationship with
max_cache_entries:- Both limits enforced independently
- Whichever limit reached first triggers cleanup
- Typically entries limit hits first
- Recommendations:
10 MB: Testing, minimal cache 100 MB: Default, sufficient for most users 500 MB: Large archives, lots of recovery attempts 1000 MB: Very large archives, maximum persistence - Examples:
max_cache_size_mb = 10 # Minimal max_cache_size_mb = 100 # Default max_cache_size_mb = 500 # Large
-
cleanup_interval_minutes- How often to run cache cleanup- Type: Integer (positive)
- Default:
60 - Valid Range: 5-1440 minutes (5 minutes to 24 hours)
- What It Does: Automatic cleanup of expired cache entries
- What Gets Cleaned:
- Entries older than
cache_duration_hours - Excess entries beyond
max_cache_entries - If size exceeds
max_cache_size_mb
- Entries older than
- Trigger Timing: Based on wall clock time
- Performance Impact: Minimal (<1 second per cleanup)
- Recommendations:
15-30 min: Frequent runs, tight control 60 min: Default, balanced (hourly cleanup) 120-240 min: Infrequent runs, less overhead 1440 min: Once per day (very infrequent runs) - Examples:
cleanup_interval_minutes = 30 # Every 30 minutes cleanup_interval_minutes = 60 # Every hour (default) cleanup_interval_minutes = 1440 # Daily
-
enable_background_cleanup- Automatic cache maintenance- Type: Boolean
- Default:
true - Valid Values:
true,false,yes,no,on,off,1,0 - What It Does:
true: Runs cleanup automatically based oncleanup_interval_minutesfalse: Only cleans up when limits exceeded (manual mode)
- Background Mode (
true):- Periodic automatic cleanup
- Prevents cache bloat
- Recommended for most users
- Manual Mode (
false):- Only cleans when forced (size/entry limits hit)
- Slightly less overhead
- Cache may grow larger before cleanup
- Use if you want maximum cache retention
- Examples:
enable_background_cleanup = true # Automatic (recommended) enable_background_cleanup = false # Manual only
[Retry]
# Retry behavior
max_retries = 5 # Maximum retry attempts per item
base_retry_delay_high = 5 # Base delay for high priority (seconds)
base_retry_delay_medium = 10 # Base delay for medium priority (seconds)
base_retry_delay_low = 15 # Base delay for low priority (seconds)
# Exponential backoff
exponential_base_delay = 60 # Base delay for exponential backoff
max_retry_delay = 86400 # Maximum delay (24 hours in seconds)
# Dead letter queue
dead_letter_threshold_days = 7 # Days before moving to dead letter queueThe retry system ensures failed downloads are automatically retried across multiple runs with intelligent priority-based backoff strategies. Failed items are queued in a persistent SQLite database (.retry_queue.db) and retried on subsequent script runs.
max_retries- Maximum retry attempts before giving up- Type: Integer (positive)
- Default:
5 - Valid Range: 1-50 (recommended: 3-10)
- What It Does: How many times to retry a failed download before moving to dead letter queue
- Retry Counter:
- Attempt 1: Initial download (not a retry)
- Attempts 2-6: Actual retries (if max_retries=5)
- After attempt 6: Move to dead letter queue
- What Triggers Retries:
- Network timeouts
- HTTP errors (403, 429, 500, 502, 503, 504)
- Temporary service unavailability
- Rate limit errors
- What Doesn't Retry:
- 404 Not Found (permanent, triggers content recovery instead)
- Invalid URLs
- File too large (exceeds size limits)
- Recommendations By Use Case:
3 retries: Quick processing, accept some failures 5 retries: Default, balanced persistence 10 retries: Maximum persistence, thorough recovery 20+ retries: Extreme cases, very unstable networks - Examples:
max_retries = 3 # Quick, fewer attempts max_retries = 5 # Default, balanced max_retries = 10 # Persistent, thorough
Failed items are assigned priorities that determine their retry delays:
-
Priority Assignment (automatic):
- High Priority: Small files (<1MB), recently saved posts, first-time failures
- Medium Priority: Medium files (1-10MB), standard content
- Low Priority: Large files (>10MB), repeated failures, low-value content
-
base_retry_delay_high- Base delay for high-priority items (seconds)- Type: Integer (positive)
- Default:
5 - Valid Range: 1-300 seconds
- When Used: Small, recent, important content
- Examples:
base_retry_delay_high = 1 # Retry almost immediately base_retry_delay_high = 5 # Default, 5 second delay base_retry_delay_high = 30 # More patient
-
base_retry_delay_medium- Base delay for medium-priority items (seconds)- Type: Integer (positive)
- Default:
10 - Valid Range: 5-600 seconds
- When Used: Standard content, typical failures
- Examples:
base_retry_delay_medium = 5 # Quick retry base_retry_delay_medium = 10 # Default base_retry_delay_medium = 60 # Patient retry
-
base_retry_delay_low- Base delay for low-priority items (seconds)- Type: Integer (positive)
- Default:
15 - Valid Range: 10-1800 seconds
- When Used: Large files, repeated failures
- Examples:
base_retry_delay_low = 10 # Relatively quick base_retry_delay_low = 15 # Default base_retry_delay_low = 120 # Very patient (2 minutes)
Priority Delay Example:
High priority item (attempt 1): Wait 5 seconds
Medium priority item (attempt 1): Wait 10 seconds
Low priority item (attempt 1): Wait 15 seconds
After the base delay, subsequent retries use exponential backoff to avoid hammering failing services.
-
exponential_base_delay- Base delay for exponential backoff calculation- Type: Integer (positive)
- Default:
60 - Valid Range: 10-3600 seconds (10 seconds to 1 hour)
- Formula:
delay = exponential_base_delay × 2^(attempt_number - 1) - Backoff Examples (base=60):
Attempt 1: 60 × 2^0 = 60 seconds (1 minute) Attempt 2: 60 × 2^1 = 120 seconds (2 minutes) Attempt 3: 60 × 2^2 = 240 seconds (4 minutes) Attempt 4: 60 × 2^3 = 480 seconds (8 minutes) Attempt 5: 60 × 2^4 = 960 seconds (16 minutes) - Purpose: Gradual backoff reduces load on failing services, increases success chance
- Recommendations:
30s: Quick backoff, impatient 60s: Default, balanced (1 minute base) 300s: Slow backoff, very patient (5 minute base) - Examples:
exponential_base_delay = 30 # Quick backoff exponential_base_delay = 60 # Default (1 min) exponential_base_delay = 300 # Slow backoff (5 min)
-
max_retry_delay- Maximum delay between retries (cap)- Type: Integer (positive)
- Default:
86400(24 hours) - Valid Range: 60-604800 seconds (1 minute to 7 days)
- What It Does: Caps exponential backoff to prevent extremely long delays
- Without Cap: Attempt 10 with base=60 would be 30,720 seconds (8.5 hours!)
- With Cap (86400): Any delay >24 hours is capped at 24 hours
- Delay Capping Example (base=60, max=86400):
Attempt 1: 60s (1 min) Attempt 2: 120s (2 min) Attempt 3: 240s (4 min) Attempt 4: 480s (8 min) Attempt 5: 960s (16 min) Attempt 6: 1920s (32 min) Attempt 7: 3840s (64 min) Attempt 8: 7680s (128 min ≈ 2 hours) Attempt 9: 15360s (256 min ≈ 4 hours) Attempt 10: 30720s (512 min ≈ 8.5 hours) Attempt 11: 61440s → CAPPED at 86400s (24 hours) - Recommendations:
3600s (1 hour): Quick turnaround, frequent runs 86400s (24 hours): Default, daily runs 604800s (7 days): Weekly runs, maximum patience - Examples:
max_retry_delay = 3600 # 1 hour max max_retry_delay = 86400 # 24 hours (default) max_retry_delay = 604800 # 7 days max
Items that exceed retry limits are moved to a "dead letter queue" for manual review or permanent archiving.
dead_letter_threshold_days- Days before giving up permanently- Type: Integer (positive)
- Default:
7 - Valid Range: 1-365 days (1 day to 1 year)
- What It Does: After this many days of retrying, item moves to dead letter queue
- Dead Letter Queue Behavior:
- Items marked as "permanently failed"
- No longer retried automatically
- Kept in database for manual review
- Can be manually cleared or re-queued
- Calculation:
- Based on first_failure_timestamp, not retry count
- Example: Item first fails on Jan 1, threshold=7 days → Moves to DLQ on Jan 8
- What Happens After DLQ:
- Item logged as permanent failure
- Visible in retry queue status
- Manual intervention required to retry
- Can be cleared to reduce database size
- Recommendations:
1-2 days: Quick cleanup, aggressive pruning 7 days: Default, balanced (1 week) 30 days: Patient, thorough recovery attempts 365 days: Maximum persistence (1 year) - Examples:
dead_letter_threshold_days = 1 # Move to DLQ after 1 day dead_letter_threshold_days = 7 # Default (1 week) dead_letter_threshold_days = 30 # Patient (1 month)
Dead Letter Queue Management:
- View DLQ items: Check
.retry_queue.dbwith SQLite browser - Clear DLQ: Delete database file (creates fresh on next run)
- Re-queue items: Manual SQL UPDATE statements
- Monitor: Check logs for "moved to dead letter queue" messages
Complete Retry Flow Example (max_retries=5, threshold=7 days):
Day 1, Run 1: Download fails → Queue (attempt 1/5, high priority, 5s delay)
Day 1, Run 2: Retry fails → Queue (attempt 2/5, 60s backoff)
Day 2, Run 1: Retry fails → Queue (attempt 3/5, 120s backoff)
Day 3, Run 1: Retry fails → Queue (attempt 4/5, 240s backoff)
Day 4, Run 1: Retry fails → Queue (attempt 5/5, 480s backoff)
Day 5, Run 1: Final retry fails → Max retries exceeded, keep in queue
Day 8: Item in queue for >7 days → Move to dead letter queue
[Storage]
# Provider: none, dropbox, s3
provider = none # Which cloud storage to use
# S3 settings (ignored when provider != s3)
s3_bucket = None # Your S3 bucket name
s3_region = None # AWS region (e.g., us-east-1)
s3_storage_class = STANDARD_IA # S3 storage class for content files
s3_endpoint_url = None # Custom endpoint (MinIO, LocalStack)The storage system lets you sync your Reddit archive to a cloud provider. Currently supports Dropbox (original) and AWS S3 (new). You can use either one, or neither (local-only).
-
provider- Which cloud storage backend to use-
Type: String (enum)
-
Default:
none -
Valid Values:
none,dropbox,s3 -
Behavior:
Value What It Does Requirements noneNo cloud sync, files stay local Nothing dropboxSync to Dropbox (original behavior) Dropbox app credentials s3Sync to AWS S3 bucket AWS credentials + S3 bucket -
Environment Variable Override:
STORAGE_PROVIDER -
Examples:
provider = none # Local-only (default) provider = dropbox # Use Dropbox provider = s3 # Use AWS S3
-
-
s3_bucket- Name of your S3 bucket- Type: String
- Default:
None - Required: Yes, when
provider = s3 - Rules:
- Bucket must already exist (not auto-created)
- Must follow S3 naming rules: lowercase, 3-63 characters, no underscores
- Must be accessible with your AWS credentials
- Environment Variable Override:
AWS_S3_BUCKET - Examples:
s3_bucket = my-reddit-archive s3_bucket = backups-2024-reddit
-
s3_region- AWS region where your bucket is located- Type: String
- Default:
None(uses AWS SDK default, typically from~/.aws/config) - Common Values:
us-east-1,us-west-2,eu-west-1,eu-central-1,ap-southeast-1 - Environment Variable Override:
AWS_DEFAULT_REGION - Examples:
s3_region = us-east-1 # US East (Virginia) - cheapest s3_region = eu-central-1 # Europe (Frankfurt) s3_region = ap-southeast-1 # Asia Pacific (Singapore)
-
s3_storage_class- Storage class for content files (posts, comments, media)-
Type: String (enum)
-
Default:
STANDARD_IA -
Environment Variable Override:
S3_STORAGE_CLASS -
Available Classes:
Storage Class Monthly Cost/GB Min Duration Best For Retrieval STANDARD~$0.023 None Frequent access Free STANDARD_IA~$0.0125 30 days Infrequent access (recommended) $0.01/GB ONEZONE_IA~$0.01 30 days Non-critical, infrequent $0.01/GB INTELLIGENT_TIERING~$0.023-$0.004 None Unpredictable access Varies GLACIER_IR~$0.004 90 days Rare access, minutes retrieval $0.03/GB GLACIER~$0.0036 90 days Archive, hours retrieval $0.03/GB DEEP_ARCHIVE~$0.00099 180 days Long-term, 12hr retrieval $0.02/GB -
Important Notes:
file_log.jsonis always stored asSTANDARDregardless of this setting (it's read every run)- Glacier classes (GLACIER_IR, GLACIER, DEEP_ARCHIVE) have minimum storage duration charges. If you re-upload a file before the minimum period, you pay for the full period. Reddit Stash automatically skips uploads when the file hasn't changed to avoid this
- All files are encrypted with SSE-S3 (AES-256) at no extra cost
-
Recommendations:
Just starting out: STANDARD (no surprises) Regular use, save money: STANDARD_IA (recommended default) Budget-conscious: ONEZONE_IA (slightly less durability) Long-term archive: GLACIER_IR (cheap, but 90-day minimum) Deep cold storage: DEEP_ARCHIVE (cheapest, 180-day minimum) -
Examples:
s3_storage_class = STANDARD_IA # Best balance of cost and access s3_storage_class = STANDARD # Safest, no retrieval fees s3_storage_class = GLACIER_IR # Cheapest for archives you rarely read
-
-
s3_endpoint_url- Custom S3-compatible endpoint URL- Type: String (URL)
- Default:
None(uses official AWS S3) - When to Use: Only for S3-compatible services like MinIO, Wasabi, Backblaze B2, or LocalStack for testing
- Environment Variable Override:
S3_ENDPOINT_URL - SSL: Automatically disabled for non-
amazonaws.comendpoints unless the URL starts withhttps:// - Examples:
s3_endpoint_url = None # Default: AWS S3 s3_endpoint_url = http://localhost:4566 # LocalStack (testing) s3_endpoint_url = http://192.168.1.100:9000 # Self-hosted MinIO s3_endpoint_url = https://s3.wasabisys.com # Wasabi
Quick alphabetical reference of all 43 settings with links to detailed documentation:
base_retry_delay_high(Integer, default: 5) - High-priority retry delay | → Retry Sectionbase_retry_delay_low(Integer, default: 15) - Low-priority retry delay | → Retry Sectionbase_retry_delay_medium(Integer, default: 10) - Medium-priority retry delay | → Retry Sectioncache_duration_hours(Integer, default: 24) - Cache recovery results duration | → Recovery Sectioncheck_type(String, default: LOG) - File existence checking method | → Settings Sectioncleanup_interval_minutes(Integer, default: 60) - Cache cleanup frequency | → Recovery Sectionclient_id(String, default: None) - Reddit API client ID | → Configuration Sectionclient_ids(String, default: None) - Imgur client IDs (multiple in settings.ini only) | → Imgur Sectionclient_secret(String, default: None) - Reddit API client secret | → Configuration Sectionclient_secrets(String, default: None) - Imgur client secrets (multiple in settings.ini only) | → Imgur Sectioncreate_thumbnails(Boolean, default: true) - Generate thumbnail versions | → Media Section
dead_letter_threshold_days(Integer, default: 7) - Days before moving to DLQ | → Retry Sectiondownload_albums(Boolean, default: true) - Process multi-image posts | → Media Sectiondownload_audio(Boolean, default: true) - Control audio downloads | → Media Sectiondownload_enabled(Boolean, default: true) - Master media download switch | → Media Sectiondownload_images(Boolean, default: true) - Control image downloads | → Media Sectiondownload_timeout(Integer, default: 30) - Per-file download timeout | → Media Sectiondownload_videos(Boolean, default: true) - Control video downloads | → Media Sectiondropbox_directory(String, default: /reddit) - Cloud storage path (Dropbox folder / S3 key prefix) | → Settings Sectionenable_background_cleanup(Boolean, default: true) - Automatic cache maintenance | → Recovery Sectionexponential_base_delay(Integer, default: 60) - Exponential backoff base delay | → Retry Sectionignore_tls_errors(Boolean, default: false) - Bypass SSL certificate validation | → Settings Section
max_album_images(Integer, default: 50) - Limit images per album | → Media Sectionmax_cache_entries(Integer, default: 10000) - Maximum cached recovery results | → Recovery Sectionmax_cache_size_mb(Integer, default: 100) - Cache size limit in MB | → Recovery Sectionmax_concurrent_downloads(Integer, default: 3) - Parallel download streams | → Media Sectionmax_daily_storage_mb(Integer, default: 1024) - Daily storage limit in MB | → Media Sectionmax_image_size(Integer, default: 5242880) - Max image file size in bytes | → Media Sectionmax_retries(Integer, default: 5) - Maximum retry attempts | → Retry Sectionmax_retry_delay(Integer, default: 86400) - Maximum retry delay cap | → Retry Sectionmax_video_size(Integer, default: 209715200) - Max video file size in bytes | → Media Sectionpassword(String, default: None) - Reddit account password | → Configuration Sectionprocess_api(Boolean, default: true) - Fetch content from Reddit API | → Settings Sectionprocess_gdpr(Boolean, default: false) - Process GDPR export files | → Settings Section
provider(String, default: none) - Cloud storage backend | → Storage Sectionrecover_deleted(Boolean, default: true) - Attempt Imgur content recovery | → Imgur Section
s3_bucket(String, default: None) - S3 bucket name | → Storage Sections3_endpoint_url(String, default: None) - Custom S3-compatible endpoint | → Storage Sections3_region(String, default: None) - AWS region | → Storage Sections3_storage_class(String, default: STANDARD_IA) - S3 storage class | → Storage Sectionsave_directory(String, default: reddit/) - Local save directory | → Settings Sectionsave_type(String, default: ALL) - What content to download | → Settings Sectionthumbnail_size(Integer, default: 800) - Thumbnail dimensions in pixels | → Media Sectiontimeout_seconds(Integer, default: 10) - Per-provider recovery timeout | → Recovery Sectionunsave_after_download(Boolean, default: false) - Auto-unsave after download | → Settings Sectionuse_pushshift_api(Boolean, default: true) - Use PullPush.io provider | → Recovery Sectionuse_reddit_previews(Boolean, default: true) - Use Reddit preview system | → Recovery Sectionuse_reveddit_api(Boolean, default: true) - Use Reveddit provider | → Recovery Sectionuse_wayback_machine(Boolean, default: true) - Use Internet Archive | → Recovery Sectionusername(String, default: None) - Reddit username | → Configuration Section
video_quality(String, default: high) - Video quality preference | → Media Section
💡 Tip: Use your browser's search (Ctrl+F / Cmd+F) to quickly find a specific setting in the detailed sections above.
For Maximum Speed:
check_type = LOG
max_concurrent_downloads = 5
download_timeout = 15For Maximum Reliability:
check_type = DIR
max_concurrent_downloads = 1
download_timeout = 60Minimal Storage:
download_images = true
download_videos = false
create_thumbnails = false
max_image_size = 1048576 # 1MBComplete Archive:
download_albums = true
max_video_size = 524288000 # 500MB
max_album_images = 0 # unlimited- Never commit settings.ini with credentials to version control
- Use environment variables for production deployments
- Keep
ignore_tls_errors = falseunless absolutely necessary - Regularly rotate API credentials
Note: Environment variables always take precedence over settings.ini values for credentials.
Reddit Stash includes an advanced media download system that can download images, videos, and other media from Reddit posts and comments. Through modern web technologies and intelligent retry mechanisms, this system achieves ~80% success rates for media downloads - a dramatic improvement over basic HTTP methods (~10% success).
What this means for you:
- Most images and videos from your saved posts will actually be downloaded and preserved
- Better compatibility with modern hosting services and anti-bot protection
- Automatic recovery from temporary failures and rate limiting
- Separate tracking shows you exactly how much media content was successfully saved
What this means for you:
- Most users: Reddit Stash works great without Imgur API - you'll get occasional Imgur download failures, which is normal
- Lucky few: If you already had an Imgur application before 2024, you can use it for better Imgur downloads
This is the normal experience - don't worry about trying to get Imgur credentials:
✅ What works perfectly:
- Reddit-hosted images and videos (i.redd.it, v.redd.it)
- Most other image hosting services
- All text content and metadata
- Some Imgur images fail with "429 rate limit" errors
- This is expected and normal - not something you need to fix
If you already have an Imgur application from before 2024:
- Find your credentials at https://imgur.com/account/settings/apps
- Set environment variables:
export IMGUR_CLIENT_ID='your_imgur_client_id' export IMGUR_CLIENT_SECRET='your_imgur_client_secret'
- Enjoy enhanced features: Better reliability, album support, fewer rate limits
For enhanced media downloading, you can optionally set these environment variables:
# Imgur API (if you have existing credentials)
export IMGUR_CLIENT_ID='your_imgur_client_id'
export IMGUR_CLIENT_SECRET='your_imgur_client_secret'
# These improve download reliability and enable advanced features
# Leave unset if you don't have Imgur API access - basic downloads will still workSince November 2025, Reddit's Responsible Builder Policy requires pre-approval before you can create new API apps. Self-service app creation at
reddit.com/prefs/appsis no longer available for new users.
If you already have API credentials (created before November 2025): Your existing client_id and client_secret continue to work. No action needed.
If you need new API credentials:
- Apply at Reddit's API Request Form
- In your application, mention:
- This is for personal data backup (non-commercial)
- Low request volume (personal use only)
- You only access your own account data
- No AI training or commercial use
- Wait for approval (typically 2-4 weeks)
- Once approved, follow the Setting Up Reddit Environment Variables instructions below
Don't have API credentials? See GDPR-Only Mode below for an alternative that works without any API access.
If you can't obtain API credentials, you can still create a structured index of your saved Reddit content using Reddit's GDPR data export:
- Request your Reddit data at https://www.reddit.com/settings/data-request (takes 2-30 days)
- Download and extract the ZIP file
- Place
saved_posts.csvandsaved_comments.csvin{save_directory}/gdpr_data/ - Configure
settings.ini:[Settings] process_api = false process_gdpr = true
- Run the script:
python reddit_stash.py
In CSV-only mode, the script creates markdown files with Reddit links for each saved item, organized by subreddit. This gives you a searchable index of your saved content. If you later obtain API credentials, you can re-run with process_api = true to fetch full content for each item.
Note: Since November 2025, you must first obtain approval through the Getting API Credentials process above before you can create a Reddit app.
- Create a Reddit app at https://www.reddit.com/prefs/apps or https://old.reddit.com/prefs/apps/
- Set up the name, select
script, and provide theredirect_urias per the PRAW docs.
- Copy the provided
REDDIT_CLIENT_IDand theREDDIT_CLIENT_SECRETbased on the following screenshot:
REDDIT_USERNAMEis your reddit usernameREDDIT_PASSWORDis your reddit password
Important for Two-Factor Authentication (TFA):
If your Reddit account has TFA enabled, you must provide your password and TFA code together, separated by a colon (:), e.g.:
REDDIT_PASSWORD='your_password:123456'
where 123456 is your current TFA code. Alternatively, you can disable TFA for the account to use prawcore authentication.
If neither is done, prawcore authentication will fail.
Keep these credentials for the setup.
- Go to Dropbox Developer App.
- Click on Create app.
- Select
Scoped accessand chooseFull DropboxorApp folderfor access type. - give a Name to your app and click
Create app.
- In the
Permissionstab, ensure the following are checked underFiles and folders:
- Your
DROPBOX_APP_KEYandDROPBOX_APP_SECRETare in the settings page of the app you created.
- To get the
DROPBOX_REFRESH_TOKENfollow the follwing steps:
Replace <DROPBOX_APP_KEY> with your DROPBOX_APP_KEY you got in previous step and add that in the below Authorization URL
https://www.dropbox.com/oauth2/authorize?client_id=<DROPBOX_APP_KEY>&token_access_type=offline&response_type=code
Paste the URL in browser and complete the code flow on the Authorization URL. You will receive an <AUTHORIZATION_CODE> at the end, save it you will need this later.
Go to Postman, and create a new POST request with below configuration
-
Add Request URL- https://api.dropboxapi.com/oauth2/token

-
Click on the Authorization tab -> Type = Basic Auth -> Username =
<DROPBOX_APP_KEY>, Password =<DROPBOX_APP_SECRET>(Refer this answer for cURL -u option)
- Body -> Select "x-www-form-urlencoded"
| Key | Value |
|---|---|
| code | <AUTHORIZATION_CODE> |
| grant_type | authorization_code |
After you click send the request, you will receive JSON payload containing refresh_token.
{
"access_token": "sl.****************",
"token_type": "bearer",
"expires_in": 14400,
"refresh_token": "*********************",
"scope": <SCOPES>,
"uid": "**********",
"account_id": "***********************"
}
and add/export the above refresh_token to DROPBOX_REFRESH_TOKEN in your environment. For more information about the setup visit OAuth Guide.
- Credits for above DROPBOX_REFRESH_TOKEN solution : https://stackoverflow.com/a/71794390/12983596
AWS S3 provides a cost-effective, highly durable alternative to Dropbox for storing your Reddit archive. Here's a step-by-step guide to set it up:
If you don't already have one:
- Go to aws.amazon.com and click Create an AWS Account
- Follow the signup process (requires email, phone number, and a payment method)
- AWS has a generous Free Tier that includes 5 GB of S3 Standard storage for 12 months
- Go to the S3 Console
- Click Create bucket
- Choose a globally unique bucket name (e.g.,
my-reddit-archive-2024)- Must be lowercase, 3-63 characters
- Only letters, numbers, and hyphens
- Select an AWS Region close to you (e.g.,
us-east-1for US East,eu-central-1for Europe) - Leave Block all public access enabled (your data stays private)
- Click Create bucket
- Go to IAM Console
- Click Users → Create user
- Enter a name like
reddit-stash-bot - Click Next → Attach policies directly
- Search for and select AmazonS3FullAccess (or create a more restrictive policy for just your bucket)
- Click Create user
- Select the user → Security credentials tab → Create access key
- Choose Application running outside AWS → Create access key
- Save both values (you won't be able to see the secret key again):
Access key ID→ this becomesAWS_ACCESS_KEY_IDSecret access key→ this becomesAWS_SECRET_ACCESS_KEY
Security tip: For GitHub Actions, you can use OIDC federation instead of static keys for enhanced security (no long-lived secrets).
Option A: Environment Variables (recommended for GitHub Actions and Docker)
export STORAGE_PROVIDER=s3
export AWS_S3_BUCKET=my-reddit-archive-2024
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=AKIA...your_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
# Optional: storage class (default: STANDARD_IA)
export S3_STORAGE_CLASS=STANDARD_IAOption B: settings.ini (for local installation)
[Storage]
provider = s3
s3_bucket = my-reddit-archive-2024
s3_region = us-east-1
s3_storage_class = STANDARD_IAAnd set your AWS credentials as environment variables (don't put them in settings.ini):
export AWS_ACCESS_KEY_ID=AKIA...your_key
export AWS_SECRET_ACCESS_KEY=your_secret_keypip install -r requirements-s3.txtThis installs boto3, the official AWS SDK for Python.
Run a download (even on a fresh bucket, this verifies connectivity):
python storage_utils.py --downloadYou should see:
-- S3 connected: s3://my-reddit-archive-2024 (STANDARD_IA) --
Now process your Reddit data and upload to S3:
python reddit_stash.py # Process Reddit content
python storage_utils.py --upload # Upload to S3For GitHub Actions, add these secrets to your repository (Settings → Secrets and variables → Actions):
| Secret Name | Value |
|---|---|
AWS_ACCESS_KEY_ID |
Your IAM user access key |
AWS_SECRET_ACCESS_KEY |
Your IAM user secret key |
AWS_DEFAULT_REGION |
Your bucket region (e.g., us-east-1) |
AWS_S3_BUCKET |
Your bucket name |
Also add these secrets (same Secrets tab):
| Secret Name | Value |
|---|---|
STORAGE_PROVIDER |
s3 |
S3_STORAGE_CLASS |
STANDARD_IA (optional, this is the default) |
The workflow will automatically use S3 instead of Dropbox.
For a typical Reddit archive (~500 MB of markdown + media):
| Storage Class | Monthly Cost | Annual Cost |
|---|---|---|
| STANDARD | ~$0.012 | ~$0.14 |
| STANDARD_IA | ~$0.006 | ~$0.08 |
| GLACIER_IR | ~$0.002 | ~$0.02 |
Plus minimal costs for PUT/GET requests (~$0.005/1000 requests). Most users will pay less than $0.10/month.
If you're switching from Dropbox to S3 (or vice versa), Reddit Stash includes a built-in migration tool that copies all your data between providers.
- Dry-run first — shows what would be transferred without making changes
- Execute — downloads everything from source, uploads to target
-
Ensure both providers are configured with credentials (see setup guides above)
-
Preview the migration (dry-run, no data moved):
python storage_utils.py --migrate --source dropbox --target s3
You'll see:
Migration plan: Dropbox -> AWS S3 Files: 1,234 Total size: 456.78 MB To execute this migration, add --execute: python storage_utils.py --migrate --source dropbox --target s3 --execute -
Execute the migration:
python storage_utils.py --migrate --source dropbox --target s3 --execute
-
Update your configuration to use the new provider:
[Storage] provider = s3
- Migration copies all files including
file_log.json, so your deduplication state is preserved - Existing files on the target are not deleted — migration only adds files
- For large archives, migration may take a while (all files pass through your machine)
- You can migrate in either direction: Dropbox → S3, or S3 → Dropbox
When using Docker, you can control the scheduling behavior using these additional environment variables:
-
SCHEDULE_MODE: Controls execution modeonce(default): Run the script once and exitperiodic: Run the script continuously on a schedule
-
SCHEDULE_INTERVAL: Time between executions in periodic mode- Default:
7200(2 hours) - Minimum:
60(1 minute) - Units: seconds
- Examples:
3600= 1 hour1800= 30 minutes14400= 4 hours
- Default:
Run once (default behavior):
docker run -it [...other env vars...] reddit-stashRun every 2 hours:
docker run -it -e SCHEDULE_MODE=periodic [...other env vars...] reddit-stashRun every 30 minutes:
docker run -it -e SCHEDULE_MODE=periodic -e SCHEDULE_INTERVAL=1800 [...other env vars...] reddit-stashStopping periodic execution:
Use Ctrl+C or send SIGTERM to the container for graceful shutdown.
While Docker's built-in scheduling (SCHEDULE_MODE=periodic) is convenient, some users prefer external scheduling for more control. Here's how to set up cron jobs to run Docker containers on a schedule.
Use External Cron when:
- You want system-level control over scheduling
- You need complex scheduling patterns (weekdays only, multiple times per day, etc.)
- You prefer containers to start/stop rather than run continuously
- You want to integrate with existing cron workflows
- You need different schedules for different operations (main script vs cloud storage uploads)
Use Built-in Scheduling when:
- You want simple, consistent intervals
- You prefer minimal setup
- You want the container to run continuously
- You don't need complex scheduling patterns
1. Edit your crontab:
crontab -e2. Add cron jobs for different schedules:
Every 2 hours (Reddit data fetch + cloud sync):
0 */2 * * * docker run --rm --env-file /home/user/.reddit-stash.env -v /home/user/reddit-data:/app/reddit reddit-stash >> /var/log/reddit-stash.log 2>&1Daily at 9 AM:
0 9 * * * docker run --rm -e REDDIT_CLIENT_ID='your_client_id' -e REDDIT_CLIENT_SECRET='your_client_secret' -e REDDIT_USERNAME='your_username' -e REDDIT_PASSWORD='your_password' -v /home/user/reddit-data:/app/reddit reddit-stash >> /var/log/reddit-stash.log 2>&1Weekdays only, every 3 hours during work hours (9 AM - 6 PM):
0 9,12,15,18 * * 1-5 docker run --rm -e REDDIT_CLIENT_ID='your_client_id' -e REDDIT_CLIENT_SECRET='your_client_secret' -e REDDIT_USERNAME='your_username' -e REDDIT_PASSWORD='your_password' -v /home/user/reddit-data:/app/reddit reddit-stash >> /var/log/reddit-stash.log 2>&1Separate cloud storage upload job (runs 30 minutes after main job):
# Dropbox
30 */2 * * * docker run --rm -e DROPBOX_APP_KEY='your_dropbox_key' -e DROPBOX_APP_SECRET='your_dropbox_secret' -e DROPBOX_REFRESH_TOKEN='your_dropbox_token' -v /home/user/reddit-data:/app/reddit reddit-stash storage_utils.py --upload >> /var/log/reddit-stash-upload.log 2>&1
# S3
30 */2 * * * docker run --rm -e STORAGE_PROVIDER='s3' -e AWS_ACCESS_KEY_ID='your_key' -e AWS_SECRET_ACCESS_KEY='your_secret' -e AWS_S3_BUCKET='your-bucket' -v /home/user/reddit-data:/app/reddit reddit-stash storage_utils.py --upload >> /var/log/reddit-stash-upload.log 2>&11. Open Task Scheduler (search "Task Scheduler" in Start menu)
2. Create Basic Task:
- Name: "Reddit Stash"
- Trigger: Daily/Weekly/Custom
- Action: "Start a program"
- Program:
docker - Arguments:
run --rm -e REDDIT_CLIENT_ID=your_client_id -e REDDIT_CLIENT_SECRET=your_client_secret -e REDDIT_USERNAME=your_username -e REDDIT_PASSWORD=your_password -v C:\reddit-data:/app/reddit reddit-stash
3. Advanced Options:
- Run whether user is logged on or not
- Run with highest privileges
- Configure for your Windows version
Option 1: Create an environment file
Create /home/user/.reddit-stash.env:
REDDIT_CLIENT_ID=your_client_id
REDDIT_CLIENT_SECRET=your_client_secret
REDDIT_USERNAME=your_username
REDDIT_PASSWORD=your_password
# For Dropbox:
DROPBOX_APP_KEY=your_dropbox_key
DROPBOX_APP_SECRET=your_dropbox_secret
DROPBOX_REFRESH_TOKEN=your_dropbox_token
# For S3 (instead of Dropbox):
# STORAGE_PROVIDER=s3
# AWS_ACCESS_KEY_ID=your_access_key
# AWS_SECRET_ACCESS_KEY=your_secret_key
# AWS_S3_BUCKET=your-bucket-nameThen use in cron:
0 */2 * * * docker run --rm --env-file /home/user/.reddit-stash.env -v /home/user/reddit-data:/app/reddit reddit-stash >> /var/log/reddit-stash.log 2>&1Option 2: Create a shell script
Create /home/user/run-reddit-stash.sh:
#!/bin/bash
docker run --rm \
--env-file /home/user/.reddit-stash.env \
-v /home/user/reddit-data:/app/reddit \
reddit-stashUses the .reddit-stash.env file from Option 1 above (supports Dropbox or S3 credentials).
Make it executable and add to cron:
chmod +x /home/user/run-reddit-stash.shCron entry:
0 */2 * * * /home/user/run-reddit-stash.sh >> /var/log/reddit-stash.log 2>&1Create docker-compose.yml:
version: '3.8'
services:
reddit-stash:
image: reddit-stash
environment:
- REDDIT_CLIENT_ID=${REDDIT_CLIENT_ID}
- REDDIT_CLIENT_SECRET=${REDDIT_CLIENT_SECRET}
- REDDIT_USERNAME=${REDDIT_USERNAME}
- REDDIT_PASSWORD=${REDDIT_PASSWORD}
# Cloud storage — choose Dropbox OR S3
- DROPBOX_APP_KEY=${DROPBOX_APP_KEY}
- DROPBOX_APP_SECRET=${DROPBOX_APP_SECRET}
- DROPBOX_REFRESH_TOKEN=${DROPBOX_REFRESH_TOKEN}
# - STORAGE_PROVIDER=s3
# - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
# - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
# - AWS_S3_BUCKET=${AWS_S3_BUCKET}
volumes:
- ./reddit:/app/reddit
profiles:
- manualRun via cron:
0 */2 * * * cd /path/to/reddit-stash && docker-compose --profile manual run --rm reddit-stash >> /var/log/reddit-stash.log 2>&1View cron logs:
# Ubuntu/Debian
tail -f /var/log/syslog | grep CRON
# CentOS/RHEL
tail -f /var/log/cron
# Your application logs
tail -f /var/log/reddit-stash.logMonitor container execution:
# Check if containers are running
docker ps
# View recent container logs
docker logs $(docker ps -lq)
# Monitor Docker events
docker events --filter container=reddit-stashCommon Issues:
-
Path issues: Cron has limited PATH. Use full paths:
0 */2 * * * /usr/bin/docker run --rm [...] reddit-stash
-
Environment variables: Cron doesn't inherit your shell environment. Use
--env-fileor set in crontab:PATH=/usr/local/bin:/usr/bin:/bin 0 */2 * * * docker run --rm [...] reddit-stash
-
Permissions: Ensure your user can run Docker without sudo:
sudo usermod -aG docker $USER # Log out and back in
-
Testing cron jobs: Test your command manually first:
# Run the exact command from your crontab docker run --rm -e REDDIT_CLIENT_ID='...' [...] reddit-stash
Using --rm flag:
- Automatically removes containers after execution
- Prevents accumulation of stopped containers
- Essential for cron-based scheduling
Memory limits:
docker run --rm --memory=512m -e [...] reddit-stashCPU limits:
docker run --rm --cpus=0.5 -e [...] reddit-stashCleanup old images periodically:
# Add to weekly cron
0 0 * * 0 docker image prune -f >> /var/log/docker-cleanup.log 2>&1unsave_after_download in settings.ini). This feature can be used to cycle through older saved posts beyond Reddit's 1000-item limit.
- The script downloads and saves a post/comment
- If successful, it attempts to unsave the item
- A small delay is added between unsave operations to respect Reddit's rate limits
- Error handling ensures that failed unsaves don't stop the script
- This process is irreversible - Once items are unsaved, they cannot be automatically restored to your saved items list
- Create backups first - Always ensure you have a backup of your saved items before enabling this feature
- Use with caution - It's recommended to first run the script without unsaving to verify everything works as expected
- Rate Limiting - The script includes built-in delays to avoid hitting Reddit's API limits
- Error Recovery - If an unsave operation fails, the script will continue processing other items
- Set
unsave_after_download = truein your settings.ini file - Run the script as normal
- The script will now unsave items after successfully downloading them
- Run the script multiple times to gradually access older saved items
- First run: Keep
unsave_after_download = falseand verify all content downloads correctly - Create a backup of your downloaded content
- Enable unsaving by setting
unsave_after_download = true - Run the script multiple times to access progressively older content
The script can process Reddit's GDPR data export to access your complete saved post history. When API credentials are available, it uses PRAW to fetch full content for each saved item. When API credentials are not available (e.g., due to the November 2025 API policy change), it runs in CSV-only mode, creating a structured index of your saved content with Reddit links. See GDPR-Only Mode for setup without API credentials.
-
Request your Reddit data:
- Go to https://www.reddit.com/settings/data-request
- Request your data (processing may take several days)
- Download the ZIP file when ready
-
Extract and place the CSV files:
- Inside your save directory (from settings.ini), create a
gdpr_datafolder - Example structure:
reddit/ # Your save directory ├── gdpr_data/ # GDPR data directory │ ├── saved_posts.csv │ └── saved_comments.csv ├── subreddit1/ # Regular saved content └── file_log.json
- Inside your save directory (from settings.ini), create a
-
Enable GDPR processing:
[Settings] process_gdpr = true
-
Run the script:
python reddit_stash.py
- Uses PRAW's built-in rate limiting
- Processes both submissions and comments
- Maintains consistent file naming with "GDPR_" prefix
- Integrates with existing file logging system
- Handles API errors and retries gracefully
- GDPR processing runs after regular API processing
- Each item requires a separate API call to fetch full content
- Rate limits are shared with regular API processing
- Large exports may take significant time to process
- Duplicate items are automatically skipped via file logging
Reddit Stash organizes content by subreddit with a clear file naming convention:
- Your Posts:
POST_[post_id].md - Your Comments:
COMMENT_[comment_id].md - Saved Posts:
SAVED_POST_[post_id].md - Saved Comments:
SAVED_COMMENT_[comment_id].md - Upvoted Posts:
UPVOTE_POST_[post_id].md - Upvoted Comments:
UPVOTE_COMMENT_[comment_id].md - GDPR Posts:
GDPR_POST_[post_id].md - GDPR Comments:
GDPR_COMMENT_[comment_id].md
The system includes several utility modules:
- file_operations.py: Handles all file saving and organization logic
- save_utils.py: Contains the core content formatting functions
- gdpr_processor.py: Processes the GDPR data export
- time_utilities.py: Manages rate limiting and API request timing
- log_utils.py: Tracks processed files to avoid duplicates
Q: Why would I want to backup my Reddit content?
A: Reddit only allows you to access your most recent 1000 saved items. This tool lets you preserve everything beyond that limit and ensures you have a backup even if content is removed from Reddit.
Q: How often does the automated backup run? A: If you use the GitHub Actions setup, it runs on a schedule:
- Every 3 hours during peak hours (6:00-21:00 UTC)
- Twice during off-peak hours (23:00 and 3:00 UTC)
Q: Can I run this without GitHub Actions?
A: Yes, you can run it locally on your machine or set up the Docker container version. The README provides instructions for both options.
Q: Does this access private/NSFW subreddits I've saved content from?
A: Yes, as long as you're logged in with your own Reddit credentials, the script can access any content you've saved, including from private or NSFW subreddits.
Q: How can I verify the script is working correctly?
A: Check your specified save directory for the backed-up files. They should be organized by subreddit with clear naming conventions.
Q: Will this impact my Reddit account in any way?
A: No, unless you enable the unsave_after_download option. This script only reads your data by default; it doesn't modify anything on Reddit unless that specific option is enabled.
Q: What happens if the script encounters rate limits?
A: The script has built-in dynamic sleep timers to respect Reddit's API rate limits. It will automatically pause and retry when necessary.
When using Reddit Stash, keep these security considerations in mind:
- Never share your Reddit API credentials, Dropbox tokens, or AWS access keys with others
- When using GitHub Actions, your credentials are stored as encrypted secrets
- For local installations, consider using environment variables instead of hardcoding credentials in the settings file
- Regularly rotate your API keys and tokens, especially if you suspect they may have been compromised
- Reddit Stash downloads and stores all content from saved posts, including links and images
- Be aware that this may include sensitive or private information if you've saved such content
- Consider where you're storing the backed-up content and who has access to that location
- Cloud storage encryption (Dropbox encryption or S3 SSE-S3) provides some protection, but for highly sensitive data, consider additional encryption
- The GitHub Actions workflow runs in GitHub's cloud environment
- While GitHub has strong security measures, be aware that your Reddit content is processed in this environment
- The workflow has access to your repository secrets and the content being processed
- For maximum security, consider running the script locally on a trusted machine
- Content is stored in plain text markdown files
- If storing content locally, ensure your device has appropriate security measures (encryption, access controls)
- If you back up your local storage to other services, be mindful of where your Reddit content might end up
Feel free to open issues or submit pull requests if you have any improvements or bug fixes.
- This project was inspired by reddit-saved-saver.
Have an idea for improving Reddit Stash? Feel free to suggest it in the issues or contribute a pull request!
✅ Recently Implemented:
- Content Recovery System - 4-provider cascade for failed downloads (Wayback Machine, PullPush.io, Reddit Previews, Reveddit) with SQLite caching and automatic retry across runs
- Advanced Media Download System - Modern web compatibility with HTTP/2 support and browser impersonation
- Comprehensive Rate Limiting - Multi-layer rate limiting with provider-specific limits and intelligent backoff
🔮 Planned Enhancements:
- Improve error handling for edge cases
- Add support for additional cloud storage providers (Google Drive, OneDrive)
- Create a simple web interface for configuration
- Enhanced Media Processing
- Video compression and format conversion options
- Parallel downloads with queue management
- Selective downloads by file size/type with user-defined rules
- Download progress tracking and statistics
- Additional Recovery Providers
- Archive.today integration
- Library of Congress web archive
- Custom recovery provider plugins
This project is licensed under the MIT License - see the LICENSE file for details.





