Intelligent media library management - LSH-based deduplication with blazing-fast processing
LSH deduplication • WASM performance • Parallel processing • Metadata extraction
Intelligently curate, organize, and deduplicate your digital photo and video collection. Built with performance, scalability, and robustness in mind using modern TypeScript, WebAssembly, and native libraries.
The Problem:
Managing large media libraries:
- Scattered files, no organization ❌
- Duplicate accumulation (storage waste) ❌
- Manual organization (time-consuming) ❌
- Simple hash matching (misses similar files) ❌
The Solution:
Media Curator:
- Auto-organize by date/camera/location ✅
- LSH-based visual similarity detection ✅
- Customizable folder structure ✅
- Perceptual hashing (finds variants) ✅
Result: Organized, deduplicated media library with intelligent visual similarity detection.
| Feature | Traditional Tools | Media Curator |
|---|---|---|
| Deduplication | ❌ Hash-based only | ✅ LSH perceptual hashing |
| Similarity Detection | ❌ Exact matches | ✅ Visual similarity (variants) |
| Performance | ✅ Parallel processing (workerpool) | |
| Metadata | ✅ SQLite (millions of files) | |
| Caching | ❌ Re-process every run | ✅ LMDB pause/resume |
| Calculations | ✅ WebAssembly (Hamming distance) |
- Native Libraries - Sharp (libvips) + FFmpeg for maximum speed
- WebAssembly - Optimized Hamming distance calculations
- Worker Pools - Parallel pHash processing (multi-core)
- SQLite - Fast metadata queries for millions of files
- LMDB Cache - Persistent intermediate results
Smart Folder Structure:
- Date-based - EXIF date (falls back to file date)
- Camera model - Group by device
- Geolocation - GPS-tagged photos
- File type - Separate images/videos
- Custom formats - Flexible placeholder system
Example Format String:
# Organize: Year > Month > Type > Filename
--format "{D.YYYY}/{D.MMMM}/{TYPE}/{NAME}_{RND}{EXT}"
# Result: 2023/April/Image/IMG_1234_a1b2c3d4.jpgBeyond Simple Hashing:
- Perceptual Hashing (pHash) - Detects visually similar files
- LSH (Locality-Sensitive Hashing) - Efficient similarity search
- Configurable Thresholds - Control sensitivity
- Multi-format Support - Images and videos
Detects:
- Exact duplicates (same hash)
- Resized versions
- Minor edits (crop, filter, compression)
- Different formats (JPG vs PNG)
- Video frame similarity
Database-Centric Design:
- SQLite - Metadata + LSH hashes for millions of files
- LMDB - Fast key-value cache for intermediate results
- Low Memory - No need to load entire library into RAM
- Pause/Resume - Cache enables quick restarts
Performance Optimization:
- Workerpool - Parallel pHash computation
- WASM - Fast Hamming distance (AssemblyScript)
- Batch Processing - Efficient file handling
- Concurrent Workers - Customizable worker count
# Install via Bun
bun install --global @sylphlab/media-curator
# Install via npm
npm install --global @sylphlab/media-curator
# Verify installation
media-curator --help- Node.js ≥18.0.0 or Bun ≥0.5.0
- FFmpeg - For video processing
- ExifTool - For metadata extraction (optional, bundled)
Install FFmpeg:
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.htmlOrganize photos from one directory to another:
media-curator /media/photos /library/organizedOrganize and separate duplicates:
media-curator /media/photos /library/organized \
-d /library/duplicates \
-e /library/errorsOrganize by year and month:
media-curator /media/photos /library/organized \
--format "{D.YYYY}/{D.MM}/{NAME}{EXT}"Complete workflow with all options:
media-curator /media/photos /media/downloads /library/organized \
-d /library/duplicates \
-e /library/errors \
--move \
--resolution 128 \
--image-similarity-threshold 0.95 \
--video-similarity-threshold 0.90 \
--format "{D.YYYY}/{D.MMMM}/{TYPE}/{NAME}_{RND}{EXT}" \
--concurrency 8 \
--verbose| Argument | Required | Description |
|---|---|---|
<source...> |
✅ | Source directories or files (multiple allowed) |
<destination> |
✅ | Destination directory for organized files |
| Option | Default | Description |
|---|---|---|
-d, --duplicate <path> |
None | Directory for duplicate files |
-e, --error <path> |
None | Directory for files with processing errors |
-m, --move |
false |
Move files instead of copying |
-v, --verbose |
false |
Enable detailed logging |
| Option | Default | Description |
|---|---|---|
--image-similarity-threshold <n> |
0.99 |
Image similarity threshold (0-1) |
--video-similarity-threshold <n> |
0.93 |
Video similarity threshold (0-1) |
--image-video-similarity-threshold <n> |
0.93 |
Cross-type similarity threshold |
-r, --resolution <n> |
64 |
pHash resolution (higher = more accurate) |
| Option | Default | Description |
|---|---|---|
-c, --concurrency <n> |
CPU cores - 1 | Number of worker processes |
--max-chunk-size <n> |
2MB |
Maximum file processing chunk size |
| Option | Default | Description |
|---|---|---|
--target-fps <n> |
2 |
Target FPS for video frame extraction |
--min-frames <n> |
5 |
Minimum frames to extract |
--max-scene-frames <n> |
100 |
Maximum frames per scene |
--scene-change-threshold <n> |
0.01 |
Scene change detection threshold |
-w, --window-size <n> |
5 |
Frame clustering window size |
-p, --step-size <n> |
1 |
Frame clustering step size |
| Option | Default | Description |
|---|---|---|
-F, --format <string> |
(see below) | Destination path format string |
--debug <path> |
None | Directory for debug reports |
Prefixes:
I.- Image Date (from EXIF)F.- File Creation DateD.- Mixed Date (prefers EXIF, falls back to file)
Patterns:
| Placeholder | Example | Description |
|---|---|---|
{?.YYYY} |
2023 |
4-digit year |
{?.YY} |
23 |
2-digit year |
{?.MMMM} |
January |
Full month name |
{?.MMM} |
Jan |
Short month name |
{?.MM} |
01 |
Month (zero-padded) |
{?.M} |
1 |
Month (no padding) |
{?.DD} |
05 |
Day (zero-padded) |
{?.D} |
5 |
Day (no padding) |
{?.DDDD} |
Sunday |
Full weekday name |
{?.DDD} |
Sun |
Short weekday name |
{?.HH} |
14 |
24-hour (zero-padded) |
{?.hh} |
02 |
12-hour (zero-padded) |
{?.mm} |
08 |
Minute (zero-padded) |
{?.ss} |
09 |
Second (zero-padded) |
{?.a} |
am |
Lowercase am/pm |
{?.A} |
AM |
Uppercase AM/PM |
{?.WW} |
01 |
Week number (01-53) |
| Placeholder | Example | Description |
|---|---|---|
{NAME} |
IMG_1234 |
Original filename (no extension) |
{NAME.L} |
img_1234 |
Lowercase filename |
{NAME.U} |
IMG_1234 |
Uppercase filename |
{EXT} |
.jpg |
File extension (with dot) |
{RND} |
a1b2c3d4 |
Random 8-char hex (prevents collisions) |
| Placeholder | Example | Description |
|---|---|---|
{GEO} |
34.05_-118.24 |
GPS coordinates (if available) |
{CAM} |
iPhone 14 Pro |
Camera model (if available) |
{TYPE} |
Image or Video |
File type |
| Placeholder | Values | Description |
|---|---|---|
{HAS.GEO} |
GeoTagged or NoGeo |
Has GPS data? |
{HAS.CAM} |
WithCamera or NoCamera |
Has camera metadata? |
{HAS.DATE} |
Dated or NoDate |
Has EXIF date? |
# By year and month
"{D.YYYY}/{D.MM}/{NAME}{EXT}"
# → 2023/04/IMG_1234.jpg
# By camera model
"{CAM}/{D.YYYY}/{NAME}{EXT}"
# → iPhone 14 Pro/2023/IMG_1234.jpg
# With geolocation
"{HAS.GEO}/{GEO}/{D.YYYY}-{D.MM}/{NAME}{EXT}"
# → GeoTagged/34.05_-118.24/2023-04/IMG_1234.jpg
# Prevent collisions
"{D.YYYY}/{D.MMMM}/{TYPE}/{NAME}_{RND}{EXT}"
# → 2023/April/Image/IMG_1234_a1b2c3d4.jpgTest organization without moving files:
media-curator /media/photos /library/organized \
--debug /tmp/curator_debug \
--format "{D.YYYY}-{D.MM}/{TYPE}/{NAME}{EXT}"No files are moved/copied, but a report is generated showing:
- What would happen
- Potential duplicates
- Metadata extraction results
For archival use cases, increase sensitivity:
media-curator /archive_source /library/organized \
-d /library/duplicates \
--move \
--resolution 128 \
--image-similarity-threshold 0.95 \
--video-similarity-threshold 0.90 \
--verboseGroup photos by camera model:
media-curator /camera_roll /library/by_camera \
--format "{HAS.CAM}/{CAM}/{D.YYYY}-{D.MM}/{NAME}_{RND}{EXT}" \
--verboseResult:
WithCamera/iPhone 14 Pro/2023-10/IMG_001_abc123ef.jpg
NoCamera/Unknown/2024-01/video_clip_xyz98765.mp4
Using shell globbing to filter:
# Only JPG files
media-curator /media/photos/**/*.jpg /library/organized_jpgs
# Only MP4 videos
media-curator /media/videos/**/*.mp4 /library/organized_mp4sUtilize all CPU cores with higher resolution:
media-curator /massive_library /organized \
-d /duplicates \
--move \
--concurrency 16 \
--resolution 128 \
--target-fps 4 \
--verbose| Component | Technology | Purpose |
|---|---|---|
| Language | TypeScript | Type-safe development |
| Runtime | Node.js / Bun | Execution environment |
| Image Processing | Sharp (libvips) | Fast image operations |
| Video Processing | FFmpeg | Video frame extraction |
| Metadata | ExifTool | EXIF/GPS extraction |
| Database | SQLite (better-sqlite3) | Metadata + LSH storage |
| Cache | LMDB | Fast key-value cache |
| Optimization | WebAssembly (AssemblyScript) | Hamming distance |
| Concurrency | workerpool | Parallel processing |
| Error Handling | neverthrow | Result types |
┌─────────────────────────────────────────────────────────┐
│ 1. Discovery │
│ • Scan source directories │
│ • Collect file paths │
└─────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 2. Gatherer (Parallel Processing) │
│ • Extract metadata (EXIF, GPS, camera) │
│ • Generate pHash (via workerpool) │
│ • Store in SQLite + LMDB cache │
│ • Tools: Sharp, FFmpeg, ExifTool, WASM │
└─────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Deduplicator (LSH-based) │
│ • Query SQLite for similarity candidates │
│ • Calculate Hamming distance (WASM) │
│ • Group duplicate sets │
│ • Identify unique files │
└─────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 4. Transfer │
│ • Organize files by format string │
│ • Move/copy unique files to destination │
│ • Move duplicates to duplicate directory │
│ • Generate debug reports │
└─────────────────────────────────────────────────────────┘
Key Design Principles:
- Functional Programming - Pure functions, immutability, composition
- Manual Dependency Injection - Testable, maintainable architecture
- Result Types - Explicit error handling via
neverthrow - Minimal Dependencies - Prefer built-in APIs
- Vacation photos - Organize by date and location
- Family events - Group by camera (different devices)
- Digital cleanup - Remove duplicate photos from phone backups
- Archival - High-sensitivity deduplication before long-term storage
- Client sessions - Organize by date and camera model
- Event coverage - Deduplicate similar shots (burst mode)
- Portfolio management - Find and remove similar images
- Backup deduplication - Clean up redundant backups
- Video library - Organize by date and metadata
- Duplicate detection - Find visually similar video clips
- Frame-based similarity - Detect re-encoded videos
- Storage optimization - Remove duplicate/similar videos
# Clone repository
git clone https://github.com/SylphxAI/media-curator.git
cd media-curator
# Install dependencies
bun install
# Build
bun run build# Lint
bun run lint
# Format
bun run format
# Type check
bun run typecheck
# Test
bun test
# Test with coverage
bun run test:cov
# Validate all
bun run validate# Development mode
bun run start
# Production build
bun run build
bun run start:node- 100% coverage enforced via CI
- Unit tests for all core modules
- Integration tests for pipeline stages
- Mock-based testing for external tools
- ESLint - Strict rules for consistency
- Prettier - Automated formatting
- TypeScript - Strict mode type safety
- Husky - Pre-commit hooks
Tested with:
- Large libraries (10,000+ files)
- Mixed photo/video collections
- Multiple source directories
- Various file formats
Optimizations:
- Worker pool parallelism
- WASM-accelerated calculations
- SQLite indexing for fast queries
- LMDB caching for pause/resume
✅ Completed
- LSH-based perceptual hashing
- SQLite metadata storage
- LMDB caching
- WebAssembly optimization
- Parallel processing (workerpool)
- FFmpeg + Sharp integration
- Customizable format strings
- CLI progress indicators
🚀 Planned
- Performance benchmarks (quantified metrics)
- Web UI for visual duplicate review
- Cloud storage integration (S3, Google Photos)
- Machine learning-based similarity (neural embeddings)
- Incremental indexing (watch mode)
- Face detection grouping
- Advanced filtering options
Contributions are welcome! Please read CONTRIBUTING.md for guidelines.
Development Guidelines:
- Open an issue - Discuss changes before implementing
- Fork the repository
- Create a feature branch -
git checkout -b feature/my-feature - Follow code standards - Run
bun run validate - Write tests - Maintain 100% coverage
- Submit a pull request
Note: Some tests may fail under bun test due to complex mocking. See memory-bank/progress.md for details.
- 🐛 Bug Reports
- 💬 Discussions
Show Your Support: ⭐ Star • 👀 Watch • 🐛 Report bugs • 💡 Suggest features • 🔀 Contribute
MIT © Sylphx
Built with:
- Sharp - High-performance image processing (libvips)
- FFmpeg - Video frame extraction
- SQLite - Metadata storage (better-sqlite3)
- LMDB - Fast key-value cache
- ExifTool - Metadata extraction
- WebAssembly - Optimized calculations
- TypeScript - Type safety
Special thanks to the open source community ❤️
- ARCHITECTURE.md - Detailed architecture documentation
- CONTRIBUTING.md - Contribution guidelines
- memory-bank/progress.md - Development progress
Organize. Deduplicate. Optimize.
Intelligent media library management with visual similarity detection
sylphx.com •
@SylphxAI •
hi@sylphx.com