Skip to content

Latest commit

 

History

History
33 lines (23 loc) · 657 Bytes

File metadata and controls

33 lines (23 loc) · 657 Bytes

OCR - Digitalize Scans

Reads pdfs and images (jpg, png by default) to a text file.

Dependencies

sudo apt install tesseract-ocr tesseract-ocr-deu

Install python env:

poetry install

Usage

Convert pdfs and images to text files in the current directory:

poetry run digitize.py .

See digitize.py -h for more options.

Example:

poetry run ./digitize.py --exclude DSC IMAG foto picture photo book -r -- ~/sync/private/

You may exclude the generated files of pattern *_ocr.txt for sync.