OCR - Digitalize Scans

Reads pdfs and images (jpg, png by default) to a text file.

Dependencies

sudo apt install tesseract-ocr tesseract-ocr-deu

Install python env:

poetry install

Convert pdfs and images to text files in the current directory:

poetry run digitize.py .

See digitize.py -h for more options.

Example:

poetry run ./digitize.py --exclude DSC IMAG foto picture photo book -r -- ~/sync/private/

You may exclude the generated files of pattern *_ocr.txt for sync.