Skip to content

Add OCR text verification to prevent false positive completions#46

Merged
maxi07 merged 15 commits into
mainfrom
copilot/fix-41
May 24, 2026
Merged

Add OCR text verification to prevent false positive completions#46
maxi07 merged 15 commits into
mainfrom
copilot/fix-41

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Aug 28, 2025

  • Understand current extract_text usage and callers
  • Add max_pages (default 10) and max_chars (default 50,000) parameters to extract_text()
  • Add tests for max_pages and max_chars limits
  • All 15 tests pass

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Aug 28, 2025

@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
Copilot AI changed the title [WIP] Add verification for OCR after our step Add OCR text verification to prevent false positive completions Aug 28, 2025
Copilot AI requested a review from maxi07 August 28, 2025 14:28
@maxi07 maxi07 marked this pull request as ready for review March 23, 2026 17:09
@maxi07
Copy link
Copy Markdown
Owner

maxi07 commented Mar 23, 2026

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Mar 23, 2026

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

Done in commit 7aa489c. Here's what was added:

Tests (tests/test_ocr_verification.py): Added 9 new unit tests in TestStartProcessing that mock all external dependencies and cover every outcome of start_processing:

  • OCR success with text → COMPLETED
  • OCR success with no/whitespace-only text → FAILED
  • OCR success but output file missing → OUTPUT_ERROR
  • Non-zero exit code → FAILED
  • UnsupportedImageFormatErrorUNSUPPORTED
  • DpiErrorDPI_ERROR
  • Forwarded to upload queue on success
  • DB final update includes both file_status and ocr_status

DB: Added an ocr_status column migration to scanneddata in sqlite_wrapper.py. The OCR service now writes the status name (e.g. "COMPLETED", "FAILED", "DPI_ERROR") to this column at the end of each OCR job.

UI: The dashboard card now shows a ⚠️ warning line with a human-readable OCR error message when OCR encountered a problem (e.g. "OCR: No text found", "OCR: Image DPI too low"). This updates both on initial page load and live via SSE when a document is processed.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • redis
    • Triggering command: /usr/bin/python python -m pytest tests/ -v --ignore=tests/test_homepage.py --ignore=tests/test_advanced_settings_web.py ****** (dns block)
    • Triggering command: /usr/bin/python python -m pytest tests/test_settings.py (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

@maxi07 maxi07 linked an issue May 24, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OCR verification step so the pipeline no longer treats “OCRmyPDF exit code 0” as sufficient proof of extracted text, persists the resulting ocr_status to SQLite, and surfaces OCR failures in the dashboard (including via live SSE updates).

Changes:

  • Verify OCR output by extracting text from the generated OCR PDF and setting OCRStatus accordingly.
  • Persist ocr_status in scanneddata via SQLite migration + service DB updates.
  • Expose ocr_status to the web UI (SSE payload + dashboard card warning display) and add unit tests covering OCR outcome scenarios.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
ocr_service/main.py Adds post-OCR text verification, records ocr_status to DB, and adds a __main__ guard.
scansynclib/scansynclib/sqlite_wrapper.py Adds migration to create ocr_status column in scanneddata.
web_service/src/main.py Includes ocr_status in SSE payload updates.
web_service/src/static/js/dashboard.js Displays OCR warning indicator/text on cards and updates it live.
web_service/src/routes/dashboard.py Adjusts dashboard load log level.
tests/test_ocr_verification.py Adds tests for text extraction behavior and start_processing OCR status outcomes.

Comment thread ocr_service/main.py Outdated

# Verify that the OCR file actually contains text
if os.path.exists(item.ocr_file):
extracted_text = extract_text(item.ocr_file).strip()
Comment thread web_service/src/main.py Outdated
currently_uploading=item.current_uploading,
current_upload_target=item.current_upload_target,
badges=badges, # Add the generated badges
ocr_status=item.ocr_status.name if item.ocr_status else None,
Comment on lines +93 to +98
module_patches = {
'ocrmypdf': mock_ocrmypdf,
'pika': mock_pika,
'scansynclib.settings': mock_settings_mod,
'scansynclib.sqlite_wrapper': MagicMock(),
}
Comment thread tests/test_ocr_verification.py Outdated
assert result == ""

@patch('scansynclib.helpers.PdfReader')
def test_extract_text_strips_whitespace(self, mock_pdf_reader):
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread ocr_service/main.py
Comment on lines +49 to +56
if os.path.exists(item.ocr_file):
extracted_text = extract_text(item.ocr_file).strip()
if extracted_text:
logger.info(f"OCR verification successful: extracted {len(extracted_text)} characters from {item.filename}")
item.ocr_status = OCRStatus.COMPLETED
else:
logger.warning(f"OCR verification failed: no text found in OCR output file {item.ocr_file}")
item.ocr_status = OCRStatus.FAILED
Comment on lines +81 to +99
mock_pika = MagicMock()
mock_pika.exceptions = MagicMock()
mock_pika.exceptions.AMQPConnectionError = Exception

mock_settings_mod = MagicMock()
mock_settings_mod.settings = MagicMock()
mock_settings_mod.settings.file_naming = MagicMock()
mock_settings_mod.settings.file_naming.ollama_server_url = None
mock_settings_mod.settings.file_naming.ollama_server_port = None
mock_settings_mod.settings.file_naming.ollama_model = None
mock_settings_mod.settings.file_naming.openai_api_key = None

module_patches = {
'ocrmypdf': mock_ocrmypdf,
'pika': mock_pika,
'scansynclib.settings': mock_settings_mod,
'scansynclib.sqlite_wrapper': MagicMock(),
}

Comment thread tests/test_ocr_verification.py Outdated
Comment on lines +36 to +49
def test_extract_text_strips_whitespace(self, mock_pdf_reader):
"""Test that extract_text properly handles text with whitespace."""
from scansynclib.helpers import extract_text

# Mock the PDF reader to return text with whitespace
mock_page = Mock()
mock_page.extract_text.return_value = " \n\t Some text \n\t "
mock_reader = Mock()
mock_reader.pages = [mock_page]
mock_pdf_reader.return_value = mock_reader

result = extract_text("dummy_path.pdf")
assert result == " \n\t Some text \n\t " # Should return raw text, not stripped

…s, fix test mock and naming

Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/5361cf46-e4fa-46bf-a66b-c4283fbf20f6

Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
Copilot finished work on behalf of maxi07 May 24, 2026 09:25
Copilot AI requested a review from maxi07 May 24, 2026 09:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

def _sigterm(signum, frame):
self._shutdown()
if callable(prev):
prev(signum, frame)
Comment thread web_service/entrypoint.sh Outdated
Comment on lines +17 to +24
echo "Starting app with Gunicorn..."
gunicorn --worker-class gevent --bind 0.0.0.0:5001 src.main:app
fi
EOF No newline at end of file
exec gunicorn \
--user appuser \
--group appuser \
--worker-class gevent \
--bind 0.0.0.0:5001 \
--graceful-timeout 30 \
src.main:app
Comment thread ocr_service/main.py
Comment on lines +48 to +56
# Verify that the OCR file actually contains text
if os.path.exists(item.ocr_file):
extracted_text = (extract_text(item.ocr_file) or "").strip()
if extracted_text:
logger.info(f"OCR verification successful: extracted {len(extracted_text)} characters from {item.filename}")
item.ocr_status = OCRStatus.COMPLETED
else:
logger.warning(f"OCR verification failed: no text found in OCR output file {item.ocr_file}")
item.ocr_status = OCRStatus.FAILED
Comment thread scansynclib/scansynclib/helpers.py Outdated
Comment on lines +280 to +285
parts.append(page_text)
total_chars += len(page_text)
if total_chars >= max_chars:
break
result = "\n".join(parts)
return result[:max_chars]
Comment on lines +188 to +196
const icon = document.createElement('i');
icon.className = 'bi bi-exclamation-triangle-fill text-warning';
const small = document.createElement('small');
small.textContent = ocrStatusText;
ocrStatusSpan.appendChild(document.createElement('br'));
ocrStatusSpan.appendChild(icon);
ocrStatusSpan.appendChild(document.createTextNode(' '));
ocrStatusSpan.appendChild(small);
ocrStatusSpan.appendChild(document.createElement('br'));
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 24, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • redis
    • Triggering command: /usr/bin/python python -m pytest tests/test_ocr_verification.py tests/test_settings.py -q (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comment thread web_service/src/static/js/dashboard.js Outdated
Comment thread web_service/entrypoint.sh
Comment on lines +9 to +13
# Use `exec` so the final process becomes PID 1 and receives SIGTERM directly
# from Docker. This enables graceful shutdown (Gunicorn workers finish in-flight
# requests, dev server stops cleanly) instead of being killed by a dying `su`
# session.
if [ "$FLASK_ENV" = "development" ]; then
Comment thread web_service/entrypoint.sh
Comment on lines +13 to 17
if [ "$FLASK_ENV" = "development" ]; then
echo "Starting Flask development server..."
flask run --host=0.0.0.0 --port=5001 --reload --debug
exec su appuser -c "exec flask run --host=0.0.0.0 --port=5001 --reload --debug"
else
echo "Starting app with Gunicorn..."
Comment thread scansynclib/scansynclib/settings.py Outdated
Comment thread tests/test_ocr_verification.py Outdated
maxi07 and others added 5 commits May 24, 2026 12:04
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@maxi07 maxi07 merged commit ec8b8f8 into main May 24, 2026
4 checks passed
@maxi07 maxi07 deleted the copilot/fix-41 branch May 24, 2026 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add verification for OCR after our step

3 participants