Add OCR text verification to prevent false positive completions by Copilot · Pull Request #46 · maxi07/ScanSync

Copilot · 2025-08-28T14:18:17Z

Understand current extract_text usage and callers
Add max_pages (default 10) and max_chars (default 50,000) parameters to extract_text()
Add tests for max_pages and max_chars limits
All 15 tests pass

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot · 2025-08-28T14:18:19Z

@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

maxi07 · 2026-03-23T17:10:21Z

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

…ssing tests Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/a2c6fc77-605f-48ae-b6b2-579df2e70e6c

Copilot · 2026-03-23T17:26:39Z

@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui)

Done in commit 7aa489c. Here's what was added:

Tests (tests/test_ocr_verification.py): Added 9 new unit tests in TestStartProcessing that mock all external dependencies and cover every outcome of start_processing:

OCR success with text → COMPLETED
OCR success with no/whitespace-only text → FAILED
OCR success but output file missing → OUTPUT_ERROR
Non-zero exit code → FAILED
UnsupportedImageFormatError → UNSUPPORTED
DpiError → DPI_ERROR
Forwarded to upload queue on success
DB final update includes both file_status and ocr_status

DB: Added an ocr_status column migration to scanneddata in sqlite_wrapper.py. The OCR service now writes the status name (e.g. "COMPLETED", "FAILED", "DPI_ERROR") to this column at the end of each OCR job.

UI: The dashboard card now shows a ⚠️ warning line with a human-readable OCR error message when OCR encountered a problem (e.g. "OCR: No text found", "OCR: Image DPI too low"). This updates both on initial page load and live via SSE when a document is processed.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

redis
- Triggering command: /usr/bin/python python -m pytest tests/ -v --ignore=tests/test_homepage.py --ignore=tests/test_advanced_settings_web.py ****** (dns block)
- Triggering command: /usr/bin/python python -m pytest tests/test_settings.py (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Copilot

Pull request overview

Adds an OCR verification step so the pipeline no longer treats “OCRmyPDF exit code 0” as sufficient proof of extracted text, persists the resulting ocr_status to SQLite, and surfaces OCR failures in the dashboard (including via live SSE updates).

Changes:

Verify OCR output by extracting text from the generated OCR PDF and setting OCRStatus accordingly.
Persist ocr_status in scanneddata via SQLite migration + service DB updates.
Expose ocr_status to the web UI (SSE payload + dashboard card warning display) and add unit tests covering OCR outcome scenarios.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`ocr_service/main.py`	Adds post-OCR text verification, records `ocr_status` to DB, and adds a `__main__` guard.
`scansynclib/scansynclib/sqlite_wrapper.py`	Adds migration to create `ocr_status` column in `scanneddata`.
`web_service/src/main.py`	Includes `ocr_status` in SSE payload updates.
`web_service/src/static/js/dashboard.js`	Displays OCR warning indicator/text on cards and updates it live.
`web_service/src/routes/dashboard.py`	Adjusts dashboard load log level.
`tests/test_ocr_verification.py`	Adds tests for text extraction behavior and `start_processing` OCR status outcomes.

+
+            # Verify that the OCR file actually contains text
+            if os.path.exists(item.ocr_file):
+                extracted_text = extract_text(item.ocr_file).strip()


                currently_uploading=item.current_uploading,
                current_upload_target=item.current_upload_target,
                badges=badges,  # Add the generated badges
+                ocr_status=item.ocr_status.name if item.ocr_status else None,


+    module_patches = {
+        'ocrmypdf': mock_ocrmypdf,
+        'pika': mock_pika,
+        'scansynclib.settings': mock_settings_mod,
+        'scansynclib.sqlite_wrapper': MagicMock(),
+    }


+        assert result == ""
+
+    @patch('scansynclib.helpers.PdfReader')
+    def test_extract_text_strips_whitespace(self, mock_pdf_reader):


Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

+            if os.path.exists(item.ocr_file):
+                extracted_text = extract_text(item.ocr_file).strip()
+                if extracted_text:
+                    logger.info(f"OCR verification successful: extracted {len(extracted_text)} characters from {item.filename}")
+                    item.ocr_status = OCRStatus.COMPLETED
+                else:
+                    logger.warning(f"OCR verification failed: no text found in OCR output file {item.ocr_file}")
+                    item.ocr_status = OCRStatus.FAILED


+    mock_pika = MagicMock()
+    mock_pika.exceptions = MagicMock()
+    mock_pika.exceptions.AMQPConnectionError = Exception
+
+    mock_settings_mod = MagicMock()
+    mock_settings_mod.settings = MagicMock()
+    mock_settings_mod.settings.file_naming = MagicMock()
+    mock_settings_mod.settings.file_naming.ollama_server_url = None
+    mock_settings_mod.settings.file_naming.ollama_server_port = None
+    mock_settings_mod.settings.file_naming.ollama_model = None
+    mock_settings_mod.settings.file_naming.openai_api_key = None
+
+    module_patches = {
+        'ocrmypdf': mock_ocrmypdf,
+        'pika': mock_pika,
+        'scansynclib.settings': mock_settings_mod,
+        'scansynclib.sqlite_wrapper': MagicMock(),
+    }
+


+    def test_extract_text_strips_whitespace(self, mock_pdf_reader):
+        """Test that extract_text properly handles text with whitespace."""
+        from scansynclib.helpers import extract_text
+
+        # Mock the PDF reader to return text with whitespace
+        mock_page = Mock()
+        mock_page.extract_text.return_value = "  \n\t  Some text  \n\t  "
+        mock_reader = Mock()
+        mock_reader.pages = [mock_page]
+        mock_pdf_reader.return_value = mock_reader
+
+        result = extract_text("dummy_path.pdf")
+        assert result == "  \n\t  Some text  \n\t  "  # Should return raw text, not stripped
+


…s, fix test mock and naming Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/5361cf46-e4fa-46bf-a66b-c4283fbf20f6 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/da1eb885-be5a-4b50-bc86-55d37ca3a9b4 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

+                def _sigterm(signum, frame):
+                    self._shutdown()
+                    if callable(prev):
+                        prev(signum, frame)


    echo "Starting app with Gunicorn..."
-    gunicorn --worker-class gevent --bind 0.0.0.0:5001 src.main:app
-fi
-EOF
+    exec gunicorn \
+        --user appuser \
+        --group appuser \
+        --worker-class gevent \
+        --bind 0.0.0.0:5001 \
+        --graceful-timeout 30 \
+        src.main:app


+            # Verify that the OCR file actually contains text
+            if os.path.exists(item.ocr_file):
+                extracted_text = (extract_text(item.ocr_file) or "").strip()
+                if extracted_text:
+                    logger.info(f"OCR verification successful: extracted {len(extracted_text)} characters from {item.filename}")
+                    item.ocr_status = OCRStatus.COMPLETED
+                else:
+                    logger.warning(f"OCR verification failed: no text found in OCR output file {item.ocr_file}")
+                    item.ocr_status = OCRStatus.FAILED


+                parts.append(page_text)
+                total_chars += len(page_text)
+                if total_chars >= max_chars:
+                    break
+        result = "\n".join(parts)
+        return result[:max_chars]


+                    const icon = document.createElement('i');
+                    icon.className = 'bi bi-exclamation-triangle-fill text-warning';
+                    const small = document.createElement('small');
+                    small.textContent = ocrStatusText;
+                    ocrStatusSpan.appendChild(document.createElement('br'));
+                    ocrStatusSpan.appendChild(icon);
+                    ocrStatusSpan.appendChild(document.createTextNode(' '));
+                    ocrStatusSpan.appendChild(small);
+                    ocrStatusSpan.appendChild(document.createElement('br'));


…web runtime Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/f01f1d65-f421-45f2-b0aa-6639bf67b715 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Copilot · 2026-05-24T09:47:37Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

redis
- Triggering command: /usr/bin/python python -m pytest tests/test_ocr_verification.py tests/test_settings.py -q (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

+# Use `exec` so the final process becomes PID 1 and receives SIGTERM directly
+# from Docker. This enables graceful shutdown (Gunicorn workers finish in-flight
+# requests, dev server stops cleanly) instead of being killed by a dying `su`
+# session.
+if [ "$FLASK_ENV" = "development" ]; then


+if [ "$FLASK_ENV" = "development" ]; then
    echo "Starting Flask development server..."
-    flask run --host=0.0.0.0 --port=5001 --reload --debug
+    exec su appuser -c "exec flask run --host=0.0.0.0 --port=5001 --reload --debug"
 else
    echo "Starting app with Gunicorn..."


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Initial plan

6702ca4

Copilot AI assigned Copilot and maxi07 Aug 28, 2025

Copilot started work on behalf of maxi07 August 28, 2025 14:18 View session

Add OCR verification to check for text extraction after OCR completion

e439182

Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add verification for OCR after our step~~ Add OCR text verification to prevent false positive completions Aug 28, 2025

Copilot AI requested a review from maxi07 August 28, 2025 14:28

Copilot finished work on behalf of maxi07 August 28, 2025 14:28

maxi07 marked this pull request as ready for review March 23, 2026 17:09

Copilot started work on behalf of maxi07 March 23, 2026 17:10 View session

Add OCR status persistence, UI display, and comprehensive start_proce…

7aa489c

…ssing tests Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/a2c6fc77-605f-48ae-b6b2-579df2e70e6c

Copilot finished work on behalf of maxi07 March 23, 2026 17:27

maxi07 linked an issue May 24, 2026 that may be closed by this pull request

Add verification for OCR after our step #41

Closed

maxi07 added 2 commits May 24, 2026 10:17

fix log message

d66d651

fix styling of error message

ea4623f

maxi07 approved these changes May 24, 2026

View reviewed changes

maxi07 requested a review from Copilot May 24, 2026 08:56

Copilot started reviewing on behalf of maxi07 May 24, 2026 08:56 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

fix pytests

15c8c13

maxi07 requested a review from Copilot May 24, 2026 09:19

Copilot started reviewing on behalf of maxi07 May 24, 2026 09:19 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Copilot started work on behalf of maxi07 May 24, 2026 09:23 View session

Address review comments: handle None text, defensive ocr_status acces…

afe2b97

…s, fix test mock and naming Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/5361cf46-e4fa-46bf-a66b-c4283fbf20f6 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Copilot finished work on behalf of maxi07 May 24, 2026 09:25

Copilot AI requested a review from maxi07 May 24, 2026 09:25

Copilot started work on behalf of maxi07 May 24, 2026 09:27 View session

Limit extract_text with max_pages=10 and max_chars=50000 defaults

7c04f60

Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/da1eb885-be5a-4b50-bc86-55d37ca3a9b4 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Copilot finished work on behalf of maxi07 May 24, 2026 09:30

maxi07 requested a review from Copilot May 24, 2026 09:31

Copilot started reviewing on behalf of maxi07 May 24, 2026 09:31 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Copilot started work on behalf of maxi07 May 24, 2026 09:43 View session

Fix review-thread issues in OCR verification, shutdown handling, and …

cf34429

…web runtime Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/f01f1d65-f421-45f2-b0aa-6639bf67b715 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>

Copilot finished work on behalf of maxi07 May 24, 2026 09:47

maxi07 requested a review from Copilot May 24, 2026 09:58

Copilot started reviewing on behalf of maxi07 May 24, 2026 09:58 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

maxi07 and others added 5 commits May 24, 2026 12:04

Potential fix for pull request finding

1803c7f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

93cbc64

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

ea6b567

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

fix tests

c8c9da9

fix ocr message

414abe8

maxi07 approved these changes May 24, 2026

View reviewed changes

Merge branch 'main' into copilot/fix-41

1b306b6

maxi07 merged commit ec8b8f8 into main May 24, 2026
4 checks passed

maxi07 deleted the copilot/fix-41 branch May 24, 2026 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OCR text verification to prevent false positive completions#46

Add OCR text verification to prevent false positive completions#46
maxi07 merged 15 commits into
mainfrom
copilot/fix-41

Copilot AI commented Aug 28, 2025 •

edited

Loading

Uh oh!

Copilot AI commented Aug 28, 2025

Uh oh!

maxi07 commented Mar 23, 2026

Uh oh!

Copilot AI commented Mar 23, 2026 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI commented May 24, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI commented Aug 28, 2025

Uh oh!

maxi07 commented Mar 23, 2026

Uh oh!

Copilot AI commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI commented May 24, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Aug 28, 2025 •

edited

Loading

Copilot AI commented Mar 23, 2026 •

edited

Loading