Add OCR text verification to prevent false positive completions#46
Conversation
|
@maxi07 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs. I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review. |
Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
|
@copilot please add tests that will test this new functionality and also make sure that the web ui reflects failing our status (OCR status gets updated properly in the db AND the ui) |
…ssing tests Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com> Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/a2c6fc77-605f-48ae-b6b2-579df2e70e6c
Done in commit Tests (
DB: Added an UI: The dashboard card now shows a Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
There was a problem hiding this comment.
Pull request overview
Adds an OCR verification step so the pipeline no longer treats “OCRmyPDF exit code 0” as sufficient proof of extracted text, persists the resulting ocr_status to SQLite, and surfaces OCR failures in the dashboard (including via live SSE updates).
Changes:
- Verify OCR output by extracting text from the generated OCR PDF and setting
OCRStatusaccordingly. - Persist
ocr_statusinscanneddatavia SQLite migration + service DB updates. - Expose
ocr_statusto the web UI (SSE payload + dashboard card warning display) and add unit tests covering OCR outcome scenarios.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
ocr_service/main.py |
Adds post-OCR text verification, records ocr_status to DB, and adds a __main__ guard. |
scansynclib/scansynclib/sqlite_wrapper.py |
Adds migration to create ocr_status column in scanneddata. |
web_service/src/main.py |
Includes ocr_status in SSE payload updates. |
web_service/src/static/js/dashboard.js |
Displays OCR warning indicator/text on cards and updates it live. |
web_service/src/routes/dashboard.py |
Adjusts dashboard load log level. |
tests/test_ocr_verification.py |
Adds tests for text extraction behavior and start_processing OCR status outcomes. |
|
|
||
| # Verify that the OCR file actually contains text | ||
| if os.path.exists(item.ocr_file): | ||
| extracted_text = extract_text(item.ocr_file).strip() |
| currently_uploading=item.current_uploading, | ||
| current_upload_target=item.current_upload_target, | ||
| badges=badges, # Add the generated badges | ||
| ocr_status=item.ocr_status.name if item.ocr_status else None, |
| module_patches = { | ||
| 'ocrmypdf': mock_ocrmypdf, | ||
| 'pika': mock_pika, | ||
| 'scansynclib.settings': mock_settings_mod, | ||
| 'scansynclib.sqlite_wrapper': MagicMock(), | ||
| } |
| assert result == "" | ||
|
|
||
| @patch('scansynclib.helpers.PdfReader') | ||
| def test_extract_text_strips_whitespace(self, mock_pdf_reader): |
| if os.path.exists(item.ocr_file): | ||
| extracted_text = extract_text(item.ocr_file).strip() | ||
| if extracted_text: | ||
| logger.info(f"OCR verification successful: extracted {len(extracted_text)} characters from {item.filename}") | ||
| item.ocr_status = OCRStatus.COMPLETED | ||
| else: | ||
| logger.warning(f"OCR verification failed: no text found in OCR output file {item.ocr_file}") | ||
| item.ocr_status = OCRStatus.FAILED |
| mock_pika = MagicMock() | ||
| mock_pika.exceptions = MagicMock() | ||
| mock_pika.exceptions.AMQPConnectionError = Exception | ||
|
|
||
| mock_settings_mod = MagicMock() | ||
| mock_settings_mod.settings = MagicMock() | ||
| mock_settings_mod.settings.file_naming = MagicMock() | ||
| mock_settings_mod.settings.file_naming.ollama_server_url = None | ||
| mock_settings_mod.settings.file_naming.ollama_server_port = None | ||
| mock_settings_mod.settings.file_naming.ollama_model = None | ||
| mock_settings_mod.settings.file_naming.openai_api_key = None | ||
|
|
||
| module_patches = { | ||
| 'ocrmypdf': mock_ocrmypdf, | ||
| 'pika': mock_pika, | ||
| 'scansynclib.settings': mock_settings_mod, | ||
| 'scansynclib.sqlite_wrapper': MagicMock(), | ||
| } | ||
|
|
| def test_extract_text_strips_whitespace(self, mock_pdf_reader): | ||
| """Test that extract_text properly handles text with whitespace.""" | ||
| from scansynclib.helpers import extract_text | ||
|
|
||
| # Mock the PDF reader to return text with whitespace | ||
| mock_page = Mock() | ||
| mock_page.extract_text.return_value = " \n\t Some text \n\t " | ||
| mock_reader = Mock() | ||
| mock_reader.pages = [mock_page] | ||
| mock_pdf_reader.return_value = mock_reader | ||
|
|
||
| result = extract_text("dummy_path.pdf") | ||
| assert result == " \n\t Some text \n\t " # Should return raw text, not stripped | ||
|
|
…s, fix test mock and naming Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/5361cf46-e4fa-46bf-a66b-c4283fbf20f6 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/da1eb885-be5a-4b50-bc86-55d37ca3a9b4 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
| def _sigterm(signum, frame): | ||
| self._shutdown() | ||
| if callable(prev): | ||
| prev(signum, frame) |
| echo "Starting app with Gunicorn..." | ||
| gunicorn --worker-class gevent --bind 0.0.0.0:5001 src.main:app | ||
| fi | ||
| EOF No newline at end of file | ||
| exec gunicorn \ | ||
| --user appuser \ | ||
| --group appuser \ | ||
| --worker-class gevent \ | ||
| --bind 0.0.0.0:5001 \ | ||
| --graceful-timeout 30 \ | ||
| src.main:app |
| # Verify that the OCR file actually contains text | ||
| if os.path.exists(item.ocr_file): | ||
| extracted_text = (extract_text(item.ocr_file) or "").strip() | ||
| if extracted_text: | ||
| logger.info(f"OCR verification successful: extracted {len(extracted_text)} characters from {item.filename}") | ||
| item.ocr_status = OCRStatus.COMPLETED | ||
| else: | ||
| logger.warning(f"OCR verification failed: no text found in OCR output file {item.ocr_file}") | ||
| item.ocr_status = OCRStatus.FAILED |
| parts.append(page_text) | ||
| total_chars += len(page_text) | ||
| if total_chars >= max_chars: | ||
| break | ||
| result = "\n".join(parts) | ||
| return result[:max_chars] |
| const icon = document.createElement('i'); | ||
| icon.className = 'bi bi-exclamation-triangle-fill text-warning'; | ||
| const small = document.createElement('small'); | ||
| small.textContent = ocrStatusText; | ||
| ocrStatusSpan.appendChild(document.createElement('br')); | ||
| ocrStatusSpan.appendChild(icon); | ||
| ocrStatusSpan.appendChild(document.createTextNode(' ')); | ||
| ocrStatusSpan.appendChild(small); | ||
| ocrStatusSpan.appendChild(document.createElement('br')); |
…web runtime Agent-Logs-Url: https://github.com/maxi07/ScanSync/sessions/f01f1d65-f421-45f2-b0aa-6639bf67b715 Co-authored-by: maxi07 <7480270+maxi07@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
| # Use `exec` so the final process becomes PID 1 and receives SIGTERM directly | ||
| # from Docker. This enables graceful shutdown (Gunicorn workers finish in-flight | ||
| # requests, dev server stops cleanly) instead of being killed by a dying `su` | ||
| # session. | ||
| if [ "$FLASK_ENV" = "development" ]; then |
| if [ "$FLASK_ENV" = "development" ]; then | ||
| echo "Starting Flask development server..." | ||
| flask run --host=0.0.0.0 --port=5001 --reload --debug | ||
| exec su appuser -c "exec flask run --host=0.0.0.0 --port=5001 --reload --debug" | ||
| else | ||
| echo "Starting app with Gunicorn..." |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
extract_textusage and callersmax_pages(default 10) andmax_chars(default 50,000) parameters toextract_text()✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.