Skip to content

feat: integrate PaddleOCR as document loaders + enhance chat/index UX#814

Open
cin-niko wants to merge 20 commits intomainfrom
feat/dev
Open

feat: integrate PaddleOCR as document loaders + enhance chat/index UX#814
cin-niko wants to merge 20 commits intomainfrom
feat/dev

Conversation

@cin-niko
Copy link
Copy Markdown
Contributor

@cin-niko cin-niko commented Mar 7, 2026

Description

Introduces support for PaddleOCR Vision-Language (VL) and PP-Structure v3 document loaders in the kotaemon library, enabling advanced OCR and document layout extraction features. PaddleOCR-based document parsing enables advanced extraction of text, tables, and figures from PDFs and images using state-of-the-art OCR models.
We've updated the Dockerfile to support GPU-based PaddleOCR.

image

Type of change

  • New features (non-breaking change).
  • Bug fix (non-breaking change).
  • Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

  • I have performed a self-review of my code.
  • I have added thorough tests if it is a core feature.
  • There is a reference to the original bug report and related work.
  • I have commented on my code, particularly in hard-to-understand areas.
  • The feature is well documented.

@cin-niko cin-niko changed the title feat: integrate PaddleOCR features + enhance chat/index UX feat: integrate PaddleOCR document loaders + enhance chat/index UX Mar 7, 2026
@cin-niko cin-niko changed the title feat: integrate PaddleOCR document loaders + enhance chat/index UX feat: integrate PaddleOCR as document loaders + enhance chat/index UX Mar 7, 2026
@cin-niko cin-niko requested a review from Copilot March 7, 2026 06:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds PaddleOCR (PPStructureV3 and Vision-Language) as document loaders in the kotaemon library and enhances the chat interface's @mention UX for file tagging and web search. It also includes several ancillary improvements: a chunk type filter in the file index UI, SQL LIKE injection protection, a LightRAG settings inheritance fix, and documentation typo corrections.

Changes:

  • Introduces PPStructureV3Reader and PaddleOCRVLReader document loaders with a shared PaddleOCRResult base adapter, integrated into the indexing pipeline as new reader mode options.
  • Reworks the chat @mention system: @WebSearch replaces @web, mentions are bolded for display, and the Tribute.js autocomplete is enhanced with lookup/search improvements and delete-key handling.
  • Adds chunk type filtering (text/table/image/thumbnail) in the file index UI, SQL LIKE metacharacter escaping, and inherits parent settings in LightRAG's get_user_settings.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
libs/kotaemon/kotaemon/loaders/paddleocr_loader/adapter.py New base dataclass adapter for PaddleOCR results
libs/kotaemon/kotaemon/loaders/paddleocr_loader/ppstructure_v3_loader.py New PPStructureV3 reader and result adapter
libs/kotaemon/kotaemon/loaders/paddleocr_loader/paddleocr_vl_loader.py New PaddleOCR VL reader and result adapter
libs/kotaemon/kotaemon/loaders/paddleocr_loader/__init__.py Package exports for new loaders
libs/kotaemon/kotaemon/loaders/__init__.py Top-level loader exports updated
libs/kotaemon/kotaemon/indices/ingests/files.py Instantiates paddle readers with configurable device
libs/ktem/ktem/index/file/pipelines.py Adds paddle-struct and paddle-vl reader modes
libs/ktem/ktem/utils/conversation.py New format_mentions_for_display, updated get_file_names_regex
libs/ktem/ktem/utils/commands.py Changes WEB_SEARCH_COMMAND from "web" to "WebSearch"
libs/ktem/ktem/utils/__init__.py Exports format_mentions_for_display
libs/ktem/ktem/pages/chat/__init__.py Display formatting for mentions, pipeline input stripping, dropdown input fix
libs/ktem/ktem/pages/chat/chat_panel.py Updated placeholder text
libs/ktem/ktem/index/file/ui.py Enhanced JS autocomplete, chunk type filter, SQL LIKE escaping
libs/ktem/ktem/index/file/graph/lightrag_pipelines.py Inherits parent settings, returns them on import error
libs/kotaemon/tests/test_paddleocr_loader.py Unit tests for new PaddleOCR loaders
libs/kotaemon/tests/conftest.py Adds skip_when_paddleocr_not_installed
libs/kotaemon/pyproject.toml New optional dependency extras for docling, paddleocr, lightrag
libs/ktem/ktem/assets/md/usage.md Typo fix: "Your" → "You"
docs/usage.md Same typo fix
README.md Same typo fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/__init__.py
Comment thread libs/ktem/ktem/pages/chat/__init__.py Outdated
Comment thread libs/ktem/ktem/pages/chat/__init__.py Outdated
Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/ppstructure_v3_loader.py Outdated
@cin-niko cin-niko requested a review from Copilot March 8, 2026 05:54
@cin-niko cin-niko added the enhancement New feature or request label Mar 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/ktem/ktem/pages/chat/__init__.py
Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/adapter.py Outdated
Comment thread libs/ktem/ktem/index/file/ui.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/ktem/ktem/index/file/ui.py
Copy link
Copy Markdown

@Bobholamovic Bobholamovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've left some comments. Please take a look when you have time.

Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/paddleocr_vl_loader.py Outdated
Comment thread Dockerfile Outdated
Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/ppstructure_v3_loader.py Outdated
Comment on lines +13 to +51
TEXT_LABELS: set[str] = {
"text",
"paragraph_title",
"doc_title",
"abstract",
"content",
"footnote",
"reference",
"reference_content",
"aside_text",
"algorithm",
}

TABLE_LABELS: set[str] = {"table"}

IMAGE_LABELS: set[str] = {
"image",
"chart",
}

FORMULA_LABELS: set[str] = {
"formula",
"display_formula",
"inline_formula",
}

# Labels to ignore (not useful for RAG)
IGNORE_LABELS: set[str] = {
"footer",
"footer_image",
"formula_number",
"figure_title",
"figure_table_chart_title",
"header",
"header_image",
"number",
"seal",
"vision_footnote",
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bobholamovic Thanks for your review.
Could you help re-check these keys? I couldn't find the official docs for them, so I got them from https://github.com/PaddlePaddle/PaddleX/blob/e0c509eef1b333e3a57545b04a47f7f701fadfb1/paddlex/configs/pipelines/PaddleOCR-VL.yaml

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (2)

libs/ktem/ktem/reasoning/react.py:290

  • This code mutates the MCP server config stored in mcp_manager by doing config.pop("enabled_tools", None). Because entry["config"] is a shared dict from mcp_manager.info(), subsequent uses can lose enabled_tools and change tool filtering behavior. Use a shallow copy of the config (or read enabled_tools without popping) before passing it into tool construction.
                entry = mcp_manager.get(server_name)
                if entry:
                    config = entry["config"]
                    enabled_tools = config.pop("enabled_tools", None)
                    mcp_tools = create_tools_from_config(config, enabled_tools)

libs/ktem/ktem/reasoning/rewoo.py:416

  • This code mutates the MCP server config stored in mcp_manager by doing config.pop("enabled_tools", None). Because entry["config"] is a shared dict from mcp_manager.info(), subsequent uses can lose enabled_tools and change tool filtering behavior. Use a shallow copy of the config (or read enabled_tools without popping) before passing it into tool construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# check if regen mode is active
if chat_input_text:
chat_history = chat_history + [(chat_input_text, None)]
chat_history = chat_history + [(display_chat_input_text, None)]
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

display_chat_input_text (with bold markdown mentions) is being appended into chat_history, but chat_fn() later uses chat_history[-1][0] as the actual chat_input sent to the reasoning pipeline. This means the pipeline receives the formatted string (including **...**) and also reintroduces mention tags that were meant to be stripped before inference. Consider keeping chat_history as the cleaned, model-facing text (after removing mentions/URLs), and apply mention formatting only at render time (or keep a separate display-only history).

Suggested change
chat_history = chat_history + [(display_chat_input_text, None)]
chat_history = chat_history + [(chat_input_text, None)]

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +14
def format_mentions_for_display(input_str: str) -> str:
"""Normalize and bold @ mentions for chat display."""
mention_pattern = r'(?:(?<=\s)|^)@(?:"[^"]+"|WebSearch)'

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebSearch is hardcoded in both mention regexes. Since the command string is defined in ktem.utils.commands.WEB_SEARCH_COMMAND, duplicating it here risks the parser drifting from the actual command value. Consider building the regex from the constant (or accepting a generic @<command> pattern and validating against allowed commands separately).

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +23
mention = _normalize_mention(raw_mention)
if not mention:
return raw_match
return f"**@{mention}**"

return re.sub(mention_pattern, _replace, input_str)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_mentions_for_display() injects markdown (**@{mention}**) without escaping the mention text. If a file name contains markdown metacharacters (e.g. *, _, backticks), it can break formatting and potentially affect rendering. Consider escaping markdown in mention before wrapping it, or rendering mentions using a safer mechanism than raw markdown concatenation.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to 79
input_box.kotaTribute = tribute;
tribute.detach(input_box);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_box.kotaTribute is overwritten before detaching the previous Tribute instance. As written, tribute.detach(input_box) is called on the new instance, which won’t remove handlers/DOM artifacts from any previously attached instance and can lead to duplicate menus / leaks across updates. Detach input_box.kotaTribute (if present) before assigning the new instance.

Suggested change
input_box.kotaTribute = tribute;
tribute.detach(input_box);
var previousTribute = input_box.kotaTribute;
if (previousTribute) {
previousTribute.detach(input_box);
}
input_box.kotaTribute = tribute;

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +186
elif label in self.table_labels:
table_content = self._clean_table_html(content)
tables.append(
Document(
text=table_content,
metadata={
"type": "table",
"table_origin": table_content,
**base_metadata,
},
)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table_content is stored and later rendered as raw HTML (via Markdown->HTML passthrough and gr.HTML) without sanitization. Since OCR output can contain arbitrary text (including HTML/script tags), this can become an XSS vector when viewing chunks/evidence in the UI. Consider sanitizing the HTML (allowlist tags/attrs) or converting tables to a safe format (e.g. markdown with escaped cell text) before storing/rendering.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants