feat: integrate PaddleOCR as document loaders + enhance chat/index UX by cin-niko · Pull Request #814 · Cinnamon/kotaemon

cin-niko · 2026-03-07T06:02:50Z

Description

Introduces support for PaddleOCR Vision-Language (VL) and PP-Structure v3 document loaders in the kotaemon library, enabling advanced OCR and document layout extraction features. PaddleOCR-based document parsing enables advanced extraction of text, tables, and figures from PDFs and images using state-of-the-art OCR models.
We've updated the Dockerfile to support GPU-based PaddleOCR.

Type of change

New features (non-breaking change).
Bug fix (non-breaking change).
Breaking change (fix or feature that would cause existing functionality not to work as expected).

Checklist

I have performed a self-review of my code.
I have added thorough tests if it is a core feature.
There is a reference to the original bug report and related work.
I have commented on my code, particularly in hard-to-understand areas.
The feature is well documented.

Copilot

Pull request overview

This PR adds PaddleOCR (PPStructureV3 and Vision-Language) as document loaders in the kotaemon library and enhances the chat interface's @mention UX for file tagging and web search. It also includes several ancillary improvements: a chunk type filter in the file index UI, SQL LIKE injection protection, a LightRAG settings inheritance fix, and documentation typo corrections.

Changes:

Introduces PPStructureV3Reader and PaddleOCRVLReader document loaders with a shared PaddleOCRResult base adapter, integrated into the indexing pipeline as new reader mode options.
Reworks the chat @mention system: @WebSearch replaces @web, mentions are bolded for display, and the Tribute.js autocomplete is enhanced with lookup/search improvements and delete-key handling.
Adds chunk type filtering (text/table/image/thumbnail) in the file index UI, SQL LIKE metacharacter escaping, and inherits parent settings in LightRAG's get_user_settings.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`libs/kotaemon/kotaemon/loaders/paddleocr_loader/adapter.py`	New base dataclass adapter for PaddleOCR results
`libs/kotaemon/kotaemon/loaders/paddleocr_loader/ppstructure_v3_loader.py`	New PPStructureV3 reader and result adapter
`libs/kotaemon/kotaemon/loaders/paddleocr_loader/paddleocr_vl_loader.py`	New PaddleOCR VL reader and result adapter
`libs/kotaemon/kotaemon/loaders/paddleocr_loader/__init__.py`	Package exports for new loaders
`libs/kotaemon/kotaemon/loaders/__init__.py`	Top-level loader exports updated
`libs/kotaemon/kotaemon/indices/ingests/files.py`	Instantiates paddle readers with configurable device
`libs/ktem/ktem/index/file/pipelines.py`	Adds paddle-struct and paddle-vl reader modes
`libs/ktem/ktem/utils/conversation.py`	New `format_mentions_for_display`, updated `get_file_names_regex`
`libs/ktem/ktem/utils/commands.py`	Changes `WEB_SEARCH_COMMAND` from `"web"` to `"WebSearch"`
`libs/ktem/ktem/utils/__init__.py`	Exports `format_mentions_for_display`
`libs/ktem/ktem/pages/chat/__init__.py`	Display formatting for mentions, pipeline input stripping, dropdown input fix
`libs/ktem/ktem/pages/chat/chat_panel.py`	Updated placeholder text
`libs/ktem/ktem/index/file/ui.py`	Enhanced JS autocomplete, chunk type filter, SQL LIKE escaping
`libs/ktem/ktem/index/file/graph/lightrag_pipelines.py`	Inherits parent settings, returns them on import error
`libs/kotaemon/tests/test_paddleocr_loader.py`	Unit tests for new PaddleOCR loaders
`libs/kotaemon/tests/conftest.py`	Adds `skip_when_paddleocr_not_installed`
`libs/kotaemon/pyproject.toml`	New optional dependency extras for docling, paddleocr, lightrag
`libs/ktem/ktem/assets/md/usage.md`	Typo fix: "Your" → "You"
`docs/usage.md`	Same typo fix
`README.md`	Same typo fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Bobholamovic

Hi, I've left some comments. Please take a look when you have time.

cin-niko · 2026-03-10T13:26:34Z

+TEXT_LABELS: set[str] = {
+    "text",
+    "paragraph_title",
+    "doc_title",
+    "abstract",
+    "content",
+    "footnote",
+    "reference",
+    "reference_content",
+    "aside_text",
+    "algorithm",
+}
+
+TABLE_LABELS: set[str] = {"table"}
+
+IMAGE_LABELS: set[str] = {
+    "image",
+    "chart",
+}
+
+FORMULA_LABELS: set[str] = {
+    "formula",
+    "display_formula",
+    "inline_formula",
+}
+
+# Labels to ignore (not useful for RAG)
+IGNORE_LABELS: set[str] = {
+    "footer",
+    "footer_image",
+    "formula_number",
+    "figure_title",
+    "figure_table_chart_title",
+    "header",
+    "header_image",
+    "number",
+    "seal",
+    "vision_footnote",
+}


@Bobholamovic Thanks for your review.
Could you help re-check these keys? I couldn't find the official docs for them, so I got them from https://github.com/PaddlePaddle/PaddleX/blob/e0c509eef1b333e3a57545b04a47f7f701fadfb1/paddlex/configs/pipelines/PaddleOCR-VL.yaml

…lation

Copilot

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (2)

libs/ktem/ktem/reasoning/react.py:290

This code mutates the MCP server config stored in mcp_manager by doing config.pop("enabled_tools", None). Because entry["config"] is a shared dict from mcp_manager.info(), subsequent uses can lose enabled_tools and change tool filtering behavior. Use a shallow copy of the config (or read enabled_tools without popping) before passing it into tool construction.

                entry = mcp_manager.get(server_name)
                if entry:
                    config = entry["config"]
                    enabled_tools = config.pop("enabled_tools", None)
                    mcp_tools = create_tools_from_config(config, enabled_tools)

libs/ktem/ktem/reasoning/rewoo.py:416

This code mutates the MCP server config stored in mcp_manager by doing config.pop("enabled_tools", None). Because entry["config"] is a shared dict from mcp_manager.info(), subsequent uses can lose enabled_tools and change tool filtering behavior. Use a shallow copy of the config (or read enabled_tools without popping) before passing it into tool construction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T11:38:43Z

        # check if regen mode is active
        if chat_input_text:
-            chat_history = chat_history + [(chat_input_text, None)]
+            chat_history = chat_history + [(display_chat_input_text, None)]


display_chat_input_text (with bold markdown mentions) is being appended into chat_history, but chat_fn() later uses chat_history[-1][0] as the actual chat_input sent to the reasoning pipeline. This means the pipeline receives the formatted string (including **...**) and also reintroduces mention tags that were meant to be stripped before inference. Consider keeping chat_history as the cleaned, model-facing text (after removing mentions/URLs), and apply mention formatting only at render time (or keep a separate display-only history).

Suggested change

chat_history = chat_history + [(display_chat_input_text, None)]

chat_history = chat_history + [(chat_input_text, None)]

Copilot · 2026-04-03T11:38:44Z

+def format_mentions_for_display(input_str: str) -> str:
+    """Normalize and bold @ mentions for chat display."""
+    mention_pattern = r'(?:(?<=\s)|^)@(?:"[^"]+"|WebSearch)'
+


WebSearch is hardcoded in both mention regexes. Since the command string is defined in ktem.utils.commands.WEB_SEARCH_COMMAND, duplicating it here risks the parser drifting from the actual command value. Consider building the regex from the constant (or accepting a generic @<command> pattern and validating against allowed commands separately).

Copilot · 2026-04-03T11:38:44Z

+        mention = _normalize_mention(raw_mention)
+        if not mention:
+            return raw_match
+        return f"**@{mention}**"
+
+    return re.sub(mention_pattern, _replace, input_str)


format_mentions_for_display() injects markdown (**@{mention}**) without escaping the mention text. If a file name contains markdown metacharacters (e.g. *, _, backticks), it can break formatting and potentially affect rendering. Consider escaping markdown in mention before wrapping it, or rendering mentions using a safer mechanism than raw markdown concatenation.

Copilot · 2026-04-03T11:38:44Z

+    input_box.kotaTribute = tribute;
    tribute.detach(input_box);


input_box.kotaTribute is overwritten before detaching the previous Tribute instance. As written, tribute.detach(input_box) is called on the new instance, which won’t remove handlers/DOM artifacts from any previously attached instance and can lead to duplicate menus / leaks across updates. Detach input_box.kotaTribute (if present) before assigning the new instance.

Suggested change

input_box.kotaTribute = tribute;

tribute.detach(input_box);

var previousTribute = input_box.kotaTribute;

if (previousTribute) {

previousTribute.detach(input_box);

}

input_box.kotaTribute = tribute;

Copilot · 2026-04-03T11:38:45Z

+            elif label in self.table_labels:
+                table_content = self._clean_table_html(content)
+                tables.append(
+                    Document(
+                        text=table_content,
+                        metadata={
+                            "type": "table",
+                            "table_origin": table_content,
+                            **base_metadata,
+                        },
+                    )


table_content is stored and later rendered as raw HTML (via Markdown->HTML passthrough and gr.HTML) without sanitization. Since OCR output can contain arbitrary text (including HTML/script tags), this can become an XSS vector when viewing chunks/evidence in the UI. Consider sanitizing the HTML (allowlist tags/attrs) or converting tables to a safe format (e.g. markdown with escaped cell text) before storing/rendering.

cin-niko changed the title ~~feat: integrate PaddleOCR features + enhance chat/index UX~~ feat: integrate PaddleOCR document loaders + enhance chat/index UX Mar 7, 2026

cin-niko changed the title ~~feat: integrate PaddleOCR document loaders + enhance chat/index UX~~ feat: integrate PaddleOCR as document loaders + enhance chat/index UX Mar 7, 2026

cin-niko requested a review from Copilot March 7, 2026 06:04

Copilot started reviewing on behalf of cin-niko March 7, 2026 06:05 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

cin-niko requested a review from Copilot March 8, 2026 05:54

Copilot started reviewing on behalf of cin-niko March 8, 2026 05:55 View session

cin-niko added the enhancement New feature or request label Mar 8, 2026

Copilot AI reviewed Mar 8, 2026

View reviewed changes

Comment thread libs/ktem/ktem/pages/chat/__init__.py

Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/adapter.py Outdated

Comment thread libs/ktem/ktem/index/file/ui.py Outdated

cin-niko requested a review from Copilot March 8, 2026 06:10

Copilot started reviewing on behalf of cin-niko March 8, 2026 06:10 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

Comment thread libs/ktem/ktem/index/file/ui.py

Bobholamovic reviewed Mar 10, 2026

View reviewed changes

Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/paddleocr_vl_loader.py Outdated

Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/paddleocr_vl_loader.py

Comment thread Dockerfile Outdated

Bobholamovic reviewed Mar 10, 2026

View reviewed changes

Comment thread libs/kotaemon/kotaemon/loaders/paddleocr_loader/ppstructure_v3_loader.py Outdated

cin-niko commented Mar 10, 2026

View reviewed changes

cin-niko added 15 commits April 3, 2026 11:28

docs: typo

2c6bd0c

fix: enhance mention @ (file, websearch) in chat

1b514a5

fix: shorter placeholder guide in chat

0735a54

fix: enhance UI (file filter, chunks preview, lightrag reader settings)

224b4e5

feat: feat: integrate paddleocr (PaddleOCR-VL, PP-StructureV3)

4de32ca

fix: import

cca9acc

fix: support both urls and mention files when chatting

d2f77e1

feat: restore the test connection panel in resources

af4cf3f

fix: paddleocr output adapter

dc745f4

tests: update

6776d24

fix: parse image block content

71f2c88

fix: docker gpu stage for paddleocr

0b3f733

tests: update unit tests

6832248

tests: add skip_when_paddleocr_not_installed

8907a41

fix: reset chunk preview filter when re-selecting file

1b9cb14

cin-niko and others added 5 commits April 3, 2026 11:28

fix: paddleocr[all] -> paddleocr[doc-parser]

4849c96

fix: allow configure paddleocr device with env

0e88320

feat: parameterize CUDA version in Dockerfile for paddlepaddle instal…

cd43ec7

…lation

docs: add integration guides for PaddleOCR and Docling

aaac893

fix: updating mcp server must reflect in the tool choice UI

598f1e1

cin-niko force-pushed the feat/dev branch from f6bd92f to 598f1e1 Compare April 3, 2026 11:28

cin-niko requested review from cin-albert, Copilot and phv2312 April 3, 2026 11:28

Copilot started reviewing on behalf of cin-niko April 3, 2026 11:29 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

	chat_history = chat_history + [(display_chat_input_text, None)]
	chat_history = chat_history + [(chat_input_text, None)]

Conversation

cin-niko commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Bobholamovic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cin-niko Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cin-niko commented Mar 7, 2026 •

edited

Loading