Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds PaddleOCR (PPStructureV3 and Vision-Language) as document loaders in the kotaemon library and enhances the chat interface's @mention UX for file tagging and web search. It also includes several ancillary improvements: a chunk type filter in the file index UI, SQL LIKE injection protection, a LightRAG settings inheritance fix, and documentation typo corrections.
Changes:
- Introduces
PPStructureV3ReaderandPaddleOCRVLReaderdocument loaders with a sharedPaddleOCRResultbase adapter, integrated into the indexing pipeline as new reader mode options. - Reworks the chat
@mentionsystem:@WebSearchreplaces@web, mentions are bolded for display, and the Tribute.js autocomplete is enhanced with lookup/search improvements and delete-key handling. - Adds chunk type filtering (text/table/image/thumbnail) in the file index UI, SQL LIKE metacharacter escaping, and inherits parent settings in LightRAG's
get_user_settings.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
libs/kotaemon/kotaemon/loaders/paddleocr_loader/adapter.py |
New base dataclass adapter for PaddleOCR results |
libs/kotaemon/kotaemon/loaders/paddleocr_loader/ppstructure_v3_loader.py |
New PPStructureV3 reader and result adapter |
libs/kotaemon/kotaemon/loaders/paddleocr_loader/paddleocr_vl_loader.py |
New PaddleOCR VL reader and result adapter |
libs/kotaemon/kotaemon/loaders/paddleocr_loader/__init__.py |
Package exports for new loaders |
libs/kotaemon/kotaemon/loaders/__init__.py |
Top-level loader exports updated |
libs/kotaemon/kotaemon/indices/ingests/files.py |
Instantiates paddle readers with configurable device |
libs/ktem/ktem/index/file/pipelines.py |
Adds paddle-struct and paddle-vl reader modes |
libs/ktem/ktem/utils/conversation.py |
New format_mentions_for_display, updated get_file_names_regex |
libs/ktem/ktem/utils/commands.py |
Changes WEB_SEARCH_COMMAND from "web" to "WebSearch" |
libs/ktem/ktem/utils/__init__.py |
Exports format_mentions_for_display |
libs/ktem/ktem/pages/chat/__init__.py |
Display formatting for mentions, pipeline input stripping, dropdown input fix |
libs/ktem/ktem/pages/chat/chat_panel.py |
Updated placeholder text |
libs/ktem/ktem/index/file/ui.py |
Enhanced JS autocomplete, chunk type filter, SQL LIKE escaping |
libs/ktem/ktem/index/file/graph/lightrag_pipelines.py |
Inherits parent settings, returns them on import error |
libs/kotaemon/tests/test_paddleocr_loader.py |
Unit tests for new PaddleOCR loaders |
libs/kotaemon/tests/conftest.py |
Adds skip_when_paddleocr_not_installed |
libs/kotaemon/pyproject.toml |
New optional dependency extras for docling, paddleocr, lightrag |
libs/ktem/ktem/assets/md/usage.md |
Typo fix: "Your" → "You" |
docs/usage.md |
Same typo fix |
README.md |
Same typo fix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Bobholamovic
left a comment
There was a problem hiding this comment.
Hi, I've left some comments. Please take a look when you have time.
| TEXT_LABELS: set[str] = { | ||
| "text", | ||
| "paragraph_title", | ||
| "doc_title", | ||
| "abstract", | ||
| "content", | ||
| "footnote", | ||
| "reference", | ||
| "reference_content", | ||
| "aside_text", | ||
| "algorithm", | ||
| } | ||
|
|
||
| TABLE_LABELS: set[str] = {"table"} | ||
|
|
||
| IMAGE_LABELS: set[str] = { | ||
| "image", | ||
| "chart", | ||
| } | ||
|
|
||
| FORMULA_LABELS: set[str] = { | ||
| "formula", | ||
| "display_formula", | ||
| "inline_formula", | ||
| } | ||
|
|
||
| # Labels to ignore (not useful for RAG) | ||
| IGNORE_LABELS: set[str] = { | ||
| "footer", | ||
| "footer_image", | ||
| "formula_number", | ||
| "figure_title", | ||
| "figure_table_chart_title", | ||
| "header", | ||
| "header_image", | ||
| "number", | ||
| "seal", | ||
| "vision_footnote", | ||
| } |
There was a problem hiding this comment.
@Bobholamovic Thanks for your review.
Could you help re-check these keys? I couldn't find the official docs for them, so I got them from https://github.com/PaddlePaddle/PaddleX/blob/e0c509eef1b333e3a57545b04a47f7f701fadfb1/paddlex/configs/pipelines/PaddleOCR-VL.yaml
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (2)
libs/ktem/ktem/reasoning/react.py:290
- This code mutates the MCP server config stored in
mcp_managerby doingconfig.pop("enabled_tools", None). Becauseentry["config"]is a shared dict frommcp_manager.info(), subsequent uses can loseenabled_toolsand change tool filtering behavior. Use a shallow copy of the config (or readenabled_toolswithout popping) before passing it into tool construction.
entry = mcp_manager.get(server_name)
if entry:
config = entry["config"]
enabled_tools = config.pop("enabled_tools", None)
mcp_tools = create_tools_from_config(config, enabled_tools)
libs/ktem/ktem/reasoning/rewoo.py:416
- This code mutates the MCP server config stored in
mcp_managerby doingconfig.pop("enabled_tools", None). Becauseentry["config"]is a shared dict frommcp_manager.info(), subsequent uses can loseenabled_toolsand change tool filtering behavior. Use a shallow copy of the config (or readenabled_toolswithout popping) before passing it into tool construction.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # check if regen mode is active | ||
| if chat_input_text: | ||
| chat_history = chat_history + [(chat_input_text, None)] | ||
| chat_history = chat_history + [(display_chat_input_text, None)] |
There was a problem hiding this comment.
display_chat_input_text (with bold markdown mentions) is being appended into chat_history, but chat_fn() later uses chat_history[-1][0] as the actual chat_input sent to the reasoning pipeline. This means the pipeline receives the formatted string (including **...**) and also reintroduces mention tags that were meant to be stripped before inference. Consider keeping chat_history as the cleaned, model-facing text (after removing mentions/URLs), and apply mention formatting only at render time (or keep a separate display-only history).
| chat_history = chat_history + [(display_chat_input_text, None)] | |
| chat_history = chat_history + [(chat_input_text, None)] |
| def format_mentions_for_display(input_str: str) -> str: | ||
| """Normalize and bold @ mentions for chat display.""" | ||
| mention_pattern = r'(?:(?<=\s)|^)@(?:"[^"]+"|WebSearch)' | ||
|
|
There was a problem hiding this comment.
WebSearch is hardcoded in both mention regexes. Since the command string is defined in ktem.utils.commands.WEB_SEARCH_COMMAND, duplicating it here risks the parser drifting from the actual command value. Consider building the regex from the constant (or accepting a generic @<command> pattern and validating against allowed commands separately).
| mention = _normalize_mention(raw_mention) | ||
| if not mention: | ||
| return raw_match | ||
| return f"**@{mention}**" | ||
|
|
||
| return re.sub(mention_pattern, _replace, input_str) |
There was a problem hiding this comment.
format_mentions_for_display() injects markdown (**@{mention}**) without escaping the mention text. If a file name contains markdown metacharacters (e.g. *, _, backticks), it can break formatting and potentially affect rendering. Consider escaping markdown in mention before wrapping it, or rendering mentions using a safer mechanism than raw markdown concatenation.
| input_box.kotaTribute = tribute; | ||
| tribute.detach(input_box); |
There was a problem hiding this comment.
input_box.kotaTribute is overwritten before detaching the previous Tribute instance. As written, tribute.detach(input_box) is called on the new instance, which won’t remove handlers/DOM artifacts from any previously attached instance and can lead to duplicate menus / leaks across updates. Detach input_box.kotaTribute (if present) before assigning the new instance.
| input_box.kotaTribute = tribute; | |
| tribute.detach(input_box); | |
| var previousTribute = input_box.kotaTribute; | |
| if (previousTribute) { | |
| previousTribute.detach(input_box); | |
| } | |
| input_box.kotaTribute = tribute; |
| elif label in self.table_labels: | ||
| table_content = self._clean_table_html(content) | ||
| tables.append( | ||
| Document( | ||
| text=table_content, | ||
| metadata={ | ||
| "type": "table", | ||
| "table_origin": table_content, | ||
| **base_metadata, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
table_content is stored and later rendered as raw HTML (via Markdown->HTML passthrough and gr.HTML) without sanitization. Since OCR output can contain arbitrary text (including HTML/script tags), this can become an XSS vector when viewing chunks/evidence in the UI. Consider sanitizing the HTML (allowlist tags/attrs) or converting tables to a safe format (e.g. markdown with escaped cell text) before storing/rendering.
Description
Introduces support for PaddleOCR Vision-Language (VL) and PP-Structure v3 document loaders in the
kotaemonlibrary, enabling advanced OCR and document layout extraction features. PaddleOCR-based document parsing enables advanced extraction of text, tables, and figures from PDFs and images using state-of-the-art OCR models.We've updated the Dockerfile to support GPU-based PaddleOCR.
Type of change
Checklist