Open
Conversation
- Add context mode switching (web_browsing/local_file_processing) after PDF download - Improve pdf_extract_images to return relative paths and better error messages - Enhance ocr_image_to_text with smart path searching in common directories - Update system prompt to guide VLLM on context modes and image extraction strategies - Fix issue where VLLM couldn't find images when only filename provided
Author
|
Author
|
‘’Find the paper Attention is all you need, summarize the content, save all the images in the paper, then interpret the first image by explaining the process it shows‘’ |
3ab1d42 to
fb3762d
Compare
Author
|
*需要下载terreract离线OCR工具,参考 https://blog.csdn.net/showgea/article/details/82656515 可演示的命令(该命令可以测试调取所有工具,要在同一个session分次输入): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
更新说明
主要修改内容
1. 添加上下文感知的PDF处理机制
问题背景:
之前系统存在一个关键问题:当VLLM需要处理本地下载的PDF文件时(如提取文本、图片),它仍然会基于网页截图和DOM信息进行决策,导致无法正确理解当前上下文已从"网页浏览"切换到"本地文件处理"。
解决方案:
Agent类中新增上下文模式跟踪机制,区分两种工作模式:web_browsing:正常的网页浏览模式,基于截图和DOM进行操作local_file_processing:本地文件处理模式,处理已下载的PDF文件local_file_processing模式pdf_extract_text、pdf_extract_images、ocr_image_to_text)2. 改进OCR图片路径处理
问题:
VLLM调用OCR工具时只提供了文件名(如
"page_4_img_1.png"),但图片实际保存在子目录中(如extracted_images/page_4_img_1.png),导致找不到文件。解决方案:
pdf_extract_images函数,返回相对路径(相对于artifacts/目录)而不是绝对路径ocr_image_to_text函数,如果直接路径找不到文件,会在常见子目录中自动搜索:extracted_images/images/output_images/3. 改进PDF图片提取的错误提示
问题:
当VLLM只提取第1页的图片时,如果第1页没有图片,会返回简单的"未找到图片"错误,导致VLLM误以为整个PDF都没有图片而放弃任务。
解决方案:
pdf_extract_images的错误提示,当指定页面没有图片时:page_num=2, page_num=3)page_num参数以提取所有页面的图片4. 移除自动处理逻辑
修改:
_auto_process_pdf方法)技术细节
修改的文件
vision_llm_web_agent/agent_controller.pycontext_mode和downloaded_pdf_files属性execute_tool方法,在PDF下载成功后切换上下文模式execute_round方法,根据上下文模式调整状态信息vision_llm_web_agent/vllm_client.pysummarize_text方法(用于生成文本总结)plan_next_action中根据上下文模式决定是否提供截图vision_llm_web_agent/tools/file_operations.pypdf_extract_images函数,返回相对路径和更详细的错误提示ocr_image_to_text函数,支持智能路径搜索使用效果
修改后,系统能够:
工作流程示例
local_file_processing模式pdf_extract_text、pdf_extract_images等工具处理本地文件