[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765
Replies: 1 comment
-
|
**For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations? Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project? Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?** |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Ethan (@openvino-dev-samples) and Zhuo (@zhuo-yoyowz),
My name is Harsh Dinodia. I’ve been working within the openvino.genai repository recently on the Whisper C-API prerequisite (PR #3513). I really enjoyed the challenge of implementing the explicit API design for word-level timestamps and the quick iteration on the shared-pointer logic.
Now that I’ve got a solid handle on the core C++ bindings, I’d like to open a discussion regarding Project #1 (GUI Agent). I've been analyzing the reference architectures (UI-TARS/MobileAgent) and wanted to share my initial implementation strategy for feedback before I finalize my formal proposal.
Proposed Technical Direction:
Unified Memory & Hardware Strategy:
While I am developing locally on an 11th Gen i5 (8GB RAM), my primary target for the agent is the AIPC Cloud (32GB RAM / 18GB Unified Memory). To optimize for the Lunar Lake architecture, I intend to use a single multimodal model (Llama-3.2-Vision INT4) to minimize KV-cache overhead, keeping the model entirely resident in the NPU memory.
Low-Latency Perception Pipeline:
To avoid the typical UI-freeze during screen capture, I plan to use a C++ backend utilizing the Windows Desktop Duplication API (DXGI). This will allow us to feed frames directly into the VisionPipeline. I am also looking into running the Set-of-Mark (SoM) preprocessing pass directly on the NPU to identify interactable elements before the main VLM reasoning step.
Hybrid Grounding (SoM + UIA):
Pure visual grounding can struggle with high-DPI scaling or dynamic UI shifts. I propose a hybrid approach that augments raw pixel perception with Windows UI Automation (UIA) metadata. This should ensure the agent remains robust across complex, professional productivity apps.
Native Action Bridge:
Building on my recent work with the OpenVINO C-bindings, I will implement the execution layer as a native C++ module using the Windows SendInput API. By bypassing Python wrappers like pyautogui, we can achieve sub-ms latency between the model's decision and the hardware event.
Questions:
For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations?
Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project?
Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?
Beta Was this translation helpful? Give feedback.
All reactions