[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765

harsh2025-sketch · 2026-03-18T13:30:05Z

harsh2025-sketch
Mar 18, 2026

Hi Ethan (@openvino-dev-samples) and Zhuo (@zhuo-yoyowz),

My name is Harsh Dinodia. I’ve been working within the openvino.genai repository recently on the Whisper C-API prerequisite (PR #3513). I really enjoyed the challenge of implementing the explicit API design for word-level timestamps and the quick iteration on the shared-pointer logic.

Now that I’ve got a solid handle on the core C++ bindings, I’d like to open a discussion regarding Project #1 (GUI Agent). I've been analyzing the reference architectures (UI-TARS/MobileAgent) and wanted to share my initial implementation strategy for feedback before I finalize my formal proposal.

Proposed Technical Direction:

Unified Memory & Hardware Strategy:
While I am developing locally on an 11th Gen i5 (8GB RAM), my primary target for the agent is the AIPC Cloud (32GB RAM / 18GB Unified Memory). To optimize for the Lunar Lake architecture, I intend to use a single multimodal model (Llama-3.2-Vision INT4) to minimize KV-cache overhead, keeping the model entirely resident in the NPU memory.

Low-Latency Perception Pipeline:
To avoid the typical UI-freeze during screen capture, I plan to use a C++ backend utilizing the Windows Desktop Duplication API (DXGI). This will allow us to feed frames directly into the VisionPipeline. I am also looking into running the Set-of-Mark (SoM) preprocessing pass directly on the NPU to identify interactable elements before the main VLM reasoning step.

Hybrid Grounding (SoM + UIA):
Pure visual grounding can struggle with high-DPI scaling or dynamic UI shifts. I propose a hybrid approach that augments raw pixel perception with Windows UI Automation (UIA) metadata. This should ensure the agent remains robust across complex, professional productivity apps.

Native Action Bridge:
Building on my recent work with the OpenVINO C-bindings, I will implement the execution layer as a native C++ module using the Windows SendInput API. By bypassing Python wrappers like pyautogui, we can achieve sub-ms latency between the model's decision and the hardware event.

Questions:
For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations?

Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project?

Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?

openvino-dev-samples · 2026-03-19T01:04:27Z

openvino-dev-samples
Mar 19, 2026

Hi @harsh2025-sketch

**For the NPU 4000 target, is there a preferred INT4 quantization profile for the VisionPipeline, or should I explore custom NNCF configurations?
A: What does NPU 4000 target mean here ?

Regarding the 350-hour scope: Does the team prefer a pure visual-first approach, or is the hybrid UIA/SoM integration considered the desired standard for this project?
A: I think its all depends on you. Please focus on the final result and user experience.

Is there existing support within the GenAI library for asynchronous tensor inputs from D3D11 surfaces, or should I implement a custom buffer bridge in the C++ layer?**
A: I don't think openvino can support tensor inputs from D3D11 directly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[GSoC 2026] Project #1 Build a GUI Agent with local LLM/VLM and OpenVINO #34765

Uh oh!

harsh2025-sketch Mar 18, 2026

Replies: 1 comment

Uh oh!

openvino-dev-samples Mar 19, 2026

harsh2025-sketch
Mar 18, 2026

openvino-dev-samples
Mar 19, 2026