feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367
Draft
feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367
Conversation
… to Gemini (NAN-707)
Adds a second input channel to Gemini Live that captures the actual text of
the frontmost macOS window via the Accessibility API and sends it inline
with each 1 FPS image frame. The model now sees the screenshot AND the
ground-truth text as one moment, dramatically reducing hallucinated text on
code, terminals, dense UIs, and small fonts where vision OCR struggles.
Architecture:
- Swift sidecar (desktop/native/ax-capture) speaks JSON-line stdio to the
Electron main process. AXIsProcessTrusted-gated, returns flat list of
{role,text,frame,app} for the frontmost window.
- Main process module (desktop/src/main/ax-capture.ts) manages sidecar
lifecycle with restart backoff, 250ms timeout per call, and exposes
ax:capture / ax:permission / ax:openSettings IPC handlers.
- Renderer screen-capture loop calls AX in parallel with each JPEG frame,
formats to a compact role:text block (capped at 8KB), and sends it
alongside the image via the existing realtime WebSocket.
- Relay forwards as a sibling realtimeInput.text immediately after each
realtimeInput.video — Gemini Live has no combined part, but adjacent
sends preserve temporal alignment. Other adapters (OpenAI, xAI) just
ignore the new field via the optional sendAxText interface method.
- Per-frame ax_text persisted to timings.json so the in-house tracer can
A/B image-only vs image+text accuracy.
- Permission UX: pre-flight probe on screen-share start; if not granted,
confirm dialog opens System Settings → Privacy → Accessibility. Capture
continues vision-only either way.
Tests:
- 5 main-process formatter tests
- 4 sidecar protocol integration tests against the real Swift binary
- 4 renderer formatter tests
- 4 relay frame.append → adapter wiring tests
- 3 MediaCapture ax_text persistence tests
- All 113 desktop + 79 relay-server tests pass; both packages typecheck
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Critical: - Replace `as!` force-casts on AX results with CFGetTypeID guards. A buggy app returning a non-AXUIElement / non-AXValue would have crashed the sidecar; we now bail with ax_failed / skip the frame instead. Should-fix: - Lazy-start the sidecar on first capture/permission call instead of at app launch. Users who never screen-share pay nothing for the feature. - Honor stdin.write backpressure: if the kernel pipe buffer is full (wedged sidecar), drop the request immediately rather than letting Node buffer unbounded data internally. - Verify lipo / chmod exit codes in build-ax-capture.mjs and assert the output binary exists before declaring success. A silent build failure would have shipped a broken or missing universal binary. - Lower Swift Package macOS minimum from .v12 to .v11 to match Electron 41's floor; bumping forced unnecessary OS-version gating on the smallest piece of the bundle. - Cross-reference both copies of the format helper in their own comments so future readers know to keep them in lockstep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
Review summaryCodex skipped — Gemini findingsCritical (fixed in 79026a4):
Should-fix (fixed in 79026a4):
Should-fix (deferred with rationale):
Nits (skipped):
All 113 desktop + 79 relay-server tests still pass after the fixes; both packages typecheck. |
Critical: - AX capture is now gated to display sources only. When sharing a single window, AX text comes from the *frontmost* window which can diverge from the captured pixels (window sources track the share even when not focused). Window-source shares ship vision-only until we wire CGWindowID alignment in a follow-up. Should-fix: - ScreenCapture sets a stopped flag and rechecks it after the AX await, so a frame can no longer land after stop(). - Relay sanitizes axText at the trust boundary: rejects non-strings, re-caps at 8KB UTF-8. The renderer cap is not a trust boundary. - Sidecar restart counter only resets after 30s of stable runtime. A crash loop can no longer forever-retry at the smallest backoff. - shuttingDown flag prevents the exit handler from resurrecting the sidecar after app quit kills it. - Distribution builds (yarn dist:mac sets AX_REQUIRE_UNIVERSAL=1) now hard- fail when the x86_64 toolchain is missing instead of silently shipping arm64-only. Dev builds still degrade gracefully. - Stripped NAN-707 references from source comments — they belong in the PR body, not the codebase. Nits: - Type-literal semicolons swapped for commas in three call sites. - Dropped unused REQUEST_TIMEOUT_REASON constant and _reason arg. - Sidecar fallback-window failure now reports werr.rawValue (the actual failed call) instead of err.rawValue. Tests: - 2 new relay sanitizer tests (non-string rejection, 8KB cap). - All 113 desktop + 81 relay-server tests pass; both packages typecheck. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
Codex review on this branch (latest commit ca9b7f2)GitHub doesn't let me post a request-changes review on my own PR, so this is a comment instead. Reviewed by Codex CLI (codex-cli 0.128.0), not by the original author model. Critical (resolved)
Should-fix (resolved)
Nits (resolved)
Test results after fixes
|
- New `yarn build:ax-capture` runs the Swift compile. - `yarn dev` now invokes it before Electron starts, so a fresh checkout works without remembering an out-of-band node command. - Build is idempotent — skips when no source has changed since the last output (under 1s), full compile (~4s) only when Package.swift or Sources/ are newer than the binary. AX_FORCE_REBUILD=1 forces it. - dist:mac path is unchanged (still chained through build-services.mjs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Gemini hallucinates text from screen captures regularly — code, terminal output, dense tables, and small UI labels are the worst offenders even at our 1536px / 0.85q JPEG. Adding a parallel channel that sends the actual on-screen text via macOS Accessibility API alongside each image gives the model both the picture and the ground-truth text temporally aligned, so it stops guessing on the parts that vision can't read reliably.
Closes NAN-707.
Architecture (decisions table)
ApplicationServices.framework); sidecar is what serious mac-AX tools (Raycast/Rewind) do — easier to iterate on and notarize than a node-gyp native module.{role, text, frame, app}realtimeInput.textimmediately after eachrealtimeInput.videoax_textfield on eachvideoFrames[]entry in the per-turntimings.jsonTest plan
frame.append→sendFrame+sendAxTextrouting, including legacy-adapter compatibilityax_textpersistence and 8KB truncationax_textand Gemini's transcripts show fewer hallucinationsyarn dist:macsmoke test that the sidecar lands inContents/Resources/bin/ax-captureand is signed + notarizedImplementation notes
FrameAppendEventgets an optionalaxText?: string. Old desktop builds simply omit it; the field is additive.ProviderAdapter.sendAxTextis optional. OpenAI and xAI adapters don't implement it — only Gemini sees the channel, and that's correct since they're the only adapter accepting video."video"in Gemini's send-upstream queue so it shares oldest-drop discipline with its paired image. At 1 FPS, drift between image and AX text after a rotation is bounded to one second.sendFrame,sendAxTextdoes NOT pet the watchdog. The "are you still there?" prompt should still fire correctly during silent screen sharing.Contents/Resources/bin/ax-captureand gets signed with the app's Developer ID via electron-builder's afterSign hook. macOS attributes the AX request to the parent app bundle, so the user adds VoiceClaw toPrivacy & Security → Accessibility, not the sidecar binary directly. This needs verification on a realdist:macbuild. If macOS shows the sidecar as a separate entry, we'll either need to relaunch it via the parent's process tree differently or add explicit per-binary entitlements.formatAxTextin main,formatAxTextRendererin renderer); relay re-caps defensively to 8 KB on persist. UTF-8 byte-counted, not chars.desktop/scripts/build-services.mjsnow runsbuild-ax-capture.mjsafter the relay/openclaw bundles, producing the universal binary intodesktop/resources/bin/ax-capture.electron-builder.yml's existingextraResources: from: resources/picks it up automatically..build/and.swiftpm/directories added.🤖 Generated with Claude Code