Skip to content

feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367

Draft
yagudaev wants to merge 4 commits intomainfrom
michael/nan-707-ax-text-capture
Draft

feat(desktop): macOS Accessibility API text capture as second channel to Gemini (NAN-707)#367
yagudaev wants to merge 4 commits intomainfrom
michael/nan-707-ax-text-capture

Conversation

@yagudaev
Copy link
Copy Markdown
Owner

@yagudaev yagudaev commented May 1, 2026

Why

Gemini hallucinates text from screen captures regularly — code, terminal output, dense tables, and small UI labels are the worst offenders even at our 1536px / 0.85q JPEG. Adding a parallel channel that sends the actual on-screen text via macOS Accessibility API alongside each image gives the model both the picture and the ground-truth text temporally aligned, so it stops guessing on the parts that vision can't read reliably.

Closes NAN-707.

Architecture (decisions table)

Decision Choice Why
Capture mechanism Swift sidecar binary, JSON-line stdio AX is C-only (ApplicationServices.framework); sidecar is what serious mac-AX tools (Raycast/Rewind) do — easier to iterate on and notarize than a node-gyp native module.
Scope per capture Frontmost window only Bounds payload size; matches user attention.
Tree shape Flat list of {role, text, frame, app} Easier for the model to consume; tree depth rarely adds signal for text.
Cadence Captured fresh per image frame (1 FPS) Keeps image + text temporally aligned without separate sync logic.
Delivery to Gemini Sibling realtimeInput.text immediately after each realtimeInput.video Gemini Live has no combined video+text part, but adjacent sends in the same WS tick are treated as one moment.
Tracing ax_text field on each videoFrames[] entry in the per-turn timings.json Lets us A/B image-only vs image+text accuracy on text-heavy interfaces.
Permission UX Pre-flight probe on screen-share start; confirm dialog opens System Settings deeplink if denied Mirrors existing screen-recording flow; capture continues vision-only either way (graceful degradation).
Test plan
  • Swift sidecar builds as universal binary (arm64 + x86_64)
  • 4 sidecar protocol integration tests (ping / permission / capture / bad JSON) pass against the real binary
  • 5 main-process formatter tests + 4 renderer formatter tests
  • 4 relay session tests for frame.appendsendFrame + sendAxText routing, including legacy-adapter compatibility
  • 3 MediaCapture tests for ax_text persistence and 8KB truncation
  • All 113 desktop tests + 79 relay-server tests + typecheck both packages
  • Docs page renders into the Starlight build (33 pages built, new sidebar entry visible)
  • Manual: run a session against text-heavy UIs (VS Code, dense web table, terminal) — verify tracer rows now contain ax_text and Gemini's transcripts show fewer hallucinations
  • Manual: revoke Accessibility permission, start a share, verify the confirm dialog appears and capture continues vision-only
  • Manual: yarn dist:mac smoke test that the sidecar lands in Contents/Resources/bin/ax-capture and is signed + notarized
Implementation notes
  • Wire format: FrameAppendEvent gets an optional axText?: string. Old desktop builds simply omit it; the field is additive.
  • Adapter interface: ProviderAdapter.sendAxText is optional. OpenAI and xAI adapters don't implement it — only Gemini sees the channel, and that's correct since they're the only adapter accepting video.
  • Reconnect queue: AX text is classified as "video" in Gemini's send-upstream queue so it shares oldest-drop discipline with its paired image. At 1 FPS, drift between image and AX text after a rotation is bounded to one second.
  • Watchdog: like sendFrame, sendAxText does NOT pet the watchdog. The "are you still there?" prompt should still fire correctly during silent screen sharing.
  • Permission inheritance: the sidecar is bundled inside Contents/Resources/bin/ax-capture and gets signed with the app's Developer ID via electron-builder's afterSign hook. macOS attributes the AX request to the parent app bundle, so the user adds VoiceClaw to Privacy & Security → Accessibility, not the sidecar binary directly. This needs verification on a real dist:mac build. If macOS shows the sidecar as a separate entry, we'll either need to relaunch it via the parent's process tree differently or add explicit per-binary entitlements.
  • Truncation: client-side cap is 8 KB at format time (formatAxText in main, formatAxTextRenderer in renderer); relay re-caps defensively to 8 KB on persist. UTF-8 byte-counted, not chars.
  • Build pipeline: desktop/scripts/build-services.mjs now runs build-ax-capture.mjs after the relay/openclaw bundles, producing the universal binary into desktop/resources/bin/ax-capture. electron-builder.yml's existing extraResources: from: resources/ picks it up automatically.
  • Gitignore: built binary + Swift .build/ and .swiftpm/ directories added.

🤖 Generated with Claude Code

… to Gemini (NAN-707)

Adds a second input channel to Gemini Live that captures the actual text of
the frontmost macOS window via the Accessibility API and sends it inline
with each 1 FPS image frame. The model now sees the screenshot AND the
ground-truth text as one moment, dramatically reducing hallucinated text on
code, terminals, dense UIs, and small fonts where vision OCR struggles.

Architecture:
- Swift sidecar (desktop/native/ax-capture) speaks JSON-line stdio to the
  Electron main process. AXIsProcessTrusted-gated, returns flat list of
  {role,text,frame,app} for the frontmost window.
- Main process module (desktop/src/main/ax-capture.ts) manages sidecar
  lifecycle with restart backoff, 250ms timeout per call, and exposes
  ax:capture / ax:permission / ax:openSettings IPC handlers.
- Renderer screen-capture loop calls AX in parallel with each JPEG frame,
  formats to a compact role:text block (capped at 8KB), and sends it
  alongside the image via the existing realtime WebSocket.
- Relay forwards as a sibling realtimeInput.text immediately after each
  realtimeInput.video — Gemini Live has no combined part, but adjacent
  sends preserve temporal alignment. Other adapters (OpenAI, xAI) just
  ignore the new field via the optional sendAxText interface method.
- Per-frame ax_text persisted to timings.json so the in-house tracer can
  A/B image-only vs image+text accuracy.
- Permission UX: pre-flight probe on screen-share start; if not granted,
  confirm dialog opens System Settings → Privacy → Accessibility. Capture
  continues vision-only either way.

Tests:
- 5 main-process formatter tests
- 4 sidecar protocol integration tests against the real Swift binary
- 4 renderer formatter tests
- 4 relay frame.append → adapter wiring tests
- 3 MediaCapture ax_text persistence tests
- All 113 desktop + 79 relay-server tests pass; both packages typecheck

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
voiceclaw Ready Ready Preview, Comment May 1, 2026 11:15pm

Request Review

Critical:
- Replace `as!` force-casts on AX results with CFGetTypeID guards. A buggy
  app returning a non-AXUIElement / non-AXValue would have crashed the
  sidecar; we now bail with ax_failed / skip the frame instead.

Should-fix:
- Lazy-start the sidecar on first capture/permission call instead of at
  app launch. Users who never screen-share pay nothing for the feature.
- Honor stdin.write backpressure: if the kernel pipe buffer is full
  (wedged sidecar), drop the request immediately rather than letting Node
  buffer unbounded data internally.
- Verify lipo / chmod exit codes in build-ax-capture.mjs and assert the
  output binary exists before declaring success. A silent build failure
  would have shipped a broken or missing universal binary.
- Lower Swift Package macOS minimum from .v12 to .v11 to match Electron 41's
  floor; bumping forced unnecessary OS-version gating on the smallest piece
  of the bundle.
- Cross-reference both copies of the format helper in their own comments so
  future readers know to keep them in lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yagudaev
Copy link
Copy Markdown
Owner Author

yagudaev commented May 1, 2026

Review summary

Codex skipped — codex CLI flagged the worktree as untrusted and refused without --skip-git-repo-check. Skill bug, not a code issue. Falling back to Gemini-only.

Gemini findings

Critical (fixed in 79026a4):

  • Force-casts (as!) on AX results in three Swift sites would crash the sidecar if a buggy app returned a non-AXUIElement / non-AXValue. Replaced with CFGetTypeID guards that bail with ax_failed / skip the frame.

Should-fix (fixed in 79026a4):

  • Sidecar now lazy-starts on first capture/permission call, not at app launch.
  • stdin.write backpressure honored — drop request when the pipe buffer is full instead of buffering unbounded data in Node.
  • Build script now verifies lipo / chmod exit codes and asserts the output binary exists.
  • Swift Package.swift minimum lowered from .v12 to .v11 to match Electron 41's floor.

Should-fix (deferred with rationale):

  • Duplicated formatter (main + renderer): The two copies are 18 lines of pure string manipulation, both unit-tested separately. Consolidating would require a third bundle target or a preload-mediated path; not worth the bundling complexity for the deduplication. Added cross-reference comments in both copies.

Nits (skipped):

  • Redundant as [Any] casts in Swift readableText loop — keeping for clarity.
  • Unused _reason arg in failAllPending — kept for symmetry with the resolve-callback pattern; trivial.
  • window.confirm for permission dialog — acceptable for v1; can upgrade to a custom modal once we have a banner system.

All 113 desktop + 79 relay-server tests still pass after the fixes; both packages typecheck.

Critical:
- AX capture is now gated to display sources only. When sharing a single
  window, AX text comes from the *frontmost* window which can diverge from
  the captured pixels (window sources track the share even when not
  focused). Window-source shares ship vision-only until we wire CGWindowID
  alignment in a follow-up.

Should-fix:
- ScreenCapture sets a stopped flag and rechecks it after the AX await, so
  a frame can no longer land after stop().
- Relay sanitizes axText at the trust boundary: rejects non-strings,
  re-caps at 8KB UTF-8. The renderer cap is not a trust boundary.
- Sidecar restart counter only resets after 30s of stable runtime. A crash
  loop can no longer forever-retry at the smallest backoff.
- shuttingDown flag prevents the exit handler from resurrecting the sidecar
  after app quit kills it.
- Distribution builds (yarn dist:mac sets AX_REQUIRE_UNIVERSAL=1) now hard-
  fail when the x86_64 toolchain is missing instead of silently shipping
  arm64-only. Dev builds still degrade gracefully.
- Stripped NAN-707 references from source comments — they belong in the
  PR body, not the codebase.

Nits:
- Type-literal semicolons swapped for commas in three call sites.
- Dropped unused REQUEST_TIMEOUT_REASON constant and _reason arg.
- Sidecar fallback-window failure now reports werr.rawValue (the actual
  failed call) instead of err.rawValue.

Tests:
- 2 new relay sanitizer tests (non-string rejection, 8KB cap).
- All 113 desktop + 81 relay-server tests pass; both packages typecheck.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yagudaev
Copy link
Copy Markdown
Owner Author

yagudaev commented May 1, 2026

Codex review on this branch (latest commit ca9b7f2)

GitHub doesn't let me post a request-changes review on my own PR, so this is a comment instead. Reviewed by Codex CLI (codex-cli 0.128.0), not by the original author model.

Critical (resolved)

  • AX capture leaked text from non-shared windows — capture read NSWorkspace.shared.frontmostApplication, so sharing window A while focused on window B sent B's text alongside A's screenshot.
    • Fix: AX is now gated to display-source shares only (sourceId.startsWith('screen:')). For window shares, screen capture continues vision-only until we wire CGWindowID-aware AX traversal.

Should-fix (resolved)

  • captureFrame race after stop() — the AX await widened the window where a frame could land after the user ended sharing. Fix: stopped flag set in stop() and re-checked after the await before calling onFrame.
  • Server-side axText validation missing — relay forwarded event.axText to Gemini without checking type or size. Fix: sanitizeAxText in session.ts rejects non-strings and re-caps at 8 KB UTF-8. Two new tests cover both cases.
  • Restart-backoff counter reset on every spawn — a crash-looping sidecar would forever-retry at the smallest delay. Fix: counter only resets after RESTART_STABLE_RUNTIME_MS (30 s) of stable runtime.
  • Quit + scheduleRestart racebefore-quit killed the sidecar; the exit handler then scheduled a restart. Fix: shuttingDown flag short-circuits both ensureSidecar() / startSidecar() and scheduleRestart().
  • Universal-binary silent fallback — x86_64 build failures dropped to arm64-only without error, even on dist:mac. Fix: AX_REQUIRE_UNIVERSAL=1 set in dist:mac; build aborts on x86_64 failure under that mode. Dev builds still degrade gracefully.
  • NAN-707 referenced in source comments — per CLAUDE.md, ticket references belong in PR bodies, not the codebase. Fix: removed from relay-server/src/types.ts and relay-server/src/session.ts.

Nits (resolved)

  • Type-literal semicolons swapped for commas (style: no semicolons).
  • Dropped unused REQUEST_TIMEOUT_REASON constant and _reason arg.
  • Sidecar fallback-window failure now reports werr.rawValue (the actual failed AX call) instead of err.rawValue.

Test results after fixes

  • 113/113 desktop tests
  • 81/81 relay-server tests (+2 new sanitizer tests)
  • Both packages typecheck clean.

- New `yarn build:ax-capture` runs the Swift compile.
- `yarn dev` now invokes it before Electron starts, so a fresh checkout
  works without remembering an out-of-band node command.
- Build is idempotent — skips when no source has changed since the last
  output (under 1s), full compile (~4s) only when Package.swift or
  Sources/ are newer than the binary. AX_FORCE_REBUILD=1 forces it.
- dist:mac path is unchanged (still chained through build-services.mjs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant