world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera) by salmanmkc · Pull Request #268 · google/xrblocks

salmanmkc · 2026-05-10T11:00:59Z

been wanting a voice + vision companion in xr for a while. start it, talk to it, it sees what you see and can drop markers on stuff.

needed two sdk bits to make it work so they're in here too:

world.streamScene(prompt, opts) - opens a gemini live session w/ the camera streaming at N fps, tool calls come back via callbacks
world.lookingAt() - whatever the reticle's on, or null

demo's in demos/world_companion. there's a placeLabel tool with three styles (dot, arrow, pulse) so it can pick how to highlight - arrow if you ask it to find something, pulse for tiny stuff, dot otherwise. uses world.objects.runDetection so markers stick to the real object via depth, not to wherever your head was when the tool fired, it will do this on the desktop simulator though. small spatial panel for start/stop/clear so it's actually usable once you're in immersive.

you can ask for several things in one go ("label the couch, tv, and coffee table") and the tool takes an items[] array so they all get placed in a single call, each with its own style. labels billboard back at the camera so they stay readable when you walk around.

detector labels and what you say don't always match — television vs tv, pendant light vs floor lamp, picture vs painting. there's a small synonym table for the obvious cases, then a fallback to gemini's embedContent api with cosine similarity to match by meaning and dedupe markers across rephrasings. the embed cache is per-page so it's basically free after the first call.

tests in World.test.ts cover the tool wiring + start/stop paths.

Try launch the demo and say for example "place an arrow on my water bottle".

I have my thoughts on updating states of objects it has seen, to update later, however for now this seems ok.

Gemini will be able to talk and see screens afaik in Android XR, however this will allow interaction in the real world + gemini live.

I will see if I can get a demo recorded for this. This is open to lots of feedback though, since this is just a very rough version.

Edit, here's a demo! https://youtu.be/-5s_aV6eV_A

I may have to just add key input in again but will double check later when I'm home

world.streamScene(prompt, opts) opens a Gemini Live session and runs a periodic camera-frame loop into it, with auto-dispatch of agentic tools and auto-playback of model audio via CoreSound. Returns a {stop, isActive} handle. Throws cleanly when AI / Live capability / device camera are missing instead of failing deep in the SDK. world.lookingAt(controllerId?) is sugar over User.getReticleTarget so demos can stay on the world.* namespace. World now takes registry as a Script dependency so the new primitives can resolve AI / XRDeviceCamera / CoreSound / User without callers wiring it through every method. 11 tests covering missing-AI, non-Live AI, missing-camera, the frame loop, text+audio routing, onAudio override, tool dispatch, unknown tool, and onToolCall intercept.

A small single-file demo that wires xb.core.world.streamScene to a Live session with two demo-local tools: placeLabel drops a marker in front of the camera, and lookCloser reports what the user's reticle is aimed at via xb.core.world.lookingAt. Mirrors the world_ask UI pattern (floating bottom panel, transcript, start/stop) so users have a complete reference for the new primitive without leaving the demos directory.

Switch placeLabel from live reticle sampling to world.objects.runDetection so labels anchor to actual detected objects in world space, not wherever the user was looking when the tool fired. Also render a Troika text label above the marker, not just a bare sphere. Add a SpatialPanel with start/stop/clear controls so the demo is usable in immersive mode, not just from the flat web overlay.

placeLabel now takes a style param so the model can pick how to highlight something: dot for casual noting, arrow for 'point this out for me', pulse for small or hard-to-spot things. Arrow gently bobs, pulse expands and fades on a 1.5s loop.

Default enableDepth() leaves updateFullResolutionGeometry off, so the depth mesh snapshot used by object detection is too sparse to raycast against. Markers were landing near the camera instead of on the actual detected object. Copy the depth flags the gemini_xrobject demo uses.

salmanmkc · 2026-05-10T14:07:09Z

Turns out I hit rate limits of 20 object detections per day when I checked logs, I for some reason though it was broken

ObjectDetector now switches targetDevice to 'quest' when the Oculus browser is detected, instead of always falling back to galaxyxr params. Adds QuestCameraParams.ts with approximate Quest 3 passthrough intrinsics (fx/fy ~800 at 1280x720, ~77° HFOV from the cropped getUserMedia stream) and an offset for the RGB camera relative to the right XR eye. These are estimates - WebXR doesn't expose the real values - and may need per-device tweaks. Also swaps the detection debug image dump from auto-downloading PNGs (unusable on Quest browser) to a console-log preview that shows the image inline, and adds a few extra logs in world_companion to help see what placeLabel is actually receiving from the detector.

Quest 3 passthrough cameras are physically angled downward; labels were landing too high above table-surface objects. Apply a -0.26 rad pitch in the right-camera pose so unprojected detections line up with what the user actually sees.

Floating world labels were getting cut up by the passthrough depth mesh - letters disappearing where the mesh triangles passed in front of them. Disable depthTest/depthWrite on the troika text and bump renderOrder so labels always draw on top.

Gemini sometimes calls placeLabel multiple times for what's clearly the same physical thing (e.g. "laptop" then "macbook"), and unprojection drift puts the two markers a few cm apart - so the user sees the label twice. Match by text first, then fall back to a 2m proximity check, and update the existing marker in place instead of stacking a new one.

When the Gemini Live websocket drops (1011 internal error) and reconnects, it replays its tool-call context, which fires placeLabel again with the same items. Cache the last call key for 2s and short- circuit the duplicate so we don't redo detection or stack new markers on top of the existing ones.

Was useful while debugging Quest calibration and dedup behaviour but just noise in the console for everyone else. Error paths keep their console.warn.

salmanmkc · 2026-05-11T22:30:58Z

Hi Salman,

Do you have an Android XR device to try?

The arrow and the depth query of the object seems mismatching. Nels has an amazing demo here (https://xrblocks.github.io/docs/samples/Gemini-XRObject/) that uses average depth to place where it is by a long pinch gesture. (yes we need a panel to prompt user what to do as well in this demo)

Prompt the user with what to ask on top of the panel: E.g., try speaking "place an arrow on my laptop". See the Gemini Icebreakers demo: https://xrblocks.github.io/docs/samples/Gemini-Icebreakers/

In XR, only microphone, stop, and delete buttons were shown, I did see some 2D UIs after exiting XR with the transcription --- is this by design?

Ah no I don't have an Android XR device unfortunately, can't order one in the UK 😢 hoping that will change tomorrow

ruofeidu · 2026-05-11T22:33:59Z

I double checked the demo, the arrows were placed on the same distance regardless how far away I'm holding an object on Android XR --- maybe double check the https://xrblocks.github.io/docs/samples/Gemini-XRObject/ for existing APIs.

I'll convert to draft now.

salmanmkc · 2026-05-11T22:43:55Z

+                  anchored: true,
+                });
+              } else {
+                placeMarker(fallbackPosition(i), item.text, itemStyle);


@ruofeidu I think you're hitting the fallback here? Are you rate limited by chance or something? Is it just for arrows? Since this happens when it fails to detect or if rate limited normally.

salmanmkc · 2026-05-11T22:47:50Z

With https://xrblocks.github.io/docs/samples/Gemini-XRObject/, I'm not sure if this demo adds a new capabilities to XR Blocks, or would you correct me if I was wrong. Indeed there is a misalignment with Galaxy XR and Quest and I hope we can reach similar outcome eventually.

yeah good question, the overlap is real but i think the pitch is different:

gemini-xrobject is one-shot "tell me about this thing" — long-pinch → detect → tap → ask
world_companion is the opposite, gemini live is always listening + seeing through the camera and it decides when to mark something. so you can just say "find my keys" or "what's that thing on the shelf" without pinching first and it drops a marker mid-convo. the marker styles (arrow / pulse / dot) are there so the model picks how to highlight, not
always-a-sphere

the new sdk bits (world.streamScene + world.lookingAt) were added to make that loop possible, opening a live session with the camera streaming + tool calls coming back is what enables the continuous thing. gemini-xrobject doesn't need either since it's one-shot.

ruofeidu · 2026-05-12T01:17:10Z

Great explanation, feel free to add a README to highlight the difference :)

I think you can safely use our simulator to debug it... once working in the simulator it is likely it works in Android XR!

The arrow doesn't really point to the object now.

- depth-raycast fallback when detector misses (replaces fixed -1.2m offset) - token-overlap match so 'framed art' lands on detected 'painting' - reject anchor matches further than 8m so distant detections don't fly off-screen - batchKey on placeMarker so distinct items in one placeLabel call don't dedupe each other - billboard text via lookAt(camera) so labels rotate as you walk around them - clear leftover labels at session start - system prompt: only label when explicitly asked - showDebugVisualizations off so detector doesn't render extra markers

Previously, when the detector didn't return a matching object the label would still be dropped via a depth-raycast fallback, which placed it in a random spot in front of the user. Now those items are skipped and the tool returns anchored:false / reason:not_found so Gemini can tell the user it can't see the requested object.

Explains the demo and how it differs from Gemini-XRObject, per PR discussion.

Demo crashes with 'toast.show is not a function' when controller is connected because xrblocks-gamepad-toast custom element isn't registered. The SDK's SimulatorInterface assumes the addons bundle has been imported to side-effect-register that element.

Live API sometimes returns property names wrapped in extra literal quotes (e.g. '"style"' instead of style), so item.style ends up undefined and every label falls back to the default. Normalize by stripping leading/trailing quotes from all keys at the tool entry.

Detector and user often use different words for the same object — detector says 'sofa' when user says 'couch'. The token-overlap fallback doesn't catch these because they share no letters. Add a small synonym table and try each expansion against the detected labels.

When a detected match is more than 8m from the camera the centroid projection is unreliable and we used to drop the placement. Instead, shoot a ray through the match direction against the depth mesh and snap the marker to the surface we actually see. Falls back to dropping if no depth hit.

Two issues with the old dedupe path: 1. The 2m proximity fallback would clobber a chair label when a light-switch label landed ~1.5m away. Distinct objects routinely sit within 2m of each other, so this fallback is too aggressive. Drop it — text similarity is enough. 2. When the new style differs from the existing one (e.g. arrow replacing dot), the old path kept the existing geometry and only updated text/position. Now we remove the existing marker and fall through to fresh-marker creation with the correct geometry.

Gemini Live often narrates 'placing a dot on the sofa now' but never actually invokes placeLabel. Add a system-prompt rule that ties the narration to the tool call so it can't promise a placement and forget to do it.

Even with the same-turn rule the model still likes to narrate the result first ('I've placed a dot on the lamp') and then call the tool, which causes mismatches when the tool returns no placements. Make the ordering explicit: call placeLabel first, then describe results based on what came back in the placed array.

salmanmkc · 2026-05-12T07:43:10Z

wondering if I should use gemini embedContent API as that would help with similarities, vector db is overkill and ya would not be good to make my own local one

The follow-up sentence I added was making gemini hyper-eager to call placeLabel a second time on its own — re-placing the same set of items 30s after the original call with slightly different wording. Keep the same-turn rule, drop the result-narration rule.

Gemini Live occasionally serialises tool-call array entries as JSON strings instead of objects, which made normItem iterate the chars of the string and produce a garbled object whose .text was undefined. That undefined text passed straight through expand(), which returned [''], and ''.includes('') matched the first detected object — so 'Picture frame' kept getting placed on 'coffee table' and Gemini retried in a loop. Try JSON.parse on string items, drop anything that doesn't end up as a proper object, and bail out of expand/findMatch when the text is empty so we don't pretend an empty string is a match.

Token overlap and the synonym table miss obvious cases like Television vs TV, light vs lighting fixture, or pendant light vs floor lamp, so Gemini ends up either retrying or piling duplicate markers on the same object. Add a small embedding helper that calls Gemini's embedContent (gemini-embedding-001) with a per-page cache, warm the cache once per placeLabel call for the requested + detected + already-placed labels, and fall back to cosine similarity above 0.7 in two places: when findMatch can't find a token-overlap candidate, and when placeMarker can't find an existing marker by text-includes. Cache reads are sync via simSync, so the dedupe path stays non-async. If the embed call fails or the AI client isn't ready, we just return null and behaviour is identical to before this change.

salmanmkc · 2026-05-12T08:50:28Z

Nice I think it's at a finalized state now ready for review again, I've added gemini embedding embeddings now so similar words work based on vector difference, it's done on cloud so not local but since we're already doing api calls I think no problem!

dli7319

I'm not quite sure if world.streamScene is something that belongs in the SDK directly or in addons/. @ruofeidu was this in your plans?

We already have https://github.com/google/xrblocks/blob/main/src/addons/ai/GeminiManager.ts which is very similar and doesn't abstract everything into a single function call.

dli7319 · 2026-05-12T23:22:32Z

        .slice(0, 19)
        .replace('T', '_')
        .replace(/:/g, '-');
-      const link = document.createElement('a');


Can you revert this?

yes will do

dli7319 · 2026-05-12T23:24:33Z

+   * @throws If no AI is registered, the active model isn't Live-capable, or
+   *     no XRDeviceCamera is registered.
+   */
+  async streamScene(


Can we refactor this into its own file, e.g. world/GeminiStreaming.ts?

I guess does this depend on if it's in SDK or add ons, thoughts @ruofeidu?

salmanmkc added 11 commits May 9, 2026 13:33

Apply prettier formatting to world.streamScene + tests

7c29eed

Use IconButton + Orbiter for world_companion XR panel

2a5fa3c

Fix duplicate startBtn/stopBtn/clearBtn declarations

9953571

Shrink world_companion XR panel and push it further

a78a2ee

Flatten world_companion XR panel layout

aa63f1e

Head-lock world_companion panel to camera

b90acc9

salmanmkc marked this pull request as draft May 10, 2026 11:39

salmanmkc added 7 commits May 10, 2026 13:35

Attach world_companion panel to user rig instead of bare camera

92e84b3

Let placeLabel mark multiple objects in one call

9ada520

Accept single text/style alongside items[] in placeLabel

ccfd954

Log placeLabel detection diagnostics

e04efb9

Stop placing labels on wrong objects when no name match

825867b

Fall back to in-front-of-user placement when detection misses

9109a67

Prefer any detected object over in-front fallback

d162207

Merge branch 'main' into feat/world-stream-scene

e987544

salmanmkc marked this pull request as ready for review May 11, 2026 06:44

salmanmkc added 6 commits May 11, 2026 07:47

world_companion: drop placeLabel/placeMarker debug logging

ea50d91

Was useful while debugging Quest calibration and dedup behaviour but just noise in the console for everyone else. Error paths keep their console.warn.

salmanmkc force-pushed the feat/world-stream-scene branch from c6512b3 to ea50d91 Compare May 11, 2026 06:47

salmanmkc changed the title ~~Add world.streamScene + world.lookingAt primitives, plus world_companion demo~~ world.streamScene + world.lookingAt + world_companion demo (now with Quest 3 camera support) May 11, 2026

ruofeidu added the demo New demo for XR Blocks demonstrating novel interactivity or perception features. label May 11, 2026

ruofeidu marked this pull request as draft May 11, 2026 22:34

salmanmkc commented May 11, 2026

View reviewed changes

salmanmkc force-pushed the feat/world-stream-scene branch from 24778b5 to 0f196de Compare May 12, 2026 06:51

salmanmkc added 10 commits May 12, 2026 07:52

Merge remote-tracking branch 'origin/main' into feat/world-stream-scene

9ac484a

world_companion: add README

d5551bd

Explains the demo and how it differs from Gemini-XRObject, per PR discussion.

world_companion: nudge gemini to call placeLabel in same turn

4aae642

Gemini Live often narrates 'placing a dot on the sofa now' but never actually invokes placeLabel. Add a system-prompt rule that ties the narration to the tool call so it can't promise a placement and forget to do it.

salmanmkc added 4 commits May 12, 2026 08:51

world_companion: README — multi-item items[] and embedding dedupe

b3171a1

salmanmkc marked this pull request as ready for review May 12, 2026 08:50

salmanmkc changed the title ~~world.streamScene + world.lookingAt + world_companion demo (now with Quest 3 camera support)~~ world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera) May 12, 2026

Merge branch 'main' into feat/world-stream-scene

5df7cbd

dli7319 reviewed May 12, 2026

View reviewed changes

Merge branch 'main' into feat/world-stream-scene

b0c05e6

Conversation

salmanmkc commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

salmanmkc commented May 10, 2026

Uh oh!

salmanmkc commented May 11, 2026

Uh oh!

ruofeidu commented May 11, 2026

Uh oh!

salmanmkc May 11, 2026

Choose a reason for hiding this comment

Uh oh!

salmanmkc commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruofeidu commented May 12, 2026

Uh oh!

salmanmkc commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

salmanmkc commented May 12, 2026

Uh oh!

dli7319 left a comment

Choose a reason for hiding this comment

Uh oh!

dli7319 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

salmanmkc May 14, 2026

Choose a reason for hiding this comment

Uh oh!

dli7319 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

salmanmkc May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

salmanmkc commented May 10, 2026 •

edited

Loading

salmanmkc commented May 11, 2026 •

edited

Loading

salmanmkc commented May 12, 2026 •

edited

Loading