world.streamScene + world.lookingAt + world_companion demo (multi-item placeLabel, embedding dedupe, Quest 3 camera)#268
Conversation
world.streamScene(prompt, opts) opens a Gemini Live session and runs a
periodic camera-frame loop into it, with auto-dispatch of agentic tools and
auto-playback of model audio via CoreSound. Returns a {stop, isActive}
handle. Throws cleanly when AI / Live capability / device camera are
missing instead of failing deep in the SDK.
world.lookingAt(controllerId?) is sugar over User.getReticleTarget so demos
can stay on the world.* namespace.
World now takes registry as a Script dependency so the new primitives can
resolve AI / XRDeviceCamera / CoreSound / User without callers wiring it
through every method.
11 tests covering missing-AI, non-Live AI, missing-camera, the frame loop,
text+audio routing, onAudio override, tool dispatch, unknown tool, and
onToolCall intercept.
A small single-file demo that wires xb.core.world.streamScene to a Live session with two demo-local tools: placeLabel drops a marker in front of the camera, and lookCloser reports what the user's reticle is aimed at via xb.core.world.lookingAt. Mirrors the world_ask UI pattern (floating bottom panel, transcript, start/stop) so users have a complete reference for the new primitive without leaving the demos directory.
Switch placeLabel from live reticle sampling to world.objects.runDetection so labels anchor to actual detected objects in world space, not wherever the user was looking when the tool fired. Also render a Troika text label above the marker, not just a bare sphere. Add a SpatialPanel with start/stop/clear controls so the demo is usable in immersive mode, not just from the flat web overlay.
placeLabel now takes a style param so the model can pick how to highlight something: dot for casual noting, arrow for 'point this out for me', pulse for small or hard-to-spot things. Arrow gently bobs, pulse expands and fades on a 1.5s loop.
Default enableDepth() leaves updateFullResolutionGeometry off, so the depth mesh snapshot used by object detection is too sparse to raycast against. Markers were landing near the camera instead of on the actual detected object. Copy the depth flags the gemini_xrobject demo uses.
|
Turns out I hit rate limits of 20 object detections per day when I checked logs, I for some reason though it was broken |
ObjectDetector now switches targetDevice to 'quest' when the Oculus browser is detected, instead of always falling back to galaxyxr params. Adds QuestCameraParams.ts with approximate Quest 3 passthrough intrinsics (fx/fy ~800 at 1280x720, ~77° HFOV from the cropped getUserMedia stream) and an offset for the RGB camera relative to the right XR eye. These are estimates - WebXR doesn't expose the real values - and may need per-device tweaks. Also swaps the detection debug image dump from auto-downloading PNGs (unusable on Quest browser) to a console-log preview that shows the image inline, and adds a few extra logs in world_companion to help see what placeLabel is actually receiving from the detector.
Quest 3 passthrough cameras are physically angled downward; labels were landing too high above table-surface objects. Apply a -0.26 rad pitch in the right-camera pose so unprojected detections line up with what the user actually sees.
Floating world labels were getting cut up by the passthrough depth mesh - letters disappearing where the mesh triangles passed in front of them. Disable depthTest/depthWrite on the troika text and bump renderOrder so labels always draw on top.
Gemini sometimes calls placeLabel multiple times for what's clearly the same physical thing (e.g. "laptop" then "macbook"), and unprojection drift puts the two markers a few cm apart - so the user sees the label twice. Match by text first, then fall back to a 2m proximity check, and update the existing marker in place instead of stacking a new one.
When the Gemini Live websocket drops (1011 internal error) and reconnects, it replays its tool-call context, which fires placeLabel again with the same items. Cache the last call key for 2s and short- circuit the duplicate so we don't redo detection or stack new markers on top of the existing ones.
Was useful while debugging Quest calibration and dedup behaviour but just noise in the console for everyone else. Error paths keep their console.warn.
c6512b3 to
ea50d91
Compare
Ah no I don't have an Android XR device unfortunately, can't order one in the UK 😢 hoping that will change tomorrow |
|
I double checked the demo, the arrows were placed on the same distance regardless how far away I'm holding an object on Android XR --- maybe double check the https://xrblocks.github.io/docs/samples/Gemini-XRObject/ for existing APIs. I'll convert to draft now. |
| anchored: true, | ||
| }); | ||
| } else { | ||
| placeMarker(fallbackPosition(i), item.text, itemStyle); |
There was a problem hiding this comment.
@ruofeidu I think you're hitting the fallback here? Are you rate limited by chance or something? Is it just for arrows? Since this happens when it fails to detect or if rate limited normally.
yeah good question, the overlap is real but i think the pitch is different:
the new sdk bits (world.streamScene + world.lookingAt) were added to make that loop possible, opening a live session with the camera streaming + tool calls coming back is what enables the continuous thing. gemini-xrobject doesn't need either since it's one-shot. |
- depth-raycast fallback when detector misses (replaces fixed -1.2m offset) - token-overlap match so 'framed art' lands on detected 'painting' - reject anchor matches further than 8m so distant detections don't fly off-screen - batchKey on placeMarker so distinct items in one placeLabel call don't dedupe each other - billboard text via lookAt(camera) so labels rotate as you walk around them - clear leftover labels at session start - system prompt: only label when explicitly asked - showDebugVisualizations off so detector doesn't render extra markers
24778b5 to
0f196de
Compare
Previously, when the detector didn't return a matching object the label would still be dropped via a depth-raycast fallback, which placed it in a random spot in front of the user. Now those items are skipped and the tool returns anchored:false / reason:not_found so Gemini can tell the user it can't see the requested object.
Explains the demo and how it differs from Gemini-XRObject, per PR discussion.
Demo crashes with 'toast.show is not a function' when controller is connected because xrblocks-gamepad-toast custom element isn't registered. The SDK's SimulatorInterface assumes the addons bundle has been imported to side-effect-register that element.
Live API sometimes returns property names wrapped in extra literal quotes (e.g. '"style"' instead of style), so item.style ends up undefined and every label falls back to the default. Normalize by stripping leading/trailing quotes from all keys at the tool entry.
Detector and user often use different words for the same object — detector says 'sofa' when user says 'couch'. The token-overlap fallback doesn't catch these because they share no letters. Add a small synonym table and try each expansion against the detected labels.
When a detected match is more than 8m from the camera the centroid projection is unreliable and we used to drop the placement. Instead, shoot a ray through the match direction against the depth mesh and snap the marker to the surface we actually see. Falls back to dropping if no depth hit.
Two issues with the old dedupe path: 1. The 2m proximity fallback would clobber a chair label when a light-switch label landed ~1.5m away. Distinct objects routinely sit within 2m of each other, so this fallback is too aggressive. Drop it — text similarity is enough. 2. When the new style differs from the existing one (e.g. arrow replacing dot), the old path kept the existing geometry and only updated text/position. Now we remove the existing marker and fall through to fresh-marker creation with the correct geometry.
Gemini Live often narrates 'placing a dot on the sofa now' but never actually invokes placeLabel. Add a system-prompt rule that ties the narration to the tool call so it can't promise a placement and forget to do it.
Even with the same-turn rule the model still likes to narrate the
result first ('I've placed a dot on the lamp') and then call the tool,
which causes mismatches when the tool returns no placements. Make the
ordering explicit: call placeLabel first, then describe results based
on what came back in the placed array.
|
wondering if I should use gemini embedContent API as that would help with similarities, vector db is overkill and ya would not be good to make my own local one |
The follow-up sentence I added was making gemini hyper-eager to call placeLabel a second time on its own — re-placing the same set of items 30s after the original call with slightly different wording. Keep the same-turn rule, drop the result-narration rule.
Gemini Live occasionally serialises tool-call array entries as JSON
strings instead of objects, which made normItem iterate the chars of
the string and produce a garbled object whose .text was undefined.
That undefined text passed straight through expand(), which returned
[''], and ''.includes('') matched the first detected object — so
'Picture frame' kept getting placed on 'coffee table' and Gemini
retried in a loop.
Try JSON.parse on string items, drop anything that doesn't end up as a
proper object, and bail out of expand/findMatch when the text is
empty so we don't pretend an empty string is a match.
Token overlap and the synonym table miss obvious cases like Television vs TV, light vs lighting fixture, or pendant light vs floor lamp, so Gemini ends up either retrying or piling duplicate markers on the same object. Add a small embedding helper that calls Gemini's embedContent (gemini-embedding-001) with a per-page cache, warm the cache once per placeLabel call for the requested + detected + already-placed labels, and fall back to cosine similarity above 0.7 in two places: when findMatch can't find a token-overlap candidate, and when placeMarker can't find an existing marker by text-includes. Cache reads are sync via simSync, so the dedupe path stays non-async. If the embed call fails or the AI client isn't ready, we just return null and behaviour is identical to before this change.
|
Nice I think it's at a finalized state now ready for review again, I've added gemini embedding embeddings now so similar words work based on vector difference, it's done on cloud so not local but since we're already doing api calls I think no problem! |
dli7319
left a comment
There was a problem hiding this comment.
I'm not quite sure if world.streamScene is something that belongs in the SDK directly or in addons/. @ruofeidu was this in your plans?
We already have https://github.com/google/xrblocks/blob/main/src/addons/ai/GeminiManager.ts which is very similar and doesn't abstract everything into a single function call.
| .slice(0, 19) | ||
| .replace('T', '_') | ||
| .replace(/:/g, '-'); | ||
| const link = document.createElement('a'); |
| * @throws If no AI is registered, the active model isn't Live-capable, or | ||
| * no XRDeviceCamera is registered. | ||
| */ | ||
| async streamScene( |
There was a problem hiding this comment.
Can we refactor this into its own file, e.g. world/GeminiStreaming.ts?
There was a problem hiding this comment.
I guess does this depend on if it's in SDK or add ons, thoughts @ruofeidu?

been wanting a voice + vision companion in xr for a while. start it, talk to it, it sees what you see and can drop markers on stuff.
needed two sdk bits to make it work so they're in here too:
world.streamScene(prompt, opts)- opens a gemini live session w/ the camera streaming at N fps, tool calls come back via callbacksworld.lookingAt()- whatever the reticle's on, or nulldemo's in
demos/world_companion. there's aplaceLabeltool with three styles (dot, arrow, pulse) so it can pick how to highlight - arrow if you ask it to find something, pulse for tiny stuff, dot otherwise. usesworld.objects.runDetectionso markers stick to the real object via depth, not to wherever your head was when the tool fired, it will do this on the desktop simulator though. small spatial panel for start/stop/clear so it's actually usable once you're in immersive.you can ask for several things in one go ("label the couch, tv, and coffee table") and the tool takes an
items[]array so they all get placed in a single call, each with its own style. labels billboard back at the camera so they stay readable when you walk around.detector labels and what you say don't always match — television vs tv, pendant light vs floor lamp, picture vs painting. there's a small synonym table for the obvious cases, then a fallback to gemini's embedContent api with cosine similarity to match by meaning and dedupe markers across rephrasings. the embed cache is per-page so it's basically free after the first call.
tests in World.test.ts cover the tool wiring + start/stop paths.
Try launch the demo and say for example "place an arrow on my water bottle".
I have my thoughts on updating states of objects it has seen, to update later, however for now this seems ok.
Gemini will be able to talk and see screens afaik in Android XR, however this will allow interaction in the real world + gemini live.
I will see if I can get a demo recorded for this. This is open to lots of feedback though, since this is just a very rough version.
Edit, here's a demo! https://youtu.be/-5s_aV6eV_A
I may have to just add key input in again but will double check later when I'm home