A voice-controlled, multi-agent AI system that sees, understands, plans, and acts on your Android device in real-time.
AURA is a production-grade Android UI automation backend. A user speaks a natural language command — AURA captures the screen, parses the UI tree, plans a series of atomic actions, executes gestures on the real device via Android Accessibility API, and speaks a natural response back.
Example: "Open Spotify and play my liked songs" → AURA opens the app, locates the Liked Songs button visually, taps it, starts playback, and says "Done — your liked songs are playing."
Built as a submission to the Gemini Live Agent Challenge, AURA integrates Google ADK, Gemini Live bidirectional audio+vision, and Cloud Run deployment.
- Architecture
- The 9 Agents
- LangGraph Orchestration
- Tri-Provider Model Architecture
- Perception Pipeline
- Reactive Step Generator
- Google Cloud Integration
- Android Companion App
- WebSocket Endpoints
- REST API
- Installation
- Configuration
- Running
- Cloud Run Deployment
- Safety & Policies
┌─────────────────────────────────────────────────────────────────────┐
│ USER SPEAKS COMMAND │
│ (Android companion app, mic button) │
└─────────────────────────┬───────────────────────────────────────────┘
│ WebSocket /ws/live or /ws/audio
▼
┌─────────────────────────────────────────────────────────────────────┐
│ FastAPI + LangGraph │
│ │
│ Audio → [STT: Groq Whisper] → transcript │
│ ↓ │
│ [Commander Agent] → IntentObject (action, target, confidence) │
│ ↓ │
│ [Safety: Llama Prompt Guard 2 86M] │
│ ↓ │
│ Smart Router ─────────────────────────────────────────────────┐ │
│ NO_UI actions → Coordinator (skip perception) │ │
│ Complex/multi-step → Coordinator │ │
│ UI actions → Perception → Coordinator │ │
│ │ │
│ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ COORDINATOR (perceive→decide→act→verify) │ │ │
│ │ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌───────┐ ┌──────────┐ │ │ │
│ │ │Perceiver│ → │ Planner │ → │ Actor │ → │Verifier │ │ │ │
│ │ │UI+Vision│ │ Phases │ │ Zero- │ │Post-state│ │ │ │
│ │ │ SoM │ │Reactive │ │ LLM │ │ capture │ │ │ │
│ │ └─────────┘ └─────────┘ └───────┘ └──────────┘ │ │ │
│ │ ↑ │ │ │ │
│ │ └────── 5-step Retry ───────┘ │ │ │
│ │ Ladder + Replan │ │ │
│ └──────────────────────────────────────────────────────────┘ │ │
│ ↓ │ │
│ [Responder Agent] → natural language reply │ │
│ ↓ │
│ [TTS: Edge-TTS en-US-AriaNeural] → WAV audio │
└─────────────────────────────────────────────────────────────────────┘
│
▼
Android device executes gesture
via Accessibility API (no root)
Request lifecycle:
- Audio arrives over WebSocket → STT transcription via Groq Whisper Large v3 Turbo
- Intent parsed by
CommanderAgent(rule-based + LLM fallback) - Safety screened by Llama Prompt Guard 2 86M
- Smart routing based on action type and complexity
- Coordinator drives 9 agents through the perceive→decide→act→verify loop
- Gesture executed via Android Accessibility API after OPA Rego policy check
- Response spoken via Edge-TTS (Microsoft, no API key required)
All agents are single-responsibility — located in agents/.
| Agent | File | Responsibility |
|---|---|---|
| Commander | commander.py |
Parses voice transcript → structured IntentObject using rule-based classifier (fast) with LLM fallback |
| Planner | planner_agent.py |
Decomposes goal into skeleton phases + ordered atomic Subgoal list (max 12 words each) |
| Perceiver | perceiver_agent.py |
Captures screenshot + UI tree → ScreenState with Set-of-Marks annotations; never returns raw pixel coordinates |
| Actor | actor_agent.py |
Zero-LLM deterministic gesture executor (tap, type, scroll, swipe, open_app, etc.) |
| Responder | responder.py |
Generates natural TTS-ready conversational responses via LLM; strips markdown for clean speech |
| Verifier | verifier_agent.py |
Waits for UI to settle post-gesture, captures post-action state, detects error dialogs |
| Validator | validator.py |
Fast rule-based intent validation — no LLM calls, zero latency |
| ScreenVLM | visual_locator.py |
Tri-layer visual perception: UI tree → YOLOv8 CV → VLM selection from numbered Set-of-Marks elements |
| Coordinator | coordinator.py |
Orchestrates all 8 agents through perceive→decide→act→verify with 5-step retry ladder and adaptive replanning |
The VLM never returns raw pixel coordinates. It only selects among numbered Set-of-Marks elements detected by YOLOv8. This invariant must never be broken.
File: aura_graph/graph.py
The graph is a StateGraph(TaskState) with 15 nodes, compiled via compile_aura_graph() at server startup.
__start__
│
├─ audio → [stt] → [parse_intent]
└─ text/streaming → [parse_intent]
│
┌───────────────┼───────────────────┐
│ │ │
[coordinator] [perception]→[coordinator] [speak]
│ │
└───────────────────────────────────┘
│
[speak] → END
Conditional routing (edges.py):
NO_UIactions (open_app, scroll, system) → coordinator directly (skip perception)- Multi-step commands (contains "and", "then") → coordinator
- Conversational intents → speak directly
- UI actions → perception → coordinator
- Low confidence / general_interaction → coordinator for full planning
Retry loop (within coordinator):
perceive → decide → act → verify
↑ │
└── 5-step retry ladder ───┘
1. SAME_ACTION (retry exact)
2. ALTERNATE_SELECTOR (different element)
3. SCROLL_AND_RETRY (scroll to find)
4. VISION_FALLBACK (VLM coordinate mode)
5. ABORT → replan (max 3 replans before giving up)
State (aura_graph/state.py): TaskState TypedDict with ~40 fields, custom LangGraph reducers:
error_message— accumulates (joins multiple errors with;)status— last-writer-winscurrent_step— takes maximum value (concurrent-safe)end_time— first-writer-wins (preserves actual completion time)
AURA uses a tri-provider strategy: Groq (primary, speed), Gemini (fallback, quality), NVIDIA NIM (optional, scale).
| Task | Provider | Model | Notes |
|---|---|---|---|
| Intent parsing | Groq | llama-3.1-8b-instant |
560 T/s, <300 tokens |
| Low-confidence intent | Groq | llama-3.3-70b-versatile |
fallback |
| Planning / reasoning | Groq | meta-llama/llama-4-scout-17b-16e-instruct |
16 experts |
| Planning fallback | Gemini | gemini-2.5-flash |
|
| Response generation | Groq | meta-llama/llama-4-scout-17b-16e-instruct |
|
| Intent classification | OpenRouter | z-ai/glm-4.5-air:free |
free tier |
| Intent classif. fallback | OpenRouter | meta-llama/llama-3.3-70b-instruct:free |
|
| Intent classif. fallback 2 | Groq | llama-3.3-70b-versatile |
|
| Safety screening | Groq | meta-llama/llama-prompt-guard-2-86m |
specialized |
| ADK root agent | Google ADK | gemini-2.5-flash |
|
| Gemini Live bidi | gemini-2.0-flash-live-001 |
| Task | Provider | Model |
|---|---|---|
| UI analysis / screen understanding | Groq | meta-llama/llama-4-scout-17b-16e-instruct |
| Visual element selection (SoM) | Groq | meta-llama/llama-4-scout-17b-16e-instruct |
| VLM fallback | Gemini | gemini-2.5-flash |
| Service | Provider | Model / Voice |
|---|---|---|
| Speech-to-Text | Groq | whisper-large-v3-turbo (16 kHz PCM mono) |
| Text-to-Speech | Edge-TTS (Microsoft) | en-US-AriaNeural (default, no API key) |
| Gemini Live voice | Charon (configurable) |
Files: services/perception_controller.py, perception/omniparser_detector.py, perception/vlm_selector.py
Three-layer hybrid — each layer only runs if the previous is insufficient:
Layer 1: Android Accessibility UI Tree
→ Package name, activity, all interactive elements
→ Fast (~50 ms), but misses visual-only elements
Layer 2: YOLOv8 OmniParser (CV Detection)
→ Detects clickable elements in screenshot
→ Draws numbered Set-of-Marks boxes on image
→ Pixel-accurate without returning coordinates
Layer 3: VLM Selection (ScreenVLM)
→ Receives annotated screenshot with numbered elements
→ Returns index of target element (never raw coordinates)
→ Falls back to Gemini 2.5 Flash if Groq fails
Perception modalities (configurable via DEFAULT_PERCEPTION_MODALITY):
hybrid— UI tree + vision (default, most reliable)ui_tree— fast path for settings-style appsvision— screenshot-only for canvas/WebView appsauto— controller selects based on app type
Caching: Screenshots cached for 2 s (configurable), invalidated after 1 gesture.
File: services/reactive_step_generator.py
Instead of committing to a full upfront plan (which breaks when screens deviate), AURA generates one concrete action at a time grounded in the live screen.
async def generate_next_step(
goal, # what user wants to accomplish
screen_context, # current screen description
step_history, # what was done so far
screenshot_b64, # actual screenshot bytes
ui_hints, # UI tree labels
) -> Subgoal: # ONE next actionThe planner creates skeleton phases (e.g., "Open Spotify", "Navigate to Liked Songs", "Start Playback"). For each phase, the reactive generator asks a VLM: "given the current screen and this phase goal, what is the single next UI action?" — grounding every decision in real screen state.
┌─────────────────────────────────────────────────────────┐
│ Google Cloud │
│ │
│ ┌──────────────────┐ ┌───────────────────────────┐ │
│ │ Cloud Run │ │ Cloud Storage │ │
│ │ aura-backend │───▶│ aura-execution-logs/ │ │
│ │ (this server) │ │ logs/{task_id}.html │ │
│ └────────┬─────────┘ └───────────────────────────┘ │
│ │ │
│ ┌────────▼─────────┐ │
│ │ Google ADK │ │
│ │ root_agent │ │
│ │ gemini-2.5-flash│ │
│ │ + FunctionTool │ │
│ │ execute_aura_ │ │
│ │ task() │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌────────▼─────────┐ │
│ │ Gemini Live │ │
│ │ /ws/live │ │
│ │ gemini-2.0- │ │
│ │ flash-live-001 │ │
│ │ Bidi audio+vis │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────┘
root_agent = Agent(
name="AURA",
model="gemini-2.5-flash",
tools=[aura_tool], # FunctionTool wrapping execute_aura_task_from_text()
)Lazy initialization — set_compiled_graph(app) must be called from main.py lifespan before any tool invocation. The tool returns success, response, steps_taken, and execution_log_url.
Enabled when GEMINI_LIVE_ENABLED=true. Registers /ws/live in main.py.
Features:
- Full Voice Activity Detection (
prefix_padding_ms=160,silence_duration_ms=650, high start/end sensitivity) - Barge-in support (
START_OF_ACTIVITY_INTERRUPTS) - Thinking content filter — strips
**Bold**reasoning headers from model output - Transcript accumulation across sub-turns until
turn_complete - Non-blocking live request queue (audio + screenshot frames)
Message protocol (Android ↔ server):
| Direction | Type | Payload |
|---|---|---|
| Android → Server | audio_chunk |
PCM 16 kHz mono int16, base64 |
| Android → Server | screenshot |
JPEG base64 |
| Android → Server | text_command |
plain text fallback |
| Server → Android | audio_response |
PCM 24 kHz mono int16, base64 |
| Server → Android | transcript |
incremental + final text |
| Server → Android | task_progress |
"executing" or "idle" |
After every task, the HTML execution log is uploaded to Cloud Storage:
log_url = await upload_log_to_gcs_async(log_path, task_id)
# → gs://aura-execution-logs/logs/{task_id}.html (public URL)Non-fatal: failures are logged as warnings only. Disabled by default (GCS_LOGS_ENABLED=false).
gcloud run deploy aura-backend \
--source . \
--region us-central1 \
--allow-unauthenticated \
--memory 2Gi \
--cpu 2 \
--timeout 3600 \
--set-secrets="GOOGLE_API_KEY=...,GROQ_API_KEY=...,GEMINI_API_KEY=..."The server reads $PORT from Cloud Run automatically via Pydantic Settings. YOLOv8 is pre-warmed at Docker build time to eliminate cold-start latency.
Located in UI/. Kotlin + Jetpack Compose.
Handles continuous voice capture and WebSocket communication:
companion object {
const val SAMPLE_RATE = 16000 // 16 kHz
const val AUDIO_FORMAT = PCM_16BIT // int16 mono
const val CHUNK_MS = 100 // 100 ms chunks
const val SCREENSHOT_INTERVAL_MS = 3000 // every 3 s
const val UI_TREE_INTERVAL_MS = 5000 // every 5 s
const val PING_INTERVAL_MS = 25_000 // keepalive
}- Continuous listen mode — no push-to-talk button required
- Audio chunks encoded as base64 and sent as WebSocket frames
- Screenshots and UI tree sent on independent timers
- Auto-silence detection after 8 s of post-response inactivity
Manages conversation state as StateFlow:
ConversationState— current phase, session ID, connection statusList<ConversationMessage>— full message historyList<AgentOutput>— per-agent status updates (for debug overlay)
| Endpoint | Direction | Purpose | Notes |
|---|---|---|---|
ws://host/ws/audio |
Bidi | Legacy voice command streaming | Android app dependency — path must not change |
ws://host/ws/device |
Bidi | Device screenshot + UI tree polling | Android app dependency — path must not change |
ws://host/ws/live |
Bidi | Gemini Live bidi audio+vision | Requires GEMINI_LIVE_ENABLED=true |
ws://host/api/v1/tasks/ws |
Server→Client | Live task event streaming | For demo dashboard |
| Endpoint | Method | Description |
|---|---|---|
GET /health |
GET | Health check (legacy) |
GET /api/v1/health |
GET | Health check (versioned) |
POST /api/v1/graph/execute |
POST | Execute task from text |
GET /api/v1/device/screenshot |
GET | Live screenshot |
GET /demo |
GET | Judge dashboard (live screenshot, recent commands, GCS log links) |
GET /docs |
GET | OpenAPI docs (development only) |
- Python 3.11+
- Android device with USB debugging + Accessibility Service enabled
adbin PATH- Groq API key (required), Gemini API key (required), others optional
git clone <repo>
cd aura-live
# Install dependencies (note: filename has a space)
pip install -r "requirements copy.txt"
# Copy and fill in your API keys
cp .env.example .envAll settings flow through config/settings.py (Pydantic BaseSettings). Never read os.environ directly.
Create a .env file:
# ── Required ──────────────────────────────────────
GROQ_API_KEY=gsk_...
GEMINI_API_KEY=AIza...
GOOGLE_API_KEY=AIza... # same key, needed by google-genai SDK
# ── Providers (defaults shown) ────────────────────
DEFAULT_LLM_PROVIDER=groq
DEFAULT_VLM_PROVIDER=groq
DEFAULT_STT_PROVIDER=groq
DEFAULT_TTS_PROVIDER=edge-tts
PLANNING_PROVIDER=groq
# ── Models (defaults shown) ───────────────────────
DEFAULT_LLM_MODEL=llama-3.1-8b-instant
DEFAULT_VLM_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
PLANNING_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
DEFAULT_STT_MODEL=whisper-large-v3-turbo
DEFAULT_TTS_MODEL=en-US-AriaNeural
# ── Perception ────────────────────────────────────
DEFAULT_PERCEPTION_MODALITY=hybrid # ui_tree | hybrid | vision | auto
PERCEPTION_CACHE_ENABLED=true
PERCEPTION_CACHE_TTL=2.0
# ── Google Cloud (for Gemini Live + GCS logs) ─────
GOOGLE_CLOUD_PROJECT=your-gcp-project
GOOGLE_CLOUD_REGION=us-central1
GCS_LOGS_BUCKET=aura-execution-logs
GCS_LOGS_ENABLED=false # set true to upload HTML logs
GEMINI_LIVE_ENABLED=false # set true to enable /ws/live
GEMINI_LIVE_MODEL=gemini-2.0-flash-live-001
GEMINI_LIVE_VOICE=Charon # Aoede | Charon | Fenrir | Kore | Puck | ...
# ── Optional providers ────────────────────────────
NVIDIA_API_KEY=...
OPENROUTER_API_KEY=...
# ── LangGraph limits ──────────────────────────────
GRAPH_RECURSION_LIMIT=100
GRAPH_TIMEOUT_SECONDS=120.0
# ── Server ────────────────────────────────────────
HOST=0.0.0.0
PORT=8000
ENVIRONMENT=development
LOG_LEVEL=DEBUG
RELOAD=true
# ── Security ──────────────────────────────────────
REQUIRE_API_KEY=true
DEVICE_API_KEY=your-secret-key# Start the server
python main.py
# → http://0.0.0.0:8000
# → Docs: http://localhost:8000/docs
# → Health: GET http://localhost:8000/health
# → Demo dashboard: http://localhost:8000/demopython scripts/test_commander_live.py # test intent parsing live
python scripts/test_sensitive_blocking.py # test OPA policy blocking
python scripts/view_ui_tree.py # inspect device UI tree
python scripts/dead_code_scanner.py # scan for unused codepytest tests/
pytest tests/test_foo.py::test_bar # single test# Build + deploy from source
gcloud run deploy aura-backend \
--source . \
--region us-central1 \
--allow-unauthenticated \
--memory 2Gi \
--cpu 2 \
--timeout 3600 \
--set-secrets="GOOGLE_API_KEY=projects/.../secrets/GOOGLE_API_KEY/versions/latest,GROQ_API_KEY=...,GEMINI_API_KEY=..."
# Verify
curl https://aura-backend-xxx-uc.a.run.app/healthThe Dockerfile pre-warms YOLOv8 at build time so the first real VLM call has no model-load latency. The server reads $PORT automatically from Cloud Run's injected environment variable.
Dual-layer safety:
-
Llama Prompt Guard 2 86M (
services/prompt_guard.py) — screens every voice input before intent parsing. Blocks jailbreaks, prompt injections, and harmful commands. Fail-safe: allows on API error. -
OPA Rego Policies (
policies/,services/policy_engine.py) — gates every single gesture execution. Policies check:- Action type (send message, make purchase, delete data require confirmation)
- Device state (locked screen, accessibility disabled)
- Target app context
Both layers are fail-safe (allow on error) so transient API failures don't block legitimate commands.
- VLM never returns pixel coordinates — only selects from numbered SoM elements
- 5-stage retry ladder runs before any replanning
- Every gesture passes through OPA policy check in
gesture_executor.py - All new actions must be registered in
config/action_types.pyACTION_REGISTRY - All service functions must be
async def - All API keys go through
config/settings.py— neveros.environdirectly - 9 agents stay single-responsibility — no merging or scope creep
/ws/audioand/ws/devicepaths must not change (Android app dependency)
aura-live/
├── main.py # FastAPI app + lifespan
├── adk_agent.py # Google ADK root_agent (gemini-2.5-flash)
├── adk_streaming_server.py # Gemini Live /ws/live handler
├── gcs_log_uploader.py # Cloud Storage HTML log upload
├── Dockerfile # Cloud Run deployment
├── requirements copy.txt # Python dependencies
│
├── agents/ # The 9 single-responsibility agents
│ ├── commander.py # Intent parsing
│ ├── planner_agent.py # Goal decomposition
│ ├── perceiver_agent.py # Screen capture + SoM
│ ├── coordinator.py # Multi-agent orchestrator
│ ├── actor_agent.py # Zero-LLM gesture execution
│ ├── responder.py # Natural language responses
│ ├── validator.py # Rule-based validation
│ ├── verifier_agent.py # Post-action verification
│ └── visual_locator.py # ScreenVLM (SoM selection)
│
├── aura_graph/ # LangGraph state machine
│ ├── graph.py # Graph assembly + entry points
│ ├── state.py # TaskState TypedDict (~40 fields)
│ ├── edges.py # Conditional routing functions
│ ├── core_nodes.py # Node implementations
│ └── nodes/ # Specialized nodes
│ ├── perception_node.py
│ ├── coordinator_node.py
│ ├── validate_outcome_node.py
│ ├── retry_router_node.py
│ ├── decompose_goal_node.py
│ └── next_subgoal_node.py
│
├── config/
│ ├── settings.py # Pydantic Settings (all env vars)
│ └── action_types.py # ACTION_REGISTRY
│
├── services/
│ ├── perception_controller.py # Tri-layer perception orchestration
│ ├── reactive_step_generator.py # Per-screen action generation
│ ├── gesture_executor.py # Gesture execution + strategy selection
│ ├── llm.py # Unified LLM interface (Groq/Gemini/NVIDIA)
│ ├── vlm.py # Unified VLM interface
│ ├── stt.py # Groq Whisper STT
│ ├── tts.py # Edge-TTS (Microsoft)
│ ├── prompt_guard.py # Llama Prompt Guard 2 safety screening
│ ├── policy_engine.py # OPA Rego policy gateway
│ └── command_logger.py # HTML execution log builder
│
├── perception/
│ ├── perception_pipeline.py # YOLOv8 + SoM pipeline
│ ├── omniparser_detector.py # YOLOv8 UI element detection
│ └── vlm_selector.py # VLM-based element selection
│
├── api/
│ ├── demo.py # /demo judge dashboard
│ ├── graph.py # /api/v1/graph/execute
│ ├── health.py # /health endpoints
│ └── tasks.py # /api/v1/tasks/ws streaming
│
├── api_handlers/
│ └── websocket_router.py # /ws/audio and /ws/device handlers
│
├── policies/ # OPA Rego policy files
├── prompts/ # LLM prompt templates
└── UI/ # Android companion app (Kotlin)
└── app/src/main/java/com/aura/aura_ui/
├── conversation/ConversationViewModel.kt
└── voice/GeminiLiveController.kt
MIT — see LICENSE.
Built for the Gemini Live Agent Challenge · Powered by Google ADK, Gemini Live, Groq, LangGraph, and Android Accessibility API