TypeScript based AI voice agent for insurance support use cases. This project runs as a CLI app and supports English and German.
- MacBook Pro with M1 Pro and 16 GB RAM
- macOS Tahoe 26.3
- Apple Silicon arm64
- Accepts voice input from microphone or audio file
- Runs speech to text
- Detects intent with LLM and returns structured JSON
- Generates spoken response with text to speech
- Maintains session memory and customer context
- Saves session summaries for cross session continuity
- policy_enquiry
- report_claim
- schedule_appointment
- general_conversation
- unknown fallback
- Node.js + TypeScript
- STT: nodejs-whisper with whisper.cpp
- LLM: Ollama (default, local) or Gemini API (optional)
- TTS: macOS say
- Audio tools: sox and ffmpeg
Install Apple Command Line Tools first (required for native builds).
xcode-select --installIf they are already installed, you can verify with:
xcode-select -pInstall system tools on macOS.
brew install ffmpeg soxInstall npm packages.
npm installIf you switch to Gemini provider, set your API key.
export GEMINI_API_KEY="your_api_key_here"Model and provider settings are in src/utils/config.ts.
Default configuration uses Ollama (local).
llm: { provider: "local" }
ollama: { model: "llama3.2:3b" }Run Ollama locally and pull a model.
brew install ollama
ollama serve
ollama pull llama3.2:3bFor lower latency on limited hardware, you can use faster Ollama models (for example llama3.2:1b) instead of heavier models like mistral:7b.
npm run buildnpm run startNote: nodejs-whisper library logs some things to console by default which can't be disabled directly
npm run devnpm run dev runs with development logging enabled.
When app starts it asks phone number. If phone is valid it uses phone based user key. If phone is not provided it uses anonymous UUID key.
Commands
- r Record microphone audio and process it
- f Process audio file from audio folder
- l <en|de> Switch language and voice
- q End session and save memory plus summary
To reduce noise in normal runs, logging is split into two levels.
- Always shown in both
npm run startandnpm run dev:- Agent listening status
- STT transcribed text
- Assistant/LLM response text
- Only shown in
npm run dev:- Internal debug logs (intent detection payloads, timing diagnostics, session debug details)
The app logs timing in milliseconds for key steps and writes per-turn benchmark rows to CSV.
- CSV output file:
data/benchmarks/latency.csv - Columns:
timestamp, sessionId, userKey, llmProvider, llmModel, inputSource, sttMs, llmMs, ttsMs, totalMs, status - STT timer
- LLM timer
- TTS timer
| Metric | Value |
|---|---|
| Source | data/benchmarks/latency.csv |
| Sample size | 39 interactions |
| Status split | 38 ok, 1 error |
Avg sttMs |
1811 |
Avg llmMs |
3051 |
Avg ttsMs |
7023 |
Avg totalMs |
11888 |
| Main bottleneck | ttsMs (~59% of average total latency) |
| Provider | Model | n | Avg llmMs |
Avg totalMs |
|---|---|---|---|---|
| ollama | llama3.2:3b |
24 | 2864 | 11359 |
| ollama | llama3.2:1b |
3 | 3025 | 13983 |
| ollama | mistral:7b |
5 | 7032 | 15365 |
| gemini | gemini-3.1-flash-lite-preview |
7 | 860 | 10318 |
IMPORTANT:
ttsMsincludes the full time for macOSsayto finish speaking the complete sentence, andtotalMsalso includes that same full TTS playback duration.
NOTE: When using local models (for example via Ollama and local Whisper execution), end-to-end latency is highly dependent on local model size, hardware, and current system load.
NOTE: The first inference in each fresh app run is typically slower due to model/runtime loading (cold start). Subsequent turns are usually faster once models are warm.
- Stores conversation history in current session
- Recalls customer name in later turns
- Clears runtime session on quit
- Persists session logs in data/memory
- Persists summarized context in data/summaries
- This project is designed for Apple Silicon workflow
- macOS say is used for low latency local TTS
- Local LLM quality depends on RAM and model size
- docs/TECHNICAL_DOCUMENTATION.md
- docs/AI_USAGE.md