sqllm — Local Text-to-SQL CLI Agent

sqllm is a Go CLI that translates natural-language questions into SQL by combining:

profile-scoped runtime configuration,
local schema metadata and embeddings,
semantic table retrieval,
an LLM tool-calling loop,
and a built-in evaluation harness with baseline mode comparison.

Key differentiator: this is not a single-prompt text-to-SQL script that dumps an entire schema into one prompt. It is a retrieval-first system that incrementally builds context from a local metadata store before asking the model for SQL.

1. Overview

sqllm helps developers and analysts query PostgreSQL or SQLite without remembering table details.

Each profile maps to one database and has its own metadata cache at:

~/.sqllm/metadata/.db

That keeps environments isolated (for example, dev vs prod) and avoids cross-database schema contamination.

2. Problem Statement

Text-to-SQL in real systems is hard for reasons beyond prompt writing:

Natural language is ambiguous (user terms rarely match table names exactly).
Real schemas are large and relationship-heavy (full-schema prompting is expensive and noisy).
Join inference is difficult without explicit structural reasoning.
LLM outputs can be syntactically valid but semantically wrong.
Executing generated SQL safely requires guardrails.

sqllm addresses part of this by building and maintaining local schema metadata + embeddings, then retrieving relevant context for the model rather than prompting with the full schema every time.

3. Key Features

Profile-Based Runtime Configuration

Multiple profiles under ~/.sqllm/profiles/.yaml
Active profile tracked in ~/.sqllm/config.yaml
Commands:
- sqllm init
- sqllm profile list
- sqllm profile use
- sqllm profile current
- sqllm profile update

Incremental Schema Scanning

Per-table schema hashing (columns + foreign keys)
Detects new/updated/unchanged/deleted tables
Updates only changed/new entries in metadata store

Per-Database Metadata Isolation

Metadata path derived from db-url hash via MetadataDBPath
Store location: ~/.sqllm/metadata/.db

LLM Provider Abstraction

Ollama client (chat + embeddings)
OpenRouter client (chat + embeddings)
Intent generation is available through the same provider factory path used by ask/enrich
Provider selected through runtime profile + llm.NewClient factory

Semantic Retrieval + Tool-Calling Agent Loop

Tools exposed to the model:
- find_relevant_tables
- fetch_table_by_name
- search_by_label
Max 5 agent iterations
Context caps:
- up to 4 full-schema tables
- up to 8 total tables
Retrieval fallback chain:
- sqlite-vec cosine distance query when available
- in-memory cosine similarity fallback
- lower-threshold + keyword fallback paths

Intent Enrichment and Interactive Metadata Editing

sqllm enrich populates intent only for tables missing intent
list/search interactive flows support editing:
- table label
- per-column semantic hints
- table intent
Intent edits trigger embedding regeneration

Optional Safe Execution

sqllm ask --run executes generated SQL with safeguards:
- SELECT-only check
- forbidden keyword check
- EXPLAIN validation before execution
- self-correcting SQL repair loop when EXPLAIN fails (bounded attempts)
- tabular output, NULL handling, and 100-row cap

Evaluation Harness (Mode Comparison)

sqllm eval runs dataset-based text-to-SQL evaluation on a generated e-commerce benchmark DB
Evaluates three generation modes on the same cases:
- full_agent
- single_shot
- no_agent
Reports per-mode metrics:
- execution success rate
- execution accuracy (EX)
- retrieval precision/recall (when retrieval is enabled)
- average latency
- average iteration count
Eval setup now includes deterministic metadata enrichment:
- hardcoded high-quality table intents
- embedding generation for every eval table intent
- persistence into the same metadata store path used by normal retrieval flow

Reliability Improvements

LLM generation calls use retry with exponential backoff and jitter for transient failures
Transient detection handles rate limits (429), HTTP 5xx, timeouts, and connection resets
Non-transient errors fail fast

4. Architecture & System Design

High-Level Architecture

CLI Commands -> Runtime Resolver -> (LLM Client + Schema Store) -> Target DB

Core Components

Component	Responsibility	Key Files
CLI Root	Cobra command registration	internal/cli/root.go
Init/Profile	Profile lifecycle and switching	internal/cli/init.go, internal/cli/profile.go
Config Storage	Global active profile + profile files	internal/config/config.go
Metadata Path	Hash-based metadata DB path	internal/config/metadata.go
Runtime Resolver	Profile -> validated runtime settings	internal/runtime/runtime.go
Scan Pipeline	Schema extraction and incremental persistence	internal/cli/scan.go, internal/db/extractor.go
Metadata Store	SQLite persistence and retrieval APIs	internal/db/sqlite_store.go
LLM Factory	Provider-agnostic client creation	internal/llm/factory.go
Ask Orchestration	Mode-aware pipeline, agent loop, tool dispatch	internal/cli/ask.go, internal/cli/ask_pipeline.go, internal/cli/ask_helpers.go
Eval Pipeline	Eval DB setup, dataset execution, mode metrics	internal/cli/eval.go, internal/cli/eval_runner.go, internal/cli/eval_db.go
Service Layer	Table operations for list/search/edit	internal/service/

5. Tech Stack

Language: Go
CLI: Cobra, Promptui
Databases: PostgreSQL (lib/pq), SQLite (mattn/go-sqlite3)
Local metadata: SQLite + optional sqlite-vec
LLM providers: Ollama and OpenRouter
Serialization/config: YAML (gopkg.in/yaml.v3)
Logging: slog + tint

6. LLM and Agent Workflow (Implementation Detail)

This section reflects current implementation behavior.

Prompt Structure

For ask, each iteration sends:

system prompt (tool usage and SQL output constraints),
user query,
retrieved schema context (full + partial table views),
optional system note from previous retrieval failures.

Tool-Calling Loop

Flow per iteration:

Generate model response with tool definitions.
If tool calls are present, execute them and enrich context.
If no tool calls, treat content as final SQL.
Stop when SQL is produced or max iterations (5) is reached.

Current safeguards in loop:

Duplicate tool call detection within one ask execution.
Unsupported tool name -> command error.

Hallucination / Invalid SQL Handling

Current state:

Ask pipeline always attempts bounded EXPLAIN-driven SQL repair before optional execution.
With --run, SQL also gets SELECT-only checks before final execution.
There is still no semantic SQL correctness validator.

Retry / Failure Handling

Current state:

Generate calls use provider-aware retry with exponential backoff and jitter.
Retries are limited to transient classes of failures (429/5xx/network timeout/reset).
Retry behavior is currently focused on generation calls, not a full circuit-breaker strategy.
Failures return user-facing command errors.

Context Window Strategy

Current strategy is heuristic and bounded:

4 full-schema tables max
8 total tables max
initial retrieval + iterative augmentation
optional fallback retrieval paths when relevance is low

This controls prompt size but does not guarantee optimal context selection for every query.

7. Safety and Execution Model

ask --run adds execution safety gates:

Query must start with SELECT (case-insensitive, trimmed).
Query is blocked if it contains forbidden write/DDL keywords.
EXPLAIN runs before actual query execution:
- SQLite: EXPLAIN QUERY PLAN
- PostgreSQL: EXPLAIN
Execution output is tabular, includes NULL rendering, capped at 100 rows.

Important: this is a practical guardrail layer, not a full SQL security analysis engine.

8. Real-World Use Case

Typical developer workflow:

Create one profile per environment/database:
- app_dev, app_staging, analytics_prod.
Run scan after schema changes.
Run enrich to fill missing semantic intents.
Use ask for day-to-day query drafting.
Use ask --run for read-only exploration when safe execution is desired.
Switch environment context with sqllm profile use .

Practical benefit:

Faster query authoring when schema is large or unfamiliar.
Lower chance of selecting the wrong tables than manual guessing.
Safer read-only execution path than copy-pasting generated SQL directly into prod sessions.

9. Setup & Usage

Build

go build -o sqllm ./cmd/sqllm

Initialize Profile

sqllm init

Interactive prompts collect:

profile name
database URL
provider (ollama/openrouter)
provider-specific input
- ollama.base_url for ollama
- openrouter.api_key for openrouter
models.intent
models.sql
models.embedding

Core Commands

sqllm scan
sqllm enrich
sqllm ask "how many orders last month?"
sqllm ask --run "top 10 customers by spend"
sqllm list --missing
sqllm search users

Profile Commands

sqllm profile list
sqllm profile current
sqllm profile use mydb
sqllm profile update mydb

Editor

EDITOR env var is used for metadata editing.
If EDITOR is not set:
- Windows defaults to notepad
- non-Windows defaults to vim

10. Limitations (Current State)

Eval harness is available, but benchmark quality is still early and not yet representative of production workloads.
No hallucination detector beyond execution-time guards in --run.
SQL repair is EXPLAIN-driven and bounded; it can still fail on deeper semantic issues.
Retry strategy is heuristic and local (no global rate-limit coordination/circuit breaker).
Current EX metric is execution-result based and can under-represent semantic equivalence in some SQL variants.
Join-path inference via schema graph is not integrated into ask flow.
export/import commands are stubs.
Evaluation currently runs on one synthetic e-commerce dataset; broader domain benchmarks are still needed.
Query safety checks are keyword/shape based, not AST-level enforcement.
Tool-call dedup uses simple in-process keys; no persistence across runs.

11. Performance Notes (Evidence-Based)

What is implemented:

Incremental scan avoids rewriting unchanged table metadata.
Context caps limit prompt growth.
Embeddings are persisted in the metadata store and reused.
Dataset-driven eval command with per-mode comparison metrics (execution, EX, retrieval P/R, latency, iterations).

What is not yet available:

No cross-domain benchmark suite beyond the current e-commerce eval dataset.
No longitudinal trend tracking across commits/runs.

Interpretation: baseline quantification now exists, but broader and deeper measurement coverage is still needed.

12. Future Improvements

Expand evaluation datasets across additional domains and harder query patterns.
Extend retry policy to include intent/embedding paths and smarter provider-specific classification.
Improve SQL repair with richer repair prompts and semantic verification beyond EXPLAIN.
Integrate schema graph for join-path guidance in ask orchestration.
Add retrieval observability (per-query retrieved tables and scores).
Add token/cost accounting for cloud provider usage.
Add optional query/result caching with invalidation strategy.
Implement export/import commands.
Add safer SQL validation layer (parser/AST or allowlist policy engine).
Add profile-level secrets handling improvements beyond plain YAML.

13. Design Decisions & Trade-offs

Retrieval-first context building instead of full-schema prompting Chosen: ask starts with semantic retrieval and only sends selected tables into the prompt. Why: reduces prompt size and avoids flooding the model with irrelevant schema. Trade-off: retrieval misses can hide needed tables, so downstream SQL quality depends on retrieval quality.
Tool-calling agent loop instead of single-shot SQL generation Chosen: bounded iterative loop (max 5 turns) with find_relevant_tables, fetch_table_by_name, and search_by_label. Why: lets the model request additional context incrementally before final SQL generation. Trade-off: higher latency and complexity versus one-shot prompting, with failure if SQL is not produced within iteration cap.
sqlite-vec path with in-memory cosine fallback Chosen: use sqlite-vec when available; fallback to Go-based cosine similarity when extension is unavailable. Why: keeps semantic retrieval performant on supported setups while preserving portability. Trade-off: fallback path is less efficient and can behave differently under large metadata volumes.
Per-database metadata isolation by hashed DB URL Chosen: metadata store path is derived as ~/.sqllm/metadata/.db. Why: prevents schema/embedding leakage across environments and profiles. Trade-off: creates multiple local metadata files and requires separate scan/enrich maintenance per database.
Heuristic context limits (4 full tables, 8 total tables) Chosen: hard caps on full and partial schema context in ask orchestration. Why: controls token usage and keeps prompts within practical model limits. Trade-off: relevant long-tail tables can be excluded in complex multi-join questions.
Lightweight execution safety checks instead of full SQL parsing Chosen: SELECT-only + forbidden-keyword checks + EXPLAIN pre-check in --run mode. Why: provides practical safety with low implementation overhead. Trade-off: checks are syntactic/heuristic, not AST-level or policy-complete enforcement.

14. Project Structure

text2sql/
├── cmd/sqllm/                # CLI entrypoint
├── internal/
│   ├── cli/                  # Commands (ask, scan, enrich, list, search, profile, init)
│   ├── config/               # Active profile + profile storage
│   ├── runtime/              # Profile -> resolved runtime settings
│   ├── db/                   # Extractor + SQLite metadata store
│   ├── llm/                  # Provider interface + factory + clients
│   ├── service/              # Table service layer
│   └── graph/                # Schema graph utilities (not yet in ask flow)
├── pkg/logger/               # Logger setup
└── README.md

Performance Benchmark

We evaluate the Text2SQL system across three modes:

full_agent: iterative agent with retrieval and execution feedback
single_shot: one-pass generation with retrieval
no_agent: baseline LLM without retrieval or iteration

Summary Metrics

Metric	full_agent	single_shot	no_agent
Total Queries	40	40	40
Execution Success	29	10	30
Execution Success Rate	72.50%	25.00%	75.00%
EX Correct Count	26	10	25
EX Evaluated Count	40	40	40
EX Accuracy	65.00%	25.00%	62.50%
Retrieval Evaluated Queries	40	40	0
Average Precision (Retrieval)	0.409	0.118	0.000
Average Recall (Retrieval)	0.725	0.250	0.000
Average Latency (ms)	5765.77	1897.28	5739.18
Average Iteration Count	2.70	1.00	1.00

Key Results

full_agent improves EX accuracy by +40.00 points over single_shot (65.00% vs 25.00%).
full_agent improves retrieval quality over single_shot:
- precision: 0.409 vs 0.118 (~3.47x)
- recall: 0.725 vs 0.250 (~2.90x)
no_agent is close to full_agent on top-line execution/EX (75.00% and 62.50%), while full_agent remains slightly higher on EX (65.00%).

Observations

1. `full_agent` now delivers strong semantic correctness

EX at 65.00% indicates the agent loop is no longer only improving syntax/executability; it is frequently producing semantically correct SQL.
The higher iteration count (2.70) is consistent with multi-step retrieval and refinement before finalizing SQL.

2. Retrieval materially helps versus one-pass retrieval mode

Compared with single_shot, full_agent has much higher retrieval precision/recall (0.409/0.725 vs 0.118/0.250).
Reasoning: iterative tool usage improves schema grounding and reduces under-context errors that commonly affect one-pass generation.

3. Why `no_agent` stays competitive

no_agent achieves 75.00% execution success and 62.50% EX, close to full_agent.
Reasoning: with clearer benchmark question wording, direct generation can often succeed without retrieval overhead.
This also suggests the current dataset includes many queries where schema lookup is helpful but not always essential.

4. Latency profile reflects control-loop overhead

single_shot is fastest (1897.28 ms) because it avoids iterative calls.
full_agent and no_agent are similar in average latency (~5.7s), with full_agent paying for extra iterations and tool orchestration.

Tradeoffs

Aspect	full_agent	single_shot	no_agent
Accuracy	Best EX (65.00%)	Lowest EX (25.00%)	Near-best EX (62.50%)
Latency	Higher (5765.77 ms, iterative)	Lowest (1897.28 ms)	Similar to full_agent (5739.18 ms)
Robustness	Strong via retrieval + iterative correction	Weakest under schema ambiguity	Strong on this dataset, no retrieval
Simplicity	Highest complexity	Simplest runtime path	Simple generation path

Takeaways

full_agent is the most reliable choice for semantic correctness across this benchmark (highest EX).
Retrieval quality gains are substantial and explain much of the gap between full_agent and single_shot.
no_agent remains a viable baseline for clear/straightforward prompts, but lacks retrieval diagnostics and controllability.
There is still a practical latency tradeoff: iterative grounding improves quality but costs time.

Future Work

Improve retrieval precision further to reduce extra context (better ranking and stronger filtering).
Add query-plan-aware reranking to select among multiple valid SQL candidates.
Optimize iteration policy (early-stop conditions and smarter tool call budgeting).
Expand benchmarks with harder multi-join and long-tail schema cases to stress no_agent vs retrieval-first modes.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
eval		eval
internal		internal
pkg/logger		pkg/logger
.gitignore		.gitignore
.golangci.yml		.golangci.yml
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Folders and files

Latest commit

History

Repository files navigation

sqllm — Local Text-to-SQL CLI Agent

1. Overview

2. Problem Statement

3. Key Features

Profile-Based Runtime Configuration

Incremental Schema Scanning

Per-Database Metadata Isolation

LLM Provider Abstraction

Semantic Retrieval + Tool-Calling Agent Loop

Intent Enrichment and Interactive Metadata Editing

Optional Safe Execution

Evaluation Harness (Mode Comparison)

Reliability Improvements

4. Architecture & System Design

High-Level Architecture

Core Components

5. Tech Stack

6. LLM and Agent Workflow (Implementation Detail)

Prompt Structure

Tool-Calling Loop

Hallucination / Invalid SQL Handling

Retry / Failure Handling

Context Window Strategy

7. Safety and Execution Model

8. Real-World Use Case

9. Setup & Usage

Build

Initialize Profile

Core Commands

Profile Commands

Editor

10. Limitations (Current State)

11. Performance Notes (Evidence-Based)

12. Future Improvements

13. Design Decisions & Trade-offs

14. Project Structure

Performance Benchmark

Summary Metrics

Key Results

Observations

1. full_agent now delivers strong semantic correctness

2. Retrieval materially helps versus one-pass retrieval mode

3. Why no_agent stays competitive

4. Latency profile reflects control-loop overhead

Tradeoffs

Takeaways

Future Work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `full_agent` now delivers strong semantic correctness

3. Why `no_agent` stays competitive

Packages