sqllm is a Go CLI that translates natural-language questions into SQL by combining:
- profile-scoped runtime configuration,
- local schema metadata and embeddings,
- semantic table retrieval,
- an LLM tool-calling loop,
- and a built-in evaluation harness with baseline mode comparison.
Key differentiator: this is not a single-prompt text-to-SQL script that dumps an entire schema into one prompt. It is a retrieval-first system that incrementally builds context from a local metadata store before asking the model for SQL.
sqllm helps developers and analysts query PostgreSQL or SQLite without remembering table details.
Each profile maps to one database and has its own metadata cache at:
- ~/.sqllm/metadata/.db
That keeps environments isolated (for example, dev vs prod) and avoids cross-database schema contamination.
Text-to-SQL in real systems is hard for reasons beyond prompt writing:
- Natural language is ambiguous (user terms rarely match table names exactly).
- Real schemas are large and relationship-heavy (full-schema prompting is expensive and noisy).
- Join inference is difficult without explicit structural reasoning.
- LLM outputs can be syntactically valid but semantically wrong.
- Executing generated SQL safely requires guardrails.
sqllm addresses part of this by building and maintaining local schema metadata + embeddings, then retrieving relevant context for the model rather than prompting with the full schema every time.
- Multiple profiles under ~/.sqllm/profiles/.yaml
- Active profile tracked in ~/.sqllm/config.yaml
- Commands:
- sqllm init
- sqllm profile list
- sqllm profile use
- sqllm profile current
- sqllm profile update
- Per-table schema hashing (columns + foreign keys)
- Detects new/updated/unchanged/deleted tables
- Updates only changed/new entries in metadata store
- Metadata path derived from db-url hash via MetadataDBPath
- Store location: ~/.sqllm/metadata/.db
- Ollama client (chat + embeddings)
- OpenRouter client (chat + embeddings)
- Intent generation is available through the same provider factory path used by ask/enrich
- Provider selected through runtime profile + llm.NewClient factory
- Tools exposed to the model:
- find_relevant_tables
- fetch_table_by_name
- search_by_label
- Max 5 agent iterations
- Context caps:
- up to 4 full-schema tables
- up to 8 total tables
- Retrieval fallback chain:
- sqlite-vec cosine distance query when available
- in-memory cosine similarity fallback
- lower-threshold + keyword fallback paths
- sqllm enrich populates intent only for tables missing intent
- list/search interactive flows support editing:
- table label
- per-column semantic hints
- table intent
- Intent edits trigger embedding regeneration
- sqllm ask --run executes generated SQL with safeguards:
- SELECT-only check
- forbidden keyword check
- EXPLAIN validation before execution
- self-correcting SQL repair loop when EXPLAIN fails (bounded attempts)
- tabular output, NULL handling, and 100-row cap
- sqllm eval runs dataset-based text-to-SQL evaluation on a generated e-commerce benchmark DB
- Evaluates three generation modes on the same cases:
- full_agent
- single_shot
- no_agent
- Reports per-mode metrics:
- execution success rate
- execution accuracy (EX)
- retrieval precision/recall (when retrieval is enabled)
- average latency
- average iteration count
- Eval setup now includes deterministic metadata enrichment:
- hardcoded high-quality table intents
- embedding generation for every eval table intent
- persistence into the same metadata store path used by normal retrieval flow
- LLM generation calls use retry with exponential backoff and jitter for transient failures
- Transient detection handles rate limits (429), HTTP 5xx, timeouts, and connection resets
- Non-transient errors fail fast
CLI Commands -> Runtime Resolver -> (LLM Client + Schema Store) -> Target DB
| Component | Responsibility | Key Files |
|---|---|---|
| CLI Root | Cobra command registration | internal/cli/root.go |
| Init/Profile | Profile lifecycle and switching | internal/cli/init.go, internal/cli/profile.go |
| Config Storage | Global active profile + profile files | internal/config/config.go |
| Metadata Path | Hash-based metadata DB path | internal/config/metadata.go |
| Runtime Resolver | Profile -> validated runtime settings | internal/runtime/runtime.go |
| Scan Pipeline | Schema extraction and incremental persistence | internal/cli/scan.go, internal/db/extractor.go |
| Metadata Store | SQLite persistence and retrieval APIs | internal/db/sqlite_store.go |
| LLM Factory | Provider-agnostic client creation | internal/llm/factory.go |
| Ask Orchestration | Mode-aware pipeline, agent loop, tool dispatch | internal/cli/ask.go, internal/cli/ask_pipeline.go, internal/cli/ask_helpers.go |
| Eval Pipeline | Eval DB setup, dataset execution, mode metrics | internal/cli/eval.go, internal/cli/eval_runner.go, internal/cli/eval_db.go |
| Service Layer | Table operations for list/search/edit | internal/service/ |
- Language: Go
- CLI: Cobra, Promptui
- Databases: PostgreSQL (lib/pq), SQLite (mattn/go-sqlite3)
- Local metadata: SQLite + optional sqlite-vec
- LLM providers: Ollama and OpenRouter
- Serialization/config: YAML (gopkg.in/yaml.v3)
- Logging: slog + tint
This section reflects current implementation behavior.
For ask, each iteration sends:
- system prompt (tool usage and SQL output constraints),
- user query,
- retrieved schema context (full + partial table views),
- optional system note from previous retrieval failures.
Flow per iteration:
- Generate model response with tool definitions.
- If tool calls are present, execute them and enrich context.
- If no tool calls, treat content as final SQL.
- Stop when SQL is produced or max iterations (5) is reached.
Current safeguards in loop:
- Duplicate tool call detection within one ask execution.
- Unsupported tool name -> command error.
Current state:
- Ask pipeline always attempts bounded EXPLAIN-driven SQL repair before optional execution.
- With --run, SQL also gets SELECT-only checks before final execution.
- There is still no semantic SQL correctness validator.
Current state:
- Generate calls use provider-aware retry with exponential backoff and jitter.
- Retries are limited to transient classes of failures (429/5xx/network timeout/reset).
- Retry behavior is currently focused on generation calls, not a full circuit-breaker strategy.
- Failures return user-facing command errors.
Current strategy is heuristic and bounded:
- 4 full-schema tables max
- 8 total tables max
- initial retrieval + iterative augmentation
- optional fallback retrieval paths when relevance is low
This controls prompt size but does not guarantee optimal context selection for every query.
ask --run adds execution safety gates:
- Query must start with SELECT (case-insensitive, trimmed).
- Query is blocked if it contains forbidden write/DDL keywords.
- EXPLAIN runs before actual query execution:
- SQLite: EXPLAIN QUERY PLAN
- PostgreSQL: EXPLAIN
- Execution output is tabular, includes NULL rendering, capped at 100 rows.
Important: this is a practical guardrail layer, not a full SQL security analysis engine.
Typical developer workflow:
- Create one profile per environment/database:
- app_dev, app_staging, analytics_prod.
- Run scan after schema changes.
- Run enrich to fill missing semantic intents.
- Use ask for day-to-day query drafting.
- Use ask --run for read-only exploration when safe execution is desired.
- Switch environment context with sqllm profile use .
Practical benefit:
- Faster query authoring when schema is large or unfamiliar.
- Lower chance of selecting the wrong tables than manual guessing.
- Safer read-only execution path than copy-pasting generated SQL directly into prod sessions.
go build -o sqllm ./cmd/sqllm
sqllm init
Interactive prompts collect:
- profile name
- database URL
- provider (ollama/openrouter)
- provider-specific input
- ollama.base_url for ollama
- openrouter.api_key for openrouter
- models.intent
- models.sql
- models.embedding
sqllm scan
sqllm enrich
sqllm ask "how many orders last month?"
sqllm ask --run "top 10 customers by spend"
sqllm list --missing
sqllm search users
sqllm profile list
sqllm profile current
sqllm profile use mydb
sqllm profile update mydb
- EDITOR env var is used for metadata editing.
- If EDITOR is not set:
- Windows defaults to notepad
- non-Windows defaults to vim
- Eval harness is available, but benchmark quality is still early and not yet representative of production workloads.
- No hallucination detector beyond execution-time guards in --run.
- SQL repair is EXPLAIN-driven and bounded; it can still fail on deeper semantic issues.
- Retry strategy is heuristic and local (no global rate-limit coordination/circuit breaker).
- Current EX metric is execution-result based and can under-represent semantic equivalence in some SQL variants.
- Join-path inference via schema graph is not integrated into ask flow.
- export/import commands are stubs.
- Evaluation currently runs on one synthetic e-commerce dataset; broader domain benchmarks are still needed.
- Query safety checks are keyword/shape based, not AST-level enforcement.
- Tool-call dedup uses simple in-process keys; no persistence across runs.
What is implemented:
- Incremental scan avoids rewriting unchanged table metadata.
- Context caps limit prompt growth.
- Embeddings are persisted in the metadata store and reused.
- Dataset-driven eval command with per-mode comparison metrics (execution, EX, retrieval P/R, latency, iterations).
What is not yet available:
- No cross-domain benchmark suite beyond the current e-commerce eval dataset.
- No longitudinal trend tracking across commits/runs.
Interpretation: baseline quantification now exists, but broader and deeper measurement coverage is still needed.
- Expand evaluation datasets across additional domains and harder query patterns.
- Extend retry policy to include intent/embedding paths and smarter provider-specific classification.
- Improve SQL repair with richer repair prompts and semantic verification beyond EXPLAIN.
- Integrate schema graph for join-path guidance in ask orchestration.
- Add retrieval observability (per-query retrieved tables and scores).
- Add token/cost accounting for cloud provider usage.
- Add optional query/result caching with invalidation strategy.
- Implement export/import commands.
- Add safer SQL validation layer (parser/AST or allowlist policy engine).
- Add profile-level secrets handling improvements beyond plain YAML.
-
Retrieval-first context building instead of full-schema prompting Chosen: ask starts with semantic retrieval and only sends selected tables into the prompt. Why: reduces prompt size and avoids flooding the model with irrelevant schema. Trade-off: retrieval misses can hide needed tables, so downstream SQL quality depends on retrieval quality.
-
Tool-calling agent loop instead of single-shot SQL generation Chosen: bounded iterative loop (max 5 turns) with find_relevant_tables, fetch_table_by_name, and search_by_label. Why: lets the model request additional context incrementally before final SQL generation. Trade-off: higher latency and complexity versus one-shot prompting, with failure if SQL is not produced within iteration cap.
-
sqlite-vec path with in-memory cosine fallback Chosen: use sqlite-vec when available; fallback to Go-based cosine similarity when extension is unavailable. Why: keeps semantic retrieval performant on supported setups while preserving portability. Trade-off: fallback path is less efficient and can behave differently under large metadata volumes.
-
Per-database metadata isolation by hashed DB URL Chosen: metadata store path is derived as ~/.sqllm/metadata/.db. Why: prevents schema/embedding leakage across environments and profiles. Trade-off: creates multiple local metadata files and requires separate scan/enrich maintenance per database.
-
Heuristic context limits (4 full tables, 8 total tables) Chosen: hard caps on full and partial schema context in ask orchestration. Why: controls token usage and keeps prompts within practical model limits. Trade-off: relevant long-tail tables can be excluded in complex multi-join questions.
-
Lightweight execution safety checks instead of full SQL parsing Chosen: SELECT-only + forbidden-keyword checks + EXPLAIN pre-check in --run mode. Why: provides practical safety with low implementation overhead. Trade-off: checks are syntactic/heuristic, not AST-level or policy-complete enforcement.
text2sql/
├── cmd/sqllm/ # CLI entrypoint
├── internal/
│ ├── cli/ # Commands (ask, scan, enrich, list, search, profile, init)
│ ├── config/ # Active profile + profile storage
│ ├── runtime/ # Profile -> resolved runtime settings
│ ├── db/ # Extractor + SQLite metadata store
│ ├── llm/ # Provider interface + factory + clients
│ ├── service/ # Table service layer
│ └── graph/ # Schema graph utilities (not yet in ask flow)
├── pkg/logger/ # Logger setup
└── README.md
We evaluate the Text2SQL system across three modes:
full_agent: iterative agent with retrieval and execution feedbacksingle_shot: one-pass generation with retrievalno_agent: baseline LLM without retrieval or iteration
| Metric | full_agent | single_shot | no_agent |
|---|---|---|---|
| Total Queries | 40 | 40 | 40 |
| Execution Success | 29 | 10 | 30 |
| Execution Success Rate | 72.50% | 25.00% | 75.00% |
| EX Correct Count | 26 | 10 | 25 |
| EX Evaluated Count | 40 | 40 | 40 |
| EX Accuracy | 65.00% | 25.00% | 62.50% |
| Retrieval Evaluated Queries | 40 | 40 | 0 |
| Average Precision (Retrieval) | 0.409 | 0.118 | 0.000 |
| Average Recall (Retrieval) | 0.725 | 0.250 | 0.000 |
| Average Latency (ms) | 5765.77 | 1897.28 | 5739.18 |
| Average Iteration Count | 2.70 | 1.00 | 1.00 |
full_agentimproves EX accuracy by +40.00 points oversingle_shot(65.00% vs 25.00%).full_agentimproves retrieval quality oversingle_shot:- precision: 0.409 vs 0.118 (~3.47x)
- recall: 0.725 vs 0.250 (~2.90x)
no_agentis close tofull_agenton top-line execution/EX (75.00% and 62.50%), whilefull_agentremains slightly higher on EX (65.00%).
- EX at 65.00% indicates the agent loop is no longer only improving syntax/executability; it is frequently producing semantically correct SQL.
- The higher iteration count (2.70) is consistent with multi-step retrieval and refinement before finalizing SQL.
- Compared with
single_shot,full_agenthas much higher retrieval precision/recall (0.409/0.725 vs 0.118/0.250). - Reasoning: iterative tool usage improves schema grounding and reduces under-context errors that commonly affect one-pass generation.
no_agentachieves 75.00% execution success and 62.50% EX, close tofull_agent.- Reasoning: with clearer benchmark question wording, direct generation can often succeed without retrieval overhead.
- This also suggests the current dataset includes many queries where schema lookup is helpful but not always essential.
single_shotis fastest (1897.28 ms) because it avoids iterative calls.full_agentandno_agentare similar in average latency (~5.7s), withfull_agentpaying for extra iterations and tool orchestration.
| Aspect | full_agent | single_shot | no_agent |
|---|---|---|---|
| Accuracy | Best EX (65.00%) | Lowest EX (25.00%) | Near-best EX (62.50%) |
| Latency | Higher (5765.77 ms, iterative) | Lowest (1897.28 ms) | Similar to full_agent (5739.18 ms) |
| Robustness | Strong via retrieval + iterative correction | Weakest under schema ambiguity | Strong on this dataset, no retrieval |
| Simplicity | Highest complexity | Simplest runtime path | Simple generation path |
full_agentis the most reliable choice for semantic correctness across this benchmark (highest EX).- Retrieval quality gains are substantial and explain much of the gap between
full_agentandsingle_shot. no_agentremains a viable baseline for clear/straightforward prompts, but lacks retrieval diagnostics and controllability.- There is still a practical latency tradeoff: iterative grounding improves quality but costs time.
- Improve retrieval precision further to reduce extra context (better ranking and stronger filtering).
- Add query-plan-aware reranking to select among multiple valid SQL candidates.
- Optimize iteration policy (early-stop conditions and smarter tool call budgeting).
- Expand benchmarks with harder multi-join and long-tail schema cases to stress
no_agentvs retrieval-first modes.
MIT — see LICENSE.