Skip to content

SrabanMondal/sqllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sqllm — Local Text-to-SQL CLI Agent

sqllm is a Go CLI that translates natural-language questions into SQL by combining:

  • profile-scoped runtime configuration,
  • local schema metadata and embeddings,
  • semantic table retrieval,
  • an LLM tool-calling loop,
  • and a built-in evaluation harness with baseline mode comparison.

Key differentiator: this is not a single-prompt text-to-SQL script that dumps an entire schema into one prompt. It is a retrieval-first system that incrementally builds context from a local metadata store before asking the model for SQL.


1. Overview

sqllm helps developers and analysts query PostgreSQL or SQLite without remembering table details.

Each profile maps to one database and has its own metadata cache at:

  • ~/.sqllm/metadata/.db

That keeps environments isolated (for example, dev vs prod) and avoids cross-database schema contamination.


2. Problem Statement

Text-to-SQL in real systems is hard for reasons beyond prompt writing:

  1. Natural language is ambiguous (user terms rarely match table names exactly).
  2. Real schemas are large and relationship-heavy (full-schema prompting is expensive and noisy).
  3. Join inference is difficult without explicit structural reasoning.
  4. LLM outputs can be syntactically valid but semantically wrong.
  5. Executing generated SQL safely requires guardrails.

sqllm addresses part of this by building and maintaining local schema metadata + embeddings, then retrieving relevant context for the model rather than prompting with the full schema every time.


3. Key Features

Profile-Based Runtime Configuration

  • Multiple profiles under ~/.sqllm/profiles/.yaml
  • Active profile tracked in ~/.sqllm/config.yaml
  • Commands:
    • sqllm init
    • sqllm profile list
    • sqllm profile use
    • sqllm profile current
    • sqllm profile update

Incremental Schema Scanning

  • Per-table schema hashing (columns + foreign keys)
  • Detects new/updated/unchanged/deleted tables
  • Updates only changed/new entries in metadata store

Per-Database Metadata Isolation

  • Metadata path derived from db-url hash via MetadataDBPath
  • Store location: ~/.sqllm/metadata/.db

LLM Provider Abstraction

  • Ollama client (chat + embeddings)
  • OpenRouter client (chat + embeddings)
  • Intent generation is available through the same provider factory path used by ask/enrich
  • Provider selected through runtime profile + llm.NewClient factory

Semantic Retrieval + Tool-Calling Agent Loop

  • Tools exposed to the model:
    • find_relevant_tables
    • fetch_table_by_name
    • search_by_label
  • Max 5 agent iterations
  • Context caps:
    • up to 4 full-schema tables
    • up to 8 total tables
  • Retrieval fallback chain:
    • sqlite-vec cosine distance query when available
    • in-memory cosine similarity fallback
    • lower-threshold + keyword fallback paths

Intent Enrichment and Interactive Metadata Editing

  • sqllm enrich populates intent only for tables missing intent
  • list/search interactive flows support editing:
    • table label
    • per-column semantic hints
    • table intent
  • Intent edits trigger embedding regeneration

Optional Safe Execution

  • sqllm ask --run executes generated SQL with safeguards:
    • SELECT-only check
    • forbidden keyword check
    • EXPLAIN validation before execution
    • self-correcting SQL repair loop when EXPLAIN fails (bounded attempts)
    • tabular output, NULL handling, and 100-row cap

Evaluation Harness (Mode Comparison)

  • sqllm eval runs dataset-based text-to-SQL evaluation on a generated e-commerce benchmark DB
  • Evaluates three generation modes on the same cases:
    • full_agent
    • single_shot
    • no_agent
  • Reports per-mode metrics:
    • execution success rate
    • execution accuracy (EX)
    • retrieval precision/recall (when retrieval is enabled)
    • average latency
    • average iteration count
  • Eval setup now includes deterministic metadata enrichment:
    • hardcoded high-quality table intents
    • embedding generation for every eval table intent
    • persistence into the same metadata store path used by normal retrieval flow

Reliability Improvements

  • LLM generation calls use retry with exponential backoff and jitter for transient failures
  • Transient detection handles rate limits (429), HTTP 5xx, timeouts, and connection resets
  • Non-transient errors fail fast

4. Architecture & System Design

High-Level Architecture

CLI Commands -> Runtime Resolver -> (LLM Client + Schema Store) -> Target DB

Core Components

Component Responsibility Key Files
CLI Root Cobra command registration internal/cli/root.go
Init/Profile Profile lifecycle and switching internal/cli/init.go, internal/cli/profile.go
Config Storage Global active profile + profile files internal/config/config.go
Metadata Path Hash-based metadata DB path internal/config/metadata.go
Runtime Resolver Profile -> validated runtime settings internal/runtime/runtime.go
Scan Pipeline Schema extraction and incremental persistence internal/cli/scan.go, internal/db/extractor.go
Metadata Store SQLite persistence and retrieval APIs internal/db/sqlite_store.go
LLM Factory Provider-agnostic client creation internal/llm/factory.go
Ask Orchestration Mode-aware pipeline, agent loop, tool dispatch internal/cli/ask.go, internal/cli/ask_pipeline.go, internal/cli/ask_helpers.go
Eval Pipeline Eval DB setup, dataset execution, mode metrics internal/cli/eval.go, internal/cli/eval_runner.go, internal/cli/eval_db.go
Service Layer Table operations for list/search/edit internal/service/

5. Tech Stack

  • Language: Go
  • CLI: Cobra, Promptui
  • Databases: PostgreSQL (lib/pq), SQLite (mattn/go-sqlite3)
  • Local metadata: SQLite + optional sqlite-vec
  • LLM providers: Ollama and OpenRouter
  • Serialization/config: YAML (gopkg.in/yaml.v3)
  • Logging: slog + tint

6. LLM and Agent Workflow (Implementation Detail)

This section reflects current implementation behavior.

Prompt Structure

For ask, each iteration sends:

  1. system prompt (tool usage and SQL output constraints),
  2. user query,
  3. retrieved schema context (full + partial table views),
  4. optional system note from previous retrieval failures.

Tool-Calling Loop

Flow per iteration:

  1. Generate model response with tool definitions.
  2. If tool calls are present, execute them and enrich context.
  3. If no tool calls, treat content as final SQL.
  4. Stop when SQL is produced or max iterations (5) is reached.

Current safeguards in loop:

  • Duplicate tool call detection within one ask execution.
  • Unsupported tool name -> command error.

Hallucination / Invalid SQL Handling

Current state:

  • Ask pipeline always attempts bounded EXPLAIN-driven SQL repair before optional execution.
  • With --run, SQL also gets SELECT-only checks before final execution.
  • There is still no semantic SQL correctness validator.

Retry / Failure Handling

Current state:

  • Generate calls use provider-aware retry with exponential backoff and jitter.
  • Retries are limited to transient classes of failures (429/5xx/network timeout/reset).
  • Retry behavior is currently focused on generation calls, not a full circuit-breaker strategy.
  • Failures return user-facing command errors.

Context Window Strategy

Current strategy is heuristic and bounded:

  • 4 full-schema tables max
  • 8 total tables max
  • initial retrieval + iterative augmentation
  • optional fallback retrieval paths when relevance is low

This controls prompt size but does not guarantee optimal context selection for every query.


7. Safety and Execution Model

ask --run adds execution safety gates:

  1. Query must start with SELECT (case-insensitive, trimmed).
  2. Query is blocked if it contains forbidden write/DDL keywords.
  3. EXPLAIN runs before actual query execution:
    • SQLite: EXPLAIN QUERY PLAN
    • PostgreSQL: EXPLAIN
  4. Execution output is tabular, includes NULL rendering, capped at 100 rows.

Important: this is a practical guardrail layer, not a full SQL security analysis engine.


8. Real-World Use Case

Typical developer workflow:

  1. Create one profile per environment/database:
    • app_dev, app_staging, analytics_prod.
  2. Run scan after schema changes.
  3. Run enrich to fill missing semantic intents.
  4. Use ask for day-to-day query drafting.
  5. Use ask --run for read-only exploration when safe execution is desired.
  6. Switch environment context with sqllm profile use .

Practical benefit:

  • Faster query authoring when schema is large or unfamiliar.
  • Lower chance of selecting the wrong tables than manual guessing.
  • Safer read-only execution path than copy-pasting generated SQL directly into prod sessions.

9. Setup & Usage

Build

go build -o sqllm ./cmd/sqllm

Initialize Profile

sqllm init

Interactive prompts collect:

  • profile name
  • database URL
  • provider (ollama/openrouter)
  • provider-specific input
    • ollama.base_url for ollama
    • openrouter.api_key for openrouter
  • models.intent
  • models.sql
  • models.embedding

Core Commands

sqllm scan
sqllm enrich
sqllm ask "how many orders last month?"
sqllm ask --run "top 10 customers by spend"
sqllm list --missing
sqllm search users

Profile Commands

sqllm profile list
sqllm profile current
sqllm profile use mydb
sqllm profile update mydb

Editor

  • EDITOR env var is used for metadata editing.
  • If EDITOR is not set:
    • Windows defaults to notepad
    • non-Windows defaults to vim

10. Limitations (Current State)

  1. Eval harness is available, but benchmark quality is still early and not yet representative of production workloads.
  2. No hallucination detector beyond execution-time guards in --run.
  3. SQL repair is EXPLAIN-driven and bounded; it can still fail on deeper semantic issues.
  4. Retry strategy is heuristic and local (no global rate-limit coordination/circuit breaker).
  5. Current EX metric is execution-result based and can under-represent semantic equivalence in some SQL variants.
  6. Join-path inference via schema graph is not integrated into ask flow.
  7. export/import commands are stubs.
  8. Evaluation currently runs on one synthetic e-commerce dataset; broader domain benchmarks are still needed.
  9. Query safety checks are keyword/shape based, not AST-level enforcement.
  10. Tool-call dedup uses simple in-process keys; no persistence across runs.

11. Performance Notes (Evidence-Based)

What is implemented:

  • Incremental scan avoids rewriting unchanged table metadata.
  • Context caps limit prompt growth.
  • Embeddings are persisted in the metadata store and reused.
  • Dataset-driven eval command with per-mode comparison metrics (execution, EX, retrieval P/R, latency, iterations).

What is not yet available:

  • No cross-domain benchmark suite beyond the current e-commerce eval dataset.
  • No longitudinal trend tracking across commits/runs.

Interpretation: baseline quantification now exists, but broader and deeper measurement coverage is still needed.


12. Future Improvements

  1. Expand evaluation datasets across additional domains and harder query patterns.
  2. Extend retry policy to include intent/embedding paths and smarter provider-specific classification.
  3. Improve SQL repair with richer repair prompts and semantic verification beyond EXPLAIN.
  4. Integrate schema graph for join-path guidance in ask orchestration.
  5. Add retrieval observability (per-query retrieved tables and scores).
  6. Add token/cost accounting for cloud provider usage.
  7. Add optional query/result caching with invalidation strategy.
  8. Implement export/import commands.
  9. Add safer SQL validation layer (parser/AST or allowlist policy engine).
  10. Add profile-level secrets handling improvements beyond plain YAML.

13. Design Decisions & Trade-offs

  1. Retrieval-first context building instead of full-schema prompting Chosen: ask starts with semantic retrieval and only sends selected tables into the prompt. Why: reduces prompt size and avoids flooding the model with irrelevant schema. Trade-off: retrieval misses can hide needed tables, so downstream SQL quality depends on retrieval quality.

  2. Tool-calling agent loop instead of single-shot SQL generation Chosen: bounded iterative loop (max 5 turns) with find_relevant_tables, fetch_table_by_name, and search_by_label. Why: lets the model request additional context incrementally before final SQL generation. Trade-off: higher latency and complexity versus one-shot prompting, with failure if SQL is not produced within iteration cap.

  3. sqlite-vec path with in-memory cosine fallback Chosen: use sqlite-vec when available; fallback to Go-based cosine similarity when extension is unavailable. Why: keeps semantic retrieval performant on supported setups while preserving portability. Trade-off: fallback path is less efficient and can behave differently under large metadata volumes.

  4. Per-database metadata isolation by hashed DB URL Chosen: metadata store path is derived as ~/.sqllm/metadata/.db. Why: prevents schema/embedding leakage across environments and profiles. Trade-off: creates multiple local metadata files and requires separate scan/enrich maintenance per database.

  5. Heuristic context limits (4 full tables, 8 total tables) Chosen: hard caps on full and partial schema context in ask orchestration. Why: controls token usage and keeps prompts within practical model limits. Trade-off: relevant long-tail tables can be excluded in complex multi-join questions.

  6. Lightweight execution safety checks instead of full SQL parsing Chosen: SELECT-only + forbidden-keyword checks + EXPLAIN pre-check in --run mode. Why: provides practical safety with low implementation overhead. Trade-off: checks are syntactic/heuristic, not AST-level or policy-complete enforcement.


14. Project Structure

text2sql/
├── cmd/sqllm/                # CLI entrypoint
├── internal/
│   ├── cli/                  # Commands (ask, scan, enrich, list, search, profile, init)
│   ├── config/               # Active profile + profile storage
│   ├── runtime/              # Profile -> resolved runtime settings
│   ├── db/                   # Extractor + SQLite metadata store
│   ├── llm/                  # Provider interface + factory + clients
│   ├── service/              # Table service layer
│   └── graph/                # Schema graph utilities (not yet in ask flow)
├── pkg/logger/               # Logger setup
└── README.md

Performance Benchmark

We evaluate the Text2SQL system across three modes:

  • full_agent: iterative agent with retrieval and execution feedback
  • single_shot: one-pass generation with retrieval
  • no_agent: baseline LLM without retrieval or iteration

Summary Metrics

Metric full_agent single_shot no_agent
Total Queries 40 40 40
Execution Success 29 10 30
Execution Success Rate 72.50% 25.00% 75.00%
EX Correct Count 26 10 25
EX Evaluated Count 40 40 40
EX Accuracy 65.00% 25.00% 62.50%
Retrieval Evaluated Queries 40 40 0
Average Precision (Retrieval) 0.409 0.118 0.000
Average Recall (Retrieval) 0.725 0.250 0.000
Average Latency (ms) 5765.77 1897.28 5739.18
Average Iteration Count 2.70 1.00 1.00

Key Results

  • full_agent improves EX accuracy by +40.00 points over single_shot (65.00% vs 25.00%).
  • full_agent improves retrieval quality over single_shot:
    • precision: 0.409 vs 0.118 (~3.47x)
    • recall: 0.725 vs 0.250 (~2.90x)
  • no_agent is close to full_agent on top-line execution/EX (75.00% and 62.50%), while full_agent remains slightly higher on EX (65.00%).

Observations

1. full_agent now delivers strong semantic correctness

  • EX at 65.00% indicates the agent loop is no longer only improving syntax/executability; it is frequently producing semantically correct SQL.
  • The higher iteration count (2.70) is consistent with multi-step retrieval and refinement before finalizing SQL.

2. Retrieval materially helps versus one-pass retrieval mode

  • Compared with single_shot, full_agent has much higher retrieval precision/recall (0.409/0.725 vs 0.118/0.250).
  • Reasoning: iterative tool usage improves schema grounding and reduces under-context errors that commonly affect one-pass generation.

3. Why no_agent stays competitive

  • no_agent achieves 75.00% execution success and 62.50% EX, close to full_agent.
  • Reasoning: with clearer benchmark question wording, direct generation can often succeed without retrieval overhead.
  • This also suggests the current dataset includes many queries where schema lookup is helpful but not always essential.

4. Latency profile reflects control-loop overhead

  • single_shot is fastest (1897.28 ms) because it avoids iterative calls.
  • full_agent and no_agent are similar in average latency (~5.7s), with full_agent paying for extra iterations and tool orchestration.

Tradeoffs

Aspect full_agent single_shot no_agent
Accuracy Best EX (65.00%) Lowest EX (25.00%) Near-best EX (62.50%)
Latency Higher (5765.77 ms, iterative) Lowest (1897.28 ms) Similar to full_agent (5739.18 ms)
Robustness Strong via retrieval + iterative correction Weakest under schema ambiguity Strong on this dataset, no retrieval
Simplicity Highest complexity Simplest runtime path Simple generation path

Takeaways

  • full_agent is the most reliable choice for semantic correctness across this benchmark (highest EX).
  • Retrieval quality gains are substantial and explain much of the gap between full_agent and single_shot.
  • no_agent remains a viable baseline for clear/straightforward prompts, but lacks retrieval diagnostics and controllability.
  • There is still a practical latency tradeoff: iterative grounding improves quality but costs time.

Future Work

  • Improve retrieval precision further to reduce extra context (better ranking and stronger filtering).
  • Add query-plan-aware reranking to select among multiple valid SQL candidates.
  • Optimize iteration policy (early-stop conditions and smarter tool call budgeting).
  • Expand benchmarks with harder multi-join and long-tail schema cases to stress no_agent vs retrieval-first modes.

License

MIT — see LICENSE.

About

Agentic Text-to-SQL CLI (Go) with semantic retrieval + tool-calling loops for accurate, safe SQL generation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages