A Rust port of Google's langextract — a tool for extracting structured information from unstructured text using LLMs.
Every extraction is mapped to its exact location in the source text, enabling traceability and verification. Works with OpenAI-compatible APIs and local models (Ollama, vLLM, llama.cpp).
- Source grounding — maps extractions to character-level positions in the original text
- Schema-based or example-based extraction — define fields with a YAML schema or provide few-shot examples
- Auto schema suggestion — LLM-powered schema generation from sample files
- File cataloging — batch-process folders into searchable JSONL metadata catalogs
- Document redaction — replace sensitive entities with anonymous placeholders (with recovery via rehydration)
- Privacy mode —
--no-textflag excludes source text from output - Local LLM support — works with any OpenAI-compatible endpoint
cargo install --path .git clone https://github.com/xuy/langextract-rs.git
cd langextract-rs
cargo build --release
# Binary at target/release/langextract# config.yaml
model: gpt-4o-mini
api_key: ${OPENAI_API_KEY}
temperature: 0.0
prompt_description: |
Extract patient names and medical conditions from the clinical note.
examples:
- text: "John Smith was diagnosed with type 2 diabetes and hypertension."
extractions:
- extraction_class: patient
extraction_text: John Smith
- extraction_class: condition
extraction_text: type 2 diabetes
- extraction_class: condition
extraction_text: hypertensionFor local models, set base_url:
model: llama3
api_key: not-needed
base_url: http://localhost:11434/v1langextract extract --config config.yaml --input document.txtOutput is JSON with character-level alignment:
{
"extractions": [
{
"extraction_class": "patient",
"extraction_text": "John Smith",
"char_interval": { "start": 0, "end": 10 },
"alignment_status": "exact"
}
]
}Extract structured data from a file or folder.
# Single file (example-based)
langextract extract --config config.yaml --input file.txt
# Single file (schema-based)
langextract extract --config api-config.yaml --schema schema.yaml --input file.txt
# Entire folder
langextract extract --config config.yaml --folder ./documents --extensions txt,md
# Privacy mode (omit source text from output)
langextract extract --config config.yaml --input file.txt --no-textAuto-generate an extraction schema from sample files.
langextract suggest-schema \
--config api-config.yaml \
--folder ./samples \
--purpose "organize invoices by date and vendor"Build a searchable metadata catalog from a folder of documents.
langextract catalog \
--config api-config.yaml \
--schema file-organizer-schema.yaml \
--folder ./documents \
--output catalog.jsonlAnonymize sensitive entities in a document.
langextract redact \
--config api-config.yaml \
--preset legal \
--input contract.txt \
--map entity_map.json \
--output redacted.txtBuilt-in presets: legal, medical, finance. Custom presets are supported via --preset-file.
Restore original values from a redacted document.
langextract rehydrate \
--map entity_map.json \
--input redacted.txt \
--output restored.txt| Field | Required | Description |
|---|---|---|
model |
Yes | Model name (e.g., gpt-4o-mini, llama3) |
api_key |
Yes | API key or ${ENV_VAR} reference |
base_url |
No | Endpoint URL (for local models) |
temperature |
No | Sampling temperature (default: model default) |
prompt_description |
Yes | Extraction instructions for the LLM |
examples |
Yes | Few-shot examples (can be empty if using --schema) |
name: invoice_fields
purpose: Extract key invoice information
fields:
- name: vendor
description: Company or person issuing the invoice
field_type: string
- name: amount
description: Total amount due
field_type: number
- name: line_items
description: List of items or services
field_type: list
required: false
- name: due_date
description: Payment due date (YYYY-MM-DD)
field_type: date
required: falseSupported field types: string, number, date, list.
This is a standalone CLI tool, not a library. Key differences:
- Single binary — no Python runtime or dependencies needed
- CLI-first — designed as a command-line tool rather than a Python API
- Redaction built in — includes
redactandrehydratecommands for document anonymization - No visualization — does not generate HTML visualization files (outputs JSON/JSONL)
Apache-2.0 — same as the original project.