langextract-rs

A Rust port of Google's langextract — a tool for extracting structured information from unstructured text using LLMs.

Every extraction is mapped to its exact location in the source text, enabling traceability and verification. Works with OpenAI-compatible APIs and local models (Ollama, vLLM, llama.cpp).

Features

Source grounding — maps extractions to character-level positions in the original text
Schema-based or example-based extraction — define fields with a YAML schema or provide few-shot examples
Auto schema suggestion — LLM-powered schema generation from sample files
File cataloging — batch-process folders into searchable JSONL metadata catalogs
Document redaction — replace sensitive entities with anonymous placeholders (with recovery via rehydration)
Privacy mode — --no-text flag excludes source text from output
Local LLM support — works with any OpenAI-compatible endpoint

Installation

From source

cargo install --path .

Build from source

git clone https://github.com/xuy/langextract-rs.git
cd langextract-rs
cargo build --release
# Binary at target/release/langextract

Quick start

1. Create a config file

# config.yaml
model: gpt-4o-mini
api_key: ${OPENAI_API_KEY}
temperature: 0.0

prompt_description: |
  Extract patient names and medical conditions from the clinical note.

examples:
  - text: "John Smith was diagnosed with type 2 diabetes and hypertension."
    extractions:
      - extraction_class: patient
        extraction_text: John Smith
      - extraction_class: condition
        extraction_text: type 2 diabetes
      - extraction_class: condition
        extraction_text: hypertension

For local models, set base_url:

model: llama3
api_key: not-needed
base_url: http://localhost:11434/v1

2. Run extraction

langextract extract --config config.yaml --input document.txt

Output is JSON with character-level alignment:

{
  "extractions": [
    {
      "extraction_class": "patient",
      "extraction_text": "John Smith",
      "char_interval": { "start": 0, "end": 10 },
      "alignment_status": "exact"
    }
  ]
}

Commands

extract

Extract structured data from a file or folder.

# Single file (example-based)
langextract extract --config config.yaml --input file.txt

# Single file (schema-based)
langextract extract --config api-config.yaml --schema schema.yaml --input file.txt

# Entire folder
langextract extract --config config.yaml --folder ./documents --extensions txt,md

# Privacy mode (omit source text from output)
langextract extract --config config.yaml --input file.txt --no-text

suggest-schema

Auto-generate an extraction schema from sample files.

langextract suggest-schema \
  --config api-config.yaml \
  --folder ./samples \
  --purpose "organize invoices by date and vendor"

catalog

Build a searchable metadata catalog from a folder of documents.

langextract catalog \
  --config api-config.yaml \
  --schema file-organizer-schema.yaml \
  --folder ./documents \
  --output catalog.jsonl

redact

Anonymize sensitive entities in a document.

langextract redact \
  --config api-config.yaml \
  --preset legal \
  --input contract.txt \
  --map entity_map.json \
  --output redacted.txt

Built-in presets: legal, medical, finance. Custom presets are supported via --preset-file.

rehydrate

Restore original values from a redacted document.

langextract rehydrate \
  --map entity_map.json \
  --input redacted.txt \
  --output restored.txt

Configuration

API config (`config.yaml`)

Field	Required	Description
`model`	Yes	Model name (e.g., `gpt-4o-mini`, `llama3`)
`api_key`	Yes	API key or `${ENV_VAR}` reference
`base_url`	No	Endpoint URL (for local models)
`temperature`	No	Sampling temperature (default: model default)
`prompt_description`	Yes	Extraction instructions for the LLM
`examples`	Yes	Few-shot examples (can be empty if using `--schema`)

Extraction schema (`schema.yaml`)

name: invoice_fields
purpose: Extract key invoice information
fields:
  - name: vendor
    description: Company or person issuing the invoice
    field_type: string
  - name: amount
    description: Total amount due
    field_type: number
  - name: line_items
    description: List of items or services
    field_type: list
    required: false
  - name: due_date
    description: Payment due date (YYYY-MM-DD)
    field_type: date
    required: false

Supported field types: string, number, date, list.

Differences from the Python version

This is a standalone CLI tool, not a library. Key differences:

Single binary — no Python runtime or dependencies needed
CLI-first — designed as a command-line tool rather than a Python API
Redaction built in — includes redact and rehydrate commands for document anonymization
No visualization — does not generate HTML visualization files (outputs JSON/JSONL)

License

Apache-2.0 — same as the original project.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

langextract-rs

Features

Installation

From source

Build from source

Quick start

1. Create a config file

2. Run extraction

Commands

extract

suggest-schema

catalog

redact

rehydrate

Configuration

API config (`config.yaml`)

Extraction schema (`schema.yaml`)

Differences from the Python version

License

About

Uh oh!

Releases

Packages

Languages

xuy/langextract-rs

Folders and files

Latest commit

History

Repository files navigation

langextract-rs

Features

Installation

From source

Build from source

Quick start

1. Create a config file

2. Run extraction

Commands

extract

suggest-schema

catalog

redact

rehydrate

Configuration

API config (config.yaml)

Extraction schema (schema.yaml)

Differences from the Python version

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

API config (`config.yaml`)

Extraction schema (`schema.yaml`)

Packages