Skip to content

xuy/langextract-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

langextract-rs

A Rust port of Google's langextract — a tool for extracting structured information from unstructured text using LLMs.

Every extraction is mapped to its exact location in the source text, enabling traceability and verification. Works with OpenAI-compatible APIs and local models (Ollama, vLLM, llama.cpp).

Features

  • Source grounding — maps extractions to character-level positions in the original text
  • Schema-based or example-based extraction — define fields with a YAML schema or provide few-shot examples
  • Auto schema suggestion — LLM-powered schema generation from sample files
  • File cataloging — batch-process folders into searchable JSONL metadata catalogs
  • Document redaction — replace sensitive entities with anonymous placeholders (with recovery via rehydration)
  • Privacy mode--no-text flag excludes source text from output
  • Local LLM support — works with any OpenAI-compatible endpoint

Installation

From source

cargo install --path .

Build from source

git clone https://github.com/xuy/langextract-rs.git
cd langextract-rs
cargo build --release
# Binary at target/release/langextract

Quick start

1. Create a config file

# config.yaml
model: gpt-4o-mini
api_key: ${OPENAI_API_KEY}
temperature: 0.0

prompt_description: |
  Extract patient names and medical conditions from the clinical note.

examples:
  - text: "John Smith was diagnosed with type 2 diabetes and hypertension."
    extractions:
      - extraction_class: patient
        extraction_text: John Smith
      - extraction_class: condition
        extraction_text: type 2 diabetes
      - extraction_class: condition
        extraction_text: hypertension

For local models, set base_url:

model: llama3
api_key: not-needed
base_url: http://localhost:11434/v1

2. Run extraction

langextract extract --config config.yaml --input document.txt

Output is JSON with character-level alignment:

{
  "extractions": [
    {
      "extraction_class": "patient",
      "extraction_text": "John Smith",
      "char_interval": { "start": 0, "end": 10 },
      "alignment_status": "exact"
    }
  ]
}

Commands

extract

Extract structured data from a file or folder.

# Single file (example-based)
langextract extract --config config.yaml --input file.txt

# Single file (schema-based)
langextract extract --config api-config.yaml --schema schema.yaml --input file.txt

# Entire folder
langextract extract --config config.yaml --folder ./documents --extensions txt,md

# Privacy mode (omit source text from output)
langextract extract --config config.yaml --input file.txt --no-text

suggest-schema

Auto-generate an extraction schema from sample files.

langextract suggest-schema \
  --config api-config.yaml \
  --folder ./samples \
  --purpose "organize invoices by date and vendor"

catalog

Build a searchable metadata catalog from a folder of documents.

langextract catalog \
  --config api-config.yaml \
  --schema file-organizer-schema.yaml \
  --folder ./documents \
  --output catalog.jsonl

redact

Anonymize sensitive entities in a document.

langextract redact \
  --config api-config.yaml \
  --preset legal \
  --input contract.txt \
  --map entity_map.json \
  --output redacted.txt

Built-in presets: legal, medical, finance. Custom presets are supported via --preset-file.

rehydrate

Restore original values from a redacted document.

langextract rehydrate \
  --map entity_map.json \
  --input redacted.txt \
  --output restored.txt

Configuration

API config (config.yaml)

Field Required Description
model Yes Model name (e.g., gpt-4o-mini, llama3)
api_key Yes API key or ${ENV_VAR} reference
base_url No Endpoint URL (for local models)
temperature No Sampling temperature (default: model default)
prompt_description Yes Extraction instructions for the LLM
examples Yes Few-shot examples (can be empty if using --schema)

Extraction schema (schema.yaml)

name: invoice_fields
purpose: Extract key invoice information
fields:
  - name: vendor
    description: Company or person issuing the invoice
    field_type: string
  - name: amount
    description: Total amount due
    field_type: number
  - name: line_items
    description: List of items or services
    field_type: list
    required: false
  - name: due_date
    description: Payment due date (YYYY-MM-DD)
    field_type: date
    required: false

Supported field types: string, number, date, list.

Differences from the Python version

This is a standalone CLI tool, not a library. Key differences:

  • Single binary — no Python runtime or dependencies needed
  • CLI-first — designed as a command-line tool rather than a Python API
  • Redaction built in — includes redact and rehydrate commands for document anonymization
  • No visualization — does not generate HTML visualization files (outputs JSON/JSONL)

License

Apache-2.0 — same as the original project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published