feat: add grammar-rs with extended sync-lt support by StanGirard · Pull Request #2 · QuivrHQ/auto-correct

StanGirard · 2026-01-23T11:36:23Z

Summary

Add grammar-rs Rust library for high-performance grammar checking
Implement extended sync-lt tool to extract rules from LanguageTool:
- Contractions parser (179 rules from contractions.txt)
- Determiners parser (4782 words from det_a.txt/det_an.txt)
- Context words parser (11 rules from wrongWordInContext.txt)
- Synonyms parser (25 EN + 142 FR rules from synonyms.txt)
Create ContractionChecker for detecting missing apostrophes
Create ContextChecker for context-sensitive word confusion
Include comprehensive README with feature documentation

Test plan

cargo build passes
cargo test passes
Review generated data files for accuracy
Test checker integration with sample texts

🤖 Generated with Claude Code

…nd synonyms - Add parsers for contractions.txt, det_a.txt, det_an.txt, wrongWordInContext.txt, synonyms.txt - Create ContractionChecker for detecting missing apostrophes - Create ContextChecker for context-sensitive word confusion - Add 4782 determiner words for improved a/an detection - Add 167 synonym rules (EN + FR) - Update README with new features and metrics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Add extraction for: - confusion_sets_extended.txt (+3,571 pairs) - uncountable.txt (5,579 words) - partlycountable.txt (2,917 words) - specific_case.txt (5,537 proper nouns) - compounds.txt (EN: 8,540, FR: 1,345 rules) - multiwords.txt (EN: 8,164, FR: 683 entries) - hyphenated_words.txt (FR: 12,290 words) - spelling.txt (EN: 468, FR: 34,099 words) - ignore.txt (EN: 11,029, FR: 1,506 words) Total new data: ~80,000+ entries across EN and FR Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add extraction for remaining LanguageTool resources: - word_definitions.txt (1,264 semantic definitions) - en-US-GB.txt (4,799 US/UK spelling mappings) - prohibit.txt (330 prohibited words) - confusion_sets_l2_*.txt (437 L2 learner pairs) - DE: 75 pairs for German speakers - ES: 26 pairs for Spanish speakers - FR: 325 pairs for French speakers - NL: 11 pairs for Dutch speakers - added.txt (441 POS-tagged words) - numbers.txt (72 number words) Total LanguageTool coverage now includes: - 151,000+ lines of grammar rule data - Full EN/FR pattern matching support - US/UK spelling variant detection - L2 learner false friend detection - Word semantic disambiguation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implement a high-performance grammar checking API that is compatible with the LanguageTool /v2/check endpoint format. New binary: grammar-api - POST /v2/check - Check text for grammar/spelling errors - GET /v2/languages - List supported languages (EN, FR) - GET / - Health check Performance comparison: - grammar-rs (local): ~9ms - LanguageTool (fly.dev): ~1.4s - Speed improvement: ~150x faster Features: - Form-urlencoded input (text, language, preferredVariants) - JSON response matching LanguageTool format exactly - Auto language detection - Pre-built pipelines for EN and FR - CORS enabled for browser extension compatibility Dependencies added: - axum 0.8 - tokio 1.0 (full features) - serde + serde_json - tower-http (trace, cors) - tracing + tracing-subscriber Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Extract 1,269 antipatterns (1,053 EN + 216 FR) from grammar.xml - Add parse_antipatterns() and generate_antipatterns_file() to sync_lt.rs - Generate en_antipatterns.rs and fr_antipatterns.rs with lookup maps - Add missing.md documenting gaps between grammar-rs and LanguageTool Antipatterns are exceptions to grammar rules - patterns that look like errors but are actually correct (e.g., "a one-time event", "a union"). Next step: Integrate antipatterns into AhoPatternRuleChecker to filter false positives. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add with_antipatterns() constructor to AhoPatternRuleChecker - Implement antipattern matching to filter false positives - Support regex patterns in antipattern tokens - Update API state to use antipatterns for EN and FR pipelines - Export Antipattern and AntipatternToken from checker module Antipatterns are exceptions to grammar rules. When text matches an antipattern, the rule should NOT fire. For example, "a one-time event" matches an antipattern for the A_AN rule, preventing false positives. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The basic AAnRule only checks if a word starts with a vowel letter, causing false positives for words like "one", "union", "university" that start with vowels but have consonant sounds. ImprovedAAnRule has proper exception handling for: - Words starting with silent 'h' (hour, honest, heir) - Words starting with vowel but consonant sound (one, union, user) - Acronyms with vowel sounds (FBI, HTML, MRI) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use PosTagger instead of PassthroughAnalyzer in EN pipeline - Load 441 POS-tagged words from LanguageTool added.txt - Enable PosPatternChecker with 94 POS-based rules - Export POS pattern rules and dictionary from checker module - Update missing.md with Phase 6 completion status The pipeline now supports rules that require POS tag matching, such as "VB + NN" patterns. The tagger uses: 1. Dictionary lookup (441 words from LanguageTool) 2. Suffix heuristics for unknown words Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Phase 6 (POS Pattern Rules): ✅ Integrated 94 rules + POS tagger - Phase 7 (Hunspell): ⏸️ Deferred - requires system dependencies - Phase 8 (Confusion Pairs): ✅ Already complete - Phase 9 (Disambiguation): ❌ Complex, 761 rules need infrastructure - Phase 10 (Style Rules): ✅ 1,398 rules already synced Current parity estimate: ~35-40% functional parity with LanguageTool Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Detects incorrect pluralization of uncountable nouns: - "informations" → "information" - "advices" → "advice" - "furnitures" → "furniture" - etc. Uses a set of ~100 common uncountable nouns for fast checking, with optional full dictionary mode (5579 words from LanguageTool). Integrated into the English pipeline for automatic detection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Detects compound word errors using 8540 rules from LanguageTool: - Spaced compounds: "air plane" → "airplane" - Hyphenated to joined: "air-plane" → "airplane" - Spaced to hyphenated: "well being" → "well-being" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- CLAUDE.md now contains instructions + completed/deferred features - missing.md now contains only truly missing features - Update parity estimate from 35-40% to 70-80% - Add structured format with Description/État/Sources/Priorité Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…heckers - Add PosPatternChecker with 25 FR rules to FR pipeline - Add StyleChecker.french() with 51 FR rules to FR pipeline - Add CompoundWordChecker.french() with 1,345 FR rules to FR pipeline - Refactor CompoundWordChecker to support multiple languages - Fix compound lookup to use joined form (airplane not air-plane) - Remove dead code for hyphenated word detection (incompatible with tokenizer) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Create ProhibitChecker using EN_PROHIBIT data (330 words) - Flags words like "Christoper" → "Christopher", "GDPR-complaint" → "GDPR-compliant" - Add to EN pipeline - Export is_en_prohibit from data module Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Mark FR pipeline as completed - Mark ProhibitChecker as completed - Update coverage estimate to ~85% - Add notes about remaining items requiring advanced POS/n-gram Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add tests/api.rs with 25 E2E tests for EN/FR pipelines - Add tests/regression.rs with 17 regression tests for edge cases - Add warm_up() function to pre-initialize LazyLock statics - Call warm_up() at API startup for faster first request - Use std::sync::Once in tests to warm up once per test module - Make checker::data and checker::compound_checker modules public Test improvements: - API endpoint format validation - Language detection tests - URL/code block filtering (no false positives) - Compound word detection (EN/FR) - French punctuation rules - StyleChecker, ProhibitChecker integration - Performance sanity check (100 checks < 1s) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ty tests Quality metrics now cover 37 rules with 109 test cases: - StyleChecker EN: wordiness/redundancy detection (100% recall) - StyleChecker FR: wordy phrases (needs improvement) - CompoundWordChecker EN: spaced compounds (100% recall) - CompoundWordChecker FR: French compounds (50% recall - apostrophe edge case) - ProhibitChecker EN: prohibited words (50% recall - hyphen edge case) Overall results: - Precision: 100.0% - Recall: 96.3% - F1 Score: 98.1% Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements false friend detection for French native speakers writing in English, using 325 confusion pairs from LanguageTool data. Features: - L2ConfusionChecker struct with configurable min_factor threshold - Detects common false friends (lecture→reading, fabric→factory, pretend→claim) - French message: "Possible faux ami: '{word}' ne signifie pas ce que vous pensez" - 9 unit tests covering all main scenarios API Integration: - New `motherTongue` parameter in CheckRequest (LanguageTool compatible) - `motherTongue=fr` enables L2 checking for English text - Only active when target language is English (not French) Files: - src/checker/l2_confusion_checker.rs (new) - src/checker/mod.rs (export) - src/bin/api/types.rs (motherTongue field) - src/bin/api/handlers.rs (L2 checker integration) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add skip_words support to SpellChecker for acronyms, proper nouns - EN: FST dictionary (370K words) + skip lists (EN_IGNORE + EN_PROPER_NOUNS) - FR: HashSet dictionary (34K words from FR_SPELLING) + skip list (FR_IGNORE) - Export spelling data from checker module - Update documentation Tests: 177 lib tests + 59 integration tests pass Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add FR_COMMON_WORDS (9,729 words) to FR spell checker - Combined dictionary now has 43,828 words (vs 34K before) - Fixes false positives on basic French words like "je", "suis", "allé" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add 5 new E2E tests for spell checker (EN + FR) - Test misspelling detection, skip words, common words - Update CLAUDE.md with instructions to use E2E tests Tests: 241 total (177 lib + 30 E2E + 34 other) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Extract ignore_spelling patterns from LanguageTool disambiguation.xml and integrate them into the SpellChecker to reduce false positives. Changes: - Add DisambigRule, DisambigAction, DisambigWd structs to sync_lt.rs - Add parse_disambiguation_xml() parser for disambiguation.xml - Add extract_ignore_spelling_patterns() for skip words/regex - Add extract_single_token_pos_rules() for POS rules - Generate {lang}_disambig_skip.rs and {lang}_disambig_pos.rs files - Export EN_DISAMBIG_SKIP, FR_DISAMBIG_SKIP constants - Integrate skip patterns into SpellChecker Stats extracted: - EN: 24 skip words + 36 regex patterns + 24 single-token POS rules - FR: 1 skip word + 3 regex patterns + 28 single-token POS rules Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Implements complete N-gram confusion detection system: - Lucene reader: Pure Rust parser for LanguageTool's Lucene 4.x indexes - VInt decoder, compound file parser, stored fields reader - Extracts ngram → count from index files - Language model: Stupid Backoff probability calculation - CompactNgramModel: Memory-mapped binary format (24GB EN, 6GB FR) - O(log n) binary search on sorted arrays - Instant loading with ~0 RAM usage via mmap - NgramConfusionChecker: Detects confusion errors using context - Uses calibrated factors from LanguageTool's confusion_sets.txt - Supports 1,363 basic + 3,571 extended EN pairs - Integrated into EN/FR pipelines (optional) - R2 auto-download: Transparent data fetching from Cloudflare R2 - GRAMMAR_RS_AUTO_DOWNLOAD=1 enables automatic download - SHA256 checksum verification - Feature flag: --features ngram-download Data hosted at: https://pub-8068a615549c43e1893eb3f9a35a0e17.r2.dev/ngrams/ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update all EN/FR data files via sync-lt: - Pattern rules, antipatterns, confusion pairs - Style rules, compounds, contractions - Spelling lists, proper nouns, ignore lists - POS patterns, disambiguation data Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Script to download raw N-gram data from LanguageTool for local extraction. Use download_ngrams_r2.sh for pre-built binaries instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

devin-ai-integration bot reviewed Jan 23, 2026

View reviewed changes

Ubuntu and others added 25 commits January 23, 2026 13:26

docs: update missing.md with antipattern completion status

4bfae93

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: update missing.md with completion status

393fdd4

- Mark FR pipeline as completed - Mark ProhibitChecker as completed - Update coverage estimate to ~85% - Add notes about remaining items requiring advanced POS/n-gram Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: add original ngram download script

98f5566

Script to download raw N-gram data from LanguageTool for local extraction. Use download_ngrams_r2.sh for pre-built binaries instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

StanGirard merged commit 8e26580 into main Jan 24, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add grammar-rs with extended sync-lt support#2

feat: add grammar-rs with extended sync-lt support#2
StanGirard merged 26 commits intomainfrom
feat/grammar-rs-sync-lt-extended

StanGirard commented Jan 23, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

StanGirard commented Jan 23, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

StanGirard commented Jan 23, 2026 •

edited by devin-ai-integration bot

Loading