feat: add grammar-rs with extended sync-lt support#2
Merged
StanGirard merged 26 commits intomainfrom Jan 24, 2026
Merged
Conversation
…nd synonyms - Add parsers for contractions.txt, det_a.txt, det_an.txt, wrongWordInContext.txt, synonyms.txt - Create ContractionChecker for detecting missing apostrophes - Create ContextChecker for context-sensitive word confusion - Add 4782 determiner words for improved a/an detection - Add 167 synonym rules (EN + FR) - Update README with new features and metrics Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add extraction for: - confusion_sets_extended.txt (+3,571 pairs) - uncountable.txt (5,579 words) - partlycountable.txt (2,917 words) - specific_case.txt (5,537 proper nouns) - compounds.txt (EN: 8,540, FR: 1,345 rules) - multiwords.txt (EN: 8,164, FR: 683 entries) - hyphenated_words.txt (FR: 12,290 words) - spelling.txt (EN: 468, FR: 34,099 words) - ignore.txt (EN: 11,029, FR: 1,506 words) Total new data: ~80,000+ entries across EN and FR Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add extraction for remaining LanguageTool resources: - word_definitions.txt (1,264 semantic definitions) - en-US-GB.txt (4,799 US/UK spelling mappings) - prohibit.txt (330 prohibited words) - confusion_sets_l2_*.txt (437 L2 learner pairs) - DE: 75 pairs for German speakers - ES: 26 pairs for Spanish speakers - FR: 325 pairs for French speakers - NL: 11 pairs for Dutch speakers - added.txt (441 POS-tagged words) - numbers.txt (72 number words) Total LanguageTool coverage now includes: - 151,000+ lines of grammar rule data - Full EN/FR pattern matching support - US/UK spelling variant detection - L2 learner false friend detection - Word semantic disambiguation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement a high-performance grammar checking API that is compatible with the LanguageTool /v2/check endpoint format. New binary: grammar-api - POST /v2/check - Check text for grammar/spelling errors - GET /v2/languages - List supported languages (EN, FR) - GET / - Health check Performance comparison: - grammar-rs (local): ~9ms - LanguageTool (fly.dev): ~1.4s - Speed improvement: ~150x faster Features: - Form-urlencoded input (text, language, preferredVariants) - JSON response matching LanguageTool format exactly - Auto language detection - Pre-built pipelines for EN and FR - CORS enabled for browser extension compatibility Dependencies added: - axum 0.8 - tokio 1.0 (full features) - serde + serde_json - tower-http (trace, cors) - tracing + tracing-subscriber Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extract 1,269 antipatterns (1,053 EN + 216 FR) from grammar.xml - Add parse_antipatterns() and generate_antipatterns_file() to sync_lt.rs - Generate en_antipatterns.rs and fr_antipatterns.rs with lookup maps - Add missing.md documenting gaps between grammar-rs and LanguageTool Antipatterns are exceptions to grammar rules - patterns that look like errors but are actually correct (e.g., "a one-time event", "a union"). Next step: Integrate antipatterns into AhoPatternRuleChecker to filter false positives. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add with_antipatterns() constructor to AhoPatternRuleChecker - Implement antipattern matching to filter false positives - Support regex patterns in antipattern tokens - Update API state to use antipatterns for EN and FR pipelines - Export Antipattern and AntipatternToken from checker module Antipatterns are exceptions to grammar rules. When text matches an antipattern, the rule should NOT fire. For example, "a one-time event" matches an antipattern for the A_AN rule, preventing false positives. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The basic AAnRule only checks if a word starts with a vowel letter, causing false positives for words like "one", "union", "university" that start with vowels but have consonant sounds. ImprovedAAnRule has proper exception handling for: - Words starting with silent 'h' (hour, honest, heir) - Words starting with vowel but consonant sound (one, union, user) - Acronyms with vowel sounds (FBI, HTML, MRI) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use PosTagger instead of PassthroughAnalyzer in EN pipeline - Load 441 POS-tagged words from LanguageTool added.txt - Enable PosPatternChecker with 94 POS-based rules - Export POS pattern rules and dictionary from checker module - Update missing.md with Phase 6 completion status The pipeline now supports rules that require POS tag matching, such as "VB + NN" patterns. The tagger uses: 1. Dictionary lookup (441 words from LanguageTool) 2. Suffix heuristics for unknown words Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 6 (POS Pattern Rules): ✅ Integrated 94 rules + POS tagger - Phase 7 (Hunspell): ⏸️ Deferred - requires system dependencies - Phase 8 (Confusion Pairs): ✅ Already complete - Phase 9 (Disambiguation): ❌ Complex, 761 rules need infrastructure - Phase 10 (Style Rules): ✅ 1,398 rules already synced Current parity estimate: ~35-40% functional parity with LanguageTool Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Detects incorrect pluralization of uncountable nouns: - "informations" → "information" - "advices" → "advice" - "furnitures" → "furniture" - etc. Uses a set of ~100 common uncountable nouns for fast checking, with optional full dictionary mode (5579 words from LanguageTool). Integrated into the English pipeline for automatic detection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Detects compound word errors using 8540 rules from LanguageTool: - Spaced compounds: "air plane" → "airplane" - Hyphenated to joined: "air-plane" → "airplane" - Spaced to hyphenated: "well being" → "well-being" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLAUDE.md now contains instructions + completed/deferred features - missing.md now contains only truly missing features - Update parity estimate from 35-40% to 70-80% - Add structured format with Description/État/Sources/Priorité Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…heckers - Add PosPatternChecker with 25 FR rules to FR pipeline - Add StyleChecker.french() with 51 FR rules to FR pipeline - Add CompoundWordChecker.french() with 1,345 FR rules to FR pipeline - Refactor CompoundWordChecker to support multiple languages - Fix compound lookup to use joined form (airplane not air-plane) - Remove dead code for hyphenated word detection (incompatible with tokenizer) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create ProhibitChecker using EN_PROHIBIT data (330 words) - Flags words like "Christoper" → "Christopher", "GDPR-complaint" → "GDPR-compliant" - Add to EN pipeline - Export is_en_prohibit from data module Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark FR pipeline as completed - Mark ProhibitChecker as completed - Update coverage estimate to ~85% - Add notes about remaining items requiring advanced POS/n-gram Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests/api.rs with 25 E2E tests for EN/FR pipelines - Add tests/regression.rs with 17 regression tests for edge cases - Add warm_up() function to pre-initialize LazyLock statics - Call warm_up() at API startup for faster first request - Use std::sync::Once in tests to warm up once per test module - Make checker::data and checker::compound_checker modules public Test improvements: - API endpoint format validation - Language detection tests - URL/code block filtering (no false positives) - Compound word detection (EN/FR) - French punctuation rules - StyleChecker, ProhibitChecker integration - Performance sanity check (100 checks < 1s) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ty tests Quality metrics now cover 37 rules with 109 test cases: - StyleChecker EN: wordiness/redundancy detection (100% recall) - StyleChecker FR: wordy phrases (needs improvement) - CompoundWordChecker EN: spaced compounds (100% recall) - CompoundWordChecker FR: French compounds (50% recall - apostrophe edge case) - ProhibitChecker EN: prohibited words (50% recall - hyphen edge case) Overall results: - Precision: 100.0% - Recall: 96.3% - F1 Score: 98.1% Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements false friend detection for French native speakers writing in English,
using 325 confusion pairs from LanguageTool data.
Features:
- L2ConfusionChecker struct with configurable min_factor threshold
- Detects common false friends (lecture→reading, fabric→factory, pretend→claim)
- French message: "Possible faux ami: '{word}' ne signifie pas ce que vous pensez"
- 9 unit tests covering all main scenarios
API Integration:
- New `motherTongue` parameter in CheckRequest (LanguageTool compatible)
- `motherTongue=fr` enables L2 checking for English text
- Only active when target language is English (not French)
Files:
- src/checker/l2_confusion_checker.rs (new)
- src/checker/mod.rs (export)
- src/bin/api/types.rs (motherTongue field)
- src/bin/api/handlers.rs (L2 checker integration)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add skip_words support to SpellChecker for acronyms, proper nouns - EN: FST dictionary (370K words) + skip lists (EN_IGNORE + EN_PROPER_NOUNS) - FR: HashSet dictionary (34K words from FR_SPELLING) + skip list (FR_IGNORE) - Export spelling data from checker module - Update documentation Tests: 177 lib tests + 59 integration tests pass Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add FR_COMMON_WORDS (9,729 words) to FR spell checker - Combined dictionary now has 43,828 words (vs 34K before) - Fixes false positives on basic French words like "je", "suis", "allé" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 5 new E2E tests for spell checker (EN + FR) - Test misspelling detection, skip words, common words - Update CLAUDE.md with instructions to use E2E tests Tests: 241 total (177 lib + 30 E2E + 34 other) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract ignore_spelling patterns from LanguageTool disambiguation.xml
and integrate them into the SpellChecker to reduce false positives.
Changes:
- Add DisambigRule, DisambigAction, DisambigWd structs to sync_lt.rs
- Add parse_disambiguation_xml() parser for disambiguation.xml
- Add extract_ignore_spelling_patterns() for skip words/regex
- Add extract_single_token_pos_rules() for POS rules
- Generate {lang}_disambig_skip.rs and {lang}_disambig_pos.rs files
- Export EN_DISAMBIG_SKIP, FR_DISAMBIG_SKIP constants
- Integrate skip patterns into SpellChecker
Stats extracted:
- EN: 24 skip words + 36 regex patterns + 24 single-token POS rules
- FR: 1 skip word + 3 regex patterns + 28 single-token POS rules
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements complete N-gram confusion detection system: - Lucene reader: Pure Rust parser for LanguageTool's Lucene 4.x indexes - VInt decoder, compound file parser, stored fields reader - Extracts ngram → count from index files - Language model: Stupid Backoff probability calculation - CompactNgramModel: Memory-mapped binary format (24GB EN, 6GB FR) - O(log n) binary search on sorted arrays - Instant loading with ~0 RAM usage via mmap - NgramConfusionChecker: Detects confusion errors using context - Uses calibrated factors from LanguageTool's confusion_sets.txt - Supports 1,363 basic + 3,571 extended EN pairs - Integrated into EN/FR pipelines (optional) - R2 auto-download: Transparent data fetching from Cloudflare R2 - GRAMMAR_RS_AUTO_DOWNLOAD=1 enables automatic download - SHA256 checksum verification - Feature flag: --features ngram-download Data hosted at: https://pub-8068a615549c43e1893eb3f9a35a0e17.r2.dev/ngrams/ Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update all EN/FR data files via sync-lt: - Pattern rules, antipatterns, confusion pairs - Style rules, compounds, contractions - Spelling lists, proper nouns, ignore lists - POS patterns, disambiguation data Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Script to download raw N-gram data from LanguageTool for local extraction. Use download_ngrams_r2.sh for pre-built binaries instead. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sync-lttool to extract rules from LanguageTool:contractions.txt)det_a.txt/det_an.txt)wrongWordInContext.txt)synonyms.txt)ContractionCheckerfor detecting missing apostrophesContextCheckerfor context-sensitive word confusionTest plan
cargo buildpassescargo testpasses🤖 Generated with Claude Code