Skip to content

feat: add grammar-rs with extended sync-lt support#2

Merged
StanGirard merged 26 commits intomainfrom
feat/grammar-rs-sync-lt-extended
Jan 24, 2026
Merged

feat: add grammar-rs with extended sync-lt support#2
StanGirard merged 26 commits intomainfrom
feat/grammar-rs-sync-lt-extended

Conversation

@StanGirard
Copy link
Collaborator

@StanGirard StanGirard commented Jan 23, 2026

Summary

  • Add grammar-rs Rust library for high-performance grammar checking
  • Implement extended sync-lt tool to extract rules from LanguageTool:
    • Contractions parser (179 rules from contractions.txt)
    • Determiners parser (4782 words from det_a.txt/det_an.txt)
    • Context words parser (11 rules from wrongWordInContext.txt)
    • Synonyms parser (25 EN + 142 FR rules from synonyms.txt)
  • Create ContractionChecker for detecting missing apostrophes
  • Create ContextChecker for context-sensitive word confusion
  • Include comprehensive README with feature documentation

Test plan

  • cargo build passes
  • cargo test passes
  • Review generated data files for accuracy
  • Test checker integration with sample texts

🤖 Generated with Claude Code


Open with Devin

…nd synonyms

- Add parsers for contractions.txt, det_a.txt, det_an.txt, wrongWordInContext.txt, synonyms.txt
- Create ContractionChecker for detecting missing apostrophes
- Create ContextChecker for context-sensitive word confusion
- Add 4782 determiner words for improved a/an detection
- Add 167 synonym rules (EN + FR)
- Update README with new features and metrics

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Ubuntu and others added 25 commits January 23, 2026 13:26
Add extraction for:
- confusion_sets_extended.txt (+3,571 pairs)
- uncountable.txt (5,579 words)
- partlycountable.txt (2,917 words)
- specific_case.txt (5,537 proper nouns)
- compounds.txt (EN: 8,540, FR: 1,345 rules)
- multiwords.txt (EN: 8,164, FR: 683 entries)
- hyphenated_words.txt (FR: 12,290 words)
- spelling.txt (EN: 468, FR: 34,099 words)
- ignore.txt (EN: 11,029, FR: 1,506 words)

Total new data: ~80,000+ entries across EN and FR

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add extraction for remaining LanguageTool resources:
- word_definitions.txt (1,264 semantic definitions)
- en-US-GB.txt (4,799 US/UK spelling mappings)
- prohibit.txt (330 prohibited words)
- confusion_sets_l2_*.txt (437 L2 learner pairs)
  - DE: 75 pairs for German speakers
  - ES: 26 pairs for Spanish speakers
  - FR: 325 pairs for French speakers
  - NL: 11 pairs for Dutch speakers
- added.txt (441 POS-tagged words)
- numbers.txt (72 number words)

Total LanguageTool coverage now includes:
- 151,000+ lines of grammar rule data
- Full EN/FR pattern matching support
- US/UK spelling variant detection
- L2 learner false friend detection
- Word semantic disambiguation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement a high-performance grammar checking API that is compatible
with the LanguageTool /v2/check endpoint format.

New binary: grammar-api
- POST /v2/check - Check text for grammar/spelling errors
- GET /v2/languages - List supported languages (EN, FR)
- GET / - Health check

Performance comparison:
- grammar-rs (local): ~9ms
- LanguageTool (fly.dev): ~1.4s
- Speed improvement: ~150x faster

Features:
- Form-urlencoded input (text, language, preferredVariants)
- JSON response matching LanguageTool format exactly
- Auto language detection
- Pre-built pipelines for EN and FR
- CORS enabled for browser extension compatibility

Dependencies added:
- axum 0.8
- tokio 1.0 (full features)
- serde + serde_json
- tower-http (trace, cors)
- tracing + tracing-subscriber

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Extract 1,269 antipatterns (1,053 EN + 216 FR) from grammar.xml
- Add parse_antipatterns() and generate_antipatterns_file() to sync_lt.rs
- Generate en_antipatterns.rs and fr_antipatterns.rs with lookup maps
- Add missing.md documenting gaps between grammar-rs and LanguageTool

Antipatterns are exceptions to grammar rules - patterns that look like
errors but are actually correct (e.g., "a one-time event", "a union").

Next step: Integrate antipatterns into AhoPatternRuleChecker to filter
false positives.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add with_antipatterns() constructor to AhoPatternRuleChecker
- Implement antipattern matching to filter false positives
- Support regex patterns in antipattern tokens
- Update API state to use antipatterns for EN and FR pipelines
- Export Antipattern and AntipatternToken from checker module

Antipatterns are exceptions to grammar rules. When text matches an
antipattern, the rule should NOT fire. For example, "a one-time event"
matches an antipattern for the A_AN rule, preventing false positives.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The basic AAnRule only checks if a word starts with a vowel letter,
causing false positives for words like "one", "union", "university"
that start with vowels but have consonant sounds.

ImprovedAAnRule has proper exception handling for:
- Words starting with silent 'h' (hour, honest, heir)
- Words starting with vowel but consonant sound (one, union, user)
- Acronyms with vowel sounds (FBI, HTML, MRI)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use PosTagger instead of PassthroughAnalyzer in EN pipeline
- Load 441 POS-tagged words from LanguageTool added.txt
- Enable PosPatternChecker with 94 POS-based rules
- Export POS pattern rules and dictionary from checker module
- Update missing.md with Phase 6 completion status

The pipeline now supports rules that require POS tag matching,
such as "VB + NN" patterns. The tagger uses:
1. Dictionary lookup (441 words from LanguageTool)
2. Suffix heuristics for unknown words

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 6 (POS Pattern Rules): ✅ Integrated 94 rules + POS tagger
- Phase 7 (Hunspell): ⏸️ Deferred - requires system dependencies
- Phase 8 (Confusion Pairs): ✅ Already complete
- Phase 9 (Disambiguation): ❌ Complex, 761 rules need infrastructure
- Phase 10 (Style Rules): ✅ 1,398 rules already synced

Current parity estimate: ~35-40% functional parity with LanguageTool

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Detects incorrect pluralization of uncountable nouns:
- "informations" → "information"
- "advices" → "advice"
- "furnitures" → "furniture"
- etc.

Uses a set of ~100 common uncountable nouns for fast checking,
with optional full dictionary mode (5579 words from LanguageTool).

Integrated into the English pipeline for automatic detection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Detects compound word errors using 8540 rules from LanguageTool:
- Spaced compounds: "air plane" → "airplane"
- Hyphenated to joined: "air-plane" → "airplane"
- Spaced to hyphenated: "well being" → "well-being"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- CLAUDE.md now contains instructions + completed/deferred features
- missing.md now contains only truly missing features
- Update parity estimate from 35-40% to 70-80%
- Add structured format with Description/État/Sources/Priorité

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…heckers

- Add PosPatternChecker with 25 FR rules to FR pipeline
- Add StyleChecker.french() with 51 FR rules to FR pipeline
- Add CompoundWordChecker.french() with 1,345 FR rules to FR pipeline
- Refactor CompoundWordChecker to support multiple languages
- Fix compound lookup to use joined form (airplane not air-plane)
- Remove dead code for hyphenated word detection (incompatible with tokenizer)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create ProhibitChecker using EN_PROHIBIT data (330 words)
- Flags words like "Christoper" → "Christopher", "GDPR-complaint" → "GDPR-compliant"
- Add to EN pipeline
- Export is_en_prohibit from data module

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Mark FR pipeline as completed
- Mark ProhibitChecker as completed
- Update coverage estimate to ~85%
- Add notes about remaining items requiring advanced POS/n-gram

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tests/api.rs with 25 E2E tests for EN/FR pipelines
- Add tests/regression.rs with 17 regression tests for edge cases
- Add warm_up() function to pre-initialize LazyLock statics
- Call warm_up() at API startup for faster first request
- Use std::sync::Once in tests to warm up once per test module
- Make checker::data and checker::compound_checker modules public

Test improvements:
- API endpoint format validation
- Language detection tests
- URL/code block filtering (no false positives)
- Compound word detection (EN/FR)
- French punctuation rules
- StyleChecker, ProhibitChecker integration
- Performance sanity check (100 checks < 1s)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ty tests

Quality metrics now cover 37 rules with 109 test cases:
- StyleChecker EN: wordiness/redundancy detection (100% recall)
- StyleChecker FR: wordy phrases (needs improvement)
- CompoundWordChecker EN: spaced compounds (100% recall)
- CompoundWordChecker FR: French compounds (50% recall - apostrophe edge case)
- ProhibitChecker EN: prohibited words (50% recall - hyphen edge case)

Overall results:
- Precision: 100.0%
- Recall: 96.3%
- F1 Score: 98.1%

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements false friend detection for French native speakers writing in English,
using 325 confusion pairs from LanguageTool data.

Features:
- L2ConfusionChecker struct with configurable min_factor threshold
- Detects common false friends (lecture→reading, fabric→factory, pretend→claim)
- French message: "Possible faux ami: '{word}' ne signifie pas ce que vous pensez"
- 9 unit tests covering all main scenarios

API Integration:
- New `motherTongue` parameter in CheckRequest (LanguageTool compatible)
- `motherTongue=fr` enables L2 checking for English text
- Only active when target language is English (not French)

Files:
- src/checker/l2_confusion_checker.rs (new)
- src/checker/mod.rs (export)
- src/bin/api/types.rs (motherTongue field)
- src/bin/api/handlers.rs (L2 checker integration)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add skip_words support to SpellChecker for acronyms, proper nouns
- EN: FST dictionary (370K words) + skip lists (EN_IGNORE + EN_PROPER_NOUNS)
- FR: HashSet dictionary (34K words from FR_SPELLING) + skip list (FR_IGNORE)
- Export spelling data from checker module
- Update documentation

Tests: 177 lib tests + 59 integration tests pass

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add FR_COMMON_WORDS (9,729 words) to FR spell checker
- Combined dictionary now has 43,828 words (vs 34K before)
- Fixes false positives on basic French words like "je", "suis", "allé"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add 5 new E2E tests for spell checker (EN + FR)
- Test misspelling detection, skip words, common words
- Update CLAUDE.md with instructions to use E2E tests

Tests: 241 total (177 lib + 30 E2E + 34 other)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Extract ignore_spelling patterns from LanguageTool disambiguation.xml
and integrate them into the SpellChecker to reduce false positives.

Changes:
- Add DisambigRule, DisambigAction, DisambigWd structs to sync_lt.rs
- Add parse_disambiguation_xml() parser for disambiguation.xml
- Add extract_ignore_spelling_patterns() for skip words/regex
- Add extract_single_token_pos_rules() for POS rules
- Generate {lang}_disambig_skip.rs and {lang}_disambig_pos.rs files
- Export EN_DISAMBIG_SKIP, FR_DISAMBIG_SKIP constants
- Integrate skip patterns into SpellChecker

Stats extracted:
- EN: 24 skip words + 36 regex patterns + 24 single-token POS rules
- FR: 1 skip word + 3 regex patterns + 28 single-token POS rules

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements complete N-gram confusion detection system:

- Lucene reader: Pure Rust parser for LanguageTool's Lucene 4.x indexes
  - VInt decoder, compound file parser, stored fields reader
  - Extracts ngram → count from index files

- Language model: Stupid Backoff probability calculation
  - CompactNgramModel: Memory-mapped binary format (24GB EN, 6GB FR)
  - O(log n) binary search on sorted arrays
  - Instant loading with ~0 RAM usage via mmap

- NgramConfusionChecker: Detects confusion errors using context
  - Uses calibrated factors from LanguageTool's confusion_sets.txt
  - Supports 1,363 basic + 3,571 extended EN pairs
  - Integrated into EN/FR pipelines (optional)

- R2 auto-download: Transparent data fetching from Cloudflare R2
  - GRAMMAR_RS_AUTO_DOWNLOAD=1 enables automatic download
  - SHA256 checksum verification
  - Feature flag: --features ngram-download

Data hosted at: https://pub-8068a615549c43e1893eb3f9a35a0e17.r2.dev/ngrams/

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update all EN/FR data files via sync-lt:
- Pattern rules, antipatterns, confusion pairs
- Style rules, compounds, contractions
- Spelling lists, proper nouns, ignore lists
- POS patterns, disambiguation data

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Script to download raw N-gram data from LanguageTool for local extraction.
Use download_ngrams_r2.sh for pre-built binaries instead.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@StanGirard StanGirard merged commit 8e26580 into main Jan 24, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant