[WIP] Use nuspell for spelling correction by veloman-yunkan · Pull Request #1246 · kiwix/libkiwix

veloman-yunkan · 2025-11-21T14:37:48Z

Fixes openzim/libzim#1014
Fixes openzim/libzim#1012

This is the initial version of using nuspell for spelling correction, which yet has to be tuned.

Note that libnuspell must be available as a dependency.

Xapian-based code for spelling correction is not deleted.

This is the initial version of using nuspell for spelling correction, which yet has to be tuned. Note that libnuspell must be available as a dependency. Xapian-based code for spelling correction is not deleted.

kelson42 · 2025-11-22T17:51:05Z

@veloman-yunkan This is a very good news to see this draft.

Does it implement openzim/libzim#1012 as well?

It is important we validate the results with @gremid before merging and releasing a new libkiwix.

veloman-yunkan · 2025-11-24T12:00:40Z

Does it implement openzim/libzim#1012 as well?

Yes.

It is important we validate the results with @gremid before merging and releasing a new libkiwix.

Sure

This is a temporary change to facilitate playing with different affix rules.

veloman-yunkan · 2025-11-24T14:31:47Z

@kelson42 @gremid

For DWDS data Nuspell doesn't work well out of the box or using the Hunspell spellchecking configuration for German. However I believe that it can be made to work satisfactorily with additional rules tailored to the language learning use-model. I am providing a test package that can be used to experiment with nuspell-based spelling correction on Linux (see the README inside).

kelson42 · 2025-11-25T07:10:34Z

For DWDS data Nuspell doesn't work well out of the box or using the Hunspell spellchecking configuration for German.

Very worried about this assessment. To me, considering the importance of the german language in general and dwds in particular, I take it as a hint that Nuspell is not the proper tooling. We can not agford to tune the tool for all the main languages.

gremid · 2025-11-25T10:45:52Z

For what it is worth, I also doubt that a spell-checking library is the right approach. It certainly has some functional overlap when matching entries in a given dictionary — additional features that cater to misspellings beyond what is in such a dictionary, i .e. language-specific support for word formation, affix stripping etc., are not of much use when dealing with a finite set of potential matches.

If I were left to my (JVM-related) devices, I would use the fuzzy search functionality as provided by Apache Lucene or something comparable based on an n-gram index. But I do not know the C/C++ ecosystem for such libraries.

veloman-yunkan · 2025-11-27T12:55:07Z

For what it is worth, I also doubt that a spell-checking library is the right approach.

@gremid I can agree with your opinion if it pertains solely to the DWDS application. I don't know much about it, but as far as I understand it targets learners of German. Mistakes made in short exercises by novices in a language can be quite different from those made by native speakers in longer texts in casual writing. In the latter case misspellings can be mostly typos (a result of fast typing), whereas in the former case mistakes originate from poor command of the language and can vary depending on the native language of the learner. For example, a (literate) native German speaker would hardly have ever spelled Stuhl as Schtuhl, but such a misspelling can be expected from a new learner struggling with the phonetics of German digraphs st & sp at the beginning of syllables.

However, as far as spellchecking functionality is considered in the context of libzim I don't see why a general purpose spellchecking solution should be inferior to a custom one.

considering the importance of the german language in general and dwds in particular, I take it as a hint that Nuspell is not the proper tooling. We can not agford to tune the tool for all the main languages.

@kelson42 We need to define why we need spellchecking in libkiwix (or, later, in libzim). With nuspell we can achieve general-purpose spellchecking while at the same time providing possibility for special-purpose adjustments by client applications.

It certainly has some functional overlap when matching entries in a given dictionary — additional features that cater to misspellings beyond what is in such a dictionary, i .e. language-specific support for word formation, affix stripping etc., are not of much use when dealing with a finite set of potential matches.

Lest there be any confusion, let me make it clear that nuspell in my prototype implementation is used exclusively as a spellchecking engine without any access to the language dictionaries. The "dictionary" data for it is created from the titles of the ZIM articles.

gremid · 2025-11-27T18:23:20Z

@gremid I can agree with your opinion if it pertains solely to the DWDS application. I don't know much about it, but as far as I understand it targets learners of German. Mistakes made in short exercises by novices in a language can be quite different from those made by native speakers in longer texts in casual writing. In the latter case misspellings can be mostly typos (a result of fast typing), whereas in the former case mistakes originate from poor command of the language and can vary depending on the native language of the learner. For example, a (literate) native German speaker would hardly have ever spelled Stuhl as Schtuhl, but such a misspelling can be expected from a new learner struggling with the phonetics of German digraphs st & sp at the beginning of syllables.

I did not express myself very clearly. My point was that Stuhl vs. Schtuhl has a Levenshtein distance of 2 and should therefore be matched by any suggestion technique covering that distance. It is independent of the origin of the misspelling, i.e., which user cohort is more prone to such misspellings. As documented in

https://github.com/gremid/xapian-spelling-suggestions/blob/main/test_spelling_suggestion.py#L12-L13

for some typical misspellings in our test dataset, covering a Levenshtein distance of 3 is needed for all tests to pass. But even with an acceptable distance of 2 we would cover a substantial subset in a language-independent way. This is why I would not want to tune Nuspell by manually adding rules, but would instead be happy with a technique that generally allows lookups for any match within a defined distance (2 or 3 are up for debate).

Use nuspell for spelling correction

88d8f27

This is the initial version of using nuspell for spelling correction, which yet has to be tuned. Note that libnuspell must be available as a dependency. Xapian-based code for spelling correction is not deleted.

User affix file can be used for spelling correction

68c9702

This is a temporary change to facilitate playing with different affix rules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Use nuspell for spelling correction#1246

[WIP] Use nuspell for spelling correction#1246
veloman-yunkan wants to merge 2 commits intomainfrom
nuspell

veloman-yunkan commented Nov 21, 2025 •

edited by kelson42

Loading

Uh oh!

kelson42 commented Nov 22, 2025

Uh oh!

veloman-yunkan commented Nov 24, 2025

Uh oh!

veloman-yunkan commented Nov 24, 2025

Uh oh!

kelson42 commented Nov 25, 2025

Uh oh!

gremid commented Nov 25, 2025

Uh oh!

veloman-yunkan commented Nov 27, 2025

Uh oh!

gremid commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

veloman-yunkan commented Nov 21, 2025 • edited by kelson42 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kelson42 commented Nov 22, 2025

Uh oh!

veloman-yunkan commented Nov 24, 2025

Uh oh!

veloman-yunkan commented Nov 24, 2025

Uh oh!

kelson42 commented Nov 25, 2025

Uh oh!

gremid commented Nov 25, 2025

Uh oh!

veloman-yunkan commented Nov 27, 2025

Uh oh!

gremid commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

veloman-yunkan commented Nov 21, 2025 •

edited by kelson42

Loading