[WIP] Use nuspell for spelling correction#1246
Conversation
This is the initial version of using nuspell for spelling correction, which yet has to be tuned. Note that libnuspell must be available as a dependency. Xapian-based code for spelling correction is not deleted.
|
@veloman-yunkan This is a very good news to see this draft. Does it implement openzim/libzim#1012 as well? It is important we validate the results with @gremid before merging and releasing a new libkiwix. |
Yes.
Sure |
This is a temporary change to facilitate playing with different affix rules.
|
For DWDS data Nuspell doesn't work well out of the box or using the Hunspell spellchecking configuration for German. However I believe that it can be made to work satisfactorily with additional rules tailored to the language learning use-model. I am providing a test package that can be used to experiment with nuspell-based spelling correction on Linux (see the README inside). |
Very worried about this assessment. To me, considering the importance of the german language in general and dwds in particular, I take it as a hint that Nuspell is not the proper tooling. We can not agford to tune the tool for all the main languages. |
|
For what it is worth, I also doubt that a spell-checking library is the right approach. It certainly has some functional overlap when matching entries in a given dictionary — additional features that cater to misspellings beyond what is in such a dictionary, i .e. language-specific support for word formation, affix stripping etc., are not of much use when dealing with a finite set of potential matches. If I were left to my (JVM-related) devices, I would use the fuzzy search functionality as provided by Apache Lucene or something comparable based on an n-gram index. But I do not know the C/C++ ecosystem for such libraries. |
@gremid I can agree with your opinion if it pertains solely to the DWDS application. I don't know much about it, but as far as I understand it targets learners of German. Mistakes made in short exercises by novices in a language can be quite different from those made by native speakers in longer texts in casual writing. In the latter case misspellings can be mostly typos (a result of fast typing), whereas in the former case mistakes originate from poor command of the language and can vary depending on the native language of the learner. For example, a (literate) native German speaker would hardly have ever spelled Stuhl as Schtuhl, but such a misspelling can be expected from a new learner struggling with the phonetics of German digraphs st & sp at the beginning of syllables. However, as far as spellchecking functionality is considered in the context of libzim I don't see why a general purpose spellchecking solution should be inferior to a custom one.
@kelson42 We need to define why we need spellchecking in libkiwix (or, later, in libzim). With nuspell we can achieve general-purpose spellchecking while at the same time providing possibility for special-purpose adjustments by client applications.
Lest there be any confusion, let me make it clear that nuspell in my prototype implementation is used exclusively as a spellchecking engine without any access to the language dictionaries. The "dictionary" data for it is created from the titles of the ZIM articles. |
I did not express myself very clearly. My point was that Stuhl vs. Schtuhl has a Levenshtein distance of 2 and should therefore be matched by any suggestion technique covering that distance. It is independent of the origin of the misspelling, i.e., which user cohort is more prone to such misspellings. As documented in https://github.com/gremid/xapian-spelling-suggestions/blob/main/test_spelling_suggestion.py#L12-L13 for some typical misspellings in our test dataset, covering a Levenshtein distance of 3 is needed for all tests to pass. But even with an acceptable distance of 2 we would cover a substantial subset in a language-independent way. This is why I would not want to tune Nuspell by manually adding rules, but would instead be happy with a technique that generally allows lookups for any match within a defined distance (2 or 3 are up for debate). |
Fixes openzim/libzim#1014
Fixes openzim/libzim#1012
This is the initial version of using nuspell for spelling correction, which yet has to be tuned.
Note that libnuspell must be available as a dependency.
Xapian-based code for spelling correction is not deleted.