What feature would you like to request?
Hi, I’m exploring how to enable Qdrant/bm25 to properly tokenize and vectorize the Thai language.
I noticed that the model cache directory contains multiple language-specific text files:
(see image below)
I’d like to add a custom Thai corpus based on this word list:
https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt
However, simply adding a thai.txt file to the model cache directory does not work dynamically (on the fly). Could you clarify the correct way to extend or register a new language corpus for BM25, or whether additional configuration or rebuilding steps are required?
Is there any additional information you would like to provide?
No response