[Feature]: Add custom token corpus for other languages for bm25

### What feature would you like to request?

Hi, I’m exploring how to enable Qdrant/bm25 to properly tokenize and vectorize the Thai language.

I noticed that the model cache directory contains multiple language-specific text files:

(see image below)

I’d like to add a custom Thai corpus based on this word list:
https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt

However, simply adding a thai.txt file to the model cache directory does not work dynamically (on the fly). Could you clarify the correct way to extend or register a new language corpus for BM25, or whether additional configuration or rebuilding steps are required?

<img width="413" height="607" alt="Image" src="https://github.com/user-attachments/assets/ee6dae57-132d-415e-9e99-4e0985677fb8" />

### Is there any additional information you would like to provide?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add custom token corpus for other languages for bm25 #594

What feature would you like to request?

Is there any additional information you would like to provide?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add custom token corpus for other languages for bm25 #594

Description

What feature would you like to request?

Is there any additional information you would like to provide?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions