Skip to content

[Feature]: Add custom token corpus for other languages for bm25 #594

@Nevermetyou65

Description

@Nevermetyou65

What feature would you like to request?

Hi, I’m exploring how to enable Qdrant/bm25 to properly tokenize and vectorize the Thai language.

I noticed that the model cache directory contains multiple language-specific text files:

(see image below)

I’d like to add a custom Thai corpus based on this word list:
https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt

However, simply adding a thai.txt file to the model cache directory does not work dynamically (on the fly). Could you clarify the correct way to extend or register a new language corpus for BM25, or whether additional configuration or rebuilding steps are required?

Image

Is there any additional information you would like to provide?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions