Skip to content

Reading SentencePieceVocab from text file #51

@MikaelCall

Description

@MikaelCall

I've created a SentencePiece model using Python which results in a .model and a .vocab file. It is not possible to create a SentencePieceVocab from the later since Python does not seem to use protobuf but rather a plain text file. Here's an excerpt of my file:

<unk>	0
<s>	0
</s>	0
▁	-2.29038
s	-3.10405
l	-3.41047

I didn't find an option in the Python code for creating a protobuf vocab file so I wrote a parser. Unless I'm mistaken and did something wrong, would you like that code as a PR? I.e. something like:

impl SentencePieceVocab {
    ...
    
    pub fn from_vocab_txt_file(path: &str) -> Result<SentencePieceVocab, TokenizerError> { 
        ... 
    }
}

in rust-tokenizers/main/src/vocab/sentence_piece_vocab.rs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions