Current implementation for mapping the tokens to their ids caused some problems when there were new words containing "token" in them. Currently, we map from the vocab file all tokens containing the word token. However, for (at lease non-SentencePiece) tokenizers in Huggignface transformers, there are already two argmuments for this:
tokenizer.all_special_tokens
tokenizer.all_special_ids
Let's test and replace our implementation with the officially supported vocab arguments
|
def map_special_tokens_to_ids( |
Current implementation for mapping the tokens to their ids caused some problems when there were new words containing "token" in them. Currently, we map from the vocab file all tokens containing the word token. However, for (at lease non-SentencePiece) tokenizers in Huggignface transformers, there are already two argmuments for this:
tokenizer.all_special_tokenstokenizer.all_special_idsLet's test and replace our implementation with the officially supported vocab arguments
tftokenizers/tftokenizers/utils.py
Line 8 in 14dc752