Use built in function to access special tokens and ids

Current implementation for mapping the tokens to their ids caused some problems when there were new words containing "token" in them. Currently, we map from the vocab file all tokens containing the word token. However, for (at lease non-SentencePiece) tokenizers in Huggignface transformers, there are already two argmuments for this:

- `tokenizer.all_special_tokens`
- `tokenizer.all_special_ids`

Let's test and replace our implementation with the officially supported vocab arguments

https://github.com/Hugging-Face-Supporter/tftokenizers/blob/14dc752edcf764c89b34c7ebbdd1d57350575023/tftokenizers/utils.py#L8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use built in function to access special tokens and ids #16

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use built in function to access special tokens and ids #16

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions