-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Hi,
I tried to get the byte probabilities for Gemma-2 (after downloading the model from HuggingFace) and I was getting this error:
Traceback (most recent call last):
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/backend/tokenization/bytes.py", line 52, in get byte_vocab check byte _decoder (tokenizer, byte decoder)
File "/opt/conda/envs/<ENV>/Lib/python3.12/site-packages/genlm/backend/tokenization/bytes.py", Line 128, in check _byte_decoder
check byte _decoder_has _all_bytes (tokenizer, byte decoder)
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/backend/tokenization/bytes.py", line 151, in_check byte_decoder_has _all_bytes raise ByteDecoderError (
gentm. backend, tokenization. bytes. By tenecdererror Byte decoder is missing bytes: […]
…
…
From the codebase, it looked like the code was not initialising the SentencePiece tokenizer correctly. I attempted to modify this by adding the following lines:
LLM = load_model_by_name(<GEMMA-2>)
import sentencepiece as spm
sp = spm.SentencePieceProcessor ()
sp. load(<GEMMA_2>)
LLM.tokenizer.sp = sp
However, I was met with the exception again:
Traceback (most recent call last):
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/byte_1m/beam.py", line 86, in initial
async_trie = AsyncTokenByteTrie. from_vocab(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/trie.py", line 509, in from vocab
trie = TokenByteTrie (decode=vocab, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/trie.py", line 54, in_init_ self. build trie(atomic_tokens or [1)
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/trie.py", line 84, in_ build_trie raise ValueError ("Duplicate word in vocabulary: {word}")
ValueError: Duplicate word in vocabulary: b'\n'
I have previously gotten the byte probabilities correctly from Llama-3.2 3B where the code was working fine.
Is there some fix/workaround for this? Please do let me know if so. Thanks!
Metadata
Metadata
Assignees
Labels
No labels