Skip to content

Unable to generate byte probabilities for Gemma 2 2B IT #13

@avyavkumar

Description

@avyavkumar

Hi,

I tried to get the byte probabilities for Gemma-2 (after downloading the model from HuggingFace) and I was getting this error:

Traceback (most recent call last):
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/backend/tokenization/bytes.py", line 52, in get byte_vocab check byte _decoder (tokenizer, byte decoder)
File "/opt/conda/envs/<ENV>/Lib/python3.12/site-packages/genlm/backend/tokenization/bytes.py", Line 128, in check _byte_decoder
check byte _decoder_has _all_bytes (tokenizer, byte decoder)
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/backend/tokenization/bytes.py", line 151, in_check byte_decoder_has _all_bytes raise ByteDecoderError (
gentm. backend, tokenization. bytes. By tenecdererror Byte decoder is missing bytes: […]
…
…

From the codebase, it looked like the code was not initialising the SentencePiece tokenizer correctly. I attempted to modify this by adding the following lines:

LLM = load_model_by_name(<GEMMA-2>)
import sentencepiece as spm
sp = spm.SentencePieceProcessor ()
sp. load(<GEMMA_2>)
LLM.tokenizer.sp = sp

However, I was met with the exception again:

Traceback (most recent call last):
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/byte_1m/beam.py", line 86, in initial
async_trie = AsyncTokenByteTrie. from_vocab(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/trie.py", line 509, in from vocab
trie = TokenByteTrie (decode=vocab, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/trie.py", line 54, in_init_ self. build trie(atomic_tokens or [1)
File "/opt/conda/envs/<ENV>/lib/python3.12/site-packages/genlm/bytes/trie.py", line 84, in_ build_trie raise ValueError ("Duplicate word in vocabulary: {word}")
ValueError: Duplicate word in vocabulary: b'\n'

I have previously gotten the byte probabilities correctly from Llama-3.2 3B where the code was working fine.

Is there some fix/workaround for this? Please do let me know if so. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions