Skip to content

CLIPTokenizer uses 10**30 as model_max_length #45538

@D1-3105

Description

@D1-3105

System Info

transformers==5.5.4
python3.12

vs

transformers==4.57.6
python3.12

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

This script:

from transformers import CLIPTokenizer

tok1 = CLIPTokenizer.from_pretrained("black-forest-labs/FLUX.1-Canny-dev", subfolder="tokenizer", local_files_only=True)

tok2 = CLIPTokenizer.from_pretrained("black-forest-labs/FLUX.1-Canny-dev", subfolder="tokenizer")
print(tok1.model_max_length, tok2.model_max_length)
assert tok1.model_max_length == tok2.model_max_length
  1. empty your ~/.cache/huggingface/hub cache folder
  2. run with trasformers 4.57.6
  3. observe the expected behavior (the model doesn't load)
  4. empty your ~/.cache/huggingface/hub cache folder
  5. run with transformers 5.5.4
  6. observe the tokenizer's stub being loaded without any exception

Expected behavior

gives me a different behavior

  • + transformers==4.57.6:
Traceback (most recent call last):
  File "/home/oleg.konin/mlapi-glm/test.py", line 3, in <module>
    tok1 = CLIPTokenizer.from_pretrained("black-forest-labs/FLUX.1-Canny-dev", subfolder="tokenizer", local_files_only=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oleg.konin/mlapi-glm/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2113, in from_pretrained
    return cls._from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/oleg.konin/mlapi-glm/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 2359, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/oleg.konin/mlapi-glm/.venv/lib/python3.12/site-packages/transformers/models/clip/tokenization_clip.py", line 306, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType
  • + transformers==5.5.4:
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████| 705/705 [00:00<00:00, 4.32MB/s]
vocab.json: 1.06MB [00:00, 20.6MB/s]
merges.txt: 525kB [00:00, 7.86MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████| 588/588 [00:00<00:00, 6.32MB/s]
1000000000000000019884624838656 77
Traceback (most recent call last):
  File "/home/oleg.konin/mlapi-glm/test.py", line 7, in <module>
    assert tok1.model_max_length == tok2.model_max_length
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

In the second case, I think an exception should be thrown, otherwise it makes you think that the model is already present on disk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions