"sink_cache" fails on Transformers 5.12 due to outdated "Cache.__init__" API

#1
by lavrenko - opened

Summary

transformers-community/sink_cache currently fails before generation starts on transformers==5.12.0.

The failure happens when the custom SinkCache class is instantiated. The implementation subclasses transformers.Cache and calls super().__init__() without arguments, but in current Transformers 5.x this raises:

ValueError: You should provide exactly one of `layers` or `layer_class_to_replicate` to initialize a Cache.

This appears to be a compatibility issue with the newer Transformers cache API. The model card says this implementation matches the SinkCache class present in transformers<4.53.0, while the current custom-generate path is being used with Transformers 5.12.

Environment

transformers: 5.12.0
torch: 2.11.0+cu128
cuda: True
model: Qwen/Qwen3-0.6B
custom_generate: transformers-community/sink_cache

Minimal reproduction

import torch
import transformers

from transformers import AutoTokenizer, AutoModelForCausalLM

print("transformers:", transformers.__version__)
print("torch:", torch.__version__)
print("cuda:", torch.cuda.is_available())

model_id = "Qwen/Qwen3-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)

inputs = tokenizer(
    "Tell me a concise story about a cat who discovers a hidden garden.",
    return_tensors="pt",
).to(model.device)

out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=48,
    return_dict_in_generate=True,
    custom_generate="transformers-community/sink_cache",
    trust_remote_code=True,
    window_length=256,
)

Actual behavior

Generation fails before any SinkCache decoding begins:

ValueError: You should provide exactly one of `layers` or `layer_class_to_replicate` to initialize a Cache.

Traceback points to:

custom_generate/generate.py in generate(...)
    past_key_values = SinkCache(window_length=window_length, num_sink_tokens=num_sink_tokens)

custom_generate/generate.py in SinkCache.__init__(...)
    super().__init__()

transformers/cache_utils.py in Cache.__init__(...)
    raise ValueError(
        "You should provide exactly one of `layers` or `layer_class_to_replicate` to initialize a Cache."
    )

Expected behavior

The custom generation method should instantiate SinkCache successfully and run generation. For max_new_tokens < window_length, the generated sequence should match the default DynamicCache generation, as described in the model card’s sanity check.

Additional related issue

There is another Transformers 5.x compatibility problem in the unsupported-argument sanity check. The code checks fields such as return_legacy_cache via direct getattr(...) on GenerationConfig. With some GenerationConfig objects, this can raise:

AttributeError: 'GenerationConfig' object has no attribute 'return_legacy_cache'

So the unsupported-argument check should probably use safe defaults, for example:

generation_value = getattr(generation_config, arg, None)
default_model_value = getattr(default_model_generation_config, arg, None)
default_global_value = getattr(default_global_generation_config, arg, None)

Possible fix direction

One possible compatibility patch is to initialize the base Cache with an empty layer list on newer Transformers, while keeping backward compatibility with older Transformers:

class SinkCache(Cache):
    def __init__(self, window_length: int, num_sink_tokens: int) -> None:
        try:
            super().__init__(layers=[])
        except TypeError:
            super().__init__()

        self.key_cache = []
        self.value_cache = []
        self.window_length = window_length
        self.num_sink_tokens = num_sink_tokens
        self.cos_sin_rerotation_cache = {}
        self._cos_cache = None
        self._sin_cache = None

It may also be worth adding a get_max_length() alias if current cache APIs expect it:

def get_max_length(self):
    return self.window_length

def get_max_cache_shape(self):
    return self.get_max_length()

I have not confirmed this full patch yet; it is only a suggested direction. The reproducible issue is the current hard crash when instantiating SinkCache under Transformers 5.12.

Transformers Community org

It is fine imo, SinkCache has no use case with modern era LLMs that use sliding layers so we can just keep it as is. Users will have to pin old transformers versions if they want to use it

RaushanTurganbay changed discussion status to closed

Sign up or log in to comment