Instructions to use transformers-community/sink_cache with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use transformers-community/sink_cache with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("transformers-community/sink_cache", dtype="auto") - Notebooks
- Google Colab
- Kaggle
"sink_cache" fails on Transformers 5.12 due to outdated "Cache.__init__" API
Summary
transformers-community/sink_cache currently fails before generation starts on transformers==5.12.0.
The failure happens when the custom SinkCache class is instantiated. The implementation subclasses transformers.Cache and calls super().__init__() without arguments, but in current Transformers 5.x this raises:
ValueError: You should provide exactly one of `layers` or `layer_class_to_replicate` to initialize a Cache.
This appears to be a compatibility issue with the newer Transformers cache API. The model card says this implementation matches the SinkCache class present in transformers<4.53.0, while the current custom-generate path is being used with Transformers 5.12.
Environment
transformers: 5.12.0
torch: 2.11.0+cu128
cuda: True
model: Qwen/Qwen3-0.6B
custom_generate: transformers-community/sink_cache
Minimal reproduction
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
print("transformers:", transformers.__version__)
print("torch:", torch.__version__)
print("cuda:", torch.cuda.is_available())
model_id = "Qwen/Qwen3-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
inputs = tokenizer(
"Tell me a concise story about a cat who discovers a hidden garden.",
return_tensors="pt",
).to(model.device)
out = model.generate(
**inputs,
do_sample=False,
max_new_tokens=48,
return_dict_in_generate=True,
custom_generate="transformers-community/sink_cache",
trust_remote_code=True,
window_length=256,
)
Actual behavior
Generation fails before any SinkCache decoding begins:
ValueError: You should provide exactly one of `layers` or `layer_class_to_replicate` to initialize a Cache.
Traceback points to:
custom_generate/generate.py in generate(...)
past_key_values = SinkCache(window_length=window_length, num_sink_tokens=num_sink_tokens)
custom_generate/generate.py in SinkCache.__init__(...)
super().__init__()
transformers/cache_utils.py in Cache.__init__(...)
raise ValueError(
"You should provide exactly one of `layers` or `layer_class_to_replicate` to initialize a Cache."
)
Expected behavior
The custom generation method should instantiate SinkCache successfully and run generation. For max_new_tokens < window_length, the generated sequence should match the default DynamicCache generation, as described in the model card’s sanity check.
Additional related issue
There is another Transformers 5.x compatibility problem in the unsupported-argument sanity check. The code checks fields such as return_legacy_cache via direct getattr(...) on GenerationConfig. With some GenerationConfig objects, this can raise:
AttributeError: 'GenerationConfig' object has no attribute 'return_legacy_cache'
So the unsupported-argument check should probably use safe defaults, for example:
generation_value = getattr(generation_config, arg, None)
default_model_value = getattr(default_model_generation_config, arg, None)
default_global_value = getattr(default_global_generation_config, arg, None)
Possible fix direction
One possible compatibility patch is to initialize the base Cache with an empty layer list on newer Transformers, while keeping backward compatibility with older Transformers:
class SinkCache(Cache):
def __init__(self, window_length: int, num_sink_tokens: int) -> None:
try:
super().__init__(layers=[])
except TypeError:
super().__init__()
self.key_cache = []
self.value_cache = []
self.window_length = window_length
self.num_sink_tokens = num_sink_tokens
self.cos_sin_rerotation_cache = {}
self._cos_cache = None
self._sin_cache = None
It may also be worth adding a get_max_length() alias if current cache APIs expect it:
def get_max_length(self):
return self.window_length
def get_max_cache_shape(self):
return self.get_max_length()
I have not confirmed this full patch yet; it is only a suggested direction. The reproducible issue is the current hard crash when instantiating SinkCache under Transformers 5.12.
It is fine imo, SinkCache has no use case with modern era LLMs that use sliding layers so we can just keep it as is. Users will have to pin old transformers versions if they want to use it