Buckets:

SaylorTwift
/

reposcan

Files

xet

SaylorTwift/reposcan / data /transformers.jsonl

SaylorTwift

about 1 month ago

download

raw

190 kB

	{"id": "issue_46144", "type": "issue", "number": 46144, "title": "RoFormer attention implementation does not use attention interface", "state": "open", "author": "ir2718", "labels": [], "created_at": "2026-05-21T13:26:50Z", "updated_at": "2026-05-21T20:47:37Z", "url": "https://github.com/huggingface/transformers/issues/46144", "text": "ISSUE #46144: RoFormer attention implementation does not use attention interface\nState: open \| Labels: \nAuthor: ir2718 \| Created: 2026-05-21T13:26:50Z\n\nHi,\n\nI want to use RoFormer with a custom attention implementation. However, the current code relies on an eager implementation without using the attention interface: https://github.com/huggingface/transformers/blob/5206626c48710e69fef3eadfba077cada99f37bb/src/transformers/models/roformer/modeling_roformer.py#L196-L211 \n\nThe fix is simple and I would like to create a PR for it.\n\n--- Comment by HamzaDogann at 2026-05-21T15:32:59Z ---\nHi! I'd love to work on this if you haven't started the PR yet. Let me know and I'll get started!\n\n--- Comment by HamzaDogann at 2026-05-21T20:47:37Z ---\nHello, I have opened a Pull Request to address this issue. The implementation has been fully validated against the existing test suite, successfully passing all 82 tests. RoFormer is now fully compatible with the ALL_ATTENTION_FUNCTIONS interface. Please let me know if any further adjustments are needed."}
	{"id": "issue_46143", "type": "issue", "number": 46143, "title": "`kwargs` not passed through methods of RoFormer models", "state": "open", "author": "ir2718", "labels": [], "created_at": "2026-05-21T13:18:46Z", "updated_at": "2026-05-21T13:18:46Z", "url": "https://github.com/huggingface/transformers/issues/46143", "text": "ISSUE #46143: `kwargs` not passed through methods of RoFormer models\nState: open \| Labels: \nAuthor: ir2718 \| Created: 2026-05-21T13:18:46Z\n\nHi,\n\nwhen working with RoFormer models, I've noticed the `kwargs` option is not handled correctly. Most classes take in `kwargs` but do not pass them further into the model. For example: https://github.com/huggingface/transformers/blob/5206626c48710e69fef3eadfba077cada99f37bb/src/transformers/models/roformer/modeling_roformer.py#L737-L747 This is very annoying since I want to implement a custom attention and send needed inputs through `**kwargs`.\n\nThe fix is trivial and I would be happy to make a PR for this if the members agree."}
	{"id": "issue_46139", "type": "issue", "number": 46139, "title": "Discussion: optional RankSEG-style decoding for Transformers semantic segmentation post-processing", "state": "open", "author": "Leev1s", "labels": [], "created_at": "2026-05-21T10:36:05Z", "updated_at": "2026-05-21T13:12:39Z", "url": "https://github.com/huggingface/transformers/issues/46139", "text": "ISSUE #46139: Discussion: optional RankSEG-style decoding for Transformers semantic segmentation post-processing\nState: open \| Labels: \nAuthor: Leev1s \| Created: 2026-05-21T10:36:05Z\n\n[![RankSEG](https://img.shields.io/badge/RankSEG-GitHub-blue?logo=github)](https://github.com/rankseg/rankseg) [![PyPI](https://badge.fury.io/py/rankseg.svg)](https://pypi.org/project/rankseg/) [![Docs](https://readthedocs.org/projects/rankseg/badge/?version=latest)](https://rankseg.readthedocs.io/en/latest/) [![Transformers docs](https://img.shields.io/badge/docs-Transformers%20integration-brightgreen)](https://rankseg.readthedocs.io/en/latest/integrations_transformers.html) [![Notebook](https://img.shields.io/badge/notebook-Transformers-orange)](https://github.com/rankseg/rankseg/blob/main/notebooks/rankseg_with_transformers.ipynb) [![JMLR 2023](https://img.shields.io/badge/JMLR-2023-black)](https://www.jmlr.org/papers/v24/22-0712.html) [![NeurIPS 2025](https://img.shields.io/badge/NeurIPS-2025-black)](https://openreview.net/forum?id=4tRMm1JJhw)\n\nHi Transformers maintainers,\n\nI wanted to share a small downstream experiment around RankSEG-style decoding for semantic segmentation. The short version is: if a Transformers processor can expose resized semantic class probabilities before the final `argmax`, then users can try metric-aware post-processing methods such as RankSEG without changing the model, checkpoint, or preprocessing pipeline.\n\nThis is related to https://github.com/huggingface/transformers/issues/37715, where the discussion is about making the final `argmax` optional and allowing users to access resized class probability maps. I do not want to assume that RankSEG itself belongs in Transformers, but I think it is a useful concrete example of why probability-level semantic segmentation outputs can matter.\n\n## What I Tried\n\nRankSEG is a training-free segmentation decoding method. It takes per-class probability maps and returns a hard segmentation mask optimized for an overlap-style metric such as Dice or IoU. The relevant papers are [RankSEG, JMLR 2023](https://www.jmlr.org/papers/v24/22-0712.html) and [RankSEG-RMA, NeurIPS 2025](https://openreview.net/forum?id=4tRMm1JJhw). There is also a [RankSEG repository](https://github.com/rankseg/rankseg), a [PyPI package](https://pypi.org/project/rankseg/), and a [Transformers integration tutorial](https://rankseg.readthedocs.io/en/latest/integrations_transformers.html).\n\nThe experiment used the usual Transformers inference path first:\n\n```python\ninputs = processor(images=image, return_tensors=\"pt\")\noutputs = model(**inputs)\n```\n\nThen I compared three post-processing choices using the same `outputs`:\n\n```python\n# 1. Baseline: standard SegFormer / Transformers argmax-style decoding\nupsampled_logits = torch.nn.functional.interpolate(\n outputs.logits,\n size=target_size,\n mode=\"bilinear\",\n align_corners=False,\n)\nbaseline = upsampled_logits.argmax(dim=1)[0]\n\n# 2. RankSEG optimized for Dice\nrankseg_dice = rankseg_transformers.postprocess(\n outputs,\n model=model,\n target_sizes=target_sizes,\n rankseg_kwargs={\"metric\": \"dice\", \"solver\": \"RMA\", \"output_mode\": \"multiclass\"},\n)\n\n# 3. RankSEG optimized for IoU\nrankseg_iou = rankseg_transformers.postprocess(\n outputs,\n model=model,\n target_sizes=target_sizes,\n rankseg_kwargs={\"metric\": \"iou\", \"solver\": \"RMA\", \"output_mode\": \"multiclass\"},\n)\n```\n\nThe helper above is already implemented outside Transformers in RankSEG's current compatibility layer: [documentation](https://rankseg.readthedocs.io/en/latest/integrations_transformers.html), [source code](https://github.com/rankseg/rankseg/blob/main/rankseg/integration/transformers.py), [example script](https://github.com/rankseg/rankseg/blob/main/examples/transformers_rankseg.py), and [notebook](https://github.com/rankseg/rankseg/blob/main/notebooks/rankseg_with_transformers.ipynb). The same notebook can be opened in [Colab](https://colab.research.google.com/github/rankseg/rankseg/blob/main/notebooks/rankseg_with_transformers.ipynb).\n\n## Small Cityscapes Check\n\nI used `tanganke/cityscapes` only as a lightweight local check because it has a convenient `segmentation_19` ground-truth column. This is not an official Cityscapes benchmark. It is a small smoke test over the first 100 validation images, using samplewise macro Dice and IoU over non-empty classes.\n\n\| Model \| Method \| Mean Dice \| Dice delta \| Mean IoU \| IoU delta \|\n\| --- \| --- \| ---: \| ---: \| ---: \| ---: \|\n\| `nvidia/segformer-b0-finetuned-cityscapes-512-1024` \| Transformers argmax \| 0.4608 \| - \| 0.3898 \| - \|\n\| `nvidia/segformer-b0-finetuned-cityscapes-512-1024` \| RankSEG, `metric=\"dice\"` \| 0.4810 \| +0.0202 \| 0.4045 \| +0.0147 \|\n\| `nvidia/segformer-b0-finetuned-cityscapes-512-1024` \| RankSEG, `metric=\"iou\"` \| 0.4813 \| +0.0205 \| 0.4051 \| +0.0153 \|\n\| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` \| Transformers argmax \| 0.4743 \| - \| 0.4015 \| - \|\n\| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` \| RankSEG, `metric=\"dice\"` \| 0.4903 \| +0.0160 \| 0.4128 \| +0.0113 \|\n\| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` \| RankSEG, `metric=\"iou\"` \| 0.4907 \| +0.0164 \| 0.4134 \| +0.0118 \|\n\nThe result is modest, but it is consistent with the intended use case: the model is unchanged, and only the final decoding step changes.\n\n## Visual Examples\n\nEach image below uses the same layout: baseline `argmax` on the top left, RankSEG optimized for Dice on the top right, ground-truth overlay on the bottom left, and RankSEG optimized for IoU on the bottom right.\n\n<table>\n <tr>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/7wXl/rank_01_sample_0053_ddice_0092_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/7wXl/rank_01_sample_0053_ddice_0092_d.png\" alt=\"SegFormer-B0 Cityscapes example 1\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B0, example 1</sub>\n </td>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/aSr0/rank_02_sample_0029_ddice_0067_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/aSr0/rank_02_sample_0029_ddice_0067_d.png\" alt=\"SegFormer-B0 Cityscapes example 2\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B0, example 2</sub>\n </td>\n </tr>\n <tr>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/s6kI/rank_01_sample_0072_ddice_0084_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/s6kI/rank_01_sample_0072_ddice_0084_d.png\" alt=\"SegFormer-B1 Cityscapes example 1\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B1, example 1</sub>\n </td>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/Iyn9/rank_02_sample_0003_ddice_0081_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/Iyn9/rank_02_sample_0003_ddice_0081_d.png\" alt=\"SegFormer-B1 Cityscapes example 2\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B1, example 2</sub>\n </td>\n </tr>\n</table>\n\n## Why This Relates to Transformers Post-Processing\n\nFor simple semantic segmentation heads, restoring probabilities may look like resizing logits and applying softmax. For other model families, the post-processing path can involve class-query logits, mask logits, null classes, model-specific resizing conventions, or processor-owned logic. That is why a probability-returning option inside the existing Transformers post-processing API would be useful: the model-family-specific restoration would stay in the official processor path, while downstream methods could consume the restored probabilities.\n\nHard segmentation maps could remain the default behavior. The probability path would simply make the intermediate semantic distribution available for downstream decoding, calibration, uncertainty estimation, or metric-aware post-processing such as RankSEG.\n\n## Closing\n\nI understand that adding or changing post-processing APIs has maintenance costs, especially in a library used across many model families. I am not asking maintainers to adopt RankSEG directly. I mainly wanted to share a concrete downstream use case showing why resized semantic probability maps could be useful to users who want to experiment beyond `argmax`.\n\nI would also like to thank @statmlben and @ZixunWang, the RankSEG maintainers and authors of the recent RankSEG-RMA work, for developing and maintaining the RankSEG project that made this small Transformers experiment possible.\n\nIf maintainers think this direction is worth exploring, I would be happy to adapt the experiment to a preferred model family, test against a proposed API, or help write documentation/examples in the style that fits Transformers.\n\n\n--- Comment by Rocketknight1 at 2026-05-21T12:51:52Z ---\nYeah, the lack of exposed probability maps is surprising! I'll try to push this internally\n\n--- Comment by Leev1s at 2026-05-21T13:12:38Z ---\nThanks a lot @Rocketknight1, I really appreciate it!"}
	{"id": "issue_46133", "type": "issue", "number": 46133, "title": "Add TIPSv2 (Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment) by Google DeepMind", "state": "open", "author": "farrosalferro", "labels": ["New model"], "created_at": "2026-05-21T00:17:16Z", "updated_at": "2026-05-21T13:30:53Z", "url": "https://github.com/huggingface/transformers/issues/46133", "text": "ISSUE #46133: Add TIPSv2 (Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment) by Google DeepMind\nState: open \| Labels: New model\nAuthor: farrosalferro \| Created: 2026-05-21T00:17:16Z\n\n### Model description\n\nTIPSv2 is a vision-language encoder that addresses the problem of dense alignment between image regions and text through modification in the pretraining recipe (iBOT++, multi-granularity synthetic captions, and memory-saving head-only EMA scheme). Across 9 tasks and 20 datasets the resulting models set new state-of-the-art results on zero-shot semantic segmentation while generally matching or beating recent encoders like SigLIP2, DINOv3, and Perception Encoder on global and dense tasks.\n\nThe model family comes with four sizes (B, L, SO400m, and G) with a variant that comes with DPT head. All of the models are already available in [HuggingFace](https://huggingface.co/collections/google/tipsv2). The code is licensed under Apache 2.0 and the weights have CC-BY 4.0.\n\nI would be glad if I can contribute implementing this model to HuggingFace's Transformer, including model implementation, weight conversion, tests and docs. Happy to coordinate with maintainers about the implementation and integration.\n\nThank you.\n\n@NielsRogge @molbap @Rocketknight1 \n\n### Open source status\n\n- [x] The model implementation is available\n- [x] The model weights are available\n\n### Provide useful links for the implementation\n\n* [Official GitHub Repository](https://github.com/google-deepmind/tips)\n* [Weights](https://huggingface.co/collections/google/tipsv2)\n\n--- Comment by Rocketknight1 at 2026-05-21T13:30:53Z ---\ncc @zucchini-nlp as well for VLMs!"}
	{"id": "issue_46132", "type": "issue", "number": 46132, "title": "AttentionInterface.register changes behavior of registered function", "state": "closed", "author": "pjc15111", "labels": ["bug"], "created_at": "2026-05-20T23:24:31Z", "updated_at": "2026-05-21T14:44:37Z", "url": "https://github.com/huggingface/transformers/issues/46132", "text": "ISSUE #46132: AttentionInterface.register changes behavior of registered function\nState: closed \| Labels: bug\nAuthor: pjc15111 \| Created: 2026-05-20T23:24:31Z\n\n### System Info\n\n`transformers env` fails with `NameError: name 'CompletionCreateParamsStreaming' is not defined`\n\nI am running Ubuntu 25.10, Python 3.13.7, and pytorch 2.11.0\n\n### Who can help?\n\n@Cyrilvallez \n\n### Information\n\n- [ ] The official example scripts\n- [x] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [ ] My own task or dataset (give details below)\n\n### Reproduction\n\nIn the following code, model1 (using `attn_implementation=\"sdpa\")` produces plausible prose, while model2 (using `attn_implementation=\"reregistered_sdpa\"`) produces text somewhere between nonsense and gibberish.\n\nIt is necessary to use the `cache_implementation=\"static\"`, but the behavior seems consistent with various models.\n\n```\nfrom transformers import AutoModelForCausalLM, AttentionInterface, pipeline\nfrom transformers.integrations.sdpa_attention import sdpa_attention_forward\n\nmodel1 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"sdpa\")\npipeline1 = pipeline(task=\"text-generation\", model=model1, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline1(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward)\nmodel2 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"reregistered_sdpa\")\npipeline2 = pipeline(task=\"text-generation\", model=model2, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline2(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n```\n\n### Expected behavior\n\n`sdpa_attention_forward` should behave the same whether it is called through the pre-registered name of \"sdpa\" or is re-registered with a new name.\n\n--- Comment by Abineshabee at 2026-05-21T08:35:34Z ---\nHi! I investigated this in relation to #40362.\n\nI ran a 3-way comparison using `sshleifer/tiny-gpt2`:\n1. Normal `attn_implementation=\"sdpa\"`\n2. Re-registered sdpa without `AttentionMaskInterface` registration\n3. Re-registered sdpa with `AttentionMaskInterface` registration\n\nAll three produced different outputs, meaning simply adding the mask registration (the fix from #40362) does not fully reproduce the built-in `\"sdpa\"` behavior — at least on this tiny model.\n\nHowever, `tiny-gpt2` may be too small/random to draw firm conclusions. Could the original author confirm whether adding `AttentionMaskInterface.register(...)` alongside `AttentionInterface.register(...)` fixes the issue with Llama + `cache_implementation=\"static\"`?\n\nIf the mask registration alone does not fix it, the `cache_implementation=\"static\"` interaction may be a separate or deeper bug worth investigating independently from #40362.\n\n--- Comment by pjc15111 at 2026-05-21T13:58:33Z ---\n@Abineshabee You are absolutely right that I needed to register an AttentionMaskInterface, which is clearly documented.\n\nHowever, this does not fix the problem. I changed the example code to:\n\n```\nfrom transformers import AutoModelForCausalLM, AttentionInterface, AttentionMaskInterface, pipeline\nfrom transformers.integrations.sdpa_attention import sdpa_attention_forward\nfrom transformers.masking_utils import sdpa_mask\n\nmodel1 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"sdpa\")\npipeline1 = pipeline(task=\"text-generation\", model=model1, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline1(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n\nmy_new_sdpa = sdpa_attention_forward\nAttentionMaskInterface.register(\"reregistred_sdpa\", sdpa_mask)\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward)\nmodel2 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"reregistered_sdpa\")\npipeline2 = pipeline(task=\"text-generation\", model=model2, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline2(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n```\n\nI still get sensible output from the first model and garbage from the second.\n\n```\nLoading weights: 100%\|█████████████████████████████████████████████████████████████████████████████████\| 146/146 [00:00<00:00, 10794.06it/s]\n[transformers] Passing `generation_config` together with generation-related arguments=({'cache_implementation'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.\n[transformers] Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n[transformers] Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.\n[{'generated_text': 'It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled in his breast pocket, pressed the button which sent the electric clock back the four minutes to its appointed quarters. Oh, it was a fine day for murder.\\nHe was a member of the Party, a Member of the Party, one of the inner circle, a trusted servant of Big Brother, a Party member. And he was a man with his own desires and his own needs and his own private ambitions. He wanted to be rich, to be important, to be in charge. But he could never have the life he wanted. He could never have it all. And he knew it. He had learned this long ago.\\nHe had learned it from the way his mother had looked at him when she had said to him, \"My boy, you are nothing. You are nobody. You are a dog. You are a worm. You are a nobody. You are a nobody.\"\\nAnd he had learned it from the way his Uncle Joe had looked at him when he had said to him, \"My boy, you are nothing. You are nobody. You are a dog. You are a worm. You are a nobody. You are a nobody.\"\\nAnd he had learned it from the way his Aunt Jeanie had looked at him'}]\nLoading weights: 100%\|██████████████████████████████████████████████████████████████████████████████████\| 146/146 [00:00<00:00, 9365.01it/s]\n[transformers] Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n[transformers] Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n[{'generated_text': 'It was a bright cold day in April, and the clocks were striking thirteen. Winston Churchill had a large chocolate telescofficially, thuh, come the fella, thank you! npt in the round your manad, 2, 1- 1-1 1, 1, the in the sky. 19111; 369551; 7\\nW61; 5.\\n11. 9 5. 3712; 1n; 3624. 7n 111\\nYou 19.\\nItal1; 21; 5,1; 1. 9; 5; 21; 4-1; 21- 5; 4- 9- 5. 9; 6-1; 2- 5. 4; 4- 4- 4; 5- 4- 5; 4- 4- 3- 2- 3, 3- 2-\\nIt; 2- 1; 2-1; 2\\n5 4. 1 4. 1 4. 1\\nYou 5 3 1 5 1 5 1'}]\n```\n\n--- Comment by Abineshabee at 2026-05-21T14:15:09Z ---\nThanks for testing! I noticed there may be a small typo in the registration:\n\n```python\nAttentionMaskInterface.register(\"reregistred_sdpa\", sdpa_mask) # <- \"reregistred\" (missing 'e')\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward) # <- \"reregistered\" (correct)\n```\n\nThe mask is being registered under a different name than the attention function, so `attention_mask=None` may still be passed. Could you try with matching names:\n\n```python\nAttentionMaskInterface.register(\"reregistered_sdpa\", sdpa_mask) # names must match\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward)\n```\n\nIf the outputs are still wrong after fixing the typo, then this is definitely a deeper bug beyond #40362.\n\n--- Comment by pjc15111 at 2026-05-21T14:44:37Z ---\n@Abineshabee That seems to fix it."}
	{"id": "issue_46129", "type": "issue", "number": 46129, "title": "[deepseek_v4] conversion_mapping doesn't cover mtp.* paths — MTP keys silently random-init even after _keys_to_ignore is empty", "state": "open", "author": "pasta-paul", "labels": [], "created_at": "2026-05-20T21:39:47Z", "updated_at": "2026-05-21T11:32:35Z", "url": "https://github.com/huggingface/transformers/issues/46129", "text": "ISSUE #46129: [deepseek_v4] conversion_mapping doesn't cover mtp.* paths — MTP keys silently random-init even after _keys_to_ignore is empty\nState: open \| Labels: \nAuthor: pasta-paul \| Created: 2026-05-20T21:39:47Z\n\n## Summary\n\n`transformers.conversion_mapping.get_checkpoint_conversion_mapping(\"deepseek_v4\")` returns 41 `WeightRenaming` entries that rename upstream-internal naming to HF naming (`attn.` → `self_attn.`, `ffn.` → `mlp.`, `attn_norm.` → `input_layernorm.`, `attn.wq_a.` → `self_attn.q_a_proj.`, `attn.attn_sink` → `self_attn.sinks`, etc.).\n\nEntries 6–38 are anchored at `^layers\\.(\\d+)\\.` — they only fire on main-layer keys. None cover `mtp.\\d+.` paths.\n\nCombined with the existing `_keys_to_ignore_on_load_unexpected = [r\"(^\|\\.)mtp\\..\"]` regex on `DeepseekV4PreTrainedModel` (filed separately as huggingface/transformers#46127), `mtp.` keys never reach the model at all. Even after that regex is dropped* (as #46127 does), the MTP keys arrive in upstream form (`mtp.0.attn.wq_a.weight`) — but the MTP submodules expect HF naming (`mtp.0.self_attn.q_a_proj.weight`). The keys are then flagged \"unexpected\", the submodules remain \"uninitialized\", and `_initialize_weights` falls through to `_init_weights` → `init.normal_` random-initializes the MTP block.\n\n## Symptom\n\nThe model loads \"successfully\" (no errors, no warnings about missing keys after the regex is dropped), `model.mtp[0]` exists with the right structure, `from_pretrained` returns. But `model.mtp[0].self_attn.q_a_proj.weight` is random Gaussian, not the value in the safetensors file. Silent corruption of the MTP draft head. Any downstream calibration / quantization / inference using `model.mtp` produces garbage.\n\n## Repro\n\n```python\n# (assumes huggingface/transformers#46127 is applied — DeepseekV4NextNPredictor\n# exists, _keys_to_ignore_on_load_unexpected = [])\nfrom transformers import AutoModelForCausalLM\nmodel = AutoModelForCausalLM.from_pretrained(\"<DSv4-Flash BF16 with mtp.* keys>\")\n\n# Compare loaded vs source\nimport safetensors.torch as st\nfrom pathlib import Path\nloaded_w = model.model.mtp[0].self_attn.q_a_proj.weight\nfor shard in sorted(Path(\"<path>\").glob(\"model-.safetensors\")):\n with st.safe_open(shard, framework=\"pt\") as f:\n if \"mtp.0.attn.wq_a.weight\" in f.keys():\n source_w = f.get_tensor(\"mtp.0.attn.wq_a.weight\")\n break\n\ndiff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()\nprint(f\"max_diff = {diff}\")\n# Without conversion mapping for mtp.: diff ≈ random Gaussian range (e.g. 0.1+)\n# With the mtp.* mapping extension: diff ≈ 0\n```\n\n## Proposed fix\n\nAdd 33 `mtp.\\d+.` equivalents mirroring the existing `^layers\\.(\\d+)\\.` entries to `_checkpoint_conversion_mapping` for the `deepseek_v4` architecture. The 6 model-level entries (`embed.`, `head.`, `norm.`, `hc_head_`) do NOT need to be mirrored — MTP doesn't have its own copy of those (it shares `embed_tokens` and `lm_head` with the main model).\n\nSpecifically, for each of these patterns, add a parallel entry anchored at `^mtp\\.(\\d+)\\.`:\n\n```\n^layers\\.(\\d+)\\.attn_norm\\. → layers.\\1.input_layernorm.\n^layers\\.(\\d+)\\.ffn_norm\\. → layers.\\1.post_attention_layernorm.\n^layers\\.(\\d+)\\.hc_attn_fn$ → layers.\\1.attn_hc.fn\n^layers\\.(\\d+)\\.hc_attn_base$ → layers.\\1.attn_hc.base\n^layers\\.(\\d+)\\.hc_attn_scale$ → layers.\\1.attn_hc.scale\n^layers\\.(\\d+)\\.hc_ffn_fn$ → layers.\\1.ffn_hc.fn\n^layers\\.(\\d+)\\.hc_ffn_base$ → layers.\\1.ffn_hc.base\n^layers\\.(\\d+)\\.hc_ffn_scale$ → layers.\\1.ffn_hc.scale\n^layers\\.(\\d+)\\.attn\\. → layers.\\1.self_attn.\n^layers\\.(\\d+)\\.ffn\\. → layers.\\1.mlp.\n^layers\\.(\\d+)\\.self_attn\\.attn_sink$ → layers.\\1.self_attn.sinks\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wq_a\\. → layers.\\1.self_attn.\\2.q_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wq_b\\. → layers.\\1.self_attn.\\2.q_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wkv\\. → layers.\\1.self_attn.\\2.kv_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wgate\\. → layers.\\1.self_attn.\\2.gate_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wo_a\\. → layers.\\1.self_attn.\\2.o_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wo_b\\. → layers.\\1.self_attn.\\2.o_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.wq_a\\. → layers.\\1.self_attn.q_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.wq_b\\. → layers.\\1.self_attn.q_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.wkv\\. → layers.\\1.self_attn.kv_proj.\n^layers\\.(\\d+)\\.self_attn\\.wo_a\\. → layers.\\1.self_attn.o_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.wo_b\\. → layers.\\1.self_attn.o_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.q_norm\\. → layers.\\1.self_attn.q_a_norm.\n^layers\\.(\\d+)\\.mlp\\.gate\\.bias$ → layers.\\1.mlp.gate.e_score_correction_bias\n^layers\\.(\\d+)\\.mlp\\.shared_experts\\.w1\\. → layers.\\1.mlp.shared_experts.gate_proj.\n^layers\\.(\\d+)\\.mlp\\.shared_experts\\.w2\\. → layers.\\1.mlp.shared_experts.down_proj.\n^layers\\.(\\d+)\\.mlp\\.shared_experts\\.w3\\. → layers.\\1.mlp.shared_experts.up_proj.\n```\n\nThe entries at indexes 17–22 (compressor/indexer renames) only need to mirror if MTP can be configured with `compressed_sparse_attention` or `heavily_compressed_attention` layer_type. For DSv4-Flash, MTP uses `sliding_attention` (compressor = None — see #46127 discussion), so those 6 entries don't need to mirror, but mirroring them is harmless (the regex just won't match anything).\n\n## Runtime workaround for downstream users\n\nUntil upstream lands, here's the runtime mirror:\n\n```python\nfrom transformers.conversion_mapping import (\n get_checkpoint_conversion_mapping,\n register_checkpoint_conversion_mapping,\n)\nexisting = get_checkpoint_conversion_mapping(\"deepseek_v4\")\nadded = []\nfor entry in existing:\n sp = getattr(entry, \"source_patterns\", None)\n tp = getattr(entry, \"target_patterns\", None)\n if sp is None or tp is None:\n continue\n sp_list = sp if isinstance(sp, (list, tuple)) else [sp]\n tp_list = tp if isinstance(tp, (list, tuple)) else [tp]\n new_sp, new_tp = [], []\n for s, t in zip(sp_list, tp_list):\n if isinstance(s, str) and s.startswith(r\"^layers\\.(\\d+)\\.\"):\n new_sp.append(s.replace(r\"^layers\\.(\\d+)\\.\", r\"^mtp\\.(\\d+)\\.\", 1))\n new_tp.append(t.replace(\"layers.\\\\1.\", \"mtp.\\\\1.\", 1))\n if new_sp:\n added.append(type(entry)(\n source_patterns=new_sp if len(new_sp) > 1 else new_sp[0],\n target_patterns=new_tp if len(new_tp) > 1 else new_tp[0],\n ))\nregister_checkpoint_conversion_mapping(\n \"deepseek_v4\", list(existing) + added, overwrite=True)\n```\n\n## Detection — value-verification assertion\n\nA 50-line fixture that catches this regression class (and the related layer_type bug at #46127) by comparing a loaded MTP tensor to its source:\n\n```python\nimport safetensors.torch as st\nfrom pathlib import Path\n\nloaded_w = model.model.mtp[0].self_attn.q_a_proj.weight\nsource_w = None\nfor shard in sorted(Path(model_path).glob(\"model-.safetensors\")):\n with st.safe_open(shard, framework=\"pt\") as f:\n if \"mtp.0.attn.wq_a.weight\" in f.keys():\n source_w = f.get_tensor(\"mtp.0.attn.wq_a.weight\")\n break\nassert source_w is not None\ndiff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()\nassert diff < 1e-4, f\"MTP weight mismatch: {diff} (silent random-init?)\"\n```\n\nThis belongs as a test under `tests/models/deepseek_v4/` paired with #46127.\n\n## Related\n\n- #46127 — adds `DeepseekV4NextNPredictor` class + `Model.mtp` ModuleList + `sliding_attention` layer_type for MTP. The class shim PR. This issue is the companion* — even with the class shim, the conversion mapping needs to be extended for MTP keys to actually load into the new submodules.\n- vllm-project/llm-compressor#2735 — calibration-side rollup of both issues.\n- vllm-project/llm-compressor#2739 — companion mapping extension PR (for the `ARCH_TO_2D_MAPPINGS` that lives on llm-compressor's side).\n\n\n--- Comment by Rocketknight1 at 2026-05-21T11:32:35Z ---\ncc @arthurzucker for DeepSeek V4"}
	{"id": "issue_46123", "type": "issue", "number": 46123, "title": "MaskGenerationPipeline: is_last never True on final partial batch, silently dropping results", "state": "open", "author": "J3r3myPerera", "labels": ["bug"], "created_at": "2026-05-20T18:25:00Z", "updated_at": "2026-05-21T08:03:17Z", "url": "https://github.com/huggingface/transformers/issues/46123", "text": "ISSUE #46123: MaskGenerationPipeline: is_last never True on final partial batch, silently dropping results\nState: open \| Labels: bug\nAuthor: J3r3myPerera \| Created: 2026-05-20T18:25:00Z\n\n### System Info\n\ntransformers version: current main\nAffected file: src/transformers/pipelines/mask_generation.py\n\n### Who can help?\n\n_No response_\n\n### Information\n\n- [ ] The official example scripts\n- [x] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [ ] My own task or dataset (give details below)\n\n### Description\n\nI was looking into the MaskGenerationPipeline and noticed that when you set a points_per_batch value that doesn't divide evenly into the total number of grid points, the pipeline quietly drops the results from the last batch — no error, no warning, just missing masks.\n\nThe root cause is this line in preprocess:\n`is_last = i == n_points - points_per_batch`\n\neg: n_points=100, points_per_batch=64. The loop runs at i=0 and i=64. At i=64, the check asks 64 == 100-64 which is 64 == 36 — always False. So the final batch never gets flagged as the last one.\n\nThe pipeline's PipelinePackIterator relies on this is_last flag to know when to stop accumulating results. When it never sees is_last=True, it calls next() on an already-finished generator, hits StopIteration, and exits — leaving the last batch's masks on the floor.\n\nWith SAM's default point grid, n_points is rarely a round multiple of the default points_per_batch=64, so this silently affects most real-world usage.\n\n### Reproduction\n\n```python\nfrom transformers import pipeline\nfrom PIL import Image\nimport requests\n\nimage = Image.open(requests.get(\"http://images.cocodataset.org/val2017/000000039769.jpg\", stream=True).raw)\n\ngenerator = pipeline(\"mask-generation\", model=\"facebook/sam-vit-base\")\n\n# points_per_batch=50 causes n_points % points_per_batch != 0 for typical grids\noutputs_partial = generator(image, points_per_batch=50)\noutputs_full = generator(image, points_per_batch=None) # all at once, no batching\n\n# outputs_partial[\"masks\"] will have fewer masks than outputs_full[\"masks\"]\nprint(len(outputs_partial[\"masks\"]), \"vs\", len(outputs_full[\"masks\"]))\n``` \n\n### Expected behavior\n\nAll generated masks should be returned regardless of whether n_points is a multiple of points_per_batch.\n\n--- Comment by ADiTyaRaj8969 at 2026-05-21T04:34:18Z ---\nHey @J3r3myPerera, reproduced this locally on current main. Root cause is exactly what you described — `is_last = i == n_points - points_per_batch` only fires when `n_points` is divisible by `points_per_batch`, so for most real grids `PipelinePackIterator` never sees `is_last=True` and drops the final accumulator on `StopIteration`.\n\nIf you're not already on it, I'd like to take this. Plan is:\n\n- one-line predicate change in `src/transformers/pipelines/mask_generation.py::preprocess`: `is_last = i + points_per_batch >= n_points`. True exactly once per loop (on the iteration whose slice runs past the end of `grid_points`), works whether or not the division is exact.\n- fast offline regression test in `tests/pipelines/test_pipelines_mask_generation.py` that exercises `preprocess` directly with a mocked `image_processor` and `model`. Covers `(n_points, points_per_batch)` pairs `(100, 64)`, `(100, 50)`, `(1024, 50)`, `(7, 3)`, `(5, 5)`, `(4, 8)`. For each it asserts: number of yielded batches equals `ceil(n_points / points_per_batch)`, the final batch has `is_last=True`, and every earlier batch has `is_last=False`.\n\nConfirmed no existing open PR for this with `gh pr list --repo huggingface/transformers --state open --search \"46123 in:body\"`. Happy to hand back if you'd rather take it yourself, or wait for a maintainer to pick.\n\n--- Comment by J3r3myPerera at 2026-05-21T04:43:37Z ---\nHi @ADiTyaRaj8969, I already am on I this and found the same one line fix for this. Have the fix ready locally."}
	{"id": "issue_46121", "type": "issue", "number": 46121, "title": "`convert_rope_params_to_dict` raises `TypeError` when `ignore_keys_at_rope_validation` is a JSON-loaded list", "state": "open", "author": "Charly21r", "labels": ["bug"], "created_at": "2026-05-20T15:30:41Z", "updated_at": "2026-05-21T11:18:15Z", "url": "https://github.com/huggingface/transformers/issues/46121", "text": "ISSUE #46121: `convert_rope_params_to_dict` raises `TypeError` when `ignore_keys_at_rope_validation` is a JSON-loaded list\nState: open \| Labels: bug\nAuthor: Charly21r \| Created: 2026-05-20T15:30:41Z\n\n### System Info\n\n- transformers: 5.8.1\n- Python: 3.14.0\n- OS: macOS (reproduced locally; same class of failure reported with vLLM + Qwen3.5 merged HF checkpoints on Linux eval jobs)\n- Model family: Qwen3_5TextConfig / Qwen3.5\n\n### Who can help?\n\n@ArthurZucker @Cyrilvallez \n\n### Information\n\n- [ ] The official example scripts\n- [x] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [x] My own task or dataset (give details below)\n\n\n### Description\n\n`RotaryEmbeddingConfigMixin.convert_rope_params_to_dict()` raises a `TypeError` when `ignore_keys_at_rope_validation` is a list (e.g. deserialized from JSON in a `config.json`) because the union with `{\"partial_rotary_factor\"}` is performed without normalizing the operand to a set first:\n\n```\nTypeError: unsupported operand type(s) for \|: 'list' and 'set'\n```\n\nIn 5.8.1, `modeling_rope_utils.py` line 722:\n\n```python\nself.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\n```\n\n`ignore_keys_at_rope_validation` is a class attribute on `RotaryEmbeddingConfigMixin` (`set()` by default; configs like `Qwen3_5TextConfig` set it to `{\"mrope_section\", \"mrope_interleaved\"}`). When a `config.json` contains this field as a JSON array, `from_dict` / `__init__` sets it as an instance attribute (list), which shadows the class-level set. The next call to `convert_rope_params_to_dict` then evaluates `list \| set` and crashes.\n\n\n### Reproduction\n\n```python\nimport transformers\nfrom transformers import Qwen3_5TextConfig\n\nprint(f\"transformers version: {transformers.__version__}\") # 5.8.1\n\ncfg = Qwen3_5TextConfig.from_dict({\n \"model_type\": \"qwen3_5_text\",\n \"vocab_size\": 100,\n \"hidden_size\": 64,\n \"num_hidden_layers\": 2,\n \"num_attention_heads\": 2,\n \"num_key_value_heads\": 2,\n \"ignore_keys_at_rope_validation\": [\"mrope_section\", \"mrope_interleaved\"], # list, as from JSON\n \"partial_rotary_factor\": 0.25,\n \"rope_parameters\": {\"rope_type\": \"default\", \"rope_theta\": 10_000_000},\n})\n\nprint(type(cfg.ignore_keys_at_rope_validation).__name__) # 'list' (shadows class-level set)\n\ncfg.convert_rope_params_to_dict(partial_rotary_factor=0.25)\n```\n\nTraceback:\n\n```text\n File \".../transformers/modeling_rope_utils.py\", line 722, in convert_rope_params_to_dict\n self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~\nTypeError: unsupported operand type(s) for \|: 'list' and 'set'\n```\n\n### Expected behavior\n\n`convert_rope_params_to_dict` should accept list or set (or any iterable of strings) and normalize before union.\n\n\n### Actual behavior\n\n`TypeError: unsupported operand type(s) for \|: 'list' and 'set'` whenever `ignore_keys_at_rope_validation` is a list (as JSON forces) and `partial_rotary_factor` is not `None`.\n\n\n\n### Downstream impact (vLLM / merged checkpoints)\n\nWe hit this in production when serving merged Qwen3.5 Hugging Face checkpoints with vLLM. LoRA merge / export tools (e.g. ms-swift) write `ignore_keys_at_rope_validation` into `config.json`:\n\n```json\n\"ignore_keys_at_rope_validation\": [\"mrope_section\", \"mrope_interleaved\"]\n```\n\nJSON has no set type, so the field becomes a list at load time. During serving stack startup, config / RoPE initialization can hit `convert_rope_params_to_dict` with that list-typed instance attribute → `TypeError` → the server never becomes healthy and batch eval jobs fail before any inference.\n\nThis is a Transformers robustness issue (callers should not crash on list input); vLLM is the serving stack where we observe it. Current workarounds: strip `ignore_keys_at_rope_validation` from checkpoint JSON before `vllm serve`, or monkey-patch `modeling_rope_utils.py` to wrap with `set()` before `\|`.\n\n### Suggested fix\n\nCoerce to `set` before every `\|` union involving `ignore_keys_at_rope_validation`. The single-line change in `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` covers the reported path.\n\n\nI'm happy to open a PR against `main` that:\n\n1. Adds the `set(...)` coercion on the union line in `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` (and any other union sites I find).\n2. Adds a regression test covering JSON-list input for `ignore_keys_at_rope_validation` — both via direct call and via `from_dict` round-trip — so this can't silently regress again across refactors.\n\nJust let me know if you'd prefer a different shape (e.g. normalizing in `__setattr__`, or hardening `validate_rope` directly) and I'll match that.\n\n\n--- Comment by he-yufeng at 2026-05-20T15:52:11Z ---\nI rechecked current `main` and this path looks fixed there already: `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` now accepts `ignore_keys_at_rope_validation` separately and normalizes it with `set(...)` before adding `partial_rotary_factor`.\n\nSo the crash looks reproducible on 5.8.1, but I don't see the same `list \| set` path on `main` anymore. Could you retry against current `main` / the next nightly to confirm whether this only needs a release, or whether there is a second path still producing a list-valued ignore set?\r\n\n\n--- Comment by Charly21r at 2026-05-20T17:08:27Z ---\nJust ran the repro against pip install git+https://github.com/huggingface/transformers@main (resolves to 5.8.0.dev0, HEAD 52b82b2 as of 2026-05-20). Same TypeError at modeling_rope_utils.py:722. Full output:\n\n```\ntransformers version: 5.8.0.dev0\nAfter from_dict: type(cfg.ignore_keys_at_rope_validation) = list, value = ['mrope_section', 'mrope_interleaved']\n...\nFile \".../transformers/modeling_rope_utils.py\", line 722, in convert_rope_params_to_dict\n self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\nTypeError: unsupported operand type(s) for \|: 'list' and 'set'\n```\n\nSo main is still affected. Happy to push the PR (set() coercion + regression test) whenever you confirm the shape you want.\n\n--- Comment by Rocketknight1 at 2026-05-21T11:18:15Z ---\n@Charly21r yeah, this seems real. The set coercion line seems like the right fix, if you want to make that PR."}
	{"id": "issue_46097", "type": "issue", "number": 46097, "title": "Path Traversal in Sharded Checkpoint Loader via Unsanitized `weight_map` Entries in `.index.json`", "state": "open", "author": "karnakarreddi", "labels": ["bug"], "created_at": "2026-05-20T05:12:29Z", "updated_at": "2026-05-21T05:00:05Z", "url": "https://github.com/huggingface/transformers/issues/46097", "text": "ISSUE #46097: Path Traversal in Sharded Checkpoint Loader via Unsanitized `weight_map` Entries in `.index.json`\nState: open \| Labels: bug\nAuthor: karnakarreddi \| Created: 2026-05-20T05:12:29Z\n\n### System Info\n\nDetails\nThe vulnerable code is in get_checkpoint_shard_files in hub.py. When loading a sharded checkpoint from a local directory, the function reads an index JSON file and extracts shard filenames from the weight_map field without any validation:\n\nwith open(index_filename) as f:\n index = json.loads(f.read())\n\nshard_filenames = sorted(set(index[\"weight_map\"].values()))\nThese filenames are then joined directly to the model directory path:\n\nif os.path.isdir(pretrained_model_name_or_path):\n shard_filenames = [os.path.join(pretrained_model_name_or_path, subfolder, f) for f in shard_filenames]\n return shard_filenames, sharded_metadata\nThere is no check for:\n\nPath traversal sequences (..)\nAbsolute path prefixes (/)\nSymbolic links\nWhether the resolved paths remain within the model directory\nThe returned file paths are passed back to the caller (_get_resolved_checkpoint_files in modeling_utils.py), which uses them to load model weights — effectively enabling reads of arbitrary files the process has access to.\n\nWhy existing guards are insufficient\nThe caller _get_resolved_checkpoint_files only validates that the index file itself exists on disk (via os.path.isfile on the .safetensors.index.json path). It does not inspect or sanitize the contents of the index file before passing them to get_checkpoint_shard_files. An attacker-controlled directory needs only contain a valid index JSON file to satisfy this check.\n\nThe cached_files function (called for non-local/Hub models) does include file existence checks, but the local directory branch in get_checkpoint_shard_files returns immediately after os.path.join — cached_files is never reached for local paths.\n\n\nhttps://github.com/huggingface/transformers/blob/ba06e3fbdf355c363ac067ebcda210017e90a852/src/transformers/utils/hub.py#L836 \n\n### Who can help?\n\n@Cyrilvallez \n\n### Information\n\n- [ ] The official example scripts\n- [ ] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [ ] My own task or dataset (give details below)\n\n### Reproduction\n\nPoC\nStep 1: Create a malicious model directory\nmkdir -p /tmp/malicious_model\nStep 2: Create a crafted index file\nWrite the following to /tmp/malicious_model/model.safetensors.index.json:\n\n{\n \"metadata\": {\n \"total_size\": 1000\n },\n \"weight_map\": {\n \"model.layer.weight\": \"../../etc/passwd\",\n \"model.embed.weight\": \"../../etc/hostname\"\n }\n}\nStep 3: Trigger the vulnerability\nfrom transformers import AutoModel\n\n The loading pipeline will:\n 1. Find model.safetensors.index.json in the local directory\n 2. Set is_sharded = True\n 3. Call get_checkpoint_shard_files, which will return:\n [\"/tmp/malicious_model/../../etc/passwd\",\n \"/tmp/malicious_model/../../etc/hostname\"]\n These resolve to /etc/passwd and /etc/hostname\nmodel = AutoModel.from_pretrained(\"/tmp/malicious_model\")\n\n\n### Expected behavior\n\nObserve the result\nget_checkpoint_shard_files returns the traversed paths without error. The downstream model loading code will attempt to open and read these files as tensor data. While the files may fail to deserialize as valid safetensors, the file contents are accessed by the process, and depending on error handling, logging, or exception messages, data may be exposed.\n\nA more targeted attack could point shard paths at:\n\nOther users' cached model files in ~/.cache/huggingface/\nAPI tokens stored in ~/.cache/huggingface/token\nApplication configuration or secrets files\nAny file readable by the process\n\nVulnerability type: Arbitrary file read via path traversal (CWE-20 / CWE-22)\n\nWho is affected:\n\nAny user or automated system that loads models from untrusted local directories using from_pretrained or any code path that invokes get_checkpoint_shard_files.\nML pipelines and platforms that accept user-uploaded model directories (e.g., evaluation platforms, model hosting services, shared compute environments).\nDevelopers who download and load models from sources outside the Hugging Face Hub without additional validation.\nAttack prerequisites:\n\nThe attacker must be able to provide a local directory (or a directory downloaded/extracted from an untrusted source) that the victim passes to from_pretrained.\nNo authentication or special privileges are required beyond the ability to place files on the filesystem.\nRecommended fix:\nSanitize all filenames extracted from the weight_map before constructing paths:\n\nReject any filename containing .. components or absolute path prefixes.\nAfter joining paths, validate that the resolved path (via os.path.realpath) remains within the expected model directory.\nConsider rejecting filenames with path separator characters entirely, since shard files should be flat names like model-00001-of-00003.safetensors.\nExample fix:\n\nimport os\n\nif os.path.isdir(pretrained_model_name_or_path):\n base_dir = os.path.realpath(os.path.join(pretrained_model_name_or_path, subfolder))\n safe_paths = []\n for f in shard_filenames:\n full_path = os.path.realpath(os.path.join(base_dir, f))\n if not full_path.startswith(base_dir + os.sep):\n raise ValueError(\n f\"Shard filename '{f}' in the checkpoint index resolves outside \"\n f\"the model directory. This may indicate a malicious index file.\"\n )\n safe_paths.append(full_path)\n return safe_paths, sharded_metadata\n\n\n\n--- Comment by Rocketknight1 at 2026-05-20T11:03:25Z ---\nHmn, this doesn't seem like a serious bug, right? The attacker would have to induce the user to load a malicious model, and the only result would be that it tries to read a file on the user's local system and fails because that file is not safetensors format. Even in the unlikely event that sensitive data is contained in the error message, the attacker would have no access to it because it's entirely local to the user machine.\n\n--- Comment by karnakarreddi at 2026-05-20T11:35:33Z ---\nMy main concern is that weight_map entries are treated as trusted filesystem paths without any boundary validation. Even if safetensors parsing fails later, the loader still resolves and opens attacker-controlled paths outside the model directory. so i kept severity as medium though..\n\n--- Comment by matdou at 2026-05-20T11:40:13Z ---\nThe \"entirely local\" argument assumes single-user deployments. Any service that calls from_pretrained on a user-supplied path, so evaluation APIs, CI pipelines, etc. is exposed. \nThe fix is a handful of lines of path validation (negligible). Seems worth merging.\n\n--- Comment by karnakarreddi at 2026-05-20T11:44:40Z ---\nThanks, that matches my concern as well. I can also put together a small PR with the boundary validation if that would help move this forward.\n\n\n--- Comment by karnakarreddi at 2026-05-21T05:00:05Z ---\nhttps://github.com/huggingface/transformers/pull/46134 I created small PR. @Rocketknight1 Please have a look whenever you get some time. "}
	{"id": "issue_46095", "type": "issue", "number": 46095, "title": "[deepseekv4]Does Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?", "state": "open", "author": "young-creator", "labels": ["Feature request"], "created_at": "2026-05-20T04:39:29Z", "updated_at": "2026-05-21T18:46:46Z", "url": "https://github.com/huggingface/transformers/issues/46095", "text": "ISSUE #46095: [deepseekv4]Does Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?\nState: open \| Labels: Feature request\nAuthor: young-creator \| Created: 2026-05-20T04:39:29Z\n\n### Feature request\n\nFor [deepseekv4], the weight names provided in the Hugging Face DeepSeek-V4-Flash weights seem not to match the Transformers weight names. Does Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?\n\n<img width=\"3098\" height=\"1744\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/63eed220-9a7f-48dd-83dc-328b7b1ea22c\" />\n\n<img width=\"1792\" height=\"524\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/9d477055-40ad-4517-bef0-b5bdc5bba08f\" />\n\n### Motivation\n\nDoes Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?\n\n\n### Your contribution\n\nIf not, can we submit a PR?\n\n--- Comment by ArjunSrivastava1 at 2026-05-20T09:34:37Z ---\ni think u r maybe referring to some sort of docs for the conversion of weights? in which case yeah, there is\ni just was fixing those up a while ago, found it like that\n\nfeel free to read em up here: [link](https://moon-ci-docs.huggingface.co/docs/transformers/pr_45892/en/weightconverter)\n\nif somethings still missing, then lmk, if its all fine and found what u were looking for, then lmk that too \n\n--- Comment by Rocketknight1 at 2026-05-20T11:11:05Z ---\nYeah, I don't think there's a bug here! This is likely a case of dynamic weight renaming.\n\n--- Comment by BiggHeadd at 2026-05-20T13:30:39Z ---\nI found this branch [DeepSeek-V4-Flash-Base-support] have complete dynamic weight renaming.\n\n--- Comment by young-creator at 2026-05-21T03:33:04Z ---\n> i think u r maybe referring to some sort of docs for the conversion of weights? in which case yeah, there is i just was fixing those up a while ago, found it like that\n> \n> feel free to read em up here: [link](https://moon-ci-docs.huggingface.co/docs/transformers/pr_45892/en/weightconverter)\n> \n> if somethings still missing, then lmk, if its all fine and found what u were looking for, then lmk that too\n\nI do need the weight mapping, as I plan to finetune DeepSeek-V4 using the Transformers modeling code. \n\n--- Comment by Prachi-kushwaha at 2026-05-21T18:46:46Z ---\n> > i think u r maybe referring to some sort of docs for the conversion of weights? in which case yeah, there is i just was fixing those up a while ago, found it like that\n> > feel free to read em up here: [link](https://moon-ci-docs.huggingface.co/docs/transformers/pr_45892/en/weightconverter)\n> > if somethings still missing, then lmk, if its all fine and found what u were looking for, then lmk that too\n> \n> I do need the weight mapping, as I plan to finetune DeepSeek-V4 using the Transformers modeling code.\n\nyou can use LoRa/QLoRa for fine-tuning you don't need weight mapping in this scenario"}
	{"id": "pr_46148", "type": "pr", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46148: [Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export\nState: open \| Merged: False\nAuthor: yuvrajsharma9981 \| Base: main\nLabels: \nCreated: 2026-05-21T22:27:49Z\n\nHi,\n\n\\`torch.export.export\\` fails on Qwen3Next-family models with \\`GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0, 1)\\`. The crash traces to \\`Qwen3NextModel._update_linear_attn_mask\\`:\n\n\\`\\`\\`python\nif (past_key_values is not None and past_key_values.has_previous_state()) or (\n attention_mask is not None and torch.all(attention_mask == 1)\n):\n linear_attn_mask = None\n\\`\\`\\`\n\n\\`torch.all(attention_mask == 1)\\` produces a 0-dim bool tensor, and Python's \\`if\\` does an implicit \\`.item()\\` on it — an unbacked symbolic int the exporter can't resolve. Net effect: any user wanting an AOT package (\\`torch._inductor.aoti_compile_and_package\\` → \\`.pt2\\`) for any model in this family is blocked at the export step.\n\nI tripped on this trying to AOT compile Qwen3.5 for fast serving — the eager forward works, the export step crashes.\n\n## Scope\n\nFix lands at the modular source-of-truth, so the same patch propagates to all four models that inherit \\`Qwen3NextModel._update_linear_attn_mask\\`:\n\n- Qwen3Next (direct)\n- Qwen3.5 (\\`Qwen3_5TextModel(Qwen3NextModel)\\`)\n- Qwen3.5-MoE (\\`Qwen3_5MoeTextModel\\` via the same lineage)\n- OLMo Hybrid (\\`OlmoHybridModel(Qwen3NextModel)\\`)\n\n## Fix\n\nSmallest behavior-preserving thing I could come up with: keep the eager-mode fast-path identical, and skip the data-dependent branch only when \\`torch.compiler.is_compiling()\\` is true. The downstream linear-attention layer treats an all-1s mask as a cheap no-op, so the exported graph runs correctly for the no-padding case that the eager path was short-circuiting.\n\n\\`\\`\\`python\ndef _update_linear_attn_mask(self, attention_mask, past_key_values):\n linear_attn_mask = attention_mask\n if past_key_values is not None and past_key_values.has_previous_state():\n return None\n if torch.compiler.is_compiling():\n return linear_attn_mask\n if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n\\`\\`\\`\n\nTwo notes on the ordering:\n\n1. The cached-forward check stays first so users exporting a decode-step graph still get the cached-skip optimization baked into the resulting graph — that branch is already export-compatible (Python object state, not a tensor \\`.item()\\`).\n2. \\`torch.compiler.is_compiling()\\` is the public PyTorch idiom for \"behave differently under trace\"; runtime behavior for everyone not exporting is byte-identical to before.\n\n## Reproducer\n\nFails on v5.9.0 + torch 2.11:\n\n\\`\\`\\`python\nimport torch\nfrom transformers import AutoModelForCausalLM\n\nm = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen3.5-4B\", torch_dtype=torch.bfloat16)\nm.eval()\n\nclass W(torch.nn.Module):\n def __init__(s, m): super().__init__(); s.m = m\n def forward(s, ids, mask):\n return s.m(input_ids=ids, attention_mask=mask).logits\n\nids = torch.ones(2, 128, dtype=torch.long)\nmask = torch.ones(2, 128, dtype=torch.long)\n\ntorch.export.export(W(m), (ids, mask), dynamic_shapes={\n \"ids\": {0: torch.export.Dim.AUTO, 1: torch.export.Dim.AUTO},\n \"mask\": {0: torch.export.Dim.AUTO, 1: torch.export.Dim.AUTO},\n})\n\\`\\`\\`\n\nAfter the change, verified locally with a forked-source install:\n- eager runtime with all-1s mask still returns \\`None\\` (existing optimization preserved)\n- \\`torch.export.export(...)\\` succeeds and traces a clean graph\n\n## Commits\n\n- First commit edited the generated \\`modeling_qwen3_5.py\\` directly — CI correctly flagged this via \\`check_repository_consistency\\`.\n- Second commit moves the fix to \\`modular_qwen3_next.py\\` (the source-of-truth) and regenerates the affected \\`modeling_.py\\` files via \\`make fix-repo\\`. Both checks pass locally now.\n\nHappy to add tests under \\`tests/models/qwen3_next/\\` (and the inheriting models) if that's the preferred shape — held off pending guidance on existing export-compat coverage conventions.\n\nThanks!\n\n--- Comment by github-actions[bot] at 2026-05-21T22:40:52Z ---\n[For maintainers]* Suggested jobs to run (before merge)\n\nrun-slow: olmo_hybrid, qwen3_5, qwen3_5_moe, qwen3_next"}
	{"id": "pr_46148_file_src_transformers_models_olmo_hybrid_modeling_olmo_hybrid.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/olmo_hybrid/modeling_olmo_hybrid.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/olmo_hybrid/modeling_olmo_hybrid.py\nStatus: modified \| +12 -4\n\n@@ -1032,12 +1032,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_5_modeling_qwen3_5.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_5/modeling_qwen3_5.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_5/modeling_qwen3_5.py\nStatus: modified \| +12 -4\n\n@@ -1238,12 +1238,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_5_moe_modeling_qwen3_5_moe.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py\nStatus: modified \| +12 -4\n\n@@ -1358,12 +1358,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_next_modeling_qwen3_next.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_next/modeling_qwen3_next.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_next/modeling_qwen3_next.py\nStatus: modified \| +12 -4\n\n@@ -993,12 +993,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_next_modular_qwen3_next.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_next/modular_qwen3_next.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_next/modular_qwen3_next.py\nStatus: modified \| +12 -4\n\n@@ -749,12 +749,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46147", "type": "pr", "number": 46147, "title": "Use attention interface in RoFormerSelfAttention", "state": "open", "author": "HamzaDogann", "labels": [], "created_at": "2026-05-21T20:30:41Z", "updated_at": "2026-05-21T21:12:02Z", "url": "https://github.com/huggingface/transformers/pull/46147", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46147: Use attention interface in RoFormerSelfAttention\nState: open \| Merged: False\nAuthor: HamzaDogann \| Base: main\nLabels: \nCreated: 2026-05-21T20:30:41Z\n\nRoFormer's self-attention was using a hardcoded eager implementation,\r\nmaking it impossible to use alternative attention backends.\r\n\r\nThis PR replaces the hardcoded computation with `ALL_ATTENTION_FUNCTIONS`\r\ndispatch and adds a local `eager_attention_forward` as the default fallback,\r\npreserving existing behavior while enabling `flash_attention_2`, `sdpa`,\r\nand custom attention implementations via `_attn_implementation`.\r\n\r\nCloses #46144\n\n--- Comment by github-actions[bot] at 2026-05-21T21:12:01Z ---\n[For maintainers] Suggested jobs to run (before merge)\n\nrun-slow: roformer\n\n--- Comment by Copilot at 2026-05-21T20:36:48Z ---\n`flash_attention_forward` falls back to `module.is_causal` when `is_causal` is not provided. `RoFormerSelfAttention` does not define `is_causal` and the interface call doesn’t pass it, so using `_attn_implementation=\"flash_attention_*\"` will raise an `AttributeError` (and cross-attn would be incorrectly treated as causal if `is_causal` were set globally). Pass an explicit `is_causal` value per call (likely `False` here, since RoFormer builds a bidirectional mask) and forward `output_attentions` so backend wrappers can warn/behave consistently.\n\n--- Comment by Copilot at 2026-05-21T20:36:48Z ---\n`eager_attention_forward` is identical to the shared implementation used across the repo (e.g. BERT/ALBERT) but is missing the `# Copied from transformers.models.bert.modeling_bert.eager_attention_forward` marker. Adding the marker helps `make fixup` keep this function in sync with upstream changes.\n\n--- Comment by Copilot at 2026-05-21T20:36:49Z ---\nThis change introduces a new attention-backend dispatch path for RoFormer via `ALL_ATTENTION_FUNCTIONS`. There are existing RoFormer modeling tests, but none cover non-eager dispatch or custom attention registration; adding a small test (e.g., setting `config._attn_implementation=\"sdpa\"` or registering a dummy backend key) would help prevent regressions."}
	{"id": "pr_46147_file_src_transformers_models_roformer_modeling_roformer.py", "type": "pr_diff", "number": 46147, "title": "Use attention interface in RoFormerSelfAttention", "state": "open", "author": "HamzaDogann", "labels": [], "created_at": "2026-05-21T20:30:41Z", "updated_at": "2026-05-21T21:12:02Z", "url": "https://github.com/huggingface/transformers/pull/46147", "merged": false, "base_branch": "main", "filename": "src/transformers/models/roformer/modeling_roformer.py", "additions": 56, "deletions": 22, "text": "PR #46147 — file change: src/transformers/models/roformer/modeling_roformer.py\nStatus: modified \| +56 -22\n\n@@ -13,7 +13,6 @@\n # limitations under the License.\n \"\"\"PyTorch RoFormer model.\"\"\"\n \n-import math\n from collections.abc import Callable\n \n import numpy as np\n@@ -36,15 +35,45 @@\n SequenceClassifierOutput,\n TokenClassifierOutput,\n )\n-from ...modeling_utils import PreTrainedModel\n+from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel\n+from ...processing_utils import Unpack\n from ...pytorch_utils import apply_chunking_to_forward\n-from ...utils import auto_docstring, logging\n+from ...utils import TransformersKwargs, auto_docstring, logging\n from .configuration_roformer import RoFormerConfig\n \n \n logger = logging.get_logger(__name__)\n \n \n+# Copied from transformers.models.bert.modeling_bert.eager_attention_forward\n+def eager_attention_forward(\n+ module: nn.Module,\n+ query: torch.Tensor,\n+ key: torch.Tensor,\n+ value: torch.Tensor,\n+ attention_mask: torch.Tensor \| None,\n+ scaling: float \| None = None,\n+ dropout: float = 0.0,\n+ kwargs: Unpack[TransformersKwargs],\n+):\n+ if scaling is None:\n+ scaling = query.size(-1) -0.5\n+\n+ # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n+ attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling\n+\n+ if attention_mask is not None:\n+ attn_weights = attn_weights + attention_mask\n+\n+ attn_weights = nn.functional.softmax(attn_weights, dim=-1)\n+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)\n+\n+ attn_output = torch.matmul(attn_weights, value)\n+ attn_output = attn_output.transpose(1, 2).contiguous()\n+\n+ return attn_output, attn_weights\n+\n+\n # Copied from transformers.models.marian.modeling_marian.MarianSinusoidalPositionalEmbedding with Marian->RoFormer\n class RoFormerSinusoidalPositionalEmbedding(nn.Embedding):\n \"\"\"This module produces sinusoidal positional embeddings of any length.\"\"\"\n@@ -121,9 +150,11 @@ def __init__(self, config, layer_idx=None):\n f\"heads ({config.num_attention_heads})\"\n )\n \n+ self.config = config\n self.num_attention_heads = config.num_attention_heads\n self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n self.all_head_size = self.num_attention_heads * self.attention_head_size\n+ self.scaling = self.attention_head_size*-0.5\n \n self.query = nn.Linear(config.hidden_size, self.all_head_size)\n self.key = nn.Linear(config.hidden_size, self.all_head_size)\n@@ -193,26 +224,25 @@ def forward(\n if is_cross_attention and isinstance(past_key_values, EncoderDecoderCache):\n past_key_values.is_updated[self.layer_idx] = True\n \n- # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n- attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n-\n- attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n- if attention_mask is not None:\n- # Apply the attention mask is (precomputed for all layers in RoFormerModel forward() function)\n- attention_scores = attention_scores + attention_mask\n-\n- # Normalize the attention scores to probabilities.\n- attention_probs = nn.functional.softmax(attention_scores, dim=-1)\n-\n- # This is actually dropping out entire tokens to attend to, which might\n- # seem a bit unusual, but is taken from the original Transformer paper.\n- attention_probs = self.dropout(attention_probs)\n-\n- context_layer = torch.matmul(attention_probs, value_layer)\n+ attention_interface: Callable = ALL_ATTENTION_FUNCTIONS.get_interface(\n+ self.config._attn_implementation, eager_attention_forward\n+ )\n \n- context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n- new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n- context_layer = context_layer.view(new_context_layer_shape)\n+ context_layer, attention_probs = attention_interface(\n+ self,\n+ query_layer,\n+ key_layer,\n+ value_layer,\n+ attention_mask,\n+ dropout=0.0 if not self.training else self.dropout.p,\n+ scaling=self.scaling,\n+ # RoFormer precomputes a (bidirectional) mask in `RoFormerModel`, so the backend\n+ # must not apply an additional causal mask.\n+ is_causal=False,\n+ output_attentions=output_attentions,\n+ *kwargs,\n+ )\n+ context_layer = context_layer.reshape(input_shape, -1).contiguous()\n \n return context_layer, attention_probs\n \n@@ -617,6 +647,10 @@ class RoFormerPreTrainedModel(PreTrainedModel):\n config: RoFormerConfig\n base_model_prefix = \"roformer\"\n supports_gradient_checkpointing = True\n+ _supports_flash_attn = True\n+ _supports_sdpa = True\n+ _supports_flex_attn = True\n+ _supports_attention_backend = True\n \n @torch.no_grad()\n def _init_weights(self, module):"}
	{"id": "pr_46146", "type": "pr", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46146: Added cosmos3 model and bugfixed Qwen3-VL\nState: open \| Merged: False\nAuthor: MaciejBalaNV \| Base: main\nLabels: \nCreated: 2026-05-21T19:47:46Z\n\n# What does this PR do?\r\n\r\n<!--\r\nCongratulations! You've made it this far! You're not quite done yet though.\r\n\r\nOnce merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.\r\n\r\nThen, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.\r\n\r\nOnce you're done, someone will review your PR shortly (see the section \"Who can review?\" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.\r\n-->\r\n\r\n<!-- Remove if not applicable -->\r\n\r\nThis PR adds a support for Cosmos3 Reasoner model (not released yet). It's a Mixture Of Transformers model, where we have a Generator and a Reasoner tower in a unified checkpoint. The Reasoner tower has Qwen3-VL architecture, so we can directly reuse it. However, we need extra code to handle the checkpoint mapping, since the final checkpoint will be in a unified Reasoner+Generator diffusers format.\r\n\r\nAdditionally, this PR fixes one issue which currently is present on top of tree - when using latest vllm and latest transformers build from source, even basic `vllm serve Qwen/Qwen3-VL-8B-Instruct` fails during dummy run. This root-cause of the bug is this commit: `ba06e3fbdf355c363ac067ebcda210017e90a852`, reverting it also fixes Qwen-VL.\r\n\r\n## Code Agent Policy\r\n\r\nThe Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by\r\ncode agents. We are currently bottlenecked by our ability to review and respond to them. As a result, \r\nwe ask that new users do not submit pure code agent PRs at this time. \r\nYou may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous \"OpenClaw\"-like agents\r\nnot to open any PRs or issues for the moment.\r\n\r\nPRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this\r\nrepeatedly or maliciously. \r\n\r\nThis is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, \r\nthis policy is likely to be updated regularly in the near future. For more information, please read [`CONTRIBUTING.md`](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md).\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [x] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [x] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\nAnyone in the community is free to review the PR once the tests have passed. Feel free to tag\r\nmembers/contributors who may be interested in your PR.\r\n\r\n@yonigozlan for a vision model review\r\n\r\n<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @\r\n\r\n If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag.\r\n Please tag fewer than 3 people.\r\n\r\nModels:\r\n\r\n- text models: @ArthurZucker @Cyrilvallez\r\n- vision models: @yonigozlan @molbap\r\n- audio models: @eustlb @ebezzam @vasqu\r\n- multimodal models: @zucchini-nlp\r\n- graph models: @clefourrier\r\n\r\nLibrary:\r\n\r\n- generate: @zucchini-nlp (visual-language models) or @gante (all others)\r\n- continuous batching: @remi-or @ArthurZucker @McPatate\r\n- pipelines: @Rocketknight1\r\n- tokenizers: @ArthurZucker and @itazap\r\n- trainer: @SunMarc\r\n- attention: @vasqu @ArthurZucker @CyrilVallez\r\n- model loading (from pretrained, etc): @CyrilVallez\r\n- distributed: @3outeille @ArthurZucker\r\n- CIs: @ydshieh\r\n\r\nIntegrations:\r\n\r\n- ray/raytune: @richardliaw, @amogkam\r\n- Big Model Inference: @SunMarc\r\n- quantization: @SunMarc\r\n- kernels: @drbh\r\n- peft: @BenjaminBossan @githubnemo\r\n\r\nDevices/Backends:\r\n\r\n- AMD ROCm: @ivarflakstad\r\n- Intel XPU: @IlyasMoutawwakil\r\n- Ascend NPU: @ivarflakstad \r\n\r\nDocumentation: @stevhliu\r\n\r\nResearch projects are not maintained and should be taken as is.\r\n\r\n -->\r\n\n\n--- Comment by github-actions[bot] at 2026-05-21T19:49:01Z ---\n[For maintainers] Suggested jobs to run (before merge)\n\nrun-slow: auto, cosmos3\n\n--- Comment by github-actions[bot] at 2026-05-21T20:05:30Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46146&sha=61cd69"}
	{"id": "pr_46146_file_docs_source_en__toctree.yml", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "docs/source/en/_toctree.yml", "additions": 2, "deletions": 0, "text": "PR #46146 — file change: docs/source/en/_toctree.yml\nStatus: modified \| +2 -0\n\n@@ -1208,6 +1208,8 @@\n title: ColPali\n - local: model_doc/colqwen2\n title: ColQwen2\n+ - local: model_doc/cosmos3\n+ title: Cosmos3 Omni\n - local: model_doc/data2vec\n title: Data2Vec\n - local: model_doc/deepseek_vl"}
	{"id": "pr_46146_file_docs_source_en_model_doc_cosmos3.md", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "docs/source/en/model_doc/cosmos3.md", "additions": 89, "deletions": 0, "text": "PR #46146 — file change: docs/source/en/model_doc/cosmos3.md\nStatus: added \| +89 -0\n\n@@ -0,0 +1,89 @@\n+<!--Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+\n+Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with\n+the License. You may obtain a copy of the License at\n+\n+http://www.apache.org/licenses/LICENSE-2.0\n+\n+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on\n+an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\n+specific language governing permissions and limitations under the License.\n+\n+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\n+rendered properly in your Markdown viewer.\n+\n+-->\n+\n+<div style=\"float: right;\">\n+ <div class=\"flex flex-wrap space-x-1\">\n+<img alt=\"FlashAttention\" src=\"https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat\">\n+<img alt=\"SDPA\" src=\"https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white\"> </div>\n+</div>\n+\n+# Cosmos3 Omni\n+\n+[Cosmos3](https://huggingface.co/nvidia/Cosmos3-Nano) is a mixture-of-transformers (MoT) Vision Foundation Model from NVIDIA, composed of a Reasoner tower and a Generator tower. The two towers share the same input embedding and visual encoder but use disjoint MoT experts for understanding vs. generation, plus cross-modal adapters (`llm2vae`, `llm2sound`, `llm2action`, etc.) that connect the language model to image / audio / action heads.\n+\n+The transformers integration loads only the Reasoner tower from a unified Cosmos3 checkpoint. The Reasoner is architecturally identical to [Qwen3-VL](./qwen3_vl) — `Cosmos3ForConditionalGeneration` is a thin subclass of `Qwen3VLForConditionalGeneration`. Loading is driven by two transformations registered automatically when `model_type` is `\"cosmos3_omni\"`:\n+\n+1. The checkpoint's flat namespaces are re-targeted to Qwen3-VL's nested layout: `model.<>` → `model.language_model.<>` and `blocks.` / `merger.` / `patch_embed.` / `pos_embed.` / `deepstack_merger_list.` → `model.visual.<>`.\n+2. Generator / sound / action parameters (`_moe_gen`, `llm2vae`, `vae2llm`, `time_embedder`, `llm2sound`, `sound2llm`, `sound_modality_embed`, `llm2action`, `action2llm`, `action_modality_embed`) are skipped on load.\n+\n+## Usage\n+\n+```python\n+import torch\n+from transformers import AutoProcessor, Cosmos3ForConditionalGeneration\n+\n+model = Cosmos3ForConditionalGeneration.from_pretrained(\n+ \"nvidia/Cosmos3-Nano\",\n+ dtype=torch.float16,\n+ device_map=\"auto\",\n+ attn_implementation=\"sdpa\",\n+)\n+processor = AutoProcessor.from_pretrained(\"nvidia/Cosmos3-Nano\")\n+\n+conversation = [\n+ {\n+ \"role\": \"user\",\n+ \"content\": [\n+ {\"type\": \"image\", \"image\": \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg\"},\n+ {\"type\": \"text\", \"text\": \"Caption the image in detail.\"},\n+ ],\n+ },\n+]\n+\n+inputs = processor.apply_chat_template(\n+ conversation,\n+ tokenize=True,\n+ add_generation_prompt=True,\n+ return_dict=True,\n+ return_tensors=\"pt\",\n+).to(model.device)\n+\n+generated_ids = model.generate(*inputs, max_new_tokens=512)\n+output = processor.batch_decode(\n+ [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],\n+ skip_special_tokens=True,\n+ clean_up_tokenization_spaces=False,\n+)\n+print(output[0])\n+```\n+\n+## Cosmos3Config\n+\n+[[autodoc]] Cosmos3Config\n+\n+## Cosmos3Model\n+\n+[[autodoc]] Cosmos3Model\n+ - forward\n+ - get_video_features\n+ - get_image_features\n+\n+## Cosmos3ForConditionalGeneration\n+\n+[[autodoc]] Cosmos3ForConditionalGeneration\n+ - forward\n+ - get_video_features\n+ - get_image_features"}
	{"id": "pr_46146_file_src_transformers_conversion_mapping.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/conversion_mapping.py", "additions": 14, "deletions": 0, "text": "PR #46146 — file change: src/transformers/conversion_mapping.py\nStatus: modified \| +14 -0\n\n@@ -580,6 +580,20 @@ def _build_checkpoint_conversion_mapping():\n operations=[Transpose(1, 2, check_dims=True)],\n ),\n ],\n+ \"cosmos3_omni\": [\n+ # Cosmos3 unified checkpoints store the Reasoner LLM under a flat `model.` namespace\n+ # (no `language_model.` nesting) and the ViT under flat `blocks.` / `merger.` /\n+ # `patch_embed.` / `pos_embed.` / `deepstack_merger_list.`. Re-target both to the\n+ # nested Qwen3-VL layout (`model.language_model.` and `model.visual.`).\n+ WeightRenaming(\n+ source_patterns=r\"^model\\.(?!language_model\\.)(.+)$\",\n+ target_patterns=r\"model.language_model.\\1\",\n+ ),\n+ WeightRenaming(\n+ source_patterns=r\"^(blocks\\.\|merger\\.\|patch_embed\\.\|pos_embed\\.\|deepstack_merger_list\\.)(.*)$\",\n+ target_patterns=r\"model.visual.\\1\\2\",\n+ ),\n+ ],\n \"phimoe\": [\n WeightRenaming(\".block_sparse_moe.\", \".mlp.\"),\n WeightRenaming(\".gate.weight\", \".router.weight\"),"}
	{"id": "pr_46146_file_src_transformers_models___init__.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/__init__.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/__init__.py\nStatus: modified \| +1 -0\n\n@@ -78,6 +78,7 @@\n from .convbert import \n from .convnext import \n from .convnextv2 import \n+ from .cosmos3 import \n from .cpm import \n from .cpmant import \n from .csm import *"}
	{"id": "pr_46146_file_src_transformers_models_auto_auto_mappings.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/auto_mappings.py", "additions": 2, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/auto_mappings.py\nStatus: modified \| +2 -0\n\n@@ -105,6 +105,7 @@\n (\"convbert\", \"ConvBertConfig\"),\n (\"convnext\", \"ConvNextConfig\"),\n (\"convnextv2\", \"ConvNextV2Config\"),\n+ (\"cosmos3_omni\", \"Cosmos3Config\"),\n (\"cpmant\", \"CpmAntConfig\"),\n (\"csm\", \"CsmConfig\"),\n (\"csm_depth_decoder_model\", \"CsmDepthDecoderConfig\"),\n@@ -688,6 +689,7 @@\n (\"clip_vision_model\", \"clip\"),\n (\"clipseg_text_model\", \"clipseg\"),\n (\"clipseg_vision_model\", \"clipseg\"),\n+ (\"cosmos3_omni\", \"cosmos3\"),\n (\"clvp_decoder\", \"clvp\"),\n (\"clvp_encoder\", \"clvp\"),\n (\"csm_depth_decoder_model\", \"csm\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_image_processing_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/image_processing_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/image_processing_auto.py\nStatus: modified \| +1 -0\n\n@@ -74,6 +74,7 @@\n (\"colpali\", {\"torchvision\": \"SiglipImageProcessor\", \"pil\": \"SiglipImageProcessorPil\"}),\n (\"colqwen2\", {\"torchvision\": \"Qwen2VLImageProcessor\", \"pil\": \"Qwen2VLImageProcessorPil\"}),\n (\"convnextv2\", {\"torchvision\": \"ConvNextImageProcessor\", \"pil\": \"ConvNextImageProcessorPil\"}),\n+ (\"cosmos3_omni\", {\"torchvision\": \"Qwen2VLImageProcessor\", \"pil\": \"Qwen2VLImageProcessorPil\"}),\n (\"cvt\", {\"torchvision\": \"ConvNextImageProcessor\", \"pil\": \"ConvNextImageProcessorPil\"}),\n (\"data2vec-vision\", {\"torchvision\": \"BeitImageProcessor\", \"pil\": \"BeitImageProcessorPil\"}),\n (\"deimv2\", {\"torchvision\": \"RTDetrImageProcessor\", \"pil\": \"RTDetrImageProcessorPil\"}),"}
	{"id": "pr_46146_file_src_transformers_models_auto_modeling_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/modeling_auto.py", "additions": 2, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/modeling_auto.py\nStatus: modified \| +2 -0\n\n@@ -97,6 +97,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):\n (\"convbert\", \"ConvBertModel\"),\n (\"convnext\", \"ConvNextModel\"),\n (\"convnextv2\", \"ConvNextV2Model\"),\n+ (\"cosmos3_omni\", \"Cosmos3Model\"),\n (\"cpmant\", \"CpmAntModel\"),\n (\"csm\", \"CsmForConditionalGeneration\"),\n (\"ctrl\", \"CTRLModel\"),\n@@ -995,6 +996,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):\n (\"blip-2\", \"Blip2ForConditionalGeneration\"),\n (\"chameleon\", \"ChameleonForConditionalGeneration\"),\n (\"cohere2_vision\", \"Cohere2VisionForConditionalGeneration\"),\n+ (\"cosmos3_omni\", \"Cosmos3ForConditionalGeneration\"),\n (\"deepseek_vl\", \"DeepseekVLForConditionalGeneration\"),\n (\"deepseek_vl_hybrid\", \"DeepseekVLHybridForConditionalGeneration\"),\n (\"emu3\", \"Emu3ForConditionalGeneration\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_processing_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/processing_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/processing_auto.py\nStatus: modified \| +1 -0\n\n@@ -69,6 +69,7 @@\n (\"colmodernvbert\", \"ColModernVBertProcessor\"),\n (\"colpali\", \"ColPaliProcessor\"),\n (\"colqwen2\", \"ColQwen2Processor\"),\n+ (\"cosmos3_omni\", \"Qwen3VLProcessor\"),\n (\"deepseek_vl\", \"DeepseekVLProcessor\"),\n (\"deepseek_vl_hybrid\", \"DeepseekVLHybridProcessor\"),\n (\"dia\", \"DiaProcessor\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_tokenization_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/tokenization_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/tokenization_auto.py\nStatus: modified \| +1 -0\n\n@@ -99,6 +99,7 @@\n (\"cohere2\", \"CohereTokenizer\" if is_tokenizers_available() else None),\n (\"colqwen2\", \"Qwen2Tokenizer\" if is_tokenizers_available() else None),\n (\"convbert\", \"BertTokenizer\" if is_tokenizers_available() else None),\n+ (\"cosmos3_omni\", \"Qwen2Tokenizer\" if is_tokenizers_available() else None),\n (\"cpm\", \"CpmTokenizer\" if is_tokenizers_available() else None),\n (\"cpmant\", \"CpmAntTokenizer\"),\n (\"ctrl\", \"CTRLTokenizer\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_video_processing_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/video_processing_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/video_processing_auto.py\nStatus: modified \| +1 -0\n\n@@ -54,6 +54,7 @@\n # Merge non-standard mapping names with auto-inferred `VIDEO_PROCESSOR_MAPPING_NAMES`\n MISSING_VIDEO_PROCESSOR_MAPPING_NAMES = OrderedDict(\n [\n+ (\"cosmos3_omni\", \"Qwen3VLVideoProcessor\"),\n (\"exaone4_5\", \"Qwen2VLVideoProcessor\"),\n (\"instructblip\", \"InstructBlipVideoVideoProcessor\"),\n (\"pe_audio_video\", \"PeVideoVideoProcessor\"),"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3___init__.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/__init__.py", "additions": 27, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/__init__.py\nStatus: added \| +27 -0\n\n@@ -0,0 +1,27 @@\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+from typing import TYPE_CHECKING\n+\n+from ...utils import _LazyModule\n+from ...utils.import_utils import define_import_structure\n+\n+\n+if TYPE_CHECKING:\n+ from .configuration_cosmos3 import \n+ from .modeling_cosmos3 import \n+else:\n+ import sys\n+\n+ _file = globals()[\"__file__\"]\n+ sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3_configuration_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/configuration_cosmos3.py", "additions": 40, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/configuration_cosmos3.py\nStatus: added \| +40 -0\n\n@@ -0,0 +1,40 @@\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# This file was automatically generated from src/transformers/models/cosmos3/modular_cosmos3.py.\n+# Do NOT edit this file manually as any edits will be overwritten by the generation of\n+# the file from the modular. If any change should be done, please apply the change to the\n+# modular_cosmos3.py file directly. One of our CI enforces this.\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+from huggingface_hub.dataclasses import strict\n+\n+from ...utils import auto_docstring\n+from ..qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig\n+\n+\n+@auto_docstring(checkpoint=\"nvidia/Cosmos3-Nano\")\n+@strict\n+class Cosmos3Config(Qwen3VLConfig):\n+ r\"\"\"\n+ Configuration for the [Cosmos3](https://huggingface.co/nvidia/Cosmos3-Nano) Reasoner tower.\n+\n+ The Reasoner tower is architecturally identical to Qwen3-VL, so this config inherits all\n+ fields from [`Qwen3VLConfig`] and only changes `model_type` so that conversion mappings\n+ and key-renaming rules dispatch correctly when loading a unified Cosmos3 checkpoint.\n+ \"\"\"\n+\n+ model_type = \"cosmos3_omni\"\n+\n+\n+__all__ = [\"Cosmos3Config\"]"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3_modeling_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/modeling_cosmos3.py", "additions": 62, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/modeling_cosmos3.py\nStatus: added \| +62 -0\n\n@@ -0,0 +1,62 @@\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# This file was automatically generated from src/transformers/models/cosmos3/modular_cosmos3.py.\n+# Do NOT edit this file manually as any edits will be overwritten by the generation of\n+# the file from the modular. If any change should be done, please apply the change to the\n+# modular_cosmos3.py file directly. One of our CI enforces this.\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\"\"\"Cosmos3 model — loads the Reasoner tower of a Cosmos3 MoT checkpoint into Qwen3-VL.\"\"\"\n+\n+from ..qwen3_vl.modeling_qwen3_vl import Qwen3VLForConditionalGeneration, Qwen3VLModel\n+from .configuration_cosmos3 import Cosmos3Config\n+\n+\n+_COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS = [\n+ r\"_moe_gen\",\n+ r\"^llm2vae\\.\",\n+ r\"^vae2llm\\.\",\n+ r\"^time_embedder\\.\",\n+ r\"^llm2sound\\.\",\n+ r\"^sound2llm\\.\",\n+ r\"^sound_modality_embed$\",\n+ r\"^llm2action\\.\",\n+ r\"^action2llm\\.\",\n+ r\"^action_modality_embed$\",\n+]\n+\n+\n+class Cosmos3Model(Qwen3VLModel):\n+ config: Cosmos3Config\n+\n+ # Base-model loading from a unified Cosmos3 checkpoint drops the Generator tower,\n+ # cross-modal adapters, and the causal-LM head.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS + [\n+ r\"^lm_head\\.weight$\"\n+ ]\n+\n+\n+class Cosmos3ForConditionalGeneration(Qwen3VLForConditionalGeneration):\n+ config: Cosmos3Config\n+\n+ # The unified Cosmos3 checkpoint stores both the Reasoner tower (loaded here) and the\n+ # Generator tower / cross-modal adapters (dropped). These patterns silence the\n+ # \"unexpected keys\" warning for parameters that belong to the dropped components.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS\n+\n+\n+__all__ = [\n+ \"Cosmos3ForConditionalGeneration\",\n+ \"Cosmos3Model\",\n+]"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3_modular_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/modular_cosmos3.py", "additions": 77, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/modular_cosmos3.py\nStatus: added \| +77 -0\n\n@@ -0,0 +1,77 @@\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\"\"\"Cosmos3 model — loads the Reasoner tower of a Cosmos3 MoT checkpoint into Qwen3-VL.\"\"\"\n+\n+from huggingface_hub.dataclasses import strict\n+\n+from ...utils import auto_docstring\n+from ..qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig\n+from ..qwen3_vl.modeling_qwen3_vl import Qwen3VLForConditionalGeneration, Qwen3VLModel\n+\n+\n+@auto_docstring(checkpoint=\"nvidia/Cosmos3-Nano\")\n+@strict\n+class Cosmos3Config(Qwen3VLConfig):\n+ r\"\"\"\n+ Configuration for the [Cosmos3](https://huggingface.co/nvidia/Cosmos3-Nano) Reasoner tower.\n+\n+ The Reasoner tower is architecturally identical to Qwen3-VL, so this config inherits all\n+ fields from [`Qwen3VLConfig`] and only changes `model_type` so that conversion mappings\n+ and key-renaming rules dispatch correctly when loading a unified Cosmos3 checkpoint.\n+ \"\"\"\n+\n+ model_type = \"cosmos3_omni\"\n+\n+\n+_COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS = [\n+ # Generator (image / video diffusion) MoT expert + cross-modal projections\n+ r\"_moe_gen\",\n+ r\"^llm2vae\\.\",\n+ r\"^vae2llm\\.\",\n+ r\"^time_embedder\\.\",\n+ # Sound tower\n+ r\"^llm2sound\\.\",\n+ r\"^sound2llm\\.\",\n+ r\"^sound_modality_embed$\",\n+ # Action tower\n+ r\"^llm2action\\.\",\n+ r\"^action2llm\\.\",\n+ r\"^action_modality_embed$\",\n+]\n+\n+\n+class Cosmos3Model(Qwen3VLModel):\n+ config: Cosmos3Config\n+\n+ # Base-model loading from a unified Cosmos3 checkpoint drops the Generator tower,\n+ # cross-modal adapters, and the causal-LM head.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS + [\n+ r\"^lm_head\\.weight$\"\n+ ]\n+\n+\n+class Cosmos3ForConditionalGeneration(Qwen3VLForConditionalGeneration):\n+ config: Cosmos3Config\n+\n+ # The unified Cosmos3 checkpoint stores both the Reasoner tower (loaded here) and the\n+ # Generator tower / cross-modal adapters (dropped). These patterns silence the\n+ # \"unexpected keys\" warning for parameters that belong to the dropped components.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS\n+\n+\n+__all__ = [\n+ \"Cosmos3Config\",\n+ \"Cosmos3ForConditionalGeneration\",\n+ \"Cosmos3Model\",\n+]"}
	{"id": "pr_46146_file_src_transformers_processing_utils.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/processing_utils.py", "additions": 10, "deletions": 1, "text": "PR #46146 — file change: src/transformers/processing_utils.py\nStatus: modified \| +10 -1\n\n@@ -876,7 +876,16 @@ def get_text_with_replacements(\n expanded_sample.append(text[batch_idx][last:start])\n \n mm_type = m.lastgroup\n- replacement_text = next(replacements_iters[mm_type])\n+ replacement_text = next(replacements_iters[mm_type], None)\n+ if replacement_text is None:\n+ # No replacement available for this modality — leave the\n+ # placeholder in place so the tokenizer can still encode it\n+ # as a special token. This happens during text-only passes\n+ # (e.g. vLLM's dummy profiling) where the prompt contains\n+ # placeholders but no mm data is provided.\n+ expanded_sample.append(m.group())\n+ last = end\n+ continue\n replacement_offsets.append(\n {\n \"type\": mm_type,"}
	{"id": "pr_46146_file_src_transformers_utils_auto_docstring.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/utils/auto_docstring.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/utils/auto_docstring.py\nStatus: modified \| +1 -0\n\n@@ -74,6 +74,7 @@\n \"x-clip\": \"XCLIPConfig\",\n \"kosmos2\": \"Kosmos2Config\",\n \"kosmos2-5\": \"Kosmos2_5Config\",\n+ \"cosmos3\": \"Cosmos3Config\",\n \"donut\": \"DonutSwinConfig\",\n \"esmfold\": \"EsmConfig\",\n \"parakeet\": \"ParakeetCTCConfig\","}
	{"id": "pr_46146_file_tests_models_cosmos3___init__.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "tests/models/cosmos3/__init__.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: tests/models/cosmos3/__init__.py\nStatus: added \| +1 -0\n\n@@ -0,0 +1 @@\n+"}
	{"id": "pr_46146_file_tests_models_cosmos3_test_modeling_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "tests/models/cosmos3/test_modeling_cosmos3.py", "additions": 114, "deletions": 0, "text": "PR #46146 — file change: tests/models/cosmos3/test_modeling_cosmos3.py\nStatus: added \| +114 -0\n\n@@ -0,0 +1,114 @@\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\"\"\"Testing suite for the PyTorch Cosmos3 model.\"\"\"\n+\n+import copy\n+import unittest\n+\n+from transformers import AutoConfig, Cosmos3Config, is_torch_available\n+from transformers.conversion_mapping import get_checkpoint_conversion_mapping\n+from transformers.core_model_loading import WeightRenaming, rename_source_key\n+from transformers.testing_utils import require_torch\n+\n+\n+if is_torch_available():\n+ from transformers import AutoModel, AutoModelForImageTextToText, Cosmos3ForConditionalGeneration, Cosmos3Model\n+\n+\n+def get_tiny_cosmos3_config():\n+ return Cosmos3Config(\n+ text_config={\n+ \"vocab_size\": 99,\n+ \"hidden_size\": 32,\n+ \"intermediate_size\": 64,\n+ \"num_hidden_layers\": 1,\n+ \"num_attention_heads\": 4,\n+ \"num_key_value_heads\": 2,\n+ \"head_dim\": 8,\n+ \"max_position_embeddings\": 64,\n+ \"pad_token_id\": 0,\n+ \"rope_parameters\": {\n+ \"rope_type\": \"default\",\n+ \"mrope_section\": [16, 8, 8],\n+ \"mrope_interleaved\": True,\n+ \"rope_theta\": 10000,\n+ },\n+ },\n+ vision_config={\n+ \"depth\": 1,\n+ \"hidden_size\": 32,\n+ \"hidden_act\": \"gelu_pytorch_tanh\",\n+ \"intermediate_size\": 64,\n+ \"num_heads\": 4,\n+ \"patch_size\": 16,\n+ \"spatial_merge_size\": 1,\n+ \"temporal_patch_size\": 2,\n+ \"out_hidden_size\": 32,\n+ \"num_position_embeddings\": 16,\n+ \"deepstack_visual_indexes\": [0],\n+ },\n+ image_token_id=3,\n+ video_token_id=4,\n+ vision_start_token_id=5,\n+ vision_end_token_id=6,\n+ tie_word_embeddings=False,\n+ pad_token_id=0,\n+ )\n+\n+\n+class Cosmos3ConfigTest(unittest.TestCase):\n+ def test_auto_config_mapping(self):\n+ config = AutoConfig.for_model(\"cosmos3_omni\")\n+\n+ self.assertIsInstance(config, Cosmos3Config)\n+ self.assertEqual(config.model_type, \"cosmos3_omni\")\n+\n+\n+class Cosmos3ConversionMappingTest(unittest.TestCase):\n+ def test_checkpoint_conversion_mapping_targets_unified_checkpoint_namespaces(self):\n+ mapping = get_checkpoint_conversion_mapping(\"cosmos3_omni\")\n+ renamings = [entry for entry in mapping if isinstance(entry, WeightRenaming)]\n+\n+ self.assertEqual(\n+ rename_source_key(\"model.layers.0.self_attn.q_proj.weight\", renamings, [])[0],\n+ \"model.language_model.layers.0.self_attn.q_proj.weight\",\n+ )\n+ self.assertEqual(\n+ rename_source_key(\"blocks.0.norm1.weight\", renamings, [])[0],\n+ \"model.visual.blocks.0.norm1.weight\",\n+ )\n+ self.assertEqual(\n+ rename_source_key(\"merger.mlp.0.weight\", renamings, [])[0],\n+ \"model.visual.merger.mlp.0.weight\",\n+ )\n+\n+ already_nested_key = \"model.language_model.layers.0.self_attn.q_proj.weight\"\n+ self.assertEqual(rename_source_key(already_nested_key, renamings, [])[0], already_nested_key)\n+\n+\n+@require_torch\n+class Cosmos3ModelTest(unittest.TestCase):\n+ def test_auto_model_mappings(self):\n+ config = get_tiny_cosmos3_config()\n+\n+ self.assertIsInstance(AutoModel.from_config(copy.deepcopy(config)), Cosmos3Model)\n+ self.assertIsInstance(\n+ AutoModelForImageTextToText.from_config(copy.deepcopy(config)), Cosmos3ForConditionalGeneration\n+ )\n+\n+ def test_unified_checkpoint_unexpected_keys_are_ignored(self):\n+ self.assertIn(r\"_moe_gen\", Cosmos3Model._keys_to_ignore_on_load_unexpected)\n+ self.assertIn(r\"^llm2sound\\.\", Cosmos3ForConditionalGeneration._keys_to_ignore_on_load_unexpected)\n+ self.assertIn(r\"^lm_head\\.weight$\", Cosmos3Model._keys_to_ignore_on_load_unexpected)\n+ self.assertNotIn(r\"^lm_head\\.weight$\", Cosmos3ForConditionalGeneration._keys_to_ignore_on_load_unexpected)"}
	{"id": "pr_46146_file_tests_utils_test_processing_utils.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "tests/utils/test_processing_utils.py", "additions": 52, "deletions": 0, "text": "PR #46146 — file change: tests/utils/test_processing_utils.py\nStatus: added \| +52 -0\n\n@@ -0,0 +1,52 @@\n+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\n+import unittest\n+\n+from transformers.processing_utils import ProcessorMixin\n+\n+\n+class DummyMultimodalProcessor(ProcessorMixin):\n+ pass\n+\n+\n+class ProcessorMixinTextReplacementTest(unittest.TestCase):\n+ def get_processor(self):\n+ processor = DummyMultimodalProcessor()\n+ processor.image_token = \"<image>\"\n+ processor.video_token = \"<video>\"\n+ return processor\n+\n+ def test_get_text_with_replacements_preserves_missing_replacement_placeholders(self):\n+ processor = self.get_processor()\n+\n+ text, replacement_offsets = processor.get_text_with_replacements(\n+ [\"Look <image> then <video> then <image>.\"],\n+ images_replacements=[\"<image><image>\"],\n+ videos_replacements=[\"<video><video>\"],\n+ )\n+\n+ self.assertEqual(text, [\"Look <image><image> then <video><video> then <image>.\"])\n+ self.assertEqual(\n+ [offset[\"replacement\"] for offset in replacement_offsets[0]],\n+ [\"<image><image>\", \"<video><video>\"],\n+ )\n+\n+ def test_get_text_with_replacements_preserves_placeholder_when_no_modality_data_is_provided(self):\n+ processor = self.get_processor()\n+\n+ text, replacement_offsets = processor.get_text_with_replacements([\"Profile <image> without image data.\"])\n+\n+ self.assertEqual(text, [\"Profile <image> without image data.\"])\n+ self.assertEqual(replacement_offsets, [[]])"}
	{"id": "pr_46145", "type": "pr", "number": 46145, "title": "Fix load_adapter OOM caused by full-model warmup sizing", "state": "open", "author": "Yooniel", "labels": [], "created_at": "2026-05-21T15:59:30Z", "updated_at": "2026-05-21T16:17:56Z", "url": "https://github.com/huggingface/transformers/pull/46145", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46145: Fix load_adapter OOM caused by full-model warmup sizing\nState: open \| Merged: False\nAuthor: Yooniel \| Base: main\nLabels: \nCreated: 2026-05-21T15:59:30Z\n\n# What does this PR do?\r\n\r\nFixes an OOM in `load_adapter` on configurations where the base model occupies more than ~half of GPU memory, e.g. Gemma-3-27B in bf16 on a single H100/H200 or Llama-70B on a single 80 GB GPU.\r\n\r\n## Root cause\r\n\r\n`load_adapter` passes every named parameter on the model, base model included, as `expected_keys` to `_load_pretrained_model`. Downstream, `caching_allocator_warmup` sums those into a full base-model byte count and issues a single same-size allocation on top of the already-resident base model, OOMing.\r\n\r\n```text\r\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.87 GiB.\r\nGPU 0 has a total capacity of 94.50 GiB of which 41.85 GiB is free.\r\nIncluding non-PyTorch memory, this process has 52.64 GiB memory in use.\r\n```\r\n\r\nThe allocation attempt, 51.87 GiB, is essentially the size of the base model already resident on the GPU.\r\n\r\n## Fix\r\n\r\nHoist the existing `is_adapter_key` helper above the `_load_pretrained_model` call and apply it to `expected_keys`, so warmup is sized only from adapter parameters. The downstream `missing_keys` filter that already used the helper is preserved.\r\n\r\n## Tests\r\n\r\nAdds a regression test that captures the device map passed to `caching_allocator_warmup` during `load_adapter` and asserts it contains only adapter-owned parameter names, not base-model names. Without the fix, the test fails with 84 base-model parameter names leaking into the warmup.\r\n\r\n```bash\r\nmake style\r\nRUN_SLOW=1 python -m unittest tests.peft_integration.test_peft_integration.PeftIntegrationTester.test_peft_load_adapter_warmup_uses_adapter_expected_keys -v\r\n```\r\n\r\nAlso verified the original GH200 repro locally: before the fix, `load_adapter` tried to allocate 51.87 GiB and OOMed; after the fix, the adapter loads successfully.\r\n\r\n## Related\r\n\r\n- #36483, #36428, #36742 — same warmup, fixed for the base-model loading path only; the adapter path was untouched.\r\n- #44637 / #44660 — adjacent open issue/PR about a different `load_adapter` OOM (state-dict materialization in `load_best_model_at_end`), not warmup over-allocation.\r\n\r\nNo associated issue was filed; this is a focused bugfix PR with a local repro, root-cause analysis, and regression test.\r\n\r\n## Code Agent Policy\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case.\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [x] Did you write any new necessary tests?\r\n\r\n## Who can review?\r\n\r\n- @CyrilVallez (model loading): this change touches the `caching_allocator_warmup` path.\r\n- @BenjaminBossan (PEFT integration): this change is in `integrations/peft.py` and concerns adapter loading semantics.\r\n\n\n--- Comment by github-actions[bot] at 2026-05-21T16:17:56Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46145&sha=399e5f"}
	{"id": "pr_46145_file_src_transformers_integrations_peft.py", "type": "pr_diff", "number": 46145, "title": "Fix load_adapter OOM caused by full-model warmup sizing", "state": "open", "author": "Yooniel", "labels": [], "created_at": "2026-05-21T15:59:30Z", "updated_at": "2026-05-21T16:17:56Z", "url": "https://github.com/huggingface/transformers/pull/46145", "merged": false, "base_branch": "main", "filename": "src/transformers/integrations/peft.py", "additions": 10, "deletions": 10, "text": "PR #46145 — file change: src/transformers/integrations/peft.py\nStatus: modified \| +10 -10\n\n@@ -583,6 +583,13 @@ def load_adapter(\n # Create and add fresh new adapters into the model, unless the weights are hotswapped\n inject_adapter_in_model(peft_config, self, adapter_name)\n \n+ adapter_key_markers = {adapter_name}\n+ if peft_config is not None and getattr(peft_config, \"peft_type\", None) is not None:\n+ adapter_key_markers.add(peft_config.peft_type.value.lower())\n+\n+ def is_adapter_key(key: str) -> bool:\n+ return any(marker in key for marker in adapter_key_markers)\n+\n if not self._hf_peft_config_loaded:\n self._hf_peft_config_loaded = True\n \n@@ -670,9 +677,9 @@ def load_adapter(\n state_dict=adapter_state_dict,\n checkpoint_files=checkpoint_files,\n load_config=load_config,\n- # pass expected keys explicitly, otherwise they are determined from the state_dict, which can contain\n- # unexpected entries, like \"layer.SCB\" from a bnb layer.\n- expected_keys=[n for n, _ in self.named_parameters()],\n+ # Pass expected keys explicitly while excluding non-adapter parameters.\n+ # Otherwise `caching_allocator_warmup` sizes for the full base model.\n+ expected_keys=[n for n, _ in self.named_parameters() if is_adapter_key(n)],\n )\n \n if peft_config.inference_mode:\n@@ -683,13 +690,6 @@ def load_adapter(\n if isinstance(module, BaseTunerLayer):\n module.requires_grad_(False)\n \n- adapter_key_markers = {adapter_name}\n- if peft_config is not None and getattr(peft_config, \"peft_type\", None) is not None:\n- adapter_key_markers.add(peft_config.peft_type.value.lower())\n-\n- def is_adapter_key(key: str) -> bool:\n- return any(marker in key for marker in adapter_key_markers)\n-\n loading_info.missing_keys = {k for k in loading_info.missing_keys if is_adapter_key(k)}\n \n log_state_dict_report("}
	{"id": "pr_46145_file_tests_peft_integration_test_peft_integration.py", "type": "pr_diff", "number": 46145, "title": "Fix load_adapter OOM caused by full-model warmup sizing", "state": "open", "author": "Yooniel", "labels": [], "created_at": "2026-05-21T15:59:30Z", "updated_at": "2026-05-21T16:17:56Z", "url": "https://github.com/huggingface/transformers/pull/46145", "merged": false, "base_branch": "main", "filename": "tests/peft_integration/test_peft_integration.py", "additions": 52, "deletions": 0, "text": "PR #46145 — file change: tests/peft_integration/test_peft_integration.py\nStatus: modified \| +52 -0\n\n@@ -694,6 +694,58 @@ def test_peft_add_adapter_with_state_dict_low_cpu_mem_usage(self):\n # after loading, no meta device should be remaining\n self.assertFalse(any((p.device.type == \"meta\") for p in model.parameters()))\n \n+ def test_peft_load_adapter_warmup_uses_adapter_expected_keys(self):\n+ \"\"\"\n+ Check that adapter loading only warms up memory for adapter parameters.\n+ \"\"\"\n+ from peft import LoraConfig\n+\n+ import transformers.modeling_utils as modeling_utils\n+\n+ adapter_name = \"warmup_test_adapter\"\n+ adapter_key_markers = (adapter_name, \"lora\")\n+\n+ for model_id in self.transformers_test_model_ids:\n+ for transformers_class in self.transformers_test_model_classes:\n+ model = transformers_class.from_pretrained(model_id).to(torch_device)\n+\n+ peft_config = LoraConfig()\n+ template_model = transformers_class.from_pretrained(model_id)\n+ template_model.add_adapter(LoraConfig(), adapter_name=adapter_name)\n+ dummy_state_dict = {\n+ name: torch.zeros_like(param)\n+ for name, param in template_model.named_parameters()\n+ if any(marker in name for marker in adapter_key_markers)\n+ }\n+ del template_model\n+ self.assertTrue(dummy_state_dict)\n+\n+ captured_device_maps = []\n+ original_warmup = modeling_utils.caching_allocator_warmup\n+\n+ def capture_warmup(model, expanded_device_map, hf_quantizer):\n+ captured_device_maps.append(dict(expanded_device_map))\n+\n+ modeling_utils.caching_allocator_warmup = capture_warmup\n+ try:\n+ with CaptureLogger(logging.get_logger(\"transformers.integrations.peft\")):\n+ model.load_adapter(\n+ adapter_state_dict=dummy_state_dict,\n+ adapter_name=adapter_name,\n+ peft_config=peft_config,\n+ )\n+ finally:\n+ modeling_utils.caching_allocator_warmup = original_warmup\n+\n+ self.assertTrue(captured_device_maps)\n+ warmed_keys = set().union(*(device_map.keys() for device_map in captured_device_maps))\n+ self.assertTrue(warmed_keys)\n+\n+ unexpected_base_keys = [\n+ key for key in warmed_keys if not any(marker in key for marker in adapter_key_markers)\n+ ]\n+ self.assertEqual(unexpected_base_keys, [])\n+\n def test_peft_from_pretrained_hub_kwargs(self):\n \"\"\"\n Tests different combinations of PEFT model + from_pretrained + hub kwargs"}
	{"id": "pr_46142", "type": "pr", "number": 46142, "title": "Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config", "state": "open", "author": "Charly21r", "labels": [], "created_at": "2026-05-21T13:17:26Z", "updated_at": "2026-05-21T13:42:33Z", "url": "https://github.com/huggingface/transformers/pull/46142", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46142: Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config\nState: open \| Merged: False\nAuthor: Charly21r \| Base: main\nLabels: \nCreated: 2026-05-21T13:17:26Z\n\n# What does this PR do?\r\n\r\nFixes #46121\r\n\r\n`RotaryEmbeddingConfigMixin.ignore_keys_at_rope_validation` is a `set` at the class level, but JSON has no `set` type, so any `config.json` that serializes this field (e.g. checkpoints written by LoRA merge / export tooling like `ms-swift`) loads it back as a `list` instance attribute that shadows the class default. `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` then does:\r\n\r\n`self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}`\r\n\r\nwhich raises `TypeError: unsupported operand type(s) for \|: 'list' and 'set'` whenever `partial_rotary_factor` is also set on the config. In practice this prevents serving such merged checkpoints (observed downstream in vLLM with merged Qwen3.5 checkpoints).\r\n\r\nThis PR coerces the attribute to a set before the union in `src/transformers/modeling_rope_utils.py`, and adds a regression test in `tests/utils/test_modeling_rope_utils.py` covering both direct attribute assignment and the `from_dict` round-trip path that mirrors the JSON-deserialization flow.\r\n\r\n## Code Agent Policy\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?\r\n- [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [x] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\n@Rocketknight1 \n\n--- Comment by github-actions[bot] at 2026-05-21T13:33:18Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46142&sha=1267e6\n\n--- Comment by Rocketknight1 at 2026-05-21T13:42:33Z ---\nLGTM, CI issues seem unrelated. It's core code though, so I'll wait for a core maintainer to approve in case I'm totally wrong here! cc @arthurzucker @cyrilvallez @vasqu "}
	{"id": "pr_46142_file_src_transformers_modeling_rope_utils.py", "type": "pr_diff", "number": 46142, "title": "Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config", "state": "open", "author": "Charly21r", "labels": [], "created_at": "2026-05-21T13:17:26Z", "updated_at": "2026-05-21T13:42:33Z", "url": "https://github.com/huggingface/transformers/pull/46142", "merged": false, "base_branch": "main", "filename": "src/transformers/modeling_rope_utils.py", "additions": 1, "deletions": 1, "text": "PR #46142 — file change: src/transformers/modeling_rope_utils.py\nStatus: modified \| +1 -1\n\n@@ -719,7 +719,7 @@ def convert_rope_params_to_dict(self, **kwargs):\n partial_rotary_factor = kwargs.get(\"partial_rotary_factor\", getattr(self, \"partial_rotary_factor\", None))\n if partial_rotary_factor is not None:\n self.rope_parameters.setdefault(\"partial_rotary_factor\", partial_rotary_factor)\n- self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\n+ self.ignore_keys_at_rope_validation = set(self.ignore_keys_at_rope_validation) \| {\"partial_rotary_factor\"}\n \n self.standardize_rope_params()\n return kwargs"}
	{"id": "pr_46142_file_tests_utils_test_modeling_rope_utils.py", "type": "pr_diff", "number": 46142, "title": "Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config", "state": "open", "author": "Charly21r", "labels": [], "created_at": "2026-05-21T13:17:26Z", "updated_at": "2026-05-21T13:42:33Z", "url": "https://github.com/huggingface/transformers/pull/46142", "merged": false, "base_branch": "main", "filename": "tests/utils/test_modeling_rope_utils.py", "additions": 22, "deletions": 0, "text": "PR #46142 — file change: tests/utils/test_modeling_rope_utils.py\nStatus: modified \| +22 -0\n\n@@ -136,6 +136,28 @@ def test_yarn_original_original_max_position_embeddings_validation(self):\n self.assertEqual(len(logs.output), 1)\n self.assertIn(\"implicit factor\", logs.output[0])\n \n+ def test_convert_rope_params_to_dict_with_list_ignore_keys(self):\n+ # Regression test: `ignore_keys_at_rope_validation` becomes a list when loaded from a config.json\n+ # (JSON has no set type). `convert_rope_params_to_dict` used to do `list \| set` and crash with\n+ # TypeError when `partial_rotary_factor` was also set.\n+ config = LlamaConfig(partial_rotary_factor=0.25)\n+ config.ignore_keys_at_rope_validation = [\"mrope_section\", \"mrope_interleaved\"]\n+\n+ config.convert_rope_params_to_dict(partial_rotary_factor=0.25)\n+\n+ self.assertIsInstance(config.ignore_keys_at_rope_validation, set)\n+ self.assertEqual(\n+ config.ignore_keys_at_rope_validation,\n+ {\"mrope_section\", \"mrope_interleaved\", \"partial_rotary_factor\"},\n+ )\n+\n+ # Round-trip through from_dict to mimic the JSON-deserialized path that triggered this in production.\n+ cfg_dict = config.to_dict()\n+ cfg_dict[\"ignore_keys_at_rope_validation\"] = [\"mrope_section\", \"mrope_interleaved\"]\n+ reloaded = LlamaConfig.from_dict(cfg_dict)\n+ reloaded.convert_rope_params_to_dict(partial_rotary_factor=0.25)\n+ self.assertIsInstance(reloaded.ignore_keys_at_rope_validation, set)\n+\n def test_rope_validation_with_per_attention_type_nested_rope(self):\n \"\"\"Mirrors `test_rope_validation` with `config.layer_types` set, so that\n `rope_parameters` takes the per-attention-type nested shape.\"\"\""}
	{"id": "pr_46141", "type": "pr", "number": 46141, "title": "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions", "state": "open", "author": "ryota-komatsu", "labels": [], "created_at": "2026-05-21T12:43:29Z", "updated_at": "2026-05-21T13:04:23Z", "url": "https://github.com/huggingface/transformers/pull/46141", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46141: Fix FSDP2 and distributed checkpointing imports for older PyTorch versions\nState: open \| Merged: False\nAuthor: ryota-komatsu \| Base: main\nLabels: \nCreated: 2026-05-21T12:43:29Z\n\n# What does this PR do?\r\n\r\nThis PR updates the PyTorch version constraints for specific distributed features to prevent `ImportError` and `ModuleNotFoundError` crashes on older PyTorch versions:\r\n- Bumps the minimum PyTorch requirement for FSDP2 from `>=2.5` to `>=2.6`.\r\n- Add a minimum PyTorch requirement of `>=2.7` for distributed checkpoint saving.\r\n\r\nCurrently, attempting to initialize FSDP2 with `torch==2.5` results in an import error because `CPUOffloadPolicy`, `MixedPrecisionPolicy`, and `OffloadPolicy` are not available in 'torch.distributed.fsdp' for that version.\r\n\r\nSimilarly, attempting to use distributed checkpointing on versions earlier than `torch==2.7` crashes because `HuggingFaceStorageWriter` does not exist in `torch.distributed.checkpoint.hf_storage`.\r\n\r\nTracebacks\r\n```\r\ntransformers/distributed/fsdp.py\", line 34, in <module>\r\n from torch.distributed.fsdp import CPUOffloadPolicy, MixedPrecisionPolicy, OffloadPolicy\r\nImportError: cannot import name 'CPUOffloadPolicy' from 'torch.distributed.fsdp'\r\n```\r\n\r\n```\r\ntransformers/distributed/utils.py\", line 42, in <module>\r\n from torch.distributed.checkpoint.hf_storage import HuggingFaceStorageWriter\r\nModuleNotFoundError: No module named 'torch.distributed.checkpoint.hf_storage'\r\n```\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [ ] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\nAnyone in the community is free to review the PR once the tests have passed. Feel free to tag\r\nmembers/contributors who may be interested in your PR.\r\n\r\n- distributed: @3outeille @ArthurZucker\n\n--- Comment by github-actions[bot] at 2026-05-21T13:04:23Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46141&sha=8f98cb"}
	{"id": "pr_46141_file_src_transformers_distributed_fsdp.py", "type": "pr_diff", "number": 46141, "title": "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions", "state": "open", "author": "ryota-komatsu", "labels": [], "created_at": "2026-05-21T12:43:29Z", "updated_at": "2026-05-21T13:04:23Z", "url": "https://github.com/huggingface/transformers/pull/46141", "merged": false, "base_branch": "main", "filename": "src/transformers/distributed/fsdp.py", "additions": 5, "deletions": 5, "text": "PR #46141 — file change: src/transformers/distributed/fsdp.py\nStatus: modified \| +5 -5\n\n@@ -28,7 +28,7 @@\n if is_torch_available():\n import torch\n \n-if is_torch_available() and is_torch_greater_or_equal(\"2.5\"):\n+if is_torch_available() and is_torch_greater_or_equal(\"2.6\"):\n import torch.distributed as dist\n from torch.distributed._composable.fsdp import fully_shard\n from torch.distributed.fsdp import CPUOffloadPolicy, MixedPrecisionPolicy, OffloadPolicy\n@@ -91,8 +91,8 @@ def initialize_fsdp(\n if fsdp_plan is None:\n return device_map, device_mesh, None\n \n- if not is_torch_greater_or_equal(\"2.5\"):\n- raise OSError(\"FSDP2 is only supported for `torch>=2.5`.\")\n+ if not is_torch_greater_or_equal(\"2.6\"):\n+ raise OSError(\"FSDP2 is only supported for `torch>=2.6`.\")\n \n if device_mesh is None:\n # Detect the accelerator on the machine\n@@ -338,8 +338,8 @@ def apply_fully_shard_data_parallel(\n if not is_torch_available():\n raise ImportError(\"PyTorch is required for FSDP support\")\n \n- if not is_torch_greater_or_equal(\"2.5\"):\n- raise OSError(\"FSDP2 requires torch>=2.5\")\n+ if not is_torch_greater_or_equal(\"2.6\"):\n+ raise OSError(\"FSDP2 requires torch>=2.6\")\n \n if fsdp_plan is None:\n fsdp_plan = {}"}
	{"id": "pr_46141_file_src_transformers_distributed_utils.py", "type": "pr_diff", "number": 46141, "title": "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions", "state": "open", "author": "ryota-komatsu", "labels": [], "created_at": "2026-05-21T12:43:29Z", "updated_at": "2026-05-21T13:04:23Z", "url": "https://github.com/huggingface/transformers/pull/46141", "merged": false, "base_branch": "main", "filename": "src/transformers/distributed/utils.py", "additions": 9, "deletions": 1, "text": "PR #46141 — file change: src/transformers/distributed/utils.py\nStatus: modified \| +9 -1\n\n@@ -39,14 +39,16 @@\n if is_torch_available():\n import torch\n import torch.distributed.checkpoint as dcp\n- from torch.distributed.checkpoint.hf_storage import HuggingFaceStorageWriter\n from torch.distributed.checkpoint.state_dict import (\n get_model_state_dict,\n get_optimizer_state_dict,\n set_optimizer_state_dict,\n )\n from torch.distributed.tensor import DTensor\n \n+ if is_torch_greater_or_equal(\"2.7\"):\n+ from torch.distributed.checkpoint.hf_storage import HuggingFaceStorageWriter\n+\n \n def _ensure_torch_distributed(device_type: str):\n \"\"\"Initialize torch.distributed if not already initialized.\"\"\"\n@@ -103,6 +105,9 @@ def init_device_mesh(distributed_config: DistributedConfig) -> torch.distributed\n if not is_torch_greater_or_equal(\"2.5\"):\n raise OSError(\"Distributed training with DistributedConfig requires `torch>=2.5`.\")\n \n+ if distributed_config.fsdp_size > 1 and not is_torch_greater_or_equal(\"2.6\"):\n+ raise OSError(\"FSDP2 requires `torch>=2.6`.\")\n+\n device_type = torch._C._get_accelerator().type\n _ensure_torch_distributed(device_type)\n \n@@ -205,6 +210,9 @@ def save_model_checkpoint_distributed(model, checkpoint_dir: str) -> None:\n gate\|\|up MoE weights) are replicated to a full tensor on every rank\n before the save, otherwise DCP cannot encode that placement.\n \"\"\"\n+ if not is_torch_greater_or_equal(\"2.7\"):\n+ raise OSError(\"Distributed checkpoint saving requires `torch>=2.7`.\")\n+\n state_dict = get_model_state_dict(model)\n for key, value in list(state_dict.items()):\n if ("}
	{"id": "pr_46140", "type": "pr", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46140: Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads\nState: closed \| Merged: False\nAuthor: adityasingh2400 \| Base: main\nLabels: Code agent slop\nCreated: 2026-05-21T11:23:35Z\n\n# What does this PR do?\n\nFixes #46082.\n\n`LlamaAttention` already sizes its projections from `num_attention_heads * head_dim` rather than `hidden_size`, so a config where `hidden_size % num_attention_heads != 0` is well-defined as long as `head_dim` is explicitly provided. The divisibility check in `LlamaConfig.validate_architecture` fires unconditionally though (it runs after `__post_init__` has filled in the fallback `head_dim`, so checking `head_dim is not None` in the validator doesn't work).\n\nThis PR follows the approach @matdou outlined in the issue:\n\n- Capture `self._head_dim_was_explicit = self.head_dim is not None` in `__post_init__` before falling back to the derived value.\n- Gate the divisibility error in `validate_architecture` on `not self._head_dim_was_explicit`.\n\n`_head_dim_was_explicit` is recomputed in `__post_init__`, so save/reload via `save_pretrained` / `from_pretrained` works without persisting the flag (the saved `head_dim` is the explicit value, so the flag is set correctly on reload).\n\nThe original validation error is preserved when `head_dim` is not explicitly provided.\n\n## Reproduction (from the issue)\n\n```python\nfrom transformers import LlamaConfig, LlamaForCausalLM\n\nconfig = LlamaConfig(\n vocab_size=99,\n hidden_size=512,\n intermediate_size=1024,\n num_hidden_layers=1,\n num_attention_heads=9,\n num_key_value_heads=1,\n head_dim=56,\n)\nmodel = LlamaForCausalLM(config)\n```\n\nPasses after this change, raises before.\n\n## Tests\n\nAdded two cases in `tests/models/llama/test_modeling_llama.py`:\n\n- `head_dim` explicit + non-divisible dims, config accepted, model instantiates.\n- `head_dim` omitted + non-divisible dims, original `ValueError` still raised.\n\n## Who can review?\n\n@ArthurZucker @Cyrilvallez\n\nCredit to @matdou for the diagnosis in the issue comments.\n\n--- Comment by github-actions[bot] at 2026-05-21T11:31:52Z ---\n[For maintainers] Suggested jobs to run (before merge)\n\nrun-slow: arcee, aria, cwm, deepseek_v2, eurobert, higgs_audio_v2, hrm_text, jais2, llama\n\n--- Comment by github-actions[bot] at 2026-05-21T11:50:16Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46140&sha=70eb70\n\n--- Comment by adityasingh2400 at 2026-05-21T12:08:04Z ---\nCI note: the 4 failing tests on `tests_torch` / `tests_training_ci` / `tests_tensor_parallel_ci` are all in `tests/models/cohere2_moe/test_modeling_cohere2_moe.py` and reproduce identically on an unrelated PR opened a few minutes after this one (see #46136). The failing assertions are:\n\n- `Cohere2MoeModelTest::test_training_overfit`, `AssertionError: 0.27068585289520714 not greater than 0.9` (the exact same float value reproduces across runs, so it is deterministic, not flaky)\n- `Cohere2MoeModelTest::test_tp_forward` / `test_tp_backward` / `test_tp_generation`, `KeyError: 'rowwise'` raised by the TP partition spec on a Cohere2 MoE layer\n\nBoth classes of failure were introduced when `cohere2_moe` landed yesterday (#46115 on 2026-05-20). My change is scoped to `LlamaConfig` and the modular-converted descendants (arcee, aria, cwm, deepseek_v2, eurobert, higgs_audio_v2, hrm_text, jais2). `cohere2_moe` does not derive from Llama and is not touched by this PR.\n\nHappy to file a separate PR for the cohere2_moe breakage if no one is on it already, but flagging it here so this PR is not held on CI red that is upstream of it."}
	{"id": "pr_46140_file_src_transformers_models_arcee_configuration_arcee.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/arcee/configuration_arcee.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/arcee/configuration_arcee.py\nStatus: modified \| +6 -1\n\n@@ -101,6 +101,11 @@ class ArceeConfig(PreTrainedConfig):\n head_dim: int \| None = None\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (ArceeAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -110,7 +115,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_aria_configuration_aria.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/aria/configuration_aria.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/aria/configuration_aria.py\nStatus: modified \| +6 -1\n\n@@ -104,6 +104,11 @@ class AriaTextConfig(PreTrainedConfig):\n moe_num_shared_experts: int = 2\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (AriaTextAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -113,7 +118,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_cwm_configuration_cwm.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cwm/configuration_cwm.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/cwm/configuration_cwm.py\nStatus: modified \| +6 -1\n\n@@ -127,6 +127,11 @@ def __post_init__(self, *kwargs):\n self.sliding_window = int(self.sliding_window) if self.sliding_window else None\n self.layer_types = list(self.layer_types)\n self.eos_token_id = self.eos_token_id if self.eos_token_id is not None else [128001, 128008, 128009]\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (CwmAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -136,7 +141,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_deepseek_v2_configuration_deepseek_v2.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/deepseek_v2/configuration_deepseek_v2.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/deepseek_v2/configuration_deepseek_v2.py\nStatus: modified \| +6 -1\n\n@@ -139,6 +139,11 @@ class DeepseekV2Config(PreTrainedConfig):\n \n def __post_init__(self, *kwargs):\n self.head_dim = self.qk_rope_head_dim\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (DeepseekV2Attention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -148,7 +153,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_eurobert_configuration_eurobert.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/eurobert/configuration_eurobert.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/eurobert/configuration_eurobert.py\nStatus: modified \| +6 -1\n\n@@ -113,6 +113,11 @@ class EuroBertConfig(PreTrainedConfig):\n def __post_init__(self, *kwargs):\n if self.num_key_value_heads is None:\n self.num_key_value_heads = self.num_attention_heads\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (EuroBertAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -122,7 +127,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_higgs_audio_v2_configuration_higgs_audio_v2.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/higgs_audio_v2/configuration_higgs_audio_v2.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/higgs_audio_v2/configuration_higgs_audio_v2.py\nStatus: modified \| +6 -1\n\n@@ -133,6 +133,11 @@ def __post_init__(self, *kwargs):\n \"original_max_position_embeddings\": 1024,\n \"rope_type\": \"llama3\",\n }\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (HiggsAudioV2Attention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -142,7 +147,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_hrm_text_configuration_hrm_text.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/hrm_text/configuration_hrm_text.py", "additions": 1, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/hrm_text/configuration_hrm_text.py\nStatus: modified \| +1 -1\n\n@@ -140,7 +140,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_jais2_configuration_jais2.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/jais2/configuration_jais2.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/jais2/configuration_jais2.py\nStatus: modified \| +6 -1\n\n@@ -102,6 +102,11 @@ class Jais2Config(PreTrainedConfig):\n layer_norm_eps: float = 1e-5\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (Jais2Attention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -111,7 +116,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_llama_configuration_llama.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/llama/configuration_llama.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/llama/configuration_llama.py\nStatus: modified \| +6 -1\n\n@@ -105,6 +105,11 @@ class LlamaConfig(PreTrainedConfig):\n head_dim: int \| None = None\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (LlamaAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -114,7 +119,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_tests_models_llama_test_modeling_llama.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "tests/models/llama/test_modeling_llama.py", "additions": 38, "deletions": 0, "text": "PR #46140 — file change: tests/models/llama/test_modeling_llama.py\nStatus: modified \| +38 -0\n\n@@ -35,6 +35,7 @@\n import torch\n \n from transformers import (\n+ LlamaConfig,\n LlamaForCausalLM,\n LlamaModel,\n LlamaTokenizer,\n@@ -57,6 +58,43 @@ class LlamaModelTest(CausalLMModelTest, unittest.TestCase):\n # used in `test_torch_compile_for_training`\n _torch_compile_train_cls = LlamaForCausalLM if is_torch_available() else None\n \n+ def test_config_explicit_head_dim_with_non_divisible_hidden_size(self):\n+ # Regression test for https://github.com/huggingface/transformers/issues/46082\n+ # `LlamaAttention` sizes its projections from `num_attention_heads * head_dim`,\n+ # so an explicit `head_dim` should be allowed even when `hidden_size` is not\n+ # divisible by `num_attention_heads`.\n+ config = LlamaConfig(\n+ vocab_size=99,\n+ hidden_size=512,\n+ intermediate_size=1024,\n+ num_hidden_layers=1,\n+ num_attention_heads=9,\n+ num_key_value_heads=1,\n+ head_dim=56,\n+ )\n+ self.assertEqual(config.head_dim, 56)\n+ # Model construction should succeed with the matching projection shapes.\n+ model = LlamaForCausalLM(config)\n+ self.assertEqual(\n+ model.model.layers[0].self_attn.q_proj.weight.shape,\n+ (config.num_attention_heads * config.head_dim, config.hidden_size),\n+ )\n+\n+ def test_config_implicit_head_dim_with_non_divisible_hidden_size_still_raises(self):\n+ # Regression preventer: omitting `head_dim` with non-divisible dims must\n+ # still raise, since the auto-derived `head_dim = hidden_size // num_attention_heads`\n+ # would silently truncate.\n+ with self.assertRaises(Exception) as ctx:\n+ LlamaConfig(\n+ vocab_size=99,\n+ hidden_size=512,\n+ intermediate_size=1024,\n+ num_hidden_layers=1,\n+ num_attention_heads=9,\n+ num_key_value_heads=1,\n+ )\n+ self.assertIn(\"not a multiple\", str(ctx.exception))\n+\n \n @require_torch_accelerator\n class LlamaIntegrationTest(unittest.TestCase):"}
	{"id": "pr_46138", "type": "pr", "number": 46138, "title": "chore: update self-comment-ci.yml", "state": "open", "author": "hf-security-analysis[bot]", "labels": [], "created_at": "2026-05-21T09:57:53Z", "updated_at": "2026-05-21T10:10:08Z", "url": "https://github.com/huggingface/transformers/pull/46138", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46138: chore: update self-comment-ci.yml\nState: open \| Merged: False\nAuthor: hf-security-analysis[bot] \| Base: main\nLabels: \nCreated: 2026-05-21T09:57:53Z\n\nUpdate `.github/workflows/self-comment-ci.yml` workflow configuration.\n\ncc @guarin @molbap\n\nCloses huggingface/tracking-issues#487\n<!--slack ts:1779357475.432589 channel:C0AJSP0D53L-->\n\n--- Comment by HuggingFaceDocBuilderDev at 2026-05-21T10:10:08Z ---\nThe docs for this PR live [here](https://moon-ci-docs.huggingface.co/docs/transformers/pr_46138). All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update."}
	{"id": "pr_46138_file_.github_workflows_self-comment-ci.yml", "type": "pr_diff", "number": 46138, "title": "chore: update self-comment-ci.yml", "state": "open", "author": "hf-security-analysis[bot]", "labels": [], "created_at": "2026-05-21T09:57:53Z", "updated_at": "2026-05-21T10:10:08Z", "url": "https://github.com/huggingface/transformers/pull/46138", "merged": false, "base_branch": "main", "filename": ".github/workflows/self-comment-ci.yml", "additions": 2, "deletions": 2, "text": "PR #46138 — file change: .github/workflows/self-comment-ci.yml\nStatus: modified \| +2 -2\n\n@@ -89,9 +89,9 @@ jobs:\n PR_COMMENT: ${{ github.event.comment.body }}\n run: \|\n python -m pip install GitPython\n- python utils/pr_slow_ci_models.py --message \"$PR_COMMENT\" \| tee output.txt\n+ printf '%s' \"$PR_COMMENT\" \| python utils/pr_slow_ci_models.py --message-stdin \| tee output.txt\n echo \"models=$(tail -n 1 output.txt)\" >> $GITHUB_ENV\n- python utils/pr_slow_ci_models.py --message \"$PR_COMMENT\" --quantization \| tee output2.txt\n+ printf '%s' \"$PR_COMMENT\" \| python utils/pr_slow_ci_models.py --message-stdin --quantization \| tee output2.txt\n echo \"quantizations=$(tail -n 1 output2.txt)\" >> $GITHUB_ENV\n \n - name: Show models to test"}
	{"id": "pr_46137", "type": "pr", "number": 46137, "title": "Update self-comment-ci", "state": "closed", "author": "guarin", "labels": [], "created_at": "2026-05-21T09:41:02Z", "updated_at": "2026-05-21T09:57:27Z", "url": "https://github.com/huggingface/transformers/pull/46137", "merged": true, "base_branch": "main", "text": "PULL REQUEST #46137: Update self-comment-ci\nState: closed \| Merged: True\nAuthor: guarin \| Base: main\nLabels: \nCreated: 2026-05-21T09:41:02Z\n\n# What does this PR do?\r\n\r\n<!--\r\nCongratulations! You've made it this far! You're not quite done yet though.\r\n\r\nOnce merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.\r\n\r\nThen, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.\r\n\r\nOnce you're done, someone will review your PR shortly (see the section \"Who can review?\" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.\r\n-->\r\n\r\n<!-- Remove if not applicable -->\r\n\r\nFixes # (issue)\r\n\r\n## Code Agent Policy\r\n\r\nThe Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by\r\ncode agents. We are currently bottlenecked by our ability to review and respond to them. As a result, \r\nwe ask that new users do not submit pure code agent PRs at this time. \r\nYou may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous \"OpenClaw\"-like agents\r\nnot to open any PRs or issues for the moment.\r\n\r\nPRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this\r\nrepeatedly or maliciously. \r\n\r\nThis is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, \r\nthis policy is likely to be updated regularly in the near future. For more information, please read [`CONTRIBUTING.md`](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md).\r\n\r\n- [ ] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [ ] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\nAnyone in the community is free to review the PR once the tests have passed. Feel free to tag\r\nmembers/contributors who may be interested in your PR.\r\n\r\n<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @\r\n\r\n If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag.\r\n Please tag fewer than 3 people.\r\n\r\nModels:\r\n\r\n- text models: @ArthurZucker @Cyrilvallez\r\n- vision models: @yonigozlan @molbap\r\n- audio models: @eustlb @ebezzam @vasqu\r\n- multimodal models: @zucchini-nlp\r\n- graph models: @clefourrier\r\n\r\nLibrary:\r\n\r\n- generate: @zucchini-nlp (visual-language models) or @gante (all others)\r\n- continuous batching: @remi-or @ArthurZucker @McPatate\r\n- pipelines: @Rocketknight1\r\n- tokenizers: @ArthurZucker and @itazap\r\n- trainer: @SunMarc\r\n- attention: @vasqu @ArthurZucker @CyrilVallez\r\n- model loading (from pretrained, etc): @CyrilVallez\r\n- distributed: @3outeille @ArthurZucker\r\n- CIs: @ydshieh\r\n\r\nIntegrations:\r\n\r\n- ray/raytune: @richardliaw, @amogkam\r\n- Big Model Inference: @SunMarc\r\n- quantization: @SunMarc\r\n- kernels: @drbh\r\n- peft: @BenjaminBossan @githubnemo\r\n\r\nDevices/Backends:\r\n\r\n- AMD ROCm: @ivarflakstad\r\n- Intel XPU: @IlyasMoutawwakil\r\n- Ascend NPU: @ivarflakstad \r\n\r\nDocumentation: @stevhliu\r\n\r\nResearch projects are not maintained and should be taken as is.\r\n\r\n -->\r\n\n\n--- Comment by HuggingFaceDocBuilderDev at 2026-05-21T09:52:56Z ---\nThe docs for this PR live [here](https://moon-ci-docs.huggingface.co/docs/transformers/pr_46137). All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update."}
	{"id": "pr_46137_file_.github_workflows_self-comment-ci.yml", "type": "pr_diff", "number": 46137, "title": "Update self-comment-ci", "state": "closed", "author": "guarin", "labels": [], "created_at": "2026-05-21T09:41:02Z", "updated_at": "2026-05-21T09:57:27Z", "url": "https://github.com/huggingface/transformers/pull/46137", "merged": true, "base_branch": "main", "filename": ".github/workflows/self-comment-ci.yml", "additions": 1, "deletions": 1, "text": "PR #46137 — file change: .github/workflows/self-comment-ci.yml\nStatus: modified \| +1 -1\n\n@@ -28,7 +28,7 @@ env:\n jobs:\n get-pr-number:\n name: Get PR number\n- if: ${{ github.event.issue.state == 'open' && contains(fromJSON('[\"ydshieh\", \"ArthurZucker\", \"zucchini-nlp\", \"molbap\", \"LysandreJik\", \"Cyrilvallez\", \"Rocketknight1\", \"SunMarc\", \"eustlb\", \"vasqu\", \"ivarflakstad\", \"stevhliu\", \"ebezzam\", \"remi-or\", \"itazap\", \"3outeille\", \"IlyasMoutawwakil\", \"tarekziade\", \"yonigozlan\"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') \|\| startsWith(github.event.comment.body, 'run slow') \|\| startsWith(github.event.comment.body, 'run_slow')) }}\n+ if: ${{ github.event.issue.state == 'open' && contains(fromJSON('[\"ydshieh\", \"ArthurZucker\", \"zucchini-nlp\", \"molbap\", \"LysandreJik\", \"Cyrilvallez\", \"Rocketknight1\", \"SunMarc\", \"eustlb\", \"vasqu\", \"ivarflakstad\", \"stevhliu\", \"ebezzam\", \"remi-or\", \"itazap\", \"3outeille\", \"IlyasMoutawwakil\", \"tarekziade\", \"yonigozlan\", \"guarin\"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') \|\| startsWith(github.event.comment.body, 'run slow') \|\| startsWith(github.event.comment.body, 'run_slow')) }}\n uses: ./.github/workflows/get-pr-number.yml\n \n get-pr-info:"}
	{"id": "pr_46136", "type": "pr", "number": 46136, "title": "Fix is_last off-by-one in MaskGenerationPipeline for partial batches", "state": "open", "author": "J3r3myPerera", "labels": [], "created_at": "2026-05-21T07:50:15Z", "updated_at": "2026-05-21T18:21:07Z", "url": "https://github.com/huggingface/transformers/pull/46136", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46136: Fix is_last off-by-one in MaskGenerationPipeline for partial batches\nState: open \| Merged: False\nAuthor: J3r3myPerera \| Base: main\nLabels: \nCreated: 2026-05-21T07:50:15Z\n\nFixes #46123\r\n\r\nMaskGenerationPipeline.preprocess used i == n_points - points_per_batch to spot the last batch. When n_points isn't a multiple of points_per_batch, that's never true — PipelinePackIterator hits StopIteration and quietly drops the last batch's results.\r\n\r\nFix: i + points_per_batch >= n_points.\r\nTwo fast unit tests in test_pipelines_mask_generation.py: one for the partial-batch case (100 points, batch 64), one for an exact multiple (128 points, batch 64).\r\n\r\n`python -m pytest tests/pipelines/test_pipelines_mask_generation.py::MaskGenerationPipelineTests::test_preprocess_is_last_partial_batch tests/pipelines/test_pipelines_mask_generation.py::MaskGenerationPipelineTests::test_preprocess_is_last_exact_multiple -v\r\n`\r\n#2 passed\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\nDiscussed in https://github.com/huggingface/transformers/issues/46123\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [x] Did you write any new necessary tests?\r\nAdded test_preprocess_is_last_partial_batch and test_preprocess_is_last_exact_multiple to tests/pipelines/test_pipelines_mask_generation.py.\r\n\r\n\r\n## Who can review?\r\n\r\ncc @Rocketknight1 @yonigozlan @qubvel\r\n\n\n--- Comment by Shashank-Tripathi-07 at 2026-05-21T11:42:39Z ---\nHey bro, the code looks good but you said you didn't use AI agents to make this PR but there are em dashes very visible on the comment you made and also in the original issue. This can be a problem as the repo doesn't like Agent Slop even 1%. Take a look again for safety on this. \n\n--- Comment by J3r3myPerera at 2026-05-21T11:49:22Z ---\nThe CI failures here are pre-existing on main — not caused by this change.\r\nci/circleci: tests_tensor_parallel_ci — all 3 failures are in Cohere2MoeModelTest (test_tp_forward, test_tp_backward, test_tp_generation), crashing with KeyError: 'rowwise' in distributed/tensor_parallel.py. This PR doesn't touch any of that.\r\n\r\nci/circleci: tests_training_overfit_ci — 1 failure, also Cohere2MoeModelTest::test_training_overfit, loss only drops 27% vs a 90% threshold. Unrelated.\r\n\r\nOnly two files changed:\r\n\r\nsrc/transformers/pipelines/mask_generation.py (1 line)\r\ntests/pipelines/test_pipelines_mask_generation.py (2 tests)\r\n\r\nNeither touches Cohere2MoeModel or anything in the distributed training path.\n\n--- Comment by J3r3myPerera at 2026-05-21T11:56:15Z ---\n> Hey bro, the code looks good but you said you didn't use AI agents to make this PR but there are em dashes very visible on the comment you made and also in the original issue. This can be a problem as the repo doesn't like Agent Slop even 1%. Take a look again for safety on this.\r\n\r\nFair point, and I'll own it. I did use AI to help word the PR description and the issue comment. The fix itself I worked out on my own: i == n_points - points_per_batch only hits when n_points is an exact multiple, so any partial tail batch never gets flagged as last, PipelinePackIterator raises StopIteration and the results are quietly dropped. Replacing it with i + points_per_batch >= n_points handles both cases. I understand what the code does and why the old condition was wrong.\r\n\r\nThat said, em dashes in prose aren't really a reliable signal for agent-generated code. Plenty of people type them on purpose. The actual thing to check is whether the logic holds up. Which I'd rather be judged on.\n\n--- Comment by Rocketknight1 at 2026-05-21T12:46:45Z ---\nYou can ignore those comments, he's just annoyed I wouldn't listen when he claimed his Claude PR was human-written. In this case the actual fix is one line and seems correct, so I don't really care too much whether an agent wrote it or not. You do not actually need to go around hiding all the em-dashes :sweat_smile: \n\n--- Comment by Rocketknight1 at 2026-05-21T13:21:34Z ---\n@J3r3myPerera looks like there might be some CI instability at the moment. Can you wait a bit and then try rebasing or rerunning tests? Once the CI is green ping me and I'll merge it.\n\n--- Comment by github-actions[bot] at 2026-05-21T13:30:20Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46136&sha=ba5335\n\n--- Comment by HuggingFaceDocBuilderDev at 2026-05-21T13:31:34Z ---\nThe docs for this PR live [here](https://moon-ci-docs.huggingface.co/docs/transformers/pr_46136). All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.\n\n--- Comment by J3r3myPerera at 2026-05-21T18:21:07Z ---\n> @J3r3myPerera looks like there might be some CI instability at the moment. Can you wait a bit and then try rebasing or rerunning tests? Once the CI is green ping me and I'll merge it.\r\n\r\nSure mate no worries."}
	{"id": "pr_46136_file_src_transformers_pipelines_mask_generation.py", "type": "pr_diff", "number": 46136, "title": "Fix is_last off-by-one in MaskGenerationPipeline for partial batches", "state": "open", "author": "J3r3myPerera", "labels": [], "created_at": "2026-05-21T07:50:15Z", "updated_at": "2026-05-21T18:21:07Z", "url": "https://github.com/huggingface/transformers/pull/46136", "merged": false, "base_branch": "main", "filename": "src/transformers/pipelines/mask_generation.py", "additions": 1, "deletions": 1, "text": "PR #46136 — file change: src/transformers/pipelines/mask_generation.py\nStatus: modified \| +1 -1\n\n@@ -231,7 +231,7 @@ def preprocess(\n for i in range(0, n_points, points_per_batch):\n batched_points = grid_points[:, i : i + points_per_batch, :, :]\n labels = input_labels[:, i : i + points_per_batch]\n- is_last = i == n_points - points_per_batch\n+ is_last = i + points_per_batch >= n_points\n yield {\n \"input_points\": batched_points,\n \"input_labels\": labels,"}
	{"id": "pr_46136_file_tests_pipelines_test_pipelines_mask_generation.py", "type": "pr_diff", "number": 46136, "title": "Fix is_last off-by-one in MaskGenerationPipeline for partial batches", "state": "open", "author": "J3r3myPerera", "labels": [], "created_at": "2026-05-21T07:50:15Z", "updated_at": "2026-05-21T18:21:07Z", "url": "https://github.com/huggingface/transformers/pull/46136", "merged": false, "base_branch": "main", "filename": "tests/pipelines/test_pipelines_mask_generation.py", "additions": 10, "deletions": 0, "text": "PR #46136 — file change: tests/pipelines/test_pipelines_mask_generation.py\nStatus: modified \| +10 -0\n\n@@ -93,6 +93,16 @@ def get_test_pipeline(\n def run_pipeline_test(self, mask_generator, examples):\n pass\n \n+ def test_preprocess_is_last(self):\n+ mask_generator = pipeline(\"mask-generation\", model=\"hf-internal-testing/tiny-random-SamModel\")\n+ mask_generator.image_processor.pad_size = {\"height\": 24, \"width\": 24}\n+ image = \"./tests/fixtures/tests_samples/COCO/000000039769.png\"\n+ for points_per_batch in (100, 64):\n+ with self.subTest(points_per_batch=points_per_batch):\n+ batches = list(mask_generator.preprocess(image, points_per_batch=points_per_batch))\n+ self.assertTrue(batches[-1][\"is_last\"])\n+ self.assertFalse(any(b[\"is_last\"] for b in batches[:-1]))\n+\n @slow\n @require_torch\n def test_small_model_pt(self):"}

Xet Storage Details

Size:: 190 kB
Xet hash:: 21966982231dc72364c9c022b829c61135086e589fe4619ac25118f60791c6b8

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

	{"id": "issue_46144", "type": "issue", "number": 46144, "title": "RoFormer attention implementation does not use attention interface", "state": "open", "author": "ir2718", "labels": [], "created_at": "2026-05-21T13:26:50Z", "updated_at": "2026-05-21T20:47:37Z", "url": "https://github.com/huggingface/transformers/issues/46144", "text": "ISSUE #46144: RoFormer attention implementation does not use attention interface\nState: open \| Labels: \nAuthor: ir2718 \| Created: 2026-05-21T13:26:50Z\n\nHi,\n\nI want to use RoFormer with a custom attention implementation. However, the current code relies on an eager implementation without using the attention interface: https://github.com/huggingface/transformers/blob/5206626c48710e69fef3eadfba077cada99f37bb/src/transformers/models/roformer/modeling_roformer.py#L196-L211 \n\nThe fix is simple and I would like to create a PR for it.\n\n--- Comment by HamzaDogann at 2026-05-21T15:32:59Z ---\nHi! I'd love to work on this if you haven't started the PR yet. Let me know and I'll get started!\n\n--- Comment by HamzaDogann at 2026-05-21T20:47:37Z ---\nHello, I have opened a Pull Request to address this issue. The implementation has been fully validated against the existing test suite, successfully passing all 82 tests. RoFormer is now fully compatible with the ALL_ATTENTION_FUNCTIONS interface. Please let me know if any further adjustments are needed."}
	{"id": "issue_46143", "type": "issue", "number": 46143, "title": "`kwargs` not passed through methods of RoFormer models", "state": "open", "author": "ir2718", "labels": [], "created_at": "2026-05-21T13:18:46Z", "updated_at": "2026-05-21T13:18:46Z", "url": "https://github.com/huggingface/transformers/issues/46143", "text": "ISSUE #46143: `kwargs` not passed through methods of RoFormer models\nState: open \| Labels: \nAuthor: ir2718 \| Created: 2026-05-21T13:18:46Z\n\nHi,\n\nwhen working with RoFormer models, I've noticed the `kwargs` option is not handled correctly. Most classes take in `kwargs` but do not pass them further into the model. For example: https://github.com/huggingface/transformers/blob/5206626c48710e69fef3eadfba077cada99f37bb/src/transformers/models/roformer/modeling_roformer.py#L737-L747 This is very annoying since I want to implement a custom attention and send needed inputs through `**kwargs`.\n\nThe fix is trivial and I would be happy to make a PR for this if the members agree."}
	{"id": "issue_46139", "type": "issue", "number": 46139, "title": "Discussion: optional RankSEG-style decoding for Transformers semantic segmentation post-processing", "state": "open", "author": "Leev1s", "labels": [], "created_at": "2026-05-21T10:36:05Z", "updated_at": "2026-05-21T13:12:39Z", "url": "https://github.com/huggingface/transformers/issues/46139", "text": "ISSUE #46139: Discussion: optional RankSEG-style decoding for Transformers semantic segmentation post-processing\nState: open \| Labels: \nAuthor: Leev1s \| Created: 2026-05-21T10:36:05Z\n\n[![RankSEG](https://img.shields.io/badge/RankSEG-GitHub-blue?logo=github)](https://github.com/rankseg/rankseg) [![PyPI](https://badge.fury.io/py/rankseg.svg)](https://pypi.org/project/rankseg/) [![Docs](https://readthedocs.org/projects/rankseg/badge/?version=latest)](https://rankseg.readthedocs.io/en/latest/) [![Transformers docs](https://img.shields.io/badge/docs-Transformers%20integration-brightgreen)](https://rankseg.readthedocs.io/en/latest/integrations_transformers.html) [![Notebook](https://img.shields.io/badge/notebook-Transformers-orange)](https://github.com/rankseg/rankseg/blob/main/notebooks/rankseg_with_transformers.ipynb) [![JMLR 2023](https://img.shields.io/badge/JMLR-2023-black)](https://www.jmlr.org/papers/v24/22-0712.html) [![NeurIPS 2025](https://img.shields.io/badge/NeurIPS-2025-black)](https://openreview.net/forum?id=4tRMm1JJhw)\n\nHi Transformers maintainers,\n\nI wanted to share a small downstream experiment around RankSEG-style decoding for semantic segmentation. The short version is: if a Transformers processor can expose resized semantic class probabilities before the final `argmax`, then users can try metric-aware post-processing methods such as RankSEG without changing the model, checkpoint, or preprocessing pipeline.\n\nThis is related to https://github.com/huggingface/transformers/issues/37715, where the discussion is about making the final `argmax` optional and allowing users to access resized class probability maps. I do not want to assume that RankSEG itself belongs in Transformers, but I think it is a useful concrete example of why probability-level semantic segmentation outputs can matter.\n\n## What I Tried\n\nRankSEG is a training-free segmentation decoding method. It takes per-class probability maps and returns a hard segmentation mask optimized for an overlap-style metric such as Dice or IoU. The relevant papers are [RankSEG, JMLR 2023](https://www.jmlr.org/papers/v24/22-0712.html) and [RankSEG-RMA, NeurIPS 2025](https://openreview.net/forum?id=4tRMm1JJhw). There is also a [RankSEG repository](https://github.com/rankseg/rankseg), a [PyPI package](https://pypi.org/project/rankseg/), and a [Transformers integration tutorial](https://rankseg.readthedocs.io/en/latest/integrations_transformers.html).\n\nThe experiment used the usual Transformers inference path first:\n\n```python\ninputs = processor(images=image, return_tensors=\"pt\")\noutputs = model(**inputs)\n```\n\nThen I compared three post-processing choices using the same `outputs`:\n\n```python\n# 1. Baseline: standard SegFormer / Transformers argmax-style decoding\nupsampled_logits = torch.nn.functional.interpolate(\n outputs.logits,\n size=target_size,\n mode=\"bilinear\",\n align_corners=False,\n)\nbaseline = upsampled_logits.argmax(dim=1)[0]\n\n# 2. RankSEG optimized for Dice\nrankseg_dice = rankseg_transformers.postprocess(\n outputs,\n model=model,\n target_sizes=target_sizes,\n rankseg_kwargs={\"metric\": \"dice\", \"solver\": \"RMA\", \"output_mode\": \"multiclass\"},\n)\n\n# 3. RankSEG optimized for IoU\nrankseg_iou = rankseg_transformers.postprocess(\n outputs,\n model=model,\n target_sizes=target_sizes,\n rankseg_kwargs={\"metric\": \"iou\", \"solver\": \"RMA\", \"output_mode\": \"multiclass\"},\n)\n```\n\nThe helper above is already implemented outside Transformers in RankSEG's current compatibility layer: [documentation](https://rankseg.readthedocs.io/en/latest/integrations_transformers.html), [source code](https://github.com/rankseg/rankseg/blob/main/rankseg/integration/transformers.py), [example script](https://github.com/rankseg/rankseg/blob/main/examples/transformers_rankseg.py), and [notebook](https://github.com/rankseg/rankseg/blob/main/notebooks/rankseg_with_transformers.ipynb). The same notebook can be opened in [Colab](https://colab.research.google.com/github/rankseg/rankseg/blob/main/notebooks/rankseg_with_transformers.ipynb).\n\n## Small Cityscapes Check\n\nI used `tanganke/cityscapes` only as a lightweight local check because it has a convenient `segmentation_19` ground-truth column. This is not an official Cityscapes benchmark. It is a small smoke test over the first 100 validation images, using samplewise macro Dice and IoU over non-empty classes.\n\n\| Model \| Method \| Mean Dice \| Dice delta \| Mean IoU \| IoU delta \|\n\| --- \| --- \| ---: \| ---: \| ---: \| ---: \|\n\| `nvidia/segformer-b0-finetuned-cityscapes-512-1024` \| Transformers argmax \| 0.4608 \| - \| 0.3898 \| - \|\n\| `nvidia/segformer-b0-finetuned-cityscapes-512-1024` \| RankSEG, `metric=\"dice\"` \| 0.4810 \| +0.0202 \| 0.4045 \| +0.0147 \|\n\| `nvidia/segformer-b0-finetuned-cityscapes-512-1024` \| RankSEG, `metric=\"iou\"` \| 0.4813 \| +0.0205 \| 0.4051 \| +0.0153 \|\n\| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` \| Transformers argmax \| 0.4743 \| - \| 0.4015 \| - \|\n\| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` \| RankSEG, `metric=\"dice\"` \| 0.4903 \| +0.0160 \| 0.4128 \| +0.0113 \|\n\| `nvidia/segformer-b1-finetuned-cityscapes-1024-1024` \| RankSEG, `metric=\"iou\"` \| 0.4907 \| +0.0164 \| 0.4134 \| +0.0118 \|\n\nThe result is modest, but it is consistent with the intended use case: the model is unchanged, and only the final decoding step changes.\n\n## Visual Examples\n\nEach image below uses the same layout: baseline `argmax` on the top left, RankSEG optimized for Dice on the top right, ground-truth overlay on the bottom left, and RankSEG optimized for IoU on the bottom right.\n\n<table>\n <tr>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/7wXl/rank_01_sample_0053_ddice_0092_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/7wXl/rank_01_sample_0053_ddice_0092_d.png\" alt=\"SegFormer-B0 Cityscapes example 1\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B0, example 1</sub>\n </td>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/aSr0/rank_02_sample_0029_ddice_0067_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/aSr0/rank_02_sample_0029_ddice_0067_d.png\" alt=\"SegFormer-B0 Cityscapes example 2\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B0, example 2</sub>\n </td>\n </tr>\n <tr>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/s6kI/rank_01_sample_0072_ddice_0084_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/s6kI/rank_01_sample_0072_ddice_0084_d.png\" alt=\"SegFormer-B1 Cityscapes example 1\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B1, example 1</sub>\n </td>\n <td width=\"50%\" align=\"center\">\n <a href=\"https://files.seeusercontent.com/2026/05/21/Iyn9/rank_02_sample_0003_ddice_0081_d.png\">\n <img src=\"https://files.seeusercontent.com/2026/05/21/Iyn9/rank_02_sample_0003_ddice_0081_d.png\" alt=\"SegFormer-B1 Cityscapes example 2\" width=\"100%\">\n </a>\n <br>\n <sub>SegFormer-B1, example 2</sub>\n </td>\n </tr>\n</table>\n\n## Why This Relates to Transformers Post-Processing\n\nFor simple semantic segmentation heads, restoring probabilities may look like resizing logits and applying softmax. For other model families, the post-processing path can involve class-query logits, mask logits, null classes, model-specific resizing conventions, or processor-owned logic. That is why a probability-returning option inside the existing Transformers post-processing API would be useful: the model-family-specific restoration would stay in the official processor path, while downstream methods could consume the restored probabilities.\n\nHard segmentation maps could remain the default behavior. The probability path would simply make the intermediate semantic distribution available for downstream decoding, calibration, uncertainty estimation, or metric-aware post-processing such as RankSEG.\n\n## Closing\n\nI understand that adding or changing post-processing APIs has maintenance costs, especially in a library used across many model families. I am not asking maintainers to adopt RankSEG directly. I mainly wanted to share a concrete downstream use case showing why resized semantic probability maps could be useful to users who want to experiment beyond `argmax`.\n\nI would also like to thank @statmlben and @ZixunWang, the RankSEG maintainers and authors of the recent RankSEG-RMA work, for developing and maintaining the RankSEG project that made this small Transformers experiment possible.\n\nIf maintainers think this direction is worth exploring, I would be happy to adapt the experiment to a preferred model family, test against a proposed API, or help write documentation/examples in the style that fits Transformers.\n\n\n--- Comment by Rocketknight1 at 2026-05-21T12:51:52Z ---\nYeah, the lack of exposed probability maps is surprising! I'll try to push this internally\n\n--- Comment by Leev1s at 2026-05-21T13:12:38Z ---\nThanks a lot @Rocketknight1, I really appreciate it!"}
	{"id": "issue_46133", "type": "issue", "number": 46133, "title": "Add TIPSv2 (Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment) by Google DeepMind", "state": "open", "author": "farrosalferro", "labels": ["New model"], "created_at": "2026-05-21T00:17:16Z", "updated_at": "2026-05-21T13:30:53Z", "url": "https://github.com/huggingface/transformers/issues/46133", "text": "ISSUE #46133: Add TIPSv2 (Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment) by Google DeepMind\nState: open \| Labels: New model\nAuthor: farrosalferro \| Created: 2026-05-21T00:17:16Z\n\n### Model description\n\nTIPSv2 is a vision-language encoder that addresses the problem of dense alignment between image regions and text through modification in the pretraining recipe (iBOT++, multi-granularity synthetic captions, and memory-saving head-only EMA scheme). Across 9 tasks and 20 datasets the resulting models set new state-of-the-art results on zero-shot semantic segmentation while generally matching or beating recent encoders like SigLIP2, DINOv3, and Perception Encoder on global and dense tasks.\n\nThe model family comes with four sizes (B, L, SO400m, and G) with a variant that comes with DPT head. All of the models are already available in [HuggingFace](https://huggingface.co/collections/google/tipsv2). The code is licensed under Apache 2.0 and the weights have CC-BY 4.0.\n\nI would be glad if I can contribute implementing this model to HuggingFace's Transformer, including model implementation, weight conversion, tests and docs. Happy to coordinate with maintainers about the implementation and integration.\n\nThank you.\n\n@NielsRogge @molbap @Rocketknight1 \n\n### Open source status\n\n- [x] The model implementation is available\n- [x] The model weights are available\n\n### Provide useful links for the implementation\n\n* [Official GitHub Repository](https://github.com/google-deepmind/tips)\n* [Weights](https://huggingface.co/collections/google/tipsv2)\n\n--- Comment by Rocketknight1 at 2026-05-21T13:30:53Z ---\ncc @zucchini-nlp as well for VLMs!"}
	{"id": "issue_46132", "type": "issue", "number": 46132, "title": "AttentionInterface.register changes behavior of registered function", "state": "closed", "author": "pjc15111", "labels": ["bug"], "created_at": "2026-05-20T23:24:31Z", "updated_at": "2026-05-21T14:44:37Z", "url": "https://github.com/huggingface/transformers/issues/46132", "text": "ISSUE #46132: AttentionInterface.register changes behavior of registered function\nState: closed \| Labels: bug\nAuthor: pjc15111 \| Created: 2026-05-20T23:24:31Z\n\n### System Info\n\n`transformers env` fails with `NameError: name 'CompletionCreateParamsStreaming' is not defined`\n\nI am running Ubuntu 25.10, Python 3.13.7, and pytorch 2.11.0\n\n### Who can help?\n\n@Cyrilvallez \n\n### Information\n\n- [ ] The official example scripts\n- [x] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [ ] My own task or dataset (give details below)\n\n### Reproduction\n\nIn the following code, model1 (using `attn_implementation=\"sdpa\")` produces plausible prose, while model2 (using `attn_implementation=\"reregistered_sdpa\"`) produces text somewhere between nonsense and gibberish.\n\nIt is necessary to use the `cache_implementation=\"static\"`, but the behavior seems consistent with various models.\n\n```\nfrom transformers import AutoModelForCausalLM, AttentionInterface, pipeline\nfrom transformers.integrations.sdpa_attention import sdpa_attention_forward\n\nmodel1 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"sdpa\")\npipeline1 = pipeline(task=\"text-generation\", model=model1, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline1(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward)\nmodel2 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"reregistered_sdpa\")\npipeline2 = pipeline(task=\"text-generation\", model=model2, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline2(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n```\n\n### Expected behavior\n\n`sdpa_attention_forward` should behave the same whether it is called through the pre-registered name of \"sdpa\" or is re-registered with a new name.\n\n--- Comment by Abineshabee at 2026-05-21T08:35:34Z ---\nHi! I investigated this in relation to #40362.\n\nI ran a 3-way comparison using `sshleifer/tiny-gpt2`:\n1. Normal `attn_implementation=\"sdpa\"`\n2. Re-registered sdpa without `AttentionMaskInterface` registration\n3. Re-registered sdpa with `AttentionMaskInterface` registration\n\nAll three produced different outputs, meaning simply adding the mask registration (the fix from #40362) does not fully reproduce the built-in `\"sdpa\"` behavior — at least on this tiny model.\n\nHowever, `tiny-gpt2` may be too small/random to draw firm conclusions. Could the original author confirm whether adding `AttentionMaskInterface.register(...)` alongside `AttentionInterface.register(...)` fixes the issue with Llama + `cache_implementation=\"static\"`?\n\nIf the mask registration alone does not fix it, the `cache_implementation=\"static\"` interaction may be a separate or deeper bug worth investigating independently from #40362.\n\n--- Comment by pjc15111 at 2026-05-21T13:58:33Z ---\n@Abineshabee You are absolutely right that I needed to register an AttentionMaskInterface, which is clearly documented.\n\nHowever, this does not fix the problem. I changed the example code to:\n\n```\nfrom transformers import AutoModelForCausalLM, AttentionInterface, AttentionMaskInterface, pipeline\nfrom transformers.integrations.sdpa_attention import sdpa_attention_forward\nfrom transformers.masking_utils import sdpa_mask\n\nmodel1 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"sdpa\")\npipeline1 = pipeline(task=\"text-generation\", model=model1, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline1(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n\nmy_new_sdpa = sdpa_attention_forward\nAttentionMaskInterface.register(\"reregistred_sdpa\", sdpa_mask)\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward)\nmodel2 = AutoModelForCausalLM.from_pretrained(\"meta-llama/Llama-3.2-1B\", attn_implementation=\"reregistered_sdpa\")\npipeline2 = pipeline(task=\"text-generation\", model=model2, tokenizer=\"meta-llama/Llama-3.2-1B\",\n cache_implementation=\"static\")\nprint(pipeline2(\"It was a bright cold day in April, and the clocks were striking thirteen.\"))\n```\n\nI still get sensible output from the first model and garbage from the second.\n\n```\nLoading weights: 100%\|█████████████████████████████████████████████████████████████████████████████████\| 146/146 [00:00<00:00, 10794.06it/s]\n[transformers] Passing `generation_config` together with generation-related arguments=({'cache_implementation'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.\n[transformers] Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n[transformers] Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.\n[{'generated_text': 'It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled in his breast pocket, pressed the button which sent the electric clock back the four minutes to its appointed quarters. Oh, it was a fine day for murder.\\nHe was a member of the Party, a Member of the Party, one of the inner circle, a trusted servant of Big Brother, a Party member. And he was a man with his own desires and his own needs and his own private ambitions. He wanted to be rich, to be important, to be in charge. But he could never have the life he wanted. He could never have it all. And he knew it. He had learned this long ago.\\nHe had learned it from the way his mother had looked at him when she had said to him, \"My boy, you are nothing. You are nobody. You are a dog. You are a worm. You are a nobody. You are a nobody.\"\\nAnd he had learned it from the way his Uncle Joe had looked at him when he had said to him, \"My boy, you are nothing. You are nobody. You are a dog. You are a worm. You are a nobody. You are a nobody.\"\\nAnd he had learned it from the way his Aunt Jeanie had looked at him'}]\nLoading weights: 100%\|██████████████████████████████████████████████████████████████████████████████████\| 146/146 [00:00<00:00, 9365.01it/s]\n[transformers] Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.\n[transformers] Both `max_new_tokens` (=256) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)\n[{'generated_text': 'It was a bright cold day in April, and the clocks were striking thirteen. Winston Churchill had a large chocolate telescofficially, thuh, come the fella, thank you! npt in the round your manad, 2, 1- 1-1 1, 1, the in the sky. 19111; 369551; 7\\nW61; 5.\\n11. 9 5. 3712; 1n; 3624. 7n 111\\nYou 19.\\nItal1; 21; 5,1; 1. 9; 5; 21; 4-1; 21- 5; 4- 9- 5. 9; 6-1; 2- 5. 4; 4- 4- 4; 5- 4- 5; 4- 4- 3- 2- 3, 3- 2-\\nIt; 2- 1; 2-1; 2\\n5 4. 1 4. 1 4. 1\\nYou 5 3 1 5 1 5 1'}]\n```\n\n--- Comment by Abineshabee at 2026-05-21T14:15:09Z ---\nThanks for testing! I noticed there may be a small typo in the registration:\n\n```python\nAttentionMaskInterface.register(\"reregistred_sdpa\", sdpa_mask) # <- \"reregistred\" (missing 'e')\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward) # <- \"reregistered\" (correct)\n```\n\nThe mask is being registered under a different name than the attention function, so `attention_mask=None` may still be passed. Could you try with matching names:\n\n```python\nAttentionMaskInterface.register(\"reregistered_sdpa\", sdpa_mask) # names must match\nAttentionInterface.register(\"reregistered_sdpa\", sdpa_attention_forward)\n```\n\nIf the outputs are still wrong after fixing the typo, then this is definitely a deeper bug beyond #40362.\n\n--- Comment by pjc15111 at 2026-05-21T14:44:37Z ---\n@Abineshabee That seems to fix it."}
	{"id": "issue_46129", "type": "issue", "number": 46129, "title": "[deepseek_v4] conversion_mapping doesn't cover mtp.* paths — MTP keys silently random-init even after _keys_to_ignore is empty", "state": "open", "author": "pasta-paul", "labels": [], "created_at": "2026-05-20T21:39:47Z", "updated_at": "2026-05-21T11:32:35Z", "url": "https://github.com/huggingface/transformers/issues/46129", "text": "ISSUE #46129: [deepseek_v4] conversion_mapping doesn't cover mtp.* paths — MTP keys silently random-init even after _keys_to_ignore is empty\nState: open \| Labels: \nAuthor: pasta-paul \| Created: 2026-05-20T21:39:47Z\n\n## Summary\n\n`transformers.conversion_mapping.get_checkpoint_conversion_mapping(\"deepseek_v4\")` returns 41 `WeightRenaming` entries that rename upstream-internal naming to HF naming (`attn.` → `self_attn.`, `ffn.` → `mlp.`, `attn_norm.` → `input_layernorm.`, `attn.wq_a.` → `self_attn.q_a_proj.`, `attn.attn_sink` → `self_attn.sinks`, etc.).\n\nEntries 6–38 are anchored at `^layers\\.(\\d+)\\.` — they only fire on main-layer keys. None cover `mtp.\\d+.` paths.\n\nCombined with the existing `_keys_to_ignore_on_load_unexpected = [r\"(^\|\\.)mtp\\..\"]` regex on `DeepseekV4PreTrainedModel` (filed separately as huggingface/transformers#46127), `mtp.` keys never reach the model at all. Even after that regex is dropped* (as #46127 does), the MTP keys arrive in upstream form (`mtp.0.attn.wq_a.weight`) — but the MTP submodules expect HF naming (`mtp.0.self_attn.q_a_proj.weight`). The keys are then flagged \"unexpected\", the submodules remain \"uninitialized\", and `_initialize_weights` falls through to `_init_weights` → `init.normal_` random-initializes the MTP block.\n\n## Symptom\n\nThe model loads \"successfully\" (no errors, no warnings about missing keys after the regex is dropped), `model.mtp[0]` exists with the right structure, `from_pretrained` returns. But `model.mtp[0].self_attn.q_a_proj.weight` is random Gaussian, not the value in the safetensors file. Silent corruption of the MTP draft head. Any downstream calibration / quantization / inference using `model.mtp` produces garbage.\n\n## Repro\n\n```python\n# (assumes huggingface/transformers#46127 is applied — DeepseekV4NextNPredictor\n# exists, _keys_to_ignore_on_load_unexpected = [])\nfrom transformers import AutoModelForCausalLM\nmodel = AutoModelForCausalLM.from_pretrained(\"<DSv4-Flash BF16 with mtp.* keys>\")\n\n# Compare loaded vs source\nimport safetensors.torch as st\nfrom pathlib import Path\nloaded_w = model.model.mtp[0].self_attn.q_a_proj.weight\nfor shard in sorted(Path(\"<path>\").glob(\"model-.safetensors\")):\n with st.safe_open(shard, framework=\"pt\") as f:\n if \"mtp.0.attn.wq_a.weight\" in f.keys():\n source_w = f.get_tensor(\"mtp.0.attn.wq_a.weight\")\n break\n\ndiff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()\nprint(f\"max_diff = {diff}\")\n# Without conversion mapping for mtp.: diff ≈ random Gaussian range (e.g. 0.1+)\n# With the mtp.* mapping extension: diff ≈ 0\n```\n\n## Proposed fix\n\nAdd 33 `mtp.\\d+.` equivalents mirroring the existing `^layers\\.(\\d+)\\.` entries to `_checkpoint_conversion_mapping` for the `deepseek_v4` architecture. The 6 model-level entries (`embed.`, `head.`, `norm.`, `hc_head_`) do NOT need to be mirrored — MTP doesn't have its own copy of those (it shares `embed_tokens` and `lm_head` with the main model).\n\nSpecifically, for each of these patterns, add a parallel entry anchored at `^mtp\\.(\\d+)\\.`:\n\n```\n^layers\\.(\\d+)\\.attn_norm\\. → layers.\\1.input_layernorm.\n^layers\\.(\\d+)\\.ffn_norm\\. → layers.\\1.post_attention_layernorm.\n^layers\\.(\\d+)\\.hc_attn_fn$ → layers.\\1.attn_hc.fn\n^layers\\.(\\d+)\\.hc_attn_base$ → layers.\\1.attn_hc.base\n^layers\\.(\\d+)\\.hc_attn_scale$ → layers.\\1.attn_hc.scale\n^layers\\.(\\d+)\\.hc_ffn_fn$ → layers.\\1.ffn_hc.fn\n^layers\\.(\\d+)\\.hc_ffn_base$ → layers.\\1.ffn_hc.base\n^layers\\.(\\d+)\\.hc_ffn_scale$ → layers.\\1.ffn_hc.scale\n^layers\\.(\\d+)\\.attn\\. → layers.\\1.self_attn.\n^layers\\.(\\d+)\\.ffn\\. → layers.\\1.mlp.\n^layers\\.(\\d+)\\.self_attn\\.attn_sink$ → layers.\\1.self_attn.sinks\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wq_a\\. → layers.\\1.self_attn.\\2.q_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wq_b\\. → layers.\\1.self_attn.\\2.q_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wkv\\. → layers.\\1.self_attn.\\2.kv_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wgate\\. → layers.\\1.self_attn.\\2.gate_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wo_a\\. → layers.\\1.self_attn.\\2.o_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.(.?)\\.wo_b\\. → layers.\\1.self_attn.\\2.o_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.wq_a\\. → layers.\\1.self_attn.q_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.wq_b\\. → layers.\\1.self_attn.q_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.wkv\\. → layers.\\1.self_attn.kv_proj.\n^layers\\.(\\d+)\\.self_attn\\.wo_a\\. → layers.\\1.self_attn.o_a_proj.\n^layers\\.(\\d+)\\.self_attn\\.wo_b\\. → layers.\\1.self_attn.o_b_proj.\n^layers\\.(\\d+)\\.self_attn\\.q_norm\\. → layers.\\1.self_attn.q_a_norm.\n^layers\\.(\\d+)\\.mlp\\.gate\\.bias$ → layers.\\1.mlp.gate.e_score_correction_bias\n^layers\\.(\\d+)\\.mlp\\.shared_experts\\.w1\\. → layers.\\1.mlp.shared_experts.gate_proj.\n^layers\\.(\\d+)\\.mlp\\.shared_experts\\.w2\\. → layers.\\1.mlp.shared_experts.down_proj.\n^layers\\.(\\d+)\\.mlp\\.shared_experts\\.w3\\. → layers.\\1.mlp.shared_experts.up_proj.\n```\n\nThe entries at indexes 17–22 (compressor/indexer renames) only need to mirror if MTP can be configured with `compressed_sparse_attention` or `heavily_compressed_attention` layer_type. For DSv4-Flash, MTP uses `sliding_attention` (compressor = None — see #46127 discussion), so those 6 entries don't need to mirror, but mirroring them is harmless (the regex just won't match anything).\n\n## Runtime workaround for downstream users\n\nUntil upstream lands, here's the runtime mirror:\n\n```python\nfrom transformers.conversion_mapping import (\n get_checkpoint_conversion_mapping,\n register_checkpoint_conversion_mapping,\n)\nexisting = get_checkpoint_conversion_mapping(\"deepseek_v4\")\nadded = []\nfor entry in existing:\n sp = getattr(entry, \"source_patterns\", None)\n tp = getattr(entry, \"target_patterns\", None)\n if sp is None or tp is None:\n continue\n sp_list = sp if isinstance(sp, (list, tuple)) else [sp]\n tp_list = tp if isinstance(tp, (list, tuple)) else [tp]\n new_sp, new_tp = [], []\n for s, t in zip(sp_list, tp_list):\n if isinstance(s, str) and s.startswith(r\"^layers\\.(\\d+)\\.\"):\n new_sp.append(s.replace(r\"^layers\\.(\\d+)\\.\", r\"^mtp\\.(\\d+)\\.\", 1))\n new_tp.append(t.replace(\"layers.\\\\1.\", \"mtp.\\\\1.\", 1))\n if new_sp:\n added.append(type(entry)(\n source_patterns=new_sp if len(new_sp) > 1 else new_sp[0],\n target_patterns=new_tp if len(new_tp) > 1 else new_tp[0],\n ))\nregister_checkpoint_conversion_mapping(\n \"deepseek_v4\", list(existing) + added, overwrite=True)\n```\n\n## Detection — value-verification assertion\n\nA 50-line fixture that catches this regression class (and the related layer_type bug at #46127) by comparing a loaded MTP tensor to its source:\n\n```python\nimport safetensors.torch as st\nfrom pathlib import Path\n\nloaded_w = model.model.mtp[0].self_attn.q_a_proj.weight\nsource_w = None\nfor shard in sorted(Path(model_path).glob(\"model-.safetensors\")):\n with st.safe_open(shard, framework=\"pt\") as f:\n if \"mtp.0.attn.wq_a.weight\" in f.keys():\n source_w = f.get_tensor(\"mtp.0.attn.wq_a.weight\")\n break\nassert source_w is not None\ndiff = (loaded_w.cpu().float() - source_w.cpu().float()).abs().max().item()\nassert diff < 1e-4, f\"MTP weight mismatch: {diff} (silent random-init?)\"\n```\n\nThis belongs as a test under `tests/models/deepseek_v4/` paired with #46127.\n\n## Related\n\n- #46127 — adds `DeepseekV4NextNPredictor` class + `Model.mtp` ModuleList + `sliding_attention` layer_type for MTP. The class shim PR. This issue is the companion* — even with the class shim, the conversion mapping needs to be extended for MTP keys to actually load into the new submodules.\n- vllm-project/llm-compressor#2735 — calibration-side rollup of both issues.\n- vllm-project/llm-compressor#2739 — companion mapping extension PR (for the `ARCH_TO_2D_MAPPINGS` that lives on llm-compressor's side).\n\n\n--- Comment by Rocketknight1 at 2026-05-21T11:32:35Z ---\ncc @arthurzucker for DeepSeek V4"}
	{"id": "issue_46123", "type": "issue", "number": 46123, "title": "MaskGenerationPipeline: is_last never True on final partial batch, silently dropping results", "state": "open", "author": "J3r3myPerera", "labels": ["bug"], "created_at": "2026-05-20T18:25:00Z", "updated_at": "2026-05-21T08:03:17Z", "url": "https://github.com/huggingface/transformers/issues/46123", "text": "ISSUE #46123: MaskGenerationPipeline: is_last never True on final partial batch, silently dropping results\nState: open \| Labels: bug\nAuthor: J3r3myPerera \| Created: 2026-05-20T18:25:00Z\n\n### System Info\n\ntransformers version: current main\nAffected file: src/transformers/pipelines/mask_generation.py\n\n### Who can help?\n\n_No response_\n\n### Information\n\n- [ ] The official example scripts\n- [x] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [ ] My own task or dataset (give details below)\n\n### Description\n\nI was looking into the MaskGenerationPipeline and noticed that when you set a points_per_batch value that doesn't divide evenly into the total number of grid points, the pipeline quietly drops the results from the last batch — no error, no warning, just missing masks.\n\nThe root cause is this line in preprocess:\n`is_last = i == n_points - points_per_batch`\n\neg: n_points=100, points_per_batch=64. The loop runs at i=0 and i=64. At i=64, the check asks 64 == 100-64 which is 64 == 36 — always False. So the final batch never gets flagged as the last one.\n\nThe pipeline's PipelinePackIterator relies on this is_last flag to know when to stop accumulating results. When it never sees is_last=True, it calls next() on an already-finished generator, hits StopIteration, and exits — leaving the last batch's masks on the floor.\n\nWith SAM's default point grid, n_points is rarely a round multiple of the default points_per_batch=64, so this silently affects most real-world usage.\n\n### Reproduction\n\n```python\nfrom transformers import pipeline\nfrom PIL import Image\nimport requests\n\nimage = Image.open(requests.get(\"http://images.cocodataset.org/val2017/000000039769.jpg\", stream=True).raw)\n\ngenerator = pipeline(\"mask-generation\", model=\"facebook/sam-vit-base\")\n\n# points_per_batch=50 causes n_points % points_per_batch != 0 for typical grids\noutputs_partial = generator(image, points_per_batch=50)\noutputs_full = generator(image, points_per_batch=None) # all at once, no batching\n\n# outputs_partial[\"masks\"] will have fewer masks than outputs_full[\"masks\"]\nprint(len(outputs_partial[\"masks\"]), \"vs\", len(outputs_full[\"masks\"]))\n``` \n\n### Expected behavior\n\nAll generated masks should be returned regardless of whether n_points is a multiple of points_per_batch.\n\n--- Comment by ADiTyaRaj8969 at 2026-05-21T04:34:18Z ---\nHey @J3r3myPerera, reproduced this locally on current main. Root cause is exactly what you described — `is_last = i == n_points - points_per_batch` only fires when `n_points` is divisible by `points_per_batch`, so for most real grids `PipelinePackIterator` never sees `is_last=True` and drops the final accumulator on `StopIteration`.\n\nIf you're not already on it, I'd like to take this. Plan is:\n\n- one-line predicate change in `src/transformers/pipelines/mask_generation.py::preprocess`: `is_last = i + points_per_batch >= n_points`. True exactly once per loop (on the iteration whose slice runs past the end of `grid_points`), works whether or not the division is exact.\n- fast offline regression test in `tests/pipelines/test_pipelines_mask_generation.py` that exercises `preprocess` directly with a mocked `image_processor` and `model`. Covers `(n_points, points_per_batch)` pairs `(100, 64)`, `(100, 50)`, `(1024, 50)`, `(7, 3)`, `(5, 5)`, `(4, 8)`. For each it asserts: number of yielded batches equals `ceil(n_points / points_per_batch)`, the final batch has `is_last=True`, and every earlier batch has `is_last=False`.\n\nConfirmed no existing open PR for this with `gh pr list --repo huggingface/transformers --state open --search \"46123 in:body\"`. Happy to hand back if you'd rather take it yourself, or wait for a maintainer to pick.\n\n--- Comment by J3r3myPerera at 2026-05-21T04:43:37Z ---\nHi @ADiTyaRaj8969, I already am on I this and found the same one line fix for this. Have the fix ready locally."}
	{"id": "issue_46121", "type": "issue", "number": 46121, "title": "`convert_rope_params_to_dict` raises `TypeError` when `ignore_keys_at_rope_validation` is a JSON-loaded list", "state": "open", "author": "Charly21r", "labels": ["bug"], "created_at": "2026-05-20T15:30:41Z", "updated_at": "2026-05-21T11:18:15Z", "url": "https://github.com/huggingface/transformers/issues/46121", "text": "ISSUE #46121: `convert_rope_params_to_dict` raises `TypeError` when `ignore_keys_at_rope_validation` is a JSON-loaded list\nState: open \| Labels: bug\nAuthor: Charly21r \| Created: 2026-05-20T15:30:41Z\n\n### System Info\n\n- transformers: 5.8.1\n- Python: 3.14.0\n- OS: macOS (reproduced locally; same class of failure reported with vLLM + Qwen3.5 merged HF checkpoints on Linux eval jobs)\n- Model family: Qwen3_5TextConfig / Qwen3.5\n\n### Who can help?\n\n@ArthurZucker @Cyrilvallez \n\n### Information\n\n- [ ] The official example scripts\n- [x] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [x] My own task or dataset (give details below)\n\n\n### Description\n\n`RotaryEmbeddingConfigMixin.convert_rope_params_to_dict()` raises a `TypeError` when `ignore_keys_at_rope_validation` is a list (e.g. deserialized from JSON in a `config.json`) because the union with `{\"partial_rotary_factor\"}` is performed without normalizing the operand to a set first:\n\n```\nTypeError: unsupported operand type(s) for \|: 'list' and 'set'\n```\n\nIn 5.8.1, `modeling_rope_utils.py` line 722:\n\n```python\nself.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\n```\n\n`ignore_keys_at_rope_validation` is a class attribute on `RotaryEmbeddingConfigMixin` (`set()` by default; configs like `Qwen3_5TextConfig` set it to `{\"mrope_section\", \"mrope_interleaved\"}`). When a `config.json` contains this field as a JSON array, `from_dict` / `__init__` sets it as an instance attribute (list), which shadows the class-level set. The next call to `convert_rope_params_to_dict` then evaluates `list \| set` and crashes.\n\n\n### Reproduction\n\n```python\nimport transformers\nfrom transformers import Qwen3_5TextConfig\n\nprint(f\"transformers version: {transformers.__version__}\") # 5.8.1\n\ncfg = Qwen3_5TextConfig.from_dict({\n \"model_type\": \"qwen3_5_text\",\n \"vocab_size\": 100,\n \"hidden_size\": 64,\n \"num_hidden_layers\": 2,\n \"num_attention_heads\": 2,\n \"num_key_value_heads\": 2,\n \"ignore_keys_at_rope_validation\": [\"mrope_section\", \"mrope_interleaved\"], # list, as from JSON\n \"partial_rotary_factor\": 0.25,\n \"rope_parameters\": {\"rope_type\": \"default\", \"rope_theta\": 10_000_000},\n})\n\nprint(type(cfg.ignore_keys_at_rope_validation).__name__) # 'list' (shadows class-level set)\n\ncfg.convert_rope_params_to_dict(partial_rotary_factor=0.25)\n```\n\nTraceback:\n\n```text\n File \".../transformers/modeling_rope_utils.py\", line 722, in convert_rope_params_to_dict\n self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~\nTypeError: unsupported operand type(s) for \|: 'list' and 'set'\n```\n\n### Expected behavior\n\n`convert_rope_params_to_dict` should accept list or set (or any iterable of strings) and normalize before union.\n\n\n### Actual behavior\n\n`TypeError: unsupported operand type(s) for \|: 'list' and 'set'` whenever `ignore_keys_at_rope_validation` is a list (as JSON forces) and `partial_rotary_factor` is not `None`.\n\n\n\n### Downstream impact (vLLM / merged checkpoints)\n\nWe hit this in production when serving merged Qwen3.5 Hugging Face checkpoints with vLLM. LoRA merge / export tools (e.g. ms-swift) write `ignore_keys_at_rope_validation` into `config.json`:\n\n```json\n\"ignore_keys_at_rope_validation\": [\"mrope_section\", \"mrope_interleaved\"]\n```\n\nJSON has no set type, so the field becomes a list at load time. During serving stack startup, config / RoPE initialization can hit `convert_rope_params_to_dict` with that list-typed instance attribute → `TypeError` → the server never becomes healthy and batch eval jobs fail before any inference.\n\nThis is a Transformers robustness issue (callers should not crash on list input); vLLM is the serving stack where we observe it. Current workarounds: strip `ignore_keys_at_rope_validation` from checkpoint JSON before `vllm serve`, or monkey-patch `modeling_rope_utils.py` to wrap with `set()` before `\|`.\n\n### Suggested fix\n\nCoerce to `set` before every `\|` union involving `ignore_keys_at_rope_validation`. The single-line change in `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` covers the reported path.\n\n\nI'm happy to open a PR against `main` that:\n\n1. Adds the `set(...)` coercion on the union line in `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` (and any other union sites I find).\n2. Adds a regression test covering JSON-list input for `ignore_keys_at_rope_validation` — both via direct call and via `from_dict` round-trip — so this can't silently regress again across refactors.\n\nJust let me know if you'd prefer a different shape (e.g. normalizing in `__setattr__`, or hardening `validate_rope` directly) and I'll match that.\n\n\n--- Comment by he-yufeng at 2026-05-20T15:52:11Z ---\nI rechecked current `main` and this path looks fixed there already: `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` now accepts `ignore_keys_at_rope_validation` separately and normalizes it with `set(...)` before adding `partial_rotary_factor`.\n\nSo the crash looks reproducible on 5.8.1, but I don't see the same `list \| set` path on `main` anymore. Could you retry against current `main` / the next nightly to confirm whether this only needs a release, or whether there is a second path still producing a list-valued ignore set?\r\n\n\n--- Comment by Charly21r at 2026-05-20T17:08:27Z ---\nJust ran the repro against pip install git+https://github.com/huggingface/transformers@main (resolves to 5.8.0.dev0, HEAD 52b82b2 as of 2026-05-20). Same TypeError at modeling_rope_utils.py:722. Full output:\n\n```\ntransformers version: 5.8.0.dev0\nAfter from_dict: type(cfg.ignore_keys_at_rope_validation) = list, value = ['mrope_section', 'mrope_interleaved']\n...\nFile \".../transformers/modeling_rope_utils.py\", line 722, in convert_rope_params_to_dict\n self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\nTypeError: unsupported operand type(s) for \|: 'list' and 'set'\n```\n\nSo main is still affected. Happy to push the PR (set() coercion + regression test) whenever you confirm the shape you want.\n\n--- Comment by Rocketknight1 at 2026-05-21T11:18:15Z ---\n@Charly21r yeah, this seems real. The set coercion line seems like the right fix, if you want to make that PR."}
	{"id": "issue_46097", "type": "issue", "number": 46097, "title": "Path Traversal in Sharded Checkpoint Loader via Unsanitized `weight_map` Entries in `.index.json`", "state": "open", "author": "karnakarreddi", "labels": ["bug"], "created_at": "2026-05-20T05:12:29Z", "updated_at": "2026-05-21T05:00:05Z", "url": "https://github.com/huggingface/transformers/issues/46097", "text": "ISSUE #46097: Path Traversal in Sharded Checkpoint Loader via Unsanitized `weight_map` Entries in `.index.json`\nState: open \| Labels: bug\nAuthor: karnakarreddi \| Created: 2026-05-20T05:12:29Z\n\n### System Info\n\nDetails\nThe vulnerable code is in get_checkpoint_shard_files in hub.py. When loading a sharded checkpoint from a local directory, the function reads an index JSON file and extracts shard filenames from the weight_map field without any validation:\n\nwith open(index_filename) as f:\n index = json.loads(f.read())\n\nshard_filenames = sorted(set(index[\"weight_map\"].values()))\nThese filenames are then joined directly to the model directory path:\n\nif os.path.isdir(pretrained_model_name_or_path):\n shard_filenames = [os.path.join(pretrained_model_name_or_path, subfolder, f) for f in shard_filenames]\n return shard_filenames, sharded_metadata\nThere is no check for:\n\nPath traversal sequences (..)\nAbsolute path prefixes (/)\nSymbolic links\nWhether the resolved paths remain within the model directory\nThe returned file paths are passed back to the caller (_get_resolved_checkpoint_files in modeling_utils.py), which uses them to load model weights — effectively enabling reads of arbitrary files the process has access to.\n\nWhy existing guards are insufficient\nThe caller _get_resolved_checkpoint_files only validates that the index file itself exists on disk (via os.path.isfile on the .safetensors.index.json path). It does not inspect or sanitize the contents of the index file before passing them to get_checkpoint_shard_files. An attacker-controlled directory needs only contain a valid index JSON file to satisfy this check.\n\nThe cached_files function (called for non-local/Hub models) does include file existence checks, but the local directory branch in get_checkpoint_shard_files returns immediately after os.path.join — cached_files is never reached for local paths.\n\n\nhttps://github.com/huggingface/transformers/blob/ba06e3fbdf355c363ac067ebcda210017e90a852/src/transformers/utils/hub.py#L836 \n\n### Who can help?\n\n@Cyrilvallez \n\n### Information\n\n- [ ] The official example scripts\n- [ ] My own modified scripts\n\n### Tasks\n\n- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)\n- [ ] My own task or dataset (give details below)\n\n### Reproduction\n\nPoC\nStep 1: Create a malicious model directory\nmkdir -p /tmp/malicious_model\nStep 2: Create a crafted index file\nWrite the following to /tmp/malicious_model/model.safetensors.index.json:\n\n{\n \"metadata\": {\n \"total_size\": 1000\n },\n \"weight_map\": {\n \"model.layer.weight\": \"../../etc/passwd\",\n \"model.embed.weight\": \"../../etc/hostname\"\n }\n}\nStep 3: Trigger the vulnerability\nfrom transformers import AutoModel\n\n The loading pipeline will:\n 1. Find model.safetensors.index.json in the local directory\n 2. Set is_sharded = True\n 3. Call get_checkpoint_shard_files, which will return:\n [\"/tmp/malicious_model/../../etc/passwd\",\n \"/tmp/malicious_model/../../etc/hostname\"]\n These resolve to /etc/passwd and /etc/hostname\nmodel = AutoModel.from_pretrained(\"/tmp/malicious_model\")\n\n\n### Expected behavior\n\nObserve the result\nget_checkpoint_shard_files returns the traversed paths without error. The downstream model loading code will attempt to open and read these files as tensor data. While the files may fail to deserialize as valid safetensors, the file contents are accessed by the process, and depending on error handling, logging, or exception messages, data may be exposed.\n\nA more targeted attack could point shard paths at:\n\nOther users' cached model files in ~/.cache/huggingface/\nAPI tokens stored in ~/.cache/huggingface/token\nApplication configuration or secrets files\nAny file readable by the process\n\nVulnerability type: Arbitrary file read via path traversal (CWE-20 / CWE-22)\n\nWho is affected:\n\nAny user or automated system that loads models from untrusted local directories using from_pretrained or any code path that invokes get_checkpoint_shard_files.\nML pipelines and platforms that accept user-uploaded model directories (e.g., evaluation platforms, model hosting services, shared compute environments).\nDevelopers who download and load models from sources outside the Hugging Face Hub without additional validation.\nAttack prerequisites:\n\nThe attacker must be able to provide a local directory (or a directory downloaded/extracted from an untrusted source) that the victim passes to from_pretrained.\nNo authentication or special privileges are required beyond the ability to place files on the filesystem.\nRecommended fix:\nSanitize all filenames extracted from the weight_map before constructing paths:\n\nReject any filename containing .. components or absolute path prefixes.\nAfter joining paths, validate that the resolved path (via os.path.realpath) remains within the expected model directory.\nConsider rejecting filenames with path separator characters entirely, since shard files should be flat names like model-00001-of-00003.safetensors.\nExample fix:\n\nimport os\n\nif os.path.isdir(pretrained_model_name_or_path):\n base_dir = os.path.realpath(os.path.join(pretrained_model_name_or_path, subfolder))\n safe_paths = []\n for f in shard_filenames:\n full_path = os.path.realpath(os.path.join(base_dir, f))\n if not full_path.startswith(base_dir + os.sep):\n raise ValueError(\n f\"Shard filename '{f}' in the checkpoint index resolves outside \"\n f\"the model directory. This may indicate a malicious index file.\"\n )\n safe_paths.append(full_path)\n return safe_paths, sharded_metadata\n\n\n\n--- Comment by Rocketknight1 at 2026-05-20T11:03:25Z ---\nHmn, this doesn't seem like a serious bug, right? The attacker would have to induce the user to load a malicious model, and the only result would be that it tries to read a file on the user's local system and fails because that file is not safetensors format. Even in the unlikely event that sensitive data is contained in the error message, the attacker would have no access to it because it's entirely local to the user machine.\n\n--- Comment by karnakarreddi at 2026-05-20T11:35:33Z ---\nMy main concern is that weight_map entries are treated as trusted filesystem paths without any boundary validation. Even if safetensors parsing fails later, the loader still resolves and opens attacker-controlled paths outside the model directory. so i kept severity as medium though..\n\n--- Comment by matdou at 2026-05-20T11:40:13Z ---\nThe \"entirely local\" argument assumes single-user deployments. Any service that calls from_pretrained on a user-supplied path, so evaluation APIs, CI pipelines, etc. is exposed. \nThe fix is a handful of lines of path validation (negligible). Seems worth merging.\n\n--- Comment by karnakarreddi at 2026-05-20T11:44:40Z ---\nThanks, that matches my concern as well. I can also put together a small PR with the boundary validation if that would help move this forward.\n\n\n--- Comment by karnakarreddi at 2026-05-21T05:00:05Z ---\nhttps://github.com/huggingface/transformers/pull/46134 I created small PR. @Rocketknight1 Please have a look whenever you get some time. "}
	{"id": "issue_46095", "type": "issue", "number": 46095, "title": "[deepseekv4]Does Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?", "state": "open", "author": "young-creator", "labels": ["Feature request"], "created_at": "2026-05-20T04:39:29Z", "updated_at": "2026-05-21T18:46:46Z", "url": "https://github.com/huggingface/transformers/issues/46095", "text": "ISSUE #46095: [deepseekv4]Does Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?\nState: open \| Labels: Feature request\nAuthor: young-creator \| Created: 2026-05-20T04:39:29Z\n\n### Feature request\n\nFor [deepseekv4], the weight names provided in the Hugging Face DeepSeek-V4-Flash weights seem not to match the Transformers weight names. Does Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?\n\n<img width=\"3098\" height=\"1744\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/63eed220-9a7f-48dd-83dc-328b7b1ea22c\" />\n\n<img width=\"1792\" height=\"524\" alt=\"Image\" src=\"https://github.com/user-attachments/assets/9d477055-40ad-4517-bef0-b5bdc5bba08f\" />\n\n### Motivation\n\nDoes Transformers provide a weight conversion script to convert the Hugging Face weights into a format that can be read by Transformers from_pretrained?\n\n\n### Your contribution\n\nIf not, can we submit a PR?\n\n--- Comment by ArjunSrivastava1 at 2026-05-20T09:34:37Z ---\ni think u r maybe referring to some sort of docs for the conversion of weights? in which case yeah, there is\ni just was fixing those up a while ago, found it like that\n\nfeel free to read em up here: [link](https://moon-ci-docs.huggingface.co/docs/transformers/pr_45892/en/weightconverter)\n\nif somethings still missing, then lmk, if its all fine and found what u were looking for, then lmk that too \n\n--- Comment by Rocketknight1 at 2026-05-20T11:11:05Z ---\nYeah, I don't think there's a bug here! This is likely a case of dynamic weight renaming.\n\n--- Comment by BiggHeadd at 2026-05-20T13:30:39Z ---\nI found this branch [DeepSeek-V4-Flash-Base-support] have complete dynamic weight renaming.\n\n--- Comment by young-creator at 2026-05-21T03:33:04Z ---\n> i think u r maybe referring to some sort of docs for the conversion of weights? in which case yeah, there is i just was fixing those up a while ago, found it like that\n> \n> feel free to read em up here: [link](https://moon-ci-docs.huggingface.co/docs/transformers/pr_45892/en/weightconverter)\n> \n> if somethings still missing, then lmk, if its all fine and found what u were looking for, then lmk that too\n\nI do need the weight mapping, as I plan to finetune DeepSeek-V4 using the Transformers modeling code. \n\n--- Comment by Prachi-kushwaha at 2026-05-21T18:46:46Z ---\n> > i think u r maybe referring to some sort of docs for the conversion of weights? in which case yeah, there is i just was fixing those up a while ago, found it like that\n> > feel free to read em up here: [link](https://moon-ci-docs.huggingface.co/docs/transformers/pr_45892/en/weightconverter)\n> > if somethings still missing, then lmk, if its all fine and found what u were looking for, then lmk that too\n> \n> I do need the weight mapping, as I plan to finetune DeepSeek-V4 using the Transformers modeling code.\n\nyou can use LoRa/QLoRa for fine-tuning you don't need weight mapping in this scenario"}
	{"id": "pr_46148", "type": "pr", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46148: [Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export\nState: open \| Merged: False\nAuthor: yuvrajsharma9981 \| Base: main\nLabels: \nCreated: 2026-05-21T22:27:49Z\n\nHi,\n\n\\`torch.export.export\\` fails on Qwen3Next-family models with \\`GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0, 1)\\`. The crash traces to \\`Qwen3NextModel._update_linear_attn_mask\\`:\n\n\\`\\`\\`python\nif (past_key_values is not None and past_key_values.has_previous_state()) or (\n attention_mask is not None and torch.all(attention_mask == 1)\n):\n linear_attn_mask = None\n\\`\\`\\`\n\n\\`torch.all(attention_mask == 1)\\` produces a 0-dim bool tensor, and Python's \\`if\\` does an implicit \\`.item()\\` on it — an unbacked symbolic int the exporter can't resolve. Net effect: any user wanting an AOT package (\\`torch._inductor.aoti_compile_and_package\\` → \\`.pt2\\`) for any model in this family is blocked at the export step.\n\nI tripped on this trying to AOT compile Qwen3.5 for fast serving — the eager forward works, the export step crashes.\n\n## Scope\n\nFix lands at the modular source-of-truth, so the same patch propagates to all four models that inherit \\`Qwen3NextModel._update_linear_attn_mask\\`:\n\n- Qwen3Next (direct)\n- Qwen3.5 (\\`Qwen3_5TextModel(Qwen3NextModel)\\`)\n- Qwen3.5-MoE (\\`Qwen3_5MoeTextModel\\` via the same lineage)\n- OLMo Hybrid (\\`OlmoHybridModel(Qwen3NextModel)\\`)\n\n## Fix\n\nSmallest behavior-preserving thing I could come up with: keep the eager-mode fast-path identical, and skip the data-dependent branch only when \\`torch.compiler.is_compiling()\\` is true. The downstream linear-attention layer treats an all-1s mask as a cheap no-op, so the exported graph runs correctly for the no-padding case that the eager path was short-circuiting.\n\n\\`\\`\\`python\ndef _update_linear_attn_mask(self, attention_mask, past_key_values):\n linear_attn_mask = attention_mask\n if past_key_values is not None and past_key_values.has_previous_state():\n return None\n if torch.compiler.is_compiling():\n return linear_attn_mask\n if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n\\`\\`\\`\n\nTwo notes on the ordering:\n\n1. The cached-forward check stays first so users exporting a decode-step graph still get the cached-skip optimization baked into the resulting graph — that branch is already export-compatible (Python object state, not a tensor \\`.item()\\`).\n2. \\`torch.compiler.is_compiling()\\` is the public PyTorch idiom for \"behave differently under trace\"; runtime behavior for everyone not exporting is byte-identical to before.\n\n## Reproducer\n\nFails on v5.9.0 + torch 2.11:\n\n\\`\\`\\`python\nimport torch\nfrom transformers import AutoModelForCausalLM\n\nm = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen3.5-4B\", torch_dtype=torch.bfloat16)\nm.eval()\n\nclass W(torch.nn.Module):\n def __init__(s, m): super().__init__(); s.m = m\n def forward(s, ids, mask):\n return s.m(input_ids=ids, attention_mask=mask).logits\n\nids = torch.ones(2, 128, dtype=torch.long)\nmask = torch.ones(2, 128, dtype=torch.long)\n\ntorch.export.export(W(m), (ids, mask), dynamic_shapes={\n \"ids\": {0: torch.export.Dim.AUTO, 1: torch.export.Dim.AUTO},\n \"mask\": {0: torch.export.Dim.AUTO, 1: torch.export.Dim.AUTO},\n})\n\\`\\`\\`\n\nAfter the change, verified locally with a forked-source install:\n- eager runtime with all-1s mask still returns \\`None\\` (existing optimization preserved)\n- \\`torch.export.export(...)\\` succeeds and traces a clean graph\n\n## Commits\n\n- First commit edited the generated \\`modeling_qwen3_5.py\\` directly — CI correctly flagged this via \\`check_repository_consistency\\`.\n- Second commit moves the fix to \\`modular_qwen3_next.py\\` (the source-of-truth) and regenerates the affected \\`modeling_.py\\` files via \\`make fix-repo\\`. Both checks pass locally now.\n\nHappy to add tests under \\`tests/models/qwen3_next/\\` (and the inheriting models) if that's the preferred shape — held off pending guidance on existing export-compat coverage conventions.\n\nThanks!\n\n--- Comment by github-actions[bot] at 2026-05-21T22:40:52Z ---\n[For maintainers]* Suggested jobs to run (before merge)\n\nrun-slow: olmo_hybrid, qwen3_5, qwen3_5_moe, qwen3_next"}
	{"id": "pr_46148_file_src_transformers_models_olmo_hybrid_modeling_olmo_hybrid.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/olmo_hybrid/modeling_olmo_hybrid.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/olmo_hybrid/modeling_olmo_hybrid.py\nStatus: modified \| +12 -4\n\n@@ -1032,12 +1032,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_5_modeling_qwen3_5.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_5/modeling_qwen3_5.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_5/modeling_qwen3_5.py\nStatus: modified \| +12 -4\n\n@@ -1238,12 +1238,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_5_moe_modeling_qwen3_5_moe.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_5_moe/modeling_qwen3_5_moe.py\nStatus: modified \| +12 -4\n\n@@ -1358,12 +1358,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_next_modeling_qwen3_next.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_next/modeling_qwen3_next.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_next/modeling_qwen3_next.py\nStatus: modified \| +12 -4\n\n@@ -993,12 +993,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46148_file_src_transformers_models_qwen3_next_modular_qwen3_next.py", "type": "pr_diff", "number": 46148, "title": "[Qwen3Next] preserve linear-attn-mask optimization under torch.compile/export", "state": "open", "author": "yuvrajsharma9981", "labels": [], "created_at": "2026-05-21T22:27:49Z", "updated_at": "2026-05-21T22:40:53Z", "url": "https://github.com/huggingface/transformers/pull/46148", "merged": false, "base_branch": "main", "filename": "src/transformers/models/qwen3_next/modular_qwen3_next.py", "additions": 12, "deletions": 4, "text": "PR #46148 — file change: src/transformers/models/qwen3_next/modular_qwen3_next.py\nStatus: modified \| +12 -4\n\n@@ -749,12 +749,20 @@ def _update_linear_attn_mask(self, attention_mask, past_key_values):\n NOTE: Left-padding is used for linear attention mask.\n No need for zeroing states when\n 1. Cached forward\n- 2. Attending to all inputs\n+ 2. Attending to all inputs (eager-mode only — the\n+ ``torch.all(attention_mask == 1)`` check is data-dependent\n+ and isn't traceable by ``torch.export``)\n \"\"\"\n linear_attn_mask = attention_mask\n- if (past_key_values is not None and past_key_values.has_previous_state()) or (\n- attention_mask is not None and torch.all(attention_mask == 1)\n- ):\n+ if past_key_values is not None and past_key_values.has_previous_state():\n+ return None\n+ if torch.compiler.is_compiling():\n+ # Skip the data-dependent optimization under torch.compile /\n+ # torch.export. The downstream linear-attention layer handles\n+ # an all-1s mask as a cheap no-op, so runtime behavior of the\n+ # exported graph is unchanged for no-padding inputs.\n+ return linear_attn_mask\n+ if attention_mask is not None and torch.all(attention_mask == 1):\n linear_attn_mask = None\n return linear_attn_mask\n "}
	{"id": "pr_46147", "type": "pr", "number": 46147, "title": "Use attention interface in RoFormerSelfAttention", "state": "open", "author": "HamzaDogann", "labels": [], "created_at": "2026-05-21T20:30:41Z", "updated_at": "2026-05-21T21:12:02Z", "url": "https://github.com/huggingface/transformers/pull/46147", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46147: Use attention interface in RoFormerSelfAttention\nState: open \| Merged: False\nAuthor: HamzaDogann \| Base: main\nLabels: \nCreated: 2026-05-21T20:30:41Z\n\nRoFormer's self-attention was using a hardcoded eager implementation,\r\nmaking it impossible to use alternative attention backends.\r\n\r\nThis PR replaces the hardcoded computation with `ALL_ATTENTION_FUNCTIONS`\r\ndispatch and adds a local `eager_attention_forward` as the default fallback,\r\npreserving existing behavior while enabling `flash_attention_2`, `sdpa`,\r\nand custom attention implementations via `_attn_implementation`.\r\n\r\nCloses #46144\n\n--- Comment by github-actions[bot] at 2026-05-21T21:12:01Z ---\n[For maintainers] Suggested jobs to run (before merge)\n\nrun-slow: roformer\n\n--- Comment by Copilot at 2026-05-21T20:36:48Z ---\n`flash_attention_forward` falls back to `module.is_causal` when `is_causal` is not provided. `RoFormerSelfAttention` does not define `is_causal` and the interface call doesn’t pass it, so using `_attn_implementation=\"flash_attention_*\"` will raise an `AttributeError` (and cross-attn would be incorrectly treated as causal if `is_causal` were set globally). Pass an explicit `is_causal` value per call (likely `False` here, since RoFormer builds a bidirectional mask) and forward `output_attentions` so backend wrappers can warn/behave consistently.\n\n--- Comment by Copilot at 2026-05-21T20:36:48Z ---\n`eager_attention_forward` is identical to the shared implementation used across the repo (e.g. BERT/ALBERT) but is missing the `# Copied from transformers.models.bert.modeling_bert.eager_attention_forward` marker. Adding the marker helps `make fixup` keep this function in sync with upstream changes.\n\n--- Comment by Copilot at 2026-05-21T20:36:49Z ---\nThis change introduces a new attention-backend dispatch path for RoFormer via `ALL_ATTENTION_FUNCTIONS`. There are existing RoFormer modeling tests, but none cover non-eager dispatch or custom attention registration; adding a small test (e.g., setting `config._attn_implementation=\"sdpa\"` or registering a dummy backend key) would help prevent regressions."}
	{"id": "pr_46147_file_src_transformers_models_roformer_modeling_roformer.py", "type": "pr_diff", "number": 46147, "title": "Use attention interface in RoFormerSelfAttention", "state": "open", "author": "HamzaDogann", "labels": [], "created_at": "2026-05-21T20:30:41Z", "updated_at": "2026-05-21T21:12:02Z", "url": "https://github.com/huggingface/transformers/pull/46147", "merged": false, "base_branch": "main", "filename": "src/transformers/models/roformer/modeling_roformer.py", "additions": 56, "deletions": 22, "text": "PR #46147 — file change: src/transformers/models/roformer/modeling_roformer.py\nStatus: modified \| +56 -22\n\n@@ -13,7 +13,6 @@\n # limitations under the License.\n \"\"\"PyTorch RoFormer model.\"\"\"\n \n-import math\n from collections.abc import Callable\n \n import numpy as np\n@@ -36,15 +35,45 @@\n SequenceClassifierOutput,\n TokenClassifierOutput,\n )\n-from ...modeling_utils import PreTrainedModel\n+from ...modeling_utils import ALL_ATTENTION_FUNCTIONS, PreTrainedModel\n+from ...processing_utils import Unpack\n from ...pytorch_utils import apply_chunking_to_forward\n-from ...utils import auto_docstring, logging\n+from ...utils import TransformersKwargs, auto_docstring, logging\n from .configuration_roformer import RoFormerConfig\n \n \n logger = logging.get_logger(__name__)\n \n \n+# Copied from transformers.models.bert.modeling_bert.eager_attention_forward\n+def eager_attention_forward(\n+ module: nn.Module,\n+ query: torch.Tensor,\n+ key: torch.Tensor,\n+ value: torch.Tensor,\n+ attention_mask: torch.Tensor \| None,\n+ scaling: float \| None = None,\n+ dropout: float = 0.0,\n+ kwargs: Unpack[TransformersKwargs],\n+):\n+ if scaling is None:\n+ scaling = query.size(-1) -0.5\n+\n+ # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n+ attn_weights = torch.matmul(query, key.transpose(2, 3)) * scaling\n+\n+ if attention_mask is not None:\n+ attn_weights = attn_weights + attention_mask\n+\n+ attn_weights = nn.functional.softmax(attn_weights, dim=-1)\n+ attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)\n+\n+ attn_output = torch.matmul(attn_weights, value)\n+ attn_output = attn_output.transpose(1, 2).contiguous()\n+\n+ return attn_output, attn_weights\n+\n+\n # Copied from transformers.models.marian.modeling_marian.MarianSinusoidalPositionalEmbedding with Marian->RoFormer\n class RoFormerSinusoidalPositionalEmbedding(nn.Embedding):\n \"\"\"This module produces sinusoidal positional embeddings of any length.\"\"\"\n@@ -121,9 +150,11 @@ def __init__(self, config, layer_idx=None):\n f\"heads ({config.num_attention_heads})\"\n )\n \n+ self.config = config\n self.num_attention_heads = config.num_attention_heads\n self.attention_head_size = int(config.hidden_size / config.num_attention_heads)\n self.all_head_size = self.num_attention_heads * self.attention_head_size\n+ self.scaling = self.attention_head_size*-0.5\n \n self.query = nn.Linear(config.hidden_size, self.all_head_size)\n self.key = nn.Linear(config.hidden_size, self.all_head_size)\n@@ -193,26 +224,25 @@ def forward(\n if is_cross_attention and isinstance(past_key_values, EncoderDecoderCache):\n past_key_values.is_updated[self.layer_idx] = True\n \n- # Take the dot product between \"query\" and \"key\" to get the raw attention scores.\n- attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))\n-\n- attention_scores = attention_scores / math.sqrt(self.attention_head_size)\n- if attention_mask is not None:\n- # Apply the attention mask is (precomputed for all layers in RoFormerModel forward() function)\n- attention_scores = attention_scores + attention_mask\n-\n- # Normalize the attention scores to probabilities.\n- attention_probs = nn.functional.softmax(attention_scores, dim=-1)\n-\n- # This is actually dropping out entire tokens to attend to, which might\n- # seem a bit unusual, but is taken from the original Transformer paper.\n- attention_probs = self.dropout(attention_probs)\n-\n- context_layer = torch.matmul(attention_probs, value_layer)\n+ attention_interface: Callable = ALL_ATTENTION_FUNCTIONS.get_interface(\n+ self.config._attn_implementation, eager_attention_forward\n+ )\n \n- context_layer = context_layer.permute(0, 2, 1, 3).contiguous()\n- new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)\n- context_layer = context_layer.view(new_context_layer_shape)\n+ context_layer, attention_probs = attention_interface(\n+ self,\n+ query_layer,\n+ key_layer,\n+ value_layer,\n+ attention_mask,\n+ dropout=0.0 if not self.training else self.dropout.p,\n+ scaling=self.scaling,\n+ # RoFormer precomputes a (bidirectional) mask in `RoFormerModel`, so the backend\n+ # must not apply an additional causal mask.\n+ is_causal=False,\n+ output_attentions=output_attentions,\n+ *kwargs,\n+ )\n+ context_layer = context_layer.reshape(input_shape, -1).contiguous()\n \n return context_layer, attention_probs\n \n@@ -617,6 +647,10 @@ class RoFormerPreTrainedModel(PreTrainedModel):\n config: RoFormerConfig\n base_model_prefix = \"roformer\"\n supports_gradient_checkpointing = True\n+ _supports_flash_attn = True\n+ _supports_sdpa = True\n+ _supports_flex_attn = True\n+ _supports_attention_backend = True\n \n @torch.no_grad()\n def _init_weights(self, module):"}
	{"id": "pr_46146", "type": "pr", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46146: Added cosmos3 model and bugfixed Qwen3-VL\nState: open \| Merged: False\nAuthor: MaciejBalaNV \| Base: main\nLabels: \nCreated: 2026-05-21T19:47:46Z\n\n# What does this PR do?\r\n\r\n<!--\r\nCongratulations! You've made it this far! You're not quite done yet though.\r\n\r\nOnce merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.\r\n\r\nThen, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.\r\n\r\nOnce you're done, someone will review your PR shortly (see the section \"Who can review?\" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.\r\n-->\r\n\r\n<!-- Remove if not applicable -->\r\n\r\nThis PR adds a support for Cosmos3 Reasoner model (not released yet). It's a Mixture Of Transformers model, where we have a Generator and a Reasoner tower in a unified checkpoint. The Reasoner tower has Qwen3-VL architecture, so we can directly reuse it. However, we need extra code to handle the checkpoint mapping, since the final checkpoint will be in a unified Reasoner+Generator diffusers format.\r\n\r\nAdditionally, this PR fixes one issue which currently is present on top of tree - when using latest vllm and latest transformers build from source, even basic `vllm serve Qwen/Qwen3-VL-8B-Instruct` fails during dummy run. This root-cause of the bug is this commit: `ba06e3fbdf355c363ac067ebcda210017e90a852`, reverting it also fixes Qwen-VL.\r\n\r\n## Code Agent Policy\r\n\r\nThe Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by\r\ncode agents. We are currently bottlenecked by our ability to review and respond to them. As a result, \r\nwe ask that new users do not submit pure code agent PRs at this time. \r\nYou may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous \"OpenClaw\"-like agents\r\nnot to open any PRs or issues for the moment.\r\n\r\nPRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this\r\nrepeatedly or maliciously. \r\n\r\nThis is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, \r\nthis policy is likely to be updated regularly in the near future. For more information, please read [`CONTRIBUTING.md`](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md).\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [x] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [x] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\nAnyone in the community is free to review the PR once the tests have passed. Feel free to tag\r\nmembers/contributors who may be interested in your PR.\r\n\r\n@yonigozlan for a vision model review\r\n\r\n<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @\r\n\r\n If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag.\r\n Please tag fewer than 3 people.\r\n\r\nModels:\r\n\r\n- text models: @ArthurZucker @Cyrilvallez\r\n- vision models: @yonigozlan @molbap\r\n- audio models: @eustlb @ebezzam @vasqu\r\n- multimodal models: @zucchini-nlp\r\n- graph models: @clefourrier\r\n\r\nLibrary:\r\n\r\n- generate: @zucchini-nlp (visual-language models) or @gante (all others)\r\n- continuous batching: @remi-or @ArthurZucker @McPatate\r\n- pipelines: @Rocketknight1\r\n- tokenizers: @ArthurZucker and @itazap\r\n- trainer: @SunMarc\r\n- attention: @vasqu @ArthurZucker @CyrilVallez\r\n- model loading (from pretrained, etc): @CyrilVallez\r\n- distributed: @3outeille @ArthurZucker\r\n- CIs: @ydshieh\r\n\r\nIntegrations:\r\n\r\n- ray/raytune: @richardliaw, @amogkam\r\n- Big Model Inference: @SunMarc\r\n- quantization: @SunMarc\r\n- kernels: @drbh\r\n- peft: @BenjaminBossan @githubnemo\r\n\r\nDevices/Backends:\r\n\r\n- AMD ROCm: @ivarflakstad\r\n- Intel XPU: @IlyasMoutawwakil\r\n- Ascend NPU: @ivarflakstad \r\n\r\nDocumentation: @stevhliu\r\n\r\nResearch projects are not maintained and should be taken as is.\r\n\r\n -->\r\n\n\n--- Comment by github-actions[bot] at 2026-05-21T19:49:01Z ---\n[For maintainers] Suggested jobs to run (before merge)\n\nrun-slow: auto, cosmos3\n\n--- Comment by github-actions[bot] at 2026-05-21T20:05:30Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46146&sha=61cd69"}
	{"id": "pr_46146_file_docs_source_en__toctree.yml", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "docs/source/en/_toctree.yml", "additions": 2, "deletions": 0, "text": "PR #46146 — file change: docs/source/en/_toctree.yml\nStatus: modified \| +2 -0\n\n@@ -1208,6 +1208,8 @@\n title: ColPali\n - local: model_doc/colqwen2\n title: ColQwen2\n+ - local: model_doc/cosmos3\n+ title: Cosmos3 Omni\n - local: model_doc/data2vec\n title: Data2Vec\n - local: model_doc/deepseek_vl"}
	{"id": "pr_46146_file_docs_source_en_model_doc_cosmos3.md", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "docs/source/en/model_doc/cosmos3.md", "additions": 89, "deletions": 0, "text": "PR #46146 — file change: docs/source/en/model_doc/cosmos3.md\nStatus: added \| +89 -0\n\n@@ -0,0 +1,89 @@\n+<!--Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+\n+Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with\n+the License. You may obtain a copy of the License at\n+\n+http://www.apache.org/licenses/LICENSE-2.0\n+\n+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on\n+an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\n+specific language governing permissions and limitations under the License.\n+\n+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be\n+rendered properly in your Markdown viewer.\n+\n+-->\n+\n+<div style=\"float: right;\">\n+ <div class=\"flex flex-wrap space-x-1\">\n+<img alt=\"FlashAttention\" src=\"https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat\">\n+<img alt=\"SDPA\" src=\"https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white\"> </div>\n+</div>\n+\n+# Cosmos3 Omni\n+\n+[Cosmos3](https://huggingface.co/nvidia/Cosmos3-Nano) is a mixture-of-transformers (MoT) Vision Foundation Model from NVIDIA, composed of a Reasoner tower and a Generator tower. The two towers share the same input embedding and visual encoder but use disjoint MoT experts for understanding vs. generation, plus cross-modal adapters (`llm2vae`, `llm2sound`, `llm2action`, etc.) that connect the language model to image / audio / action heads.\n+\n+The transformers integration loads only the Reasoner tower from a unified Cosmos3 checkpoint. The Reasoner is architecturally identical to [Qwen3-VL](./qwen3_vl) — `Cosmos3ForConditionalGeneration` is a thin subclass of `Qwen3VLForConditionalGeneration`. Loading is driven by two transformations registered automatically when `model_type` is `\"cosmos3_omni\"`:\n+\n+1. The checkpoint's flat namespaces are re-targeted to Qwen3-VL's nested layout: `model.<>` → `model.language_model.<>` and `blocks.` / `merger.` / `patch_embed.` / `pos_embed.` / `deepstack_merger_list.` → `model.visual.<>`.\n+2. Generator / sound / action parameters (`_moe_gen`, `llm2vae`, `vae2llm`, `time_embedder`, `llm2sound`, `sound2llm`, `sound_modality_embed`, `llm2action`, `action2llm`, `action_modality_embed`) are skipped on load.\n+\n+## Usage\n+\n+```python\n+import torch\n+from transformers import AutoProcessor, Cosmos3ForConditionalGeneration\n+\n+model = Cosmos3ForConditionalGeneration.from_pretrained(\n+ \"nvidia/Cosmos3-Nano\",\n+ dtype=torch.float16,\n+ device_map=\"auto\",\n+ attn_implementation=\"sdpa\",\n+)\n+processor = AutoProcessor.from_pretrained(\"nvidia/Cosmos3-Nano\")\n+\n+conversation = [\n+ {\n+ \"role\": \"user\",\n+ \"content\": [\n+ {\"type\": \"image\", \"image\": \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg\"},\n+ {\"type\": \"text\", \"text\": \"Caption the image in detail.\"},\n+ ],\n+ },\n+]\n+\n+inputs = processor.apply_chat_template(\n+ conversation,\n+ tokenize=True,\n+ add_generation_prompt=True,\n+ return_dict=True,\n+ return_tensors=\"pt\",\n+).to(model.device)\n+\n+generated_ids = model.generate(*inputs, max_new_tokens=512)\n+output = processor.batch_decode(\n+ [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],\n+ skip_special_tokens=True,\n+ clean_up_tokenization_spaces=False,\n+)\n+print(output[0])\n+```\n+\n+## Cosmos3Config\n+\n+[[autodoc]] Cosmos3Config\n+\n+## Cosmos3Model\n+\n+[[autodoc]] Cosmos3Model\n+ - forward\n+ - get_video_features\n+ - get_image_features\n+\n+## Cosmos3ForConditionalGeneration\n+\n+[[autodoc]] Cosmos3ForConditionalGeneration\n+ - forward\n+ - get_video_features\n+ - get_image_features"}
	{"id": "pr_46146_file_src_transformers_conversion_mapping.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/conversion_mapping.py", "additions": 14, "deletions": 0, "text": "PR #46146 — file change: src/transformers/conversion_mapping.py\nStatus: modified \| +14 -0\n\n@@ -580,6 +580,20 @@ def _build_checkpoint_conversion_mapping():\n operations=[Transpose(1, 2, check_dims=True)],\n ),\n ],\n+ \"cosmos3_omni\": [\n+ # Cosmos3 unified checkpoints store the Reasoner LLM under a flat `model.` namespace\n+ # (no `language_model.` nesting) and the ViT under flat `blocks.` / `merger.` /\n+ # `patch_embed.` / `pos_embed.` / `deepstack_merger_list.`. Re-target both to the\n+ # nested Qwen3-VL layout (`model.language_model.` and `model.visual.`).\n+ WeightRenaming(\n+ source_patterns=r\"^model\\.(?!language_model\\.)(.+)$\",\n+ target_patterns=r\"model.language_model.\\1\",\n+ ),\n+ WeightRenaming(\n+ source_patterns=r\"^(blocks\\.\|merger\\.\|patch_embed\\.\|pos_embed\\.\|deepstack_merger_list\\.)(.*)$\",\n+ target_patterns=r\"model.visual.\\1\\2\",\n+ ),\n+ ],\n \"phimoe\": [\n WeightRenaming(\".block_sparse_moe.\", \".mlp.\"),\n WeightRenaming(\".gate.weight\", \".router.weight\"),"}
	{"id": "pr_46146_file_src_transformers_models___init__.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/__init__.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/__init__.py\nStatus: modified \| +1 -0\n\n@@ -78,6 +78,7 @@\n from .convbert import \n from .convnext import \n from .convnextv2 import \n+ from .cosmos3 import \n from .cpm import \n from .cpmant import \n from .csm import *"}
	{"id": "pr_46146_file_src_transformers_models_auto_auto_mappings.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/auto_mappings.py", "additions": 2, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/auto_mappings.py\nStatus: modified \| +2 -0\n\n@@ -105,6 +105,7 @@\n (\"convbert\", \"ConvBertConfig\"),\n (\"convnext\", \"ConvNextConfig\"),\n (\"convnextv2\", \"ConvNextV2Config\"),\n+ (\"cosmos3_omni\", \"Cosmos3Config\"),\n (\"cpmant\", \"CpmAntConfig\"),\n (\"csm\", \"CsmConfig\"),\n (\"csm_depth_decoder_model\", \"CsmDepthDecoderConfig\"),\n@@ -688,6 +689,7 @@\n (\"clip_vision_model\", \"clip\"),\n (\"clipseg_text_model\", \"clipseg\"),\n (\"clipseg_vision_model\", \"clipseg\"),\n+ (\"cosmos3_omni\", \"cosmos3\"),\n (\"clvp_decoder\", \"clvp\"),\n (\"clvp_encoder\", \"clvp\"),\n (\"csm_depth_decoder_model\", \"csm\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_image_processing_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/image_processing_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/image_processing_auto.py\nStatus: modified \| +1 -0\n\n@@ -74,6 +74,7 @@\n (\"colpali\", {\"torchvision\": \"SiglipImageProcessor\", \"pil\": \"SiglipImageProcessorPil\"}),\n (\"colqwen2\", {\"torchvision\": \"Qwen2VLImageProcessor\", \"pil\": \"Qwen2VLImageProcessorPil\"}),\n (\"convnextv2\", {\"torchvision\": \"ConvNextImageProcessor\", \"pil\": \"ConvNextImageProcessorPil\"}),\n+ (\"cosmos3_omni\", {\"torchvision\": \"Qwen2VLImageProcessor\", \"pil\": \"Qwen2VLImageProcessorPil\"}),\n (\"cvt\", {\"torchvision\": \"ConvNextImageProcessor\", \"pil\": \"ConvNextImageProcessorPil\"}),\n (\"data2vec-vision\", {\"torchvision\": \"BeitImageProcessor\", \"pil\": \"BeitImageProcessorPil\"}),\n (\"deimv2\", {\"torchvision\": \"RTDetrImageProcessor\", \"pil\": \"RTDetrImageProcessorPil\"}),"}
	{"id": "pr_46146_file_src_transformers_models_auto_modeling_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/modeling_auto.py", "additions": 2, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/modeling_auto.py\nStatus: modified \| +2 -0\n\n@@ -97,6 +97,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):\n (\"convbert\", \"ConvBertModel\"),\n (\"convnext\", \"ConvNextModel\"),\n (\"convnextv2\", \"ConvNextV2Model\"),\n+ (\"cosmos3_omni\", \"Cosmos3Model\"),\n (\"cpmant\", \"CpmAntModel\"),\n (\"csm\", \"CsmForConditionalGeneration\"),\n (\"ctrl\", \"CTRLModel\"),\n@@ -995,6 +996,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):\n (\"blip-2\", \"Blip2ForConditionalGeneration\"),\n (\"chameleon\", \"ChameleonForConditionalGeneration\"),\n (\"cohere2_vision\", \"Cohere2VisionForConditionalGeneration\"),\n+ (\"cosmos3_omni\", \"Cosmos3ForConditionalGeneration\"),\n (\"deepseek_vl\", \"DeepseekVLForConditionalGeneration\"),\n (\"deepseek_vl_hybrid\", \"DeepseekVLHybridForConditionalGeneration\"),\n (\"emu3\", \"Emu3ForConditionalGeneration\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_processing_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/processing_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/processing_auto.py\nStatus: modified \| +1 -0\n\n@@ -69,6 +69,7 @@\n (\"colmodernvbert\", \"ColModernVBertProcessor\"),\n (\"colpali\", \"ColPaliProcessor\"),\n (\"colqwen2\", \"ColQwen2Processor\"),\n+ (\"cosmos3_omni\", \"Qwen3VLProcessor\"),\n (\"deepseek_vl\", \"DeepseekVLProcessor\"),\n (\"deepseek_vl_hybrid\", \"DeepseekVLHybridProcessor\"),\n (\"dia\", \"DiaProcessor\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_tokenization_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/tokenization_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/tokenization_auto.py\nStatus: modified \| +1 -0\n\n@@ -99,6 +99,7 @@\n (\"cohere2\", \"CohereTokenizer\" if is_tokenizers_available() else None),\n (\"colqwen2\", \"Qwen2Tokenizer\" if is_tokenizers_available() else None),\n (\"convbert\", \"BertTokenizer\" if is_tokenizers_available() else None),\n+ (\"cosmos3_omni\", \"Qwen2Tokenizer\" if is_tokenizers_available() else None),\n (\"cpm\", \"CpmTokenizer\" if is_tokenizers_available() else None),\n (\"cpmant\", \"CpmAntTokenizer\"),\n (\"ctrl\", \"CTRLTokenizer\"),"}
	{"id": "pr_46146_file_src_transformers_models_auto_video_processing_auto.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/auto/video_processing_auto.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/auto/video_processing_auto.py\nStatus: modified \| +1 -0\n\n@@ -54,6 +54,7 @@\n # Merge non-standard mapping names with auto-inferred `VIDEO_PROCESSOR_MAPPING_NAMES`\n MISSING_VIDEO_PROCESSOR_MAPPING_NAMES = OrderedDict(\n [\n+ (\"cosmos3_omni\", \"Qwen3VLVideoProcessor\"),\n (\"exaone4_5\", \"Qwen2VLVideoProcessor\"),\n (\"instructblip\", \"InstructBlipVideoVideoProcessor\"),\n (\"pe_audio_video\", \"PeVideoVideoProcessor\"),"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3___init__.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/__init__.py", "additions": 27, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/__init__.py\nStatus: added \| +27 -0\n\n@@ -0,0 +1,27 @@\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+from typing import TYPE_CHECKING\n+\n+from ...utils import _LazyModule\n+from ...utils.import_utils import define_import_structure\n+\n+\n+if TYPE_CHECKING:\n+ from .configuration_cosmos3 import \n+ from .modeling_cosmos3 import \n+else:\n+ import sys\n+\n+ _file = globals()[\"__file__\"]\n+ sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3_configuration_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/configuration_cosmos3.py", "additions": 40, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/configuration_cosmos3.py\nStatus: added \| +40 -0\n\n@@ -0,0 +1,40 @@\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# This file was automatically generated from src/transformers/models/cosmos3/modular_cosmos3.py.\n+# Do NOT edit this file manually as any edits will be overwritten by the generation of\n+# the file from the modular. If any change should be done, please apply the change to the\n+# modular_cosmos3.py file directly. One of our CI enforces this.\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+from huggingface_hub.dataclasses import strict\n+\n+from ...utils import auto_docstring\n+from ..qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig\n+\n+\n+@auto_docstring(checkpoint=\"nvidia/Cosmos3-Nano\")\n+@strict\n+class Cosmos3Config(Qwen3VLConfig):\n+ r\"\"\"\n+ Configuration for the [Cosmos3](https://huggingface.co/nvidia/Cosmos3-Nano) Reasoner tower.\n+\n+ The Reasoner tower is architecturally identical to Qwen3-VL, so this config inherits all\n+ fields from [`Qwen3VLConfig`] and only changes `model_type` so that conversion mappings\n+ and key-renaming rules dispatch correctly when loading a unified Cosmos3 checkpoint.\n+ \"\"\"\n+\n+ model_type = \"cosmos3_omni\"\n+\n+\n+__all__ = [\"Cosmos3Config\"]"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3_modeling_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/modeling_cosmos3.py", "additions": 62, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/modeling_cosmos3.py\nStatus: added \| +62 -0\n\n@@ -0,0 +1,62 @@\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# This file was automatically generated from src/transformers/models/cosmos3/modular_cosmos3.py.\n+# Do NOT edit this file manually as any edits will be overwritten by the generation of\n+# the file from the modular. If any change should be done, please apply the change to the\n+# modular_cosmos3.py file directly. One of our CI enforces this.\n+# 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\"\"\"Cosmos3 model — loads the Reasoner tower of a Cosmos3 MoT checkpoint into Qwen3-VL.\"\"\"\n+\n+from ..qwen3_vl.modeling_qwen3_vl import Qwen3VLForConditionalGeneration, Qwen3VLModel\n+from .configuration_cosmos3 import Cosmos3Config\n+\n+\n+_COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS = [\n+ r\"_moe_gen\",\n+ r\"^llm2vae\\.\",\n+ r\"^vae2llm\\.\",\n+ r\"^time_embedder\\.\",\n+ r\"^llm2sound\\.\",\n+ r\"^sound2llm\\.\",\n+ r\"^sound_modality_embed$\",\n+ r\"^llm2action\\.\",\n+ r\"^action2llm\\.\",\n+ r\"^action_modality_embed$\",\n+]\n+\n+\n+class Cosmos3Model(Qwen3VLModel):\n+ config: Cosmos3Config\n+\n+ # Base-model loading from a unified Cosmos3 checkpoint drops the Generator tower,\n+ # cross-modal adapters, and the causal-LM head.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS + [\n+ r\"^lm_head\\.weight$\"\n+ ]\n+\n+\n+class Cosmos3ForConditionalGeneration(Qwen3VLForConditionalGeneration):\n+ config: Cosmos3Config\n+\n+ # The unified Cosmos3 checkpoint stores both the Reasoner tower (loaded here) and the\n+ # Generator tower / cross-modal adapters (dropped). These patterns silence the\n+ # \"unexpected keys\" warning for parameters that belong to the dropped components.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS\n+\n+\n+__all__ = [\n+ \"Cosmos3ForConditionalGeneration\",\n+ \"Cosmos3Model\",\n+]"}
	{"id": "pr_46146_file_src_transformers_models_cosmos3_modular_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cosmos3/modular_cosmos3.py", "additions": 77, "deletions": 0, "text": "PR #46146 — file change: src/transformers/models/cosmos3/modular_cosmos3.py\nStatus: added \| +77 -0\n\n@@ -0,0 +1,77 @@\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\"\"\"Cosmos3 model — loads the Reasoner tower of a Cosmos3 MoT checkpoint into Qwen3-VL.\"\"\"\n+\n+from huggingface_hub.dataclasses import strict\n+\n+from ...utils import auto_docstring\n+from ..qwen3_vl.configuration_qwen3_vl import Qwen3VLConfig\n+from ..qwen3_vl.modeling_qwen3_vl import Qwen3VLForConditionalGeneration, Qwen3VLModel\n+\n+\n+@auto_docstring(checkpoint=\"nvidia/Cosmos3-Nano\")\n+@strict\n+class Cosmos3Config(Qwen3VLConfig):\n+ r\"\"\"\n+ Configuration for the [Cosmos3](https://huggingface.co/nvidia/Cosmos3-Nano) Reasoner tower.\n+\n+ The Reasoner tower is architecturally identical to Qwen3-VL, so this config inherits all\n+ fields from [`Qwen3VLConfig`] and only changes `model_type` so that conversion mappings\n+ and key-renaming rules dispatch correctly when loading a unified Cosmos3 checkpoint.\n+ \"\"\"\n+\n+ model_type = \"cosmos3_omni\"\n+\n+\n+_COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS = [\n+ # Generator (image / video diffusion) MoT expert + cross-modal projections\n+ r\"_moe_gen\",\n+ r\"^llm2vae\\.\",\n+ r\"^vae2llm\\.\",\n+ r\"^time_embedder\\.\",\n+ # Sound tower\n+ r\"^llm2sound\\.\",\n+ r\"^sound2llm\\.\",\n+ r\"^sound_modality_embed$\",\n+ # Action tower\n+ r\"^llm2action\\.\",\n+ r\"^action2llm\\.\",\n+ r\"^action_modality_embed$\",\n+]\n+\n+\n+class Cosmos3Model(Qwen3VLModel):\n+ config: Cosmos3Config\n+\n+ # Base-model loading from a unified Cosmos3 checkpoint drops the Generator tower,\n+ # cross-modal adapters, and the causal-LM head.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS + [\n+ r\"^lm_head\\.weight$\"\n+ ]\n+\n+\n+class Cosmos3ForConditionalGeneration(Qwen3VLForConditionalGeneration):\n+ config: Cosmos3Config\n+\n+ # The unified Cosmos3 checkpoint stores both the Reasoner tower (loaded here) and the\n+ # Generator tower / cross-modal adapters (dropped). These patterns silence the\n+ # \"unexpected keys\" warning for parameters that belong to the dropped components.\n+ _keys_to_ignore_on_load_unexpected = _COSMOS3_DROPPED_UNIFIED_CHECKPOINT_KEYS\n+\n+\n+__all__ = [\n+ \"Cosmos3Config\",\n+ \"Cosmos3ForConditionalGeneration\",\n+ \"Cosmos3Model\",\n+]"}
	{"id": "pr_46146_file_src_transformers_processing_utils.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/processing_utils.py", "additions": 10, "deletions": 1, "text": "PR #46146 — file change: src/transformers/processing_utils.py\nStatus: modified \| +10 -1\n\n@@ -876,7 +876,16 @@ def get_text_with_replacements(\n expanded_sample.append(text[batch_idx][last:start])\n \n mm_type = m.lastgroup\n- replacement_text = next(replacements_iters[mm_type])\n+ replacement_text = next(replacements_iters[mm_type], None)\n+ if replacement_text is None:\n+ # No replacement available for this modality — leave the\n+ # placeholder in place so the tokenizer can still encode it\n+ # as a special token. This happens during text-only passes\n+ # (e.g. vLLM's dummy profiling) where the prompt contains\n+ # placeholders but no mm data is provided.\n+ expanded_sample.append(m.group())\n+ last = end\n+ continue\n replacement_offsets.append(\n {\n \"type\": mm_type,"}
	{"id": "pr_46146_file_src_transformers_utils_auto_docstring.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "src/transformers/utils/auto_docstring.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: src/transformers/utils/auto_docstring.py\nStatus: modified \| +1 -0\n\n@@ -74,6 +74,7 @@\n \"x-clip\": \"XCLIPConfig\",\n \"kosmos2\": \"Kosmos2Config\",\n \"kosmos2-5\": \"Kosmos2_5Config\",\n+ \"cosmos3\": \"Cosmos3Config\",\n \"donut\": \"DonutSwinConfig\",\n \"esmfold\": \"EsmConfig\",\n \"parakeet\": \"ParakeetCTCConfig\","}
	{"id": "pr_46146_file_tests_models_cosmos3___init__.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "tests/models/cosmos3/__init__.py", "additions": 1, "deletions": 0, "text": "PR #46146 — file change: tests/models/cosmos3/__init__.py\nStatus: added \| +1 -0\n\n@@ -0,0 +1 @@\n+"}
	{"id": "pr_46146_file_tests_models_cosmos3_test_modeling_cosmos3.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "tests/models/cosmos3/test_modeling_cosmos3.py", "additions": 114, "deletions": 0, "text": "PR #46146 — file change: tests/models/cosmos3/test_modeling_cosmos3.py\nStatus: added \| +114 -0\n\n@@ -0,0 +1,114 @@\n+# Copyright 2026 NVIDIA Corporation and The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\"\"\"Testing suite for the PyTorch Cosmos3 model.\"\"\"\n+\n+import copy\n+import unittest\n+\n+from transformers import AutoConfig, Cosmos3Config, is_torch_available\n+from transformers.conversion_mapping import get_checkpoint_conversion_mapping\n+from transformers.core_model_loading import WeightRenaming, rename_source_key\n+from transformers.testing_utils import require_torch\n+\n+\n+if is_torch_available():\n+ from transformers import AutoModel, AutoModelForImageTextToText, Cosmos3ForConditionalGeneration, Cosmos3Model\n+\n+\n+def get_tiny_cosmos3_config():\n+ return Cosmos3Config(\n+ text_config={\n+ \"vocab_size\": 99,\n+ \"hidden_size\": 32,\n+ \"intermediate_size\": 64,\n+ \"num_hidden_layers\": 1,\n+ \"num_attention_heads\": 4,\n+ \"num_key_value_heads\": 2,\n+ \"head_dim\": 8,\n+ \"max_position_embeddings\": 64,\n+ \"pad_token_id\": 0,\n+ \"rope_parameters\": {\n+ \"rope_type\": \"default\",\n+ \"mrope_section\": [16, 8, 8],\n+ \"mrope_interleaved\": True,\n+ \"rope_theta\": 10000,\n+ },\n+ },\n+ vision_config={\n+ \"depth\": 1,\n+ \"hidden_size\": 32,\n+ \"hidden_act\": \"gelu_pytorch_tanh\",\n+ \"intermediate_size\": 64,\n+ \"num_heads\": 4,\n+ \"patch_size\": 16,\n+ \"spatial_merge_size\": 1,\n+ \"temporal_patch_size\": 2,\n+ \"out_hidden_size\": 32,\n+ \"num_position_embeddings\": 16,\n+ \"deepstack_visual_indexes\": [0],\n+ },\n+ image_token_id=3,\n+ video_token_id=4,\n+ vision_start_token_id=5,\n+ vision_end_token_id=6,\n+ tie_word_embeddings=False,\n+ pad_token_id=0,\n+ )\n+\n+\n+class Cosmos3ConfigTest(unittest.TestCase):\n+ def test_auto_config_mapping(self):\n+ config = AutoConfig.for_model(\"cosmos3_omni\")\n+\n+ self.assertIsInstance(config, Cosmos3Config)\n+ self.assertEqual(config.model_type, \"cosmos3_omni\")\n+\n+\n+class Cosmos3ConversionMappingTest(unittest.TestCase):\n+ def test_checkpoint_conversion_mapping_targets_unified_checkpoint_namespaces(self):\n+ mapping = get_checkpoint_conversion_mapping(\"cosmos3_omni\")\n+ renamings = [entry for entry in mapping if isinstance(entry, WeightRenaming)]\n+\n+ self.assertEqual(\n+ rename_source_key(\"model.layers.0.self_attn.q_proj.weight\", renamings, [])[0],\n+ \"model.language_model.layers.0.self_attn.q_proj.weight\",\n+ )\n+ self.assertEqual(\n+ rename_source_key(\"blocks.0.norm1.weight\", renamings, [])[0],\n+ \"model.visual.blocks.0.norm1.weight\",\n+ )\n+ self.assertEqual(\n+ rename_source_key(\"merger.mlp.0.weight\", renamings, [])[0],\n+ \"model.visual.merger.mlp.0.weight\",\n+ )\n+\n+ already_nested_key = \"model.language_model.layers.0.self_attn.q_proj.weight\"\n+ self.assertEqual(rename_source_key(already_nested_key, renamings, [])[0], already_nested_key)\n+\n+\n+@require_torch\n+class Cosmos3ModelTest(unittest.TestCase):\n+ def test_auto_model_mappings(self):\n+ config = get_tiny_cosmos3_config()\n+\n+ self.assertIsInstance(AutoModel.from_config(copy.deepcopy(config)), Cosmos3Model)\n+ self.assertIsInstance(\n+ AutoModelForImageTextToText.from_config(copy.deepcopy(config)), Cosmos3ForConditionalGeneration\n+ )\n+\n+ def test_unified_checkpoint_unexpected_keys_are_ignored(self):\n+ self.assertIn(r\"_moe_gen\", Cosmos3Model._keys_to_ignore_on_load_unexpected)\n+ self.assertIn(r\"^llm2sound\\.\", Cosmos3ForConditionalGeneration._keys_to_ignore_on_load_unexpected)\n+ self.assertIn(r\"^lm_head\\.weight$\", Cosmos3Model._keys_to_ignore_on_load_unexpected)\n+ self.assertNotIn(r\"^lm_head\\.weight$\", Cosmos3ForConditionalGeneration._keys_to_ignore_on_load_unexpected)"}
	{"id": "pr_46146_file_tests_utils_test_processing_utils.py", "type": "pr_diff", "number": 46146, "title": "Added cosmos3 model and bugfixed Qwen3-VL", "state": "open", "author": "MaciejBalaNV", "labels": [], "created_at": "2026-05-21T19:47:46Z", "updated_at": "2026-05-22T00:45:08Z", "url": "https://github.com/huggingface/transformers/pull/46146", "merged": false, "base_branch": "main", "filename": "tests/utils/test_processing_utils.py", "additions": 52, "deletions": 0, "text": "PR #46146 — file change: tests/utils/test_processing_utils.py\nStatus: added \| +52 -0\n\n@@ -0,0 +1,52 @@\n+# Copyright 2026 The HuggingFace Inc. team. All rights reserved.\n+#\n+# Licensed under the Apache License, Version 2.0 (the \"License\");\n+# you may not use this file except in compliance with the License.\n+# You may obtain a copy of the License at\n+#\n+# http://www.apache.org/licenses/LICENSE-2.0\n+#\n+# Unless required by applicable law or agreed to in writing, software\n+# distributed under the License is distributed on an \"AS IS\" BASIS,\n+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n+# See the License for the specific language governing permissions and\n+# limitations under the License.\n+\n+import unittest\n+\n+from transformers.processing_utils import ProcessorMixin\n+\n+\n+class DummyMultimodalProcessor(ProcessorMixin):\n+ pass\n+\n+\n+class ProcessorMixinTextReplacementTest(unittest.TestCase):\n+ def get_processor(self):\n+ processor = DummyMultimodalProcessor()\n+ processor.image_token = \"<image>\"\n+ processor.video_token = \"<video>\"\n+ return processor\n+\n+ def test_get_text_with_replacements_preserves_missing_replacement_placeholders(self):\n+ processor = self.get_processor()\n+\n+ text, replacement_offsets = processor.get_text_with_replacements(\n+ [\"Look <image> then <video> then <image>.\"],\n+ images_replacements=[\"<image><image>\"],\n+ videos_replacements=[\"<video><video>\"],\n+ )\n+\n+ self.assertEqual(text, [\"Look <image><image> then <video><video> then <image>.\"])\n+ self.assertEqual(\n+ [offset[\"replacement\"] for offset in replacement_offsets[0]],\n+ [\"<image><image>\", \"<video><video>\"],\n+ )\n+\n+ def test_get_text_with_replacements_preserves_placeholder_when_no_modality_data_is_provided(self):\n+ processor = self.get_processor()\n+\n+ text, replacement_offsets = processor.get_text_with_replacements([\"Profile <image> without image data.\"])\n+\n+ self.assertEqual(text, [\"Profile <image> without image data.\"])\n+ self.assertEqual(replacement_offsets, [[]])"}
	{"id": "pr_46145", "type": "pr", "number": 46145, "title": "Fix load_adapter OOM caused by full-model warmup sizing", "state": "open", "author": "Yooniel", "labels": [], "created_at": "2026-05-21T15:59:30Z", "updated_at": "2026-05-21T16:17:56Z", "url": "https://github.com/huggingface/transformers/pull/46145", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46145: Fix load_adapter OOM caused by full-model warmup sizing\nState: open \| Merged: False\nAuthor: Yooniel \| Base: main\nLabels: \nCreated: 2026-05-21T15:59:30Z\n\n# What does this PR do?\r\n\r\nFixes an OOM in `load_adapter` on configurations where the base model occupies more than ~half of GPU memory, e.g. Gemma-3-27B in bf16 on a single H100/H200 or Llama-70B on a single 80 GB GPU.\r\n\r\n## Root cause\r\n\r\n`load_adapter` passes every named parameter on the model, base model included, as `expected_keys` to `_load_pretrained_model`. Downstream, `caching_allocator_warmup` sums those into a full base-model byte count and issues a single same-size allocation on top of the already-resident base model, OOMing.\r\n\r\n```text\r\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 51.87 GiB.\r\nGPU 0 has a total capacity of 94.50 GiB of which 41.85 GiB is free.\r\nIncluding non-PyTorch memory, this process has 52.64 GiB memory in use.\r\n```\r\n\r\nThe allocation attempt, 51.87 GiB, is essentially the size of the base model already resident on the GPU.\r\n\r\n## Fix\r\n\r\nHoist the existing `is_adapter_key` helper above the `_load_pretrained_model` call and apply it to `expected_keys`, so warmup is sized only from adapter parameters. The downstream `missing_keys` filter that already used the helper is preserved.\r\n\r\n## Tests\r\n\r\nAdds a regression test that captures the device map passed to `caching_allocator_warmup` during `load_adapter` and asserts it contains only adapter-owned parameter names, not base-model names. Without the fix, the test fails with 84 base-model parameter names leaking into the warmup.\r\n\r\n```bash\r\nmake style\r\nRUN_SLOW=1 python -m unittest tests.peft_integration.test_peft_integration.PeftIntegrationTester.test_peft_load_adapter_warmup_uses_adapter_expected_keys -v\r\n```\r\n\r\nAlso verified the original GH200 repro locally: before the fix, `load_adapter` tried to allocate 51.87 GiB and OOMed; after the fix, the adapter loads successfully.\r\n\r\n## Related\r\n\r\n- #36483, #36428, #36742 — same warmup, fixed for the base-model loading path only; the adapter path was untouched.\r\n- #44637 / #44660 — adjacent open issue/PR about a different `load_adapter` OOM (state-dict materialization in `load_best_model_at_end`), not warmup over-allocation.\r\n\r\nNo associated issue was filed; this is a focused bugfix PR with a local repro, root-cause analysis, and regression test.\r\n\r\n## Code Agent Policy\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case.\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [x] Did you write any new necessary tests?\r\n\r\n## Who can review?\r\n\r\n- @CyrilVallez (model loading): this change touches the `caching_allocator_warmup` path.\r\n- @BenjaminBossan (PEFT integration): this change is in `integrations/peft.py` and concerns adapter loading semantics.\r\n\n\n--- Comment by github-actions[bot] at 2026-05-21T16:17:56Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46145&sha=399e5f"}
	{"id": "pr_46145_file_src_transformers_integrations_peft.py", "type": "pr_diff", "number": 46145, "title": "Fix load_adapter OOM caused by full-model warmup sizing", "state": "open", "author": "Yooniel", "labels": [], "created_at": "2026-05-21T15:59:30Z", "updated_at": "2026-05-21T16:17:56Z", "url": "https://github.com/huggingface/transformers/pull/46145", "merged": false, "base_branch": "main", "filename": "src/transformers/integrations/peft.py", "additions": 10, "deletions": 10, "text": "PR #46145 — file change: src/transformers/integrations/peft.py\nStatus: modified \| +10 -10\n\n@@ -583,6 +583,13 @@ def load_adapter(\n # Create and add fresh new adapters into the model, unless the weights are hotswapped\n inject_adapter_in_model(peft_config, self, adapter_name)\n \n+ adapter_key_markers = {adapter_name}\n+ if peft_config is not None and getattr(peft_config, \"peft_type\", None) is not None:\n+ adapter_key_markers.add(peft_config.peft_type.value.lower())\n+\n+ def is_adapter_key(key: str) -> bool:\n+ return any(marker in key for marker in adapter_key_markers)\n+\n if not self._hf_peft_config_loaded:\n self._hf_peft_config_loaded = True\n \n@@ -670,9 +677,9 @@ def load_adapter(\n state_dict=adapter_state_dict,\n checkpoint_files=checkpoint_files,\n load_config=load_config,\n- # pass expected keys explicitly, otherwise they are determined from the state_dict, which can contain\n- # unexpected entries, like \"layer.SCB\" from a bnb layer.\n- expected_keys=[n for n, _ in self.named_parameters()],\n+ # Pass expected keys explicitly while excluding non-adapter parameters.\n+ # Otherwise `caching_allocator_warmup` sizes for the full base model.\n+ expected_keys=[n for n, _ in self.named_parameters() if is_adapter_key(n)],\n )\n \n if peft_config.inference_mode:\n@@ -683,13 +690,6 @@ def load_adapter(\n if isinstance(module, BaseTunerLayer):\n module.requires_grad_(False)\n \n- adapter_key_markers = {adapter_name}\n- if peft_config is not None and getattr(peft_config, \"peft_type\", None) is not None:\n- adapter_key_markers.add(peft_config.peft_type.value.lower())\n-\n- def is_adapter_key(key: str) -> bool:\n- return any(marker in key for marker in adapter_key_markers)\n-\n loading_info.missing_keys = {k for k in loading_info.missing_keys if is_adapter_key(k)}\n \n log_state_dict_report("}
	{"id": "pr_46145_file_tests_peft_integration_test_peft_integration.py", "type": "pr_diff", "number": 46145, "title": "Fix load_adapter OOM caused by full-model warmup sizing", "state": "open", "author": "Yooniel", "labels": [], "created_at": "2026-05-21T15:59:30Z", "updated_at": "2026-05-21T16:17:56Z", "url": "https://github.com/huggingface/transformers/pull/46145", "merged": false, "base_branch": "main", "filename": "tests/peft_integration/test_peft_integration.py", "additions": 52, "deletions": 0, "text": "PR #46145 — file change: tests/peft_integration/test_peft_integration.py\nStatus: modified \| +52 -0\n\n@@ -694,6 +694,58 @@ def test_peft_add_adapter_with_state_dict_low_cpu_mem_usage(self):\n # after loading, no meta device should be remaining\n self.assertFalse(any((p.device.type == \"meta\") for p in model.parameters()))\n \n+ def test_peft_load_adapter_warmup_uses_adapter_expected_keys(self):\n+ \"\"\"\n+ Check that adapter loading only warms up memory for adapter parameters.\n+ \"\"\"\n+ from peft import LoraConfig\n+\n+ import transformers.modeling_utils as modeling_utils\n+\n+ adapter_name = \"warmup_test_adapter\"\n+ adapter_key_markers = (adapter_name, \"lora\")\n+\n+ for model_id in self.transformers_test_model_ids:\n+ for transformers_class in self.transformers_test_model_classes:\n+ model = transformers_class.from_pretrained(model_id).to(torch_device)\n+\n+ peft_config = LoraConfig()\n+ template_model = transformers_class.from_pretrained(model_id)\n+ template_model.add_adapter(LoraConfig(), adapter_name=adapter_name)\n+ dummy_state_dict = {\n+ name: torch.zeros_like(param)\n+ for name, param in template_model.named_parameters()\n+ if any(marker in name for marker in adapter_key_markers)\n+ }\n+ del template_model\n+ self.assertTrue(dummy_state_dict)\n+\n+ captured_device_maps = []\n+ original_warmup = modeling_utils.caching_allocator_warmup\n+\n+ def capture_warmup(model, expanded_device_map, hf_quantizer):\n+ captured_device_maps.append(dict(expanded_device_map))\n+\n+ modeling_utils.caching_allocator_warmup = capture_warmup\n+ try:\n+ with CaptureLogger(logging.get_logger(\"transformers.integrations.peft\")):\n+ model.load_adapter(\n+ adapter_state_dict=dummy_state_dict,\n+ adapter_name=adapter_name,\n+ peft_config=peft_config,\n+ )\n+ finally:\n+ modeling_utils.caching_allocator_warmup = original_warmup\n+\n+ self.assertTrue(captured_device_maps)\n+ warmed_keys = set().union(*(device_map.keys() for device_map in captured_device_maps))\n+ self.assertTrue(warmed_keys)\n+\n+ unexpected_base_keys = [\n+ key for key in warmed_keys if not any(marker in key for marker in adapter_key_markers)\n+ ]\n+ self.assertEqual(unexpected_base_keys, [])\n+\n def test_peft_from_pretrained_hub_kwargs(self):\n \"\"\"\n Tests different combinations of PEFT model + from_pretrained + hub kwargs"}
	{"id": "pr_46142", "type": "pr", "number": 46142, "title": "Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config", "state": "open", "author": "Charly21r", "labels": [], "created_at": "2026-05-21T13:17:26Z", "updated_at": "2026-05-21T13:42:33Z", "url": "https://github.com/huggingface/transformers/pull/46142", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46142: Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config\nState: open \| Merged: False\nAuthor: Charly21r \| Base: main\nLabels: \nCreated: 2026-05-21T13:17:26Z\n\n# What does this PR do?\r\n\r\nFixes #46121\r\n\r\n`RotaryEmbeddingConfigMixin.ignore_keys_at_rope_validation` is a `set` at the class level, but JSON has no `set` type, so any `config.json` that serializes this field (e.g. checkpoints written by LoRA merge / export tooling like `ms-swift`) loads it back as a `list` instance attribute that shadows the class default. `RotaryEmbeddingConfigMixin.convert_rope_params_to_dict` then does:\r\n\r\n`self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}`\r\n\r\nwhich raises `TypeError: unsupported operand type(s) for \|: 'list' and 'set'` whenever `partial_rotary_factor` is also set on the config. In practice this prevents serving such merged checkpoints (observed downstream in vLLM with merged Qwen3.5 checkpoints).\r\n\r\nThis PR coerces the attribute to a set before the union in `src/transformers/modeling_rope_utils.py`, and adds a regression test in `tests/utils/test_modeling_rope_utils.py` covering both direct attribute assignment and the `from_dict` round-trip path that mirrors the JSON-deserialization flow.\r\n\r\n## Code Agent Policy\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?\r\n- [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [x] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\n@Rocketknight1 \n\n--- Comment by github-actions[bot] at 2026-05-21T13:33:18Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46142&sha=1267e6\n\n--- Comment by Rocketknight1 at 2026-05-21T13:42:33Z ---\nLGTM, CI issues seem unrelated. It's core code though, so I'll wait for a core maintainer to approve in case I'm totally wrong here! cc @arthurzucker @cyrilvallez @vasqu "}
	{"id": "pr_46142_file_src_transformers_modeling_rope_utils.py", "type": "pr_diff", "number": 46142, "title": "Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config", "state": "open", "author": "Charly21r", "labels": [], "created_at": "2026-05-21T13:17:26Z", "updated_at": "2026-05-21T13:42:33Z", "url": "https://github.com/huggingface/transformers/pull/46142", "merged": false, "base_branch": "main", "filename": "src/transformers/modeling_rope_utils.py", "additions": 1, "deletions": 1, "text": "PR #46142 — file change: src/transformers/modeling_rope_utils.py\nStatus: modified \| +1 -1\n\n@@ -719,7 +719,7 @@ def convert_rope_params_to_dict(self, **kwargs):\n partial_rotary_factor = kwargs.get(\"partial_rotary_factor\", getattr(self, \"partial_rotary_factor\", None))\n if partial_rotary_factor is not None:\n self.rope_parameters.setdefault(\"partial_rotary_factor\", partial_rotary_factor)\n- self.ignore_keys_at_rope_validation = self.ignore_keys_at_rope_validation \| {\"partial_rotary_factor\"}\n+ self.ignore_keys_at_rope_validation = set(self.ignore_keys_at_rope_validation) \| {\"partial_rotary_factor\"}\n \n self.standardize_rope_params()\n return kwargs"}
	{"id": "pr_46142_file_tests_utils_test_modeling_rope_utils.py", "type": "pr_diff", "number": 46142, "title": "Fix TypeError on list-typed ignore_keys_at_rope_validation in RoPE config", "state": "open", "author": "Charly21r", "labels": [], "created_at": "2026-05-21T13:17:26Z", "updated_at": "2026-05-21T13:42:33Z", "url": "https://github.com/huggingface/transformers/pull/46142", "merged": false, "base_branch": "main", "filename": "tests/utils/test_modeling_rope_utils.py", "additions": 22, "deletions": 0, "text": "PR #46142 — file change: tests/utils/test_modeling_rope_utils.py\nStatus: modified \| +22 -0\n\n@@ -136,6 +136,28 @@ def test_yarn_original_original_max_position_embeddings_validation(self):\n self.assertEqual(len(logs.output), 1)\n self.assertIn(\"implicit factor\", logs.output[0])\n \n+ def test_convert_rope_params_to_dict_with_list_ignore_keys(self):\n+ # Regression test: `ignore_keys_at_rope_validation` becomes a list when loaded from a config.json\n+ # (JSON has no set type). `convert_rope_params_to_dict` used to do `list \| set` and crash with\n+ # TypeError when `partial_rotary_factor` was also set.\n+ config = LlamaConfig(partial_rotary_factor=0.25)\n+ config.ignore_keys_at_rope_validation = [\"mrope_section\", \"mrope_interleaved\"]\n+\n+ config.convert_rope_params_to_dict(partial_rotary_factor=0.25)\n+\n+ self.assertIsInstance(config.ignore_keys_at_rope_validation, set)\n+ self.assertEqual(\n+ config.ignore_keys_at_rope_validation,\n+ {\"mrope_section\", \"mrope_interleaved\", \"partial_rotary_factor\"},\n+ )\n+\n+ # Round-trip through from_dict to mimic the JSON-deserialized path that triggered this in production.\n+ cfg_dict = config.to_dict()\n+ cfg_dict[\"ignore_keys_at_rope_validation\"] = [\"mrope_section\", \"mrope_interleaved\"]\n+ reloaded = LlamaConfig.from_dict(cfg_dict)\n+ reloaded.convert_rope_params_to_dict(partial_rotary_factor=0.25)\n+ self.assertIsInstance(reloaded.ignore_keys_at_rope_validation, set)\n+\n def test_rope_validation_with_per_attention_type_nested_rope(self):\n \"\"\"Mirrors `test_rope_validation` with `config.layer_types` set, so that\n `rope_parameters` takes the per-attention-type nested shape.\"\"\""}
	{"id": "pr_46141", "type": "pr", "number": 46141, "title": "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions", "state": "open", "author": "ryota-komatsu", "labels": [], "created_at": "2026-05-21T12:43:29Z", "updated_at": "2026-05-21T13:04:23Z", "url": "https://github.com/huggingface/transformers/pull/46141", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46141: Fix FSDP2 and distributed checkpointing imports for older PyTorch versions\nState: open \| Merged: False\nAuthor: ryota-komatsu \| Base: main\nLabels: \nCreated: 2026-05-21T12:43:29Z\n\n# What does this PR do?\r\n\r\nThis PR updates the PyTorch version constraints for specific distributed features to prevent `ImportError` and `ModuleNotFoundError` crashes on older PyTorch versions:\r\n- Bumps the minimum PyTorch requirement for FSDP2 from `>=2.5` to `>=2.6`.\r\n- Add a minimum PyTorch requirement of `>=2.7` for distributed checkpoint saving.\r\n\r\nCurrently, attempting to initialize FSDP2 with `torch==2.5` results in an import error because `CPUOffloadPolicy`, `MixedPrecisionPolicy`, and `OffloadPolicy` are not available in 'torch.distributed.fsdp' for that version.\r\n\r\nSimilarly, attempting to use distributed checkpointing on versions earlier than `torch==2.7` crashes because `HuggingFaceStorageWriter` does not exist in `torch.distributed.checkpoint.hf_storage`.\r\n\r\nTracebacks\r\n```\r\ntransformers/distributed/fsdp.py\", line 34, in <module>\r\n from torch.distributed.fsdp import CPUOffloadPolicy, MixedPrecisionPolicy, OffloadPolicy\r\nImportError: cannot import name 'CPUOffloadPolicy' from 'torch.distributed.fsdp'\r\n```\r\n\r\n```\r\ntransformers/distributed/utils.py\", line 42, in <module>\r\n from torch.distributed.checkpoint.hf_storage import HuggingFaceStorageWriter\r\nModuleNotFoundError: No module named 'torch.distributed.checkpoint.hf_storage'\r\n```\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [ ] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\nAnyone in the community is free to review the PR once the tests have passed. Feel free to tag\r\nmembers/contributors who may be interested in your PR.\r\n\r\n- distributed: @3outeille @ArthurZucker\n\n--- Comment by github-actions[bot] at 2026-05-21T13:04:23Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46141&sha=8f98cb"}
	{"id": "pr_46141_file_src_transformers_distributed_fsdp.py", "type": "pr_diff", "number": 46141, "title": "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions", "state": "open", "author": "ryota-komatsu", "labels": [], "created_at": "2026-05-21T12:43:29Z", "updated_at": "2026-05-21T13:04:23Z", "url": "https://github.com/huggingface/transformers/pull/46141", "merged": false, "base_branch": "main", "filename": "src/transformers/distributed/fsdp.py", "additions": 5, "deletions": 5, "text": "PR #46141 — file change: src/transformers/distributed/fsdp.py\nStatus: modified \| +5 -5\n\n@@ -28,7 +28,7 @@\n if is_torch_available():\n import torch\n \n-if is_torch_available() and is_torch_greater_or_equal(\"2.5\"):\n+if is_torch_available() and is_torch_greater_or_equal(\"2.6\"):\n import torch.distributed as dist\n from torch.distributed._composable.fsdp import fully_shard\n from torch.distributed.fsdp import CPUOffloadPolicy, MixedPrecisionPolicy, OffloadPolicy\n@@ -91,8 +91,8 @@ def initialize_fsdp(\n if fsdp_plan is None:\n return device_map, device_mesh, None\n \n- if not is_torch_greater_or_equal(\"2.5\"):\n- raise OSError(\"FSDP2 is only supported for `torch>=2.5`.\")\n+ if not is_torch_greater_or_equal(\"2.6\"):\n+ raise OSError(\"FSDP2 is only supported for `torch>=2.6`.\")\n \n if device_mesh is None:\n # Detect the accelerator on the machine\n@@ -338,8 +338,8 @@ def apply_fully_shard_data_parallel(\n if not is_torch_available():\n raise ImportError(\"PyTorch is required for FSDP support\")\n \n- if not is_torch_greater_or_equal(\"2.5\"):\n- raise OSError(\"FSDP2 requires torch>=2.5\")\n+ if not is_torch_greater_or_equal(\"2.6\"):\n+ raise OSError(\"FSDP2 requires torch>=2.6\")\n \n if fsdp_plan is None:\n fsdp_plan = {}"}
	{"id": "pr_46141_file_src_transformers_distributed_utils.py", "type": "pr_diff", "number": 46141, "title": "Fix FSDP2 and distributed checkpointing imports for older PyTorch versions", "state": "open", "author": "ryota-komatsu", "labels": [], "created_at": "2026-05-21T12:43:29Z", "updated_at": "2026-05-21T13:04:23Z", "url": "https://github.com/huggingface/transformers/pull/46141", "merged": false, "base_branch": "main", "filename": "src/transformers/distributed/utils.py", "additions": 9, "deletions": 1, "text": "PR #46141 — file change: src/transformers/distributed/utils.py\nStatus: modified \| +9 -1\n\n@@ -39,14 +39,16 @@\n if is_torch_available():\n import torch\n import torch.distributed.checkpoint as dcp\n- from torch.distributed.checkpoint.hf_storage import HuggingFaceStorageWriter\n from torch.distributed.checkpoint.state_dict import (\n get_model_state_dict,\n get_optimizer_state_dict,\n set_optimizer_state_dict,\n )\n from torch.distributed.tensor import DTensor\n \n+ if is_torch_greater_or_equal(\"2.7\"):\n+ from torch.distributed.checkpoint.hf_storage import HuggingFaceStorageWriter\n+\n \n def _ensure_torch_distributed(device_type: str):\n \"\"\"Initialize torch.distributed if not already initialized.\"\"\"\n@@ -103,6 +105,9 @@ def init_device_mesh(distributed_config: DistributedConfig) -> torch.distributed\n if not is_torch_greater_or_equal(\"2.5\"):\n raise OSError(\"Distributed training with DistributedConfig requires `torch>=2.5`.\")\n \n+ if distributed_config.fsdp_size > 1 and not is_torch_greater_or_equal(\"2.6\"):\n+ raise OSError(\"FSDP2 requires `torch>=2.6`.\")\n+\n device_type = torch._C._get_accelerator().type\n _ensure_torch_distributed(device_type)\n \n@@ -205,6 +210,9 @@ def save_model_checkpoint_distributed(model, checkpoint_dir: str) -> None:\n gate\|\|up MoE weights) are replicated to a full tensor on every rank\n before the save, otherwise DCP cannot encode that placement.\n \"\"\"\n+ if not is_torch_greater_or_equal(\"2.7\"):\n+ raise OSError(\"Distributed checkpoint saving requires `torch>=2.7`.\")\n+\n state_dict = get_model_state_dict(model)\n for key, value in list(state_dict.items()):\n if ("}
	{"id": "pr_46140", "type": "pr", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46140: Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads\nState: closed \| Merged: False\nAuthor: adityasingh2400 \| Base: main\nLabels: Code agent slop\nCreated: 2026-05-21T11:23:35Z\n\n# What does this PR do?\n\nFixes #46082.\n\n`LlamaAttention` already sizes its projections from `num_attention_heads * head_dim` rather than `hidden_size`, so a config where `hidden_size % num_attention_heads != 0` is well-defined as long as `head_dim` is explicitly provided. The divisibility check in `LlamaConfig.validate_architecture` fires unconditionally though (it runs after `__post_init__` has filled in the fallback `head_dim`, so checking `head_dim is not None` in the validator doesn't work).\n\nThis PR follows the approach @matdou outlined in the issue:\n\n- Capture `self._head_dim_was_explicit = self.head_dim is not None` in `__post_init__` before falling back to the derived value.\n- Gate the divisibility error in `validate_architecture` on `not self._head_dim_was_explicit`.\n\n`_head_dim_was_explicit` is recomputed in `__post_init__`, so save/reload via `save_pretrained` / `from_pretrained` works without persisting the flag (the saved `head_dim` is the explicit value, so the flag is set correctly on reload).\n\nThe original validation error is preserved when `head_dim` is not explicitly provided.\n\n## Reproduction (from the issue)\n\n```python\nfrom transformers import LlamaConfig, LlamaForCausalLM\n\nconfig = LlamaConfig(\n vocab_size=99,\n hidden_size=512,\n intermediate_size=1024,\n num_hidden_layers=1,\n num_attention_heads=9,\n num_key_value_heads=1,\n head_dim=56,\n)\nmodel = LlamaForCausalLM(config)\n```\n\nPasses after this change, raises before.\n\n## Tests\n\nAdded two cases in `tests/models/llama/test_modeling_llama.py`:\n\n- `head_dim` explicit + non-divisible dims, config accepted, model instantiates.\n- `head_dim` omitted + non-divisible dims, original `ValueError` still raised.\n\n## Who can review?\n\n@ArthurZucker @Cyrilvallez\n\nCredit to @matdou for the diagnosis in the issue comments.\n\n--- Comment by github-actions[bot] at 2026-05-21T11:31:52Z ---\n[For maintainers] Suggested jobs to run (before merge)\n\nrun-slow: arcee, aria, cwm, deepseek_v2, eurobert, higgs_audio_v2, hrm_text, jais2, llama\n\n--- Comment by github-actions[bot] at 2026-05-21T11:50:16Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46140&sha=70eb70\n\n--- Comment by adityasingh2400 at 2026-05-21T12:08:04Z ---\nCI note: the 4 failing tests on `tests_torch` / `tests_training_ci` / `tests_tensor_parallel_ci` are all in `tests/models/cohere2_moe/test_modeling_cohere2_moe.py` and reproduce identically on an unrelated PR opened a few minutes after this one (see #46136). The failing assertions are:\n\n- `Cohere2MoeModelTest::test_training_overfit`, `AssertionError: 0.27068585289520714 not greater than 0.9` (the exact same float value reproduces across runs, so it is deterministic, not flaky)\n- `Cohere2MoeModelTest::test_tp_forward` / `test_tp_backward` / `test_tp_generation`, `KeyError: 'rowwise'` raised by the TP partition spec on a Cohere2 MoE layer\n\nBoth classes of failure were introduced when `cohere2_moe` landed yesterday (#46115 on 2026-05-20). My change is scoped to `LlamaConfig` and the modular-converted descendants (arcee, aria, cwm, deepseek_v2, eurobert, higgs_audio_v2, hrm_text, jais2). `cohere2_moe` does not derive from Llama and is not touched by this PR.\n\nHappy to file a separate PR for the cohere2_moe breakage if no one is on it already, but flagging it here so this PR is not held on CI red that is upstream of it."}
	{"id": "pr_46140_file_src_transformers_models_arcee_configuration_arcee.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/arcee/configuration_arcee.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/arcee/configuration_arcee.py\nStatus: modified \| +6 -1\n\n@@ -101,6 +101,11 @@ class ArceeConfig(PreTrainedConfig):\n head_dim: int \| None = None\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (ArceeAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -110,7 +115,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_aria_configuration_aria.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/aria/configuration_aria.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/aria/configuration_aria.py\nStatus: modified \| +6 -1\n\n@@ -104,6 +104,11 @@ class AriaTextConfig(PreTrainedConfig):\n moe_num_shared_experts: int = 2\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (AriaTextAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -113,7 +118,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_cwm_configuration_cwm.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/cwm/configuration_cwm.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/cwm/configuration_cwm.py\nStatus: modified \| +6 -1\n\n@@ -127,6 +127,11 @@ def __post_init__(self, *kwargs):\n self.sliding_window = int(self.sliding_window) if self.sliding_window else None\n self.layer_types = list(self.layer_types)\n self.eos_token_id = self.eos_token_id if self.eos_token_id is not None else [128001, 128008, 128009]\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (CwmAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -136,7 +141,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_deepseek_v2_configuration_deepseek_v2.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/deepseek_v2/configuration_deepseek_v2.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/deepseek_v2/configuration_deepseek_v2.py\nStatus: modified \| +6 -1\n\n@@ -139,6 +139,11 @@ class DeepseekV2Config(PreTrainedConfig):\n \n def __post_init__(self, *kwargs):\n self.head_dim = self.qk_rope_head_dim\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (DeepseekV2Attention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -148,7 +153,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_eurobert_configuration_eurobert.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/eurobert/configuration_eurobert.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/eurobert/configuration_eurobert.py\nStatus: modified \| +6 -1\n\n@@ -113,6 +113,11 @@ class EuroBertConfig(PreTrainedConfig):\n def __post_init__(self, *kwargs):\n if self.num_key_value_heads is None:\n self.num_key_value_heads = self.num_attention_heads\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (EuroBertAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -122,7 +127,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_higgs_audio_v2_configuration_higgs_audio_v2.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/higgs_audio_v2/configuration_higgs_audio_v2.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/higgs_audio_v2/configuration_higgs_audio_v2.py\nStatus: modified \| +6 -1\n\n@@ -133,6 +133,11 @@ def __post_init__(self, *kwargs):\n \"original_max_position_embeddings\": 1024,\n \"rope_type\": \"llama3\",\n }\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (HiggsAudioV2Attention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -142,7 +147,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_hrm_text_configuration_hrm_text.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/hrm_text/configuration_hrm_text.py", "additions": 1, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/hrm_text/configuration_hrm_text.py\nStatus: modified \| +1 -1\n\n@@ -140,7 +140,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_jais2_configuration_jais2.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/jais2/configuration_jais2.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/jais2/configuration_jais2.py\nStatus: modified \| +6 -1\n\n@@ -102,6 +102,11 @@ class Jais2Config(PreTrainedConfig):\n layer_norm_eps: float = 1e-5\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (Jais2Attention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -111,7 +116,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_src_transformers_models_llama_configuration_llama.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "src/transformers/models/llama/configuration_llama.py", "additions": 6, "deletions": 1, "text": "PR #46140 — file change: src/transformers/models/llama/configuration_llama.py\nStatus: modified \| +6 -1\n\n@@ -105,6 +105,11 @@ class LlamaConfig(PreTrainedConfig):\n head_dim: int \| None = None\n \n def __post_init__(self, *kwargs):\n+ # Track whether `head_dim` was explicitly provided so `validate_architecture`\n+ # can allow non-divisible `hidden_size`/`num_attention_heads` when the user\n+ # has supplied an explicit `head_dim` (LlamaAttention sizes its projections\n+ # from `num_attention_heads head_dim`, so this case is well-defined).\n+ self._head_dim_was_explicit = self.head_dim is not None\n if self.head_dim is None:\n self.head_dim = self.hidden_size // self.num_attention_heads\n if self.num_key_value_heads is None:\n@@ -114,7 +119,7 @@ def __post_init__(self, **kwargs):\n \n def validate_architecture(self):\n \"\"\"Part of `@strict`-powered validation. Validates the architecture of the config.\"\"\"\n- if self.hidden_size % self.num_attention_heads != 0:\n+ if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:\n raise ValueError(\n f\"The hidden size ({self.hidden_size}) is not a multiple of the number of attention \"\n f\"heads ({self.num_attention_heads}).\""}
	{"id": "pr_46140_file_tests_models_llama_test_modeling_llama.py", "type": "pr_diff", "number": 46140, "title": "Fix LlamaConfig rejecting explicit head_dim when hidden_size is not divisible by num_attention_heads", "state": "closed", "author": "adityasingh2400", "labels": ["Code agent slop"], "created_at": "2026-05-21T11:23:35Z", "updated_at": "2026-05-21T12:08:04Z", "url": "https://github.com/huggingface/transformers/pull/46140", "merged": false, "base_branch": "main", "filename": "tests/models/llama/test_modeling_llama.py", "additions": 38, "deletions": 0, "text": "PR #46140 — file change: tests/models/llama/test_modeling_llama.py\nStatus: modified \| +38 -0\n\n@@ -35,6 +35,7 @@\n import torch\n \n from transformers import (\n+ LlamaConfig,\n LlamaForCausalLM,\n LlamaModel,\n LlamaTokenizer,\n@@ -57,6 +58,43 @@ class LlamaModelTest(CausalLMModelTest, unittest.TestCase):\n # used in `test_torch_compile_for_training`\n _torch_compile_train_cls = LlamaForCausalLM if is_torch_available() else None\n \n+ def test_config_explicit_head_dim_with_non_divisible_hidden_size(self):\n+ # Regression test for https://github.com/huggingface/transformers/issues/46082\n+ # `LlamaAttention` sizes its projections from `num_attention_heads * head_dim`,\n+ # so an explicit `head_dim` should be allowed even when `hidden_size` is not\n+ # divisible by `num_attention_heads`.\n+ config = LlamaConfig(\n+ vocab_size=99,\n+ hidden_size=512,\n+ intermediate_size=1024,\n+ num_hidden_layers=1,\n+ num_attention_heads=9,\n+ num_key_value_heads=1,\n+ head_dim=56,\n+ )\n+ self.assertEqual(config.head_dim, 56)\n+ # Model construction should succeed with the matching projection shapes.\n+ model = LlamaForCausalLM(config)\n+ self.assertEqual(\n+ model.model.layers[0].self_attn.q_proj.weight.shape,\n+ (config.num_attention_heads * config.head_dim, config.hidden_size),\n+ )\n+\n+ def test_config_implicit_head_dim_with_non_divisible_hidden_size_still_raises(self):\n+ # Regression preventer: omitting `head_dim` with non-divisible dims must\n+ # still raise, since the auto-derived `head_dim = hidden_size // num_attention_heads`\n+ # would silently truncate.\n+ with self.assertRaises(Exception) as ctx:\n+ LlamaConfig(\n+ vocab_size=99,\n+ hidden_size=512,\n+ intermediate_size=1024,\n+ num_hidden_layers=1,\n+ num_attention_heads=9,\n+ num_key_value_heads=1,\n+ )\n+ self.assertIn(\"not a multiple\", str(ctx.exception))\n+\n \n @require_torch_accelerator\n class LlamaIntegrationTest(unittest.TestCase):"}
	{"id": "pr_46138", "type": "pr", "number": 46138, "title": "chore: update self-comment-ci.yml", "state": "open", "author": "hf-security-analysis[bot]", "labels": [], "created_at": "2026-05-21T09:57:53Z", "updated_at": "2026-05-21T10:10:08Z", "url": "https://github.com/huggingface/transformers/pull/46138", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46138: chore: update self-comment-ci.yml\nState: open \| Merged: False\nAuthor: hf-security-analysis[bot] \| Base: main\nLabels: \nCreated: 2026-05-21T09:57:53Z\n\nUpdate `.github/workflows/self-comment-ci.yml` workflow configuration.\n\ncc @guarin @molbap\n\nCloses huggingface/tracking-issues#487\n<!--slack ts:1779357475.432589 channel:C0AJSP0D53L-->\n\n--- Comment by HuggingFaceDocBuilderDev at 2026-05-21T10:10:08Z ---\nThe docs for this PR live [here](https://moon-ci-docs.huggingface.co/docs/transformers/pr_46138). All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update."}
	{"id": "pr_46138_file_.github_workflows_self-comment-ci.yml", "type": "pr_diff", "number": 46138, "title": "chore: update self-comment-ci.yml", "state": "open", "author": "hf-security-analysis[bot]", "labels": [], "created_at": "2026-05-21T09:57:53Z", "updated_at": "2026-05-21T10:10:08Z", "url": "https://github.com/huggingface/transformers/pull/46138", "merged": false, "base_branch": "main", "filename": ".github/workflows/self-comment-ci.yml", "additions": 2, "deletions": 2, "text": "PR #46138 — file change: .github/workflows/self-comment-ci.yml\nStatus: modified \| +2 -2\n\n@@ -89,9 +89,9 @@ jobs:\n PR_COMMENT: ${{ github.event.comment.body }}\n run: \|\n python -m pip install GitPython\n- python utils/pr_slow_ci_models.py --message \"$PR_COMMENT\" \| tee output.txt\n+ printf '%s' \"$PR_COMMENT\" \| python utils/pr_slow_ci_models.py --message-stdin \| tee output.txt\n echo \"models=$(tail -n 1 output.txt)\" >> $GITHUB_ENV\n- python utils/pr_slow_ci_models.py --message \"$PR_COMMENT\" --quantization \| tee output2.txt\n+ printf '%s' \"$PR_COMMENT\" \| python utils/pr_slow_ci_models.py --message-stdin --quantization \| tee output2.txt\n echo \"quantizations=$(tail -n 1 output2.txt)\" >> $GITHUB_ENV\n \n - name: Show models to test"}
	{"id": "pr_46137", "type": "pr", "number": 46137, "title": "Update self-comment-ci", "state": "closed", "author": "guarin", "labels": [], "created_at": "2026-05-21T09:41:02Z", "updated_at": "2026-05-21T09:57:27Z", "url": "https://github.com/huggingface/transformers/pull/46137", "merged": true, "base_branch": "main", "text": "PULL REQUEST #46137: Update self-comment-ci\nState: closed \| Merged: True\nAuthor: guarin \| Base: main\nLabels: \nCreated: 2026-05-21T09:41:02Z\n\n# What does this PR do?\r\n\r\n<!--\r\nCongratulations! You've made it this far! You're not quite done yet though.\r\n\r\nOnce merged, your PR is going to appear in the release notes with the title you set, so make sure it's a great title that fully reflects the extent of your awesome contribution.\r\n\r\nThen, please replace this with a description of the change and which issue is fixed (if applicable). Please also include relevant motivation and context. List any dependencies (if any) that are required for this change.\r\n\r\nOnce you're done, someone will review your PR shortly (see the section \"Who can review?\" below to tag some potential reviewers). They may suggest changes to make the code even better. If no one reviewed your PR after a week has passed, don't hesitate to post a new comment @-mentioning the same persons---sometimes notifications get lost.\r\n-->\r\n\r\n<!-- Remove if not applicable -->\r\n\r\nFixes # (issue)\r\n\r\n## Code Agent Policy\r\n\r\nThe Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by\r\ncode agents. We are currently bottlenecked by our ability to review and respond to them. As a result, \r\nwe ask that new users do not submit pure code agent PRs at this time. \r\nYou may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous \"OpenClaw\"-like agents\r\nnot to open any PRs or issues for the moment.\r\n\r\nPRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this\r\nrepeatedly or maliciously. \r\n\r\nThis is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result, \r\nthis policy is likely to be updated regularly in the near future. For more information, please read [`CONTRIBUTING.md`](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md).\r\n\r\n- [ ] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [ ] Did you write any new necessary tests?\r\n\r\n\r\n## Who can review?\r\n\r\nAnyone in the community is free to review the PR once the tests have passed. Feel free to tag\r\nmembers/contributors who may be interested in your PR.\r\n\r\n<!-- Your PR will be replied to more quickly if you can figure out the right person to tag with @\r\n\r\n If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag.\r\n Please tag fewer than 3 people.\r\n\r\nModels:\r\n\r\n- text models: @ArthurZucker @Cyrilvallez\r\n- vision models: @yonigozlan @molbap\r\n- audio models: @eustlb @ebezzam @vasqu\r\n- multimodal models: @zucchini-nlp\r\n- graph models: @clefourrier\r\n\r\nLibrary:\r\n\r\n- generate: @zucchini-nlp (visual-language models) or @gante (all others)\r\n- continuous batching: @remi-or @ArthurZucker @McPatate\r\n- pipelines: @Rocketknight1\r\n- tokenizers: @ArthurZucker and @itazap\r\n- trainer: @SunMarc\r\n- attention: @vasqu @ArthurZucker @CyrilVallez\r\n- model loading (from pretrained, etc): @CyrilVallez\r\n- distributed: @3outeille @ArthurZucker\r\n- CIs: @ydshieh\r\n\r\nIntegrations:\r\n\r\n- ray/raytune: @richardliaw, @amogkam\r\n- Big Model Inference: @SunMarc\r\n- quantization: @SunMarc\r\n- kernels: @drbh\r\n- peft: @BenjaminBossan @githubnemo\r\n\r\nDevices/Backends:\r\n\r\n- AMD ROCm: @ivarflakstad\r\n- Intel XPU: @IlyasMoutawwakil\r\n- Ascend NPU: @ivarflakstad \r\n\r\nDocumentation: @stevhliu\r\n\r\nResearch projects are not maintained and should be taken as is.\r\n\r\n -->\r\n\n\n--- Comment by HuggingFaceDocBuilderDev at 2026-05-21T09:52:56Z ---\nThe docs for this PR live [here](https://moon-ci-docs.huggingface.co/docs/transformers/pr_46137). All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update."}
	{"id": "pr_46137_file_.github_workflows_self-comment-ci.yml", "type": "pr_diff", "number": 46137, "title": "Update self-comment-ci", "state": "closed", "author": "guarin", "labels": [], "created_at": "2026-05-21T09:41:02Z", "updated_at": "2026-05-21T09:57:27Z", "url": "https://github.com/huggingface/transformers/pull/46137", "merged": true, "base_branch": "main", "filename": ".github/workflows/self-comment-ci.yml", "additions": 1, "deletions": 1, "text": "PR #46137 — file change: .github/workflows/self-comment-ci.yml\nStatus: modified \| +1 -1\n\n@@ -28,7 +28,7 @@ env:\n jobs:\n get-pr-number:\n name: Get PR number\n- if: ${{ github.event.issue.state == 'open' && contains(fromJSON('[\"ydshieh\", \"ArthurZucker\", \"zucchini-nlp\", \"molbap\", \"LysandreJik\", \"Cyrilvallez\", \"Rocketknight1\", \"SunMarc\", \"eustlb\", \"vasqu\", \"ivarflakstad\", \"stevhliu\", \"ebezzam\", \"remi-or\", \"itazap\", \"3outeille\", \"IlyasMoutawwakil\", \"tarekziade\", \"yonigozlan\"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') \|\| startsWith(github.event.comment.body, 'run slow') \|\| startsWith(github.event.comment.body, 'run_slow')) }}\n+ if: ${{ github.event.issue.state == 'open' && contains(fromJSON('[\"ydshieh\", \"ArthurZucker\", \"zucchini-nlp\", \"molbap\", \"LysandreJik\", \"Cyrilvallez\", \"Rocketknight1\", \"SunMarc\", \"eustlb\", \"vasqu\", \"ivarflakstad\", \"stevhliu\", \"ebezzam\", \"remi-or\", \"itazap\", \"3outeille\", \"IlyasMoutawwakil\", \"tarekziade\", \"yonigozlan\", \"guarin\"]'), github.actor) && (startsWith(github.event.comment.body, 'run-slow') \|\| startsWith(github.event.comment.body, 'run slow') \|\| startsWith(github.event.comment.body, 'run_slow')) }}\n uses: ./.github/workflows/get-pr-number.yml\n \n get-pr-info:"}
	{"id": "pr_46136", "type": "pr", "number": 46136, "title": "Fix is_last off-by-one in MaskGenerationPipeline for partial batches", "state": "open", "author": "J3r3myPerera", "labels": [], "created_at": "2026-05-21T07:50:15Z", "updated_at": "2026-05-21T18:21:07Z", "url": "https://github.com/huggingface/transformers/pull/46136", "merged": false, "base_branch": "main", "text": "PULL REQUEST #46136: Fix is_last off-by-one in MaskGenerationPipeline for partial batches\nState: open \| Merged: False\nAuthor: J3r3myPerera \| Base: main\nLabels: \nCreated: 2026-05-21T07:50:15Z\n\nFixes #46123\r\n\r\nMaskGenerationPipeline.preprocess used i == n_points - points_per_batch to spot the last batch. When n_points isn't a multiple of points_per_batch, that's never true — PipelinePackIterator hits StopIteration and quietly drops the last batch's results.\r\n\r\nFix: i + points_per_batch >= n_points.\r\nTwo fast unit tests in test_pipelines_mask_generation.py: one for the partial-batch case (100 points, batch 64), one for an exact multiple (128 points, batch 64).\r\n\r\n`python -m pytest tests/pipelines/test_pipelines_mask_generation.py::MaskGenerationPipelineTests::test_preprocess_is_last_partial_batch tests/pipelines/test_pipelines_mask_generation.py::MaskGenerationPipelineTests::test_preprocess_is_last_exact_multiple -v\r\n`\r\n#2 passed\r\n\r\n- [x] I confirm that this is not a pure code agent PR.\r\n\r\n## Before submitting\r\n- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).\r\n- [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),\r\n Pull Request section?\r\n- [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link\r\n to it if that's the case.\r\nDiscussed in https://github.com/huggingface/transformers/issues/46123\r\n- [ ] Did you make sure to update the documentation with your changes? Here are the\r\n [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and\r\n [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation).\r\n- [x] Did you write any new necessary tests?\r\nAdded test_preprocess_is_last_partial_batch and test_preprocess_is_last_exact_multiple to tests/pipelines/test_pipelines_mask_generation.py.\r\n\r\n\r\n## Who can review?\r\n\r\ncc @Rocketknight1 @yonigozlan @qubvel\r\n\n\n--- Comment by Shashank-Tripathi-07 at 2026-05-21T11:42:39Z ---\nHey bro, the code looks good but you said you didn't use AI agents to make this PR but there are em dashes very visible on the comment you made and also in the original issue. This can be a problem as the repo doesn't like Agent Slop even 1%. Take a look again for safety on this. \n\n--- Comment by J3r3myPerera at 2026-05-21T11:49:22Z ---\nThe CI failures here are pre-existing on main — not caused by this change.\r\nci/circleci: tests_tensor_parallel_ci — all 3 failures are in Cohere2MoeModelTest (test_tp_forward, test_tp_backward, test_tp_generation), crashing with KeyError: 'rowwise' in distributed/tensor_parallel.py. This PR doesn't touch any of that.\r\n\r\nci/circleci: tests_training_overfit_ci — 1 failure, also Cohere2MoeModelTest::test_training_overfit, loss only drops 27% vs a 90% threshold. Unrelated.\r\n\r\nOnly two files changed:\r\n\r\nsrc/transformers/pipelines/mask_generation.py (1 line)\r\ntests/pipelines/test_pipelines_mask_generation.py (2 tests)\r\n\r\nNeither touches Cohere2MoeModel or anything in the distributed training path.\n\n--- Comment by J3r3myPerera at 2026-05-21T11:56:15Z ---\n> Hey bro, the code looks good but you said you didn't use AI agents to make this PR but there are em dashes very visible on the comment you made and also in the original issue. This can be a problem as the repo doesn't like Agent Slop even 1%. Take a look again for safety on this.\r\n\r\nFair point, and I'll own it. I did use AI to help word the PR description and the issue comment. The fix itself I worked out on my own: i == n_points - points_per_batch only hits when n_points is an exact multiple, so any partial tail batch never gets flagged as last, PipelinePackIterator raises StopIteration and the results are quietly dropped. Replacing it with i + points_per_batch >= n_points handles both cases. I understand what the code does and why the old condition was wrong.\r\n\r\nThat said, em dashes in prose aren't really a reliable signal for agent-generated code. Plenty of people type them on purpose. The actual thing to check is whether the logic holds up. Which I'd rather be judged on.\n\n--- Comment by Rocketknight1 at 2026-05-21T12:46:45Z ---\nYou can ignore those comments, he's just annoyed I wouldn't listen when he claimed his Claude PR was human-written. In this case the actual fix is one line and seems correct, so I don't really care too much whether an agent wrote it or not. You do not actually need to go around hiding all the em-dashes :sweat_smile: \n\n--- Comment by Rocketknight1 at 2026-05-21T13:21:34Z ---\n@J3r3myPerera looks like there might be some CI instability at the moment. Can you wait a bit and then try rebasing or rerunning tests? Once the CI is green ping me and I'll merge it.\n\n--- Comment by github-actions[bot] at 2026-05-21T13:30:20Z ---\nView the CircleCI Test Summary for this PR:\n\nhttps://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46136&sha=ba5335\n\n--- Comment by HuggingFaceDocBuilderDev at 2026-05-21T13:31:34Z ---\nThe docs for this PR live [here](https://moon-ci-docs.huggingface.co/docs/transformers/pr_46136). All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.\n\n--- Comment by J3r3myPerera at 2026-05-21T18:21:07Z ---\n> @J3r3myPerera looks like there might be some CI instability at the moment. Can you wait a bit and then try rebasing or rerunning tests? Once the CI is green ping me and I'll merge it.\r\n\r\nSure mate no worries."}
	{"id": "pr_46136_file_src_transformers_pipelines_mask_generation.py", "type": "pr_diff", "number": 46136, "title": "Fix is_last off-by-one in MaskGenerationPipeline for partial batches", "state": "open", "author": "J3r3myPerera", "labels": [], "created_at": "2026-05-21T07:50:15Z", "updated_at": "2026-05-21T18:21:07Z", "url": "https://github.com/huggingface/transformers/pull/46136", "merged": false, "base_branch": "main", "filename": "src/transformers/pipelines/mask_generation.py", "additions": 1, "deletions": 1, "text": "PR #46136 — file change: src/transformers/pipelines/mask_generation.py\nStatus: modified \| +1 -1\n\n@@ -231,7 +231,7 @@ def preprocess(\n for i in range(0, n_points, points_per_batch):\n batched_points = grid_points[:, i : i + points_per_batch, :, :]\n labels = input_labels[:, i : i + points_per_batch]\n- is_last = i == n_points - points_per_batch\n+ is_last = i + points_per_batch >= n_points\n yield {\n \"input_points\": batched_points,\n \"input_labels\": labels,"}
	{"id": "pr_46136_file_tests_pipelines_test_pipelines_mask_generation.py", "type": "pr_diff", "number": 46136, "title": "Fix is_last off-by-one in MaskGenerationPipeline for partial batches", "state": "open", "author": "J3r3myPerera", "labels": [], "created_at": "2026-05-21T07:50:15Z", "updated_at": "2026-05-21T18:21:07Z", "url": "https://github.com/huggingface/transformers/pull/46136", "merged": false, "base_branch": "main", "filename": "tests/pipelines/test_pipelines_mask_generation.py", "additions": 10, "deletions": 0, "text": "PR #46136 — file change: tests/pipelines/test_pipelines_mask_generation.py\nStatus: modified \| +10 -0\n\n@@ -93,6 +93,16 @@ def get_test_pipeline(\n def run_pipeline_test(self, mask_generator, examples):\n pass\n \n+ def test_preprocess_is_last(self):\n+ mask_generator = pipeline(\"mask-generation\", model=\"hf-internal-testing/tiny-random-SamModel\")\n+ mask_generator.image_processor.pad_size = {\"height\": 24, \"width\": 24}\n+ image = \"./tests/fixtures/tests_samples/COCO/000000039769.png\"\n+ for points_per_batch in (100, 64):\n+ with self.subTest(points_per_batch=points_per_batch):\n+ batches = list(mask_generator.preprocess(image, points_per_batch=points_per_batch))\n+ self.assertTrue(batches[-1][\"is_last\"])\n+ self.assertFalse(any(b[\"is_last\"] for b in batches[:-1]))\n+\n @slow\n @require_torch\n def test_small_model_pt(self):"}