andrevp commited on Feb 13

Commit

33b4062

verified ·

1 Parent(s): af85fd9

Upload MiniCPM-o 4.5 MLX 4-bit quantized model

Browse files

Files changed (21) hide show

.gitattributes +1 -0
README.md +137 -0
added_tokens.json +107 -0
chat_template.jinja +88 -0
config.json +297 -0
configuration_minicpmo.py +260 -0
generation_config.json +12 -0
model-00001-of-00002.safetensors +3 -0
model-00002-of-00002.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling_minicpmo.py +0 -0
modeling_navit_siglip.py +981 -0
preprocessor_config.json +35 -0
processing_minicpmo.py +1665 -0
processor_config.json +102 -0
special_tokens_map.json +88 -0
tokenization_minicpmo_fast.py +120 -0
tokenizer.json +3 -0
tokenizer_config.json +22 -0
utils.py +2417 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,137 @@

+---
+license: apache-2.0
+license_link: https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE
+base_model: openbmb/MiniCPM-o-4_5
+tags:
+  - mlx
+  - vision
+  - multimodal
+  - vlm
+  - minicpm
+  - apple-silicon
+  - quantized
+language:
+  - en
+  - zh
+  - id
+  - fr
+  - de
+library_name: mlx
+pipeline_tag: image-text-to-text
+---
+# MiniCPM-o 4.5 — MLX 4-bit Quantized
+4-bit quantized [MLX](https://github.com/ml-explore/mlx) conversion of [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) for fast inference on Apple Silicon (M1/M2/M3/M4).
+## Model Details
+| | |
+|---|---|
+| **Base model** | [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) |
+| **Architecture** | SigLIP2 (27L) + Perceiver Resampler + Qwen3 LLM (36L) |
+| **Parameters** | ~8B |
+| **Quantization** | 4-bit (5.255 effective bits) — LLM quantized, vision encoder & resampler full precision |
+| **Size on disk** | ~5.3 GB |
+| **Framework** | [MLX](https://github.com/ml-explore/mlx) via [mlx-vlm](https://github.com/Blaizzy/mlx-vlm) |
+## Performance (M4 Pro, 24 GB RAM)
+| Mode | Prompt Processing | Generation | Peak Memory |
+|------|-------------------|------------|-------------|
+| Text-only | ~100 tok/s | ~55 tok/s | ~5.8 GB |
+| Image + Text | ~150 tok/s | ~51 tok/s | ~6.5 GB |
+## Capabilities
+- Image understanding & description
+- OCR / text extraction from images
+- Chart & diagram analysis
+- Math equation solving from images
+- Visual reasoning & counting
+- Code generation
+- Multilingual (English, Chinese, Indonesian, French, German, etc.)
+## Requirements
+- Apple Silicon Mac (M1 or later)
+- Python 3.10+
+- ~8 GB free RAM
+```bash
+pip install mlx-vlm torch transformers Pillow
+```
+## Quick Start
+### Python API
+```python
+from mlx_vlm import load
+from mlx_vlm.generate import generate_step
+import mlx.core as mx
+model, processor = load("andrevp/MiniCPM-o-4_5-MLX", trust_remote_code=True)
+# Text-only
+text = "<|im_start|>user\nWhat is machine learning?<|im_end|>\n<|im_start|>assistant\n"
+input_ids = mx.array(processor.tokenizer(text, return_tensors="np")["input_ids"])
+tokens = []
+for token, _ in generate_step(input_ids, model, None, None, temp=0.0):
+    tok_val = token.item()
+    tokens.append(tok_val)
+    if processor.tokenizer.decode([tok_val]) in ["<|im_end|>", "<|endoftext|>"]:
+        break
+print(processor.tokenizer.decode(tokens, skip_special_tokens=True))
+```
+### Chat Script
+A standalone `chat_minicpmo.py` script is available in the [conversion repository](https://github.com/andrevp):
+```bash
+# Single-shot with image
+python chat_minicpmo.py photo.jpg -p "What's in this image?"
+# Single-shot text-only
+python chat_minicpmo.py -p "Explain quantum computing briefly."
+# Interactive mode
+python chat_minicpmo.py
+# Interactive with pre-loaded image
+python chat_minicpmo.py photo.jpg
+```
+Interactive commands: `/image <path>` | `/clear` | `/quit`
+## Quantization Details
+- **LLM layers**: 4-bit quantized (group_size=64, affine mode)
+- **Vision encoder (SigLIP2)**: Full precision (not quantized)
+- **Perceiver Resampler**: Full precision (not quantized)
+- **Weight breakdown**: 907 LLM keys (quantized) + 437 vision keys + 17 resampler keys (full precision)
+## Limitations
+- **Vision-language only**: Audio input (Whisper encoder) and TTS output (CosyVoice2) from the original model are not included in this conversion.
+- **Single image per turn**: Processes one image at a time.
+- Quantization may slightly reduce output quality compared to the full-precision model.
+## License
+This model is released under the **Apache-2.0** license, following the original [openbmb/MiniCPM-o-4_5](https://huggingface.co/openbmb/MiniCPM-o-4_5) license.
+See the [original license](https://github.com/OpenBMB/MiniCPM-V/blob/main/LICENSE) for full terms.
+## Disclaimer
+> As an LMM, MiniCPM-o 4.5 generates content by learning from a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgments. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers. We will not be liable for any problems arising from the use of the MiniCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
+## Credits
+- **Original model**: [OpenBMB](https://github.com/OpenBMB) — [MiniCPM-o 4.5](https://huggingface.co/openbmb/MiniCPM-o-4_5)
+- **MLX framework**: [Apple ML Explore](https://github.com/ml-explore/mlx)
+- **mlx-vlm**: [Prince Canuma](https://github.com/Blaizzy/mlx-vlm)

added_tokens.json ADDED Viewed

	@@ -0,0 +1,107 @@

+{
+  "</answer>": 151686,
+  "</box>": 151674,
+  "</focus>": 151688,
+  "</image>": 151670,
+  "</image_id>": 151682,
+  "</image_save_to>": 151696,
+  "</line>": 151690,
+  "</perception>": 151692,
+  "</point>": 151678,
+  "</quad>": 151676,
+  "</ref>": 151672,
+  "</slice>": 151680,
+  "</source_image>": 151694,
+  "</think>": 151668,
+  "</tool_call>": 151658,
+  "</tool_response>": 151666,
+  "</unit>": 151684,
+  "<answer>": 151685,
+  "<box>": 151673,
+  "<focus>": 151687,
+  "<image>": 151669,
+  "<image_id>": 151681,
+  "<image_save_to>": 151695,
+  "<line>": 151689,
+  "<perception>": 151691,
+  "<point>": 151677,
+  "<quad>": 151675,
+  "<ref>": 151671,
+  "<slice>": 151679,
+  "<source_image>": 151693,
+  "<think>": 151667,
+  "<tool_call>": 151657,
+  "<tool_response>": 151665,
+  "<unit>": 151683,
+  "<|audio_end|>": 151699,
+  "<|audio_start|>": 151697,
+  "<|audio|>": 151698,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|emotion_end|>": 151711,
+  "<|emotion_start|>": 151710,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|interrupt|>": 151707,
+  "<|listen|>": 151705,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|pitch_end|>": 151715,
+  "<|pitch_start|>": 151714,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|speak|>": 151706,
+  "<|speed_end|>": 151713,
+  "<|speed_start|>": 151712,
+  "<|spk_bos|>": 151700,
+  "<|spk_eos|>": 151702,
+  "<|spk|>": 151701,
+  "<|turn_bos|>": 151716,
+  "<|timbre_10|>": 151726,
+  "<|timbre_11|>": 151727,
+  "<|timbre_12|>": 151728,
+  "<|timbre_13|>": 151729,
+  "<|timbre_14|>": 151730,
+  "<|timbre_15|>": 151731,
+  "<|timbre_16|>": 151732,
+  "<|timbre_17|>": 151733,
+  "<|timbre_18|>": 151734,
+  "<|timbre_19|>": 151735,
+  "<|turn_eos|>": 151717,
+  "<|timbre_20|>": 151736,
+  "<|timbre_21|>": 151737,
+  "<|timbre_22|>": 151738,
+  "<|timbre_23|>": 151739,
+  "<|timbre_24|>": 151740,
+  "<|timbre_25|>": 151741,
+  "<|timbre_26|>": 151742,
+  "<|timbre_27|>": 151743,
+  "<|timbre_28|>": 151744,
+  "<|timbre_29|>": 151745,
+  "<|chunk_eos|>": 151718,
+  "<|timbre_30|>": 151746,
+  "<|timbre_31|>": 151747,
+  "<|chunk_bos|>": 151719,
+  "<|chunk_tts_bos|>": 151720,
+  "<|chunk_tts_eos|>": 151721,
+  "<|tts_pad|>": 151722,
+  "<|timbre_7|>": 151723,
+  "<|timbre_8|>": 151724,
+  "<|timbre_9|>": 151725,
+  "<|tts_bos|>": 151703,
+  "<|tts_eos|>": 151704,
+  "<|vad_end|>": 151709,
+  "<|vad_start|>": 151708,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,88 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set content = message.content %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is defined and message.reasoning_content is not none %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in message.content %}
+                {%- set content = message.content.split('</think>')[-1].lstrip('\n') %}
+                {%- set reasoning_content = message.content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- '<|im_start|>' + message.role + '\n' + content }}
+            {%- endif %}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+    {%- if use_tts_template is defined and use_tts_template is true %}
+        {{- '<|tts_bos|>' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,297 @@

+{
+    "architectures": [
+        "MiniCPMO"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "audio_chunk_length": 1.0,
+    "audio_config": {
+        "_attn_implementation_autoset": true,
+        "_name_or_path": "openai/whisper-medium",
+        "activation_dropout": 0.0,
+        "activation_function": "gelu",
+        "apply_spec_augment": false,
+        "architectures": [
+            "MiniCPMWhisperEncoder"
+        ],
+        "attention_dropout": 0.0,
+        "begin_suppress_tokens": [
+            220,
+            50257
+        ],
+        "bos_token_id": 50257,
+        "classifier_proj_size": 256,
+        "d_model": 1024,
+        "decoder_attention_heads": 16,
+        "decoder_ffn_dim": 4096,
+        "decoder_layerdrop": 0.0,
+        "decoder_layers": 24,
+        "decoder_start_token_id": 50258,
+        "dropout": 0.0,
+        "encoder_attention_heads": 16,
+        "encoder_ffn_dim": 4096,
+        "encoder_layerdrop": 0.0,
+        "encoder_layers": 24,
+        "eos_token_id": 50257,
+        "forced_decoder_ids": [
+            [
+                1,
+                50259
+            ],
+            [
+                2,
+                50359
+            ],
+            [
+                3,
+                50363
+            ]
+        ],
+        "init_std": 0.02,
+        "mask_feature_length": 10,
+        "mask_feature_min_masks": 0,
+        "mask_feature_prob": 0.0,
+        "mask_time_length": 10,
+        "mask_time_min_masks": 2,
+        "mask_time_prob": 0.05,
+        "max_length": 448,
+        "max_source_positions": 1500,
+        "max_target_positions": 448,
+        "median_filter_width": 7,
+        "model_type": "whisper",
+        "num_hidden_layers": 24,
+        "num_mel_bins": 80,
+        "pad_token_id": 50257,
+        "scale_embedding": false,
+        "suppress_tokens": [
+            1,
+            2,
+            7,
+            8,
+            9,
+            10,
+            14,
+            25,
+            26,
+            27,
+            28,
+            29,
+            31,
+            58,
+            59,
+            60,
+            61,
+            62,
+            63,
+            90,
+            91,
+            92,
+            93,
+            359,
+            503,
+            522,
+            542,
+            873,
+            893,
+            902,
+            918,
+            922,
+            931,
+            1350,
+            1853,
+            1982,
+            2460,
+            2627,
+            3246,
+            3253,
+            3268,
+            3536,
+            3846,
+            3961,
+            4183,
+            4667,
+            6585,
+            6647,
+            7273,
+            9061,
+            9383,
+            10428,
+            10929,
+            11938,
+            12033,
+            12331,
+            12562,
+            13793,
+            14157,
+            14635,
+            15265,
+            15618,
+            16553,
+            16604,
+            18362,
+            18956,
+            20075,
+            21675,
+            22520,
+            26130,
+            26161,
+            26435,
+            28279,
+            29464,
+            31650,
+            32302,
+            32470,
+            36865,
+            42863,
+            47425,
+            49870,
+            50254,
+            50258,
+            50358,
+            50359,
+            50360,
+            50361,
+            50362
+        ],
+        "torch_dtype": "float32",
+        "use_cache": true,
+        "use_weighted_layer_sum": false,
+        "vocab_size": 51865
+    },
+    "audio_pool_step": 5,
+    "auto_map": {
+        "AutoConfig": "configuration_minicpmo.MiniCPMOConfig",
+        "AutoModel": "modeling_minicpmo.MiniCPMO",
+        "AutoModelForCausalLM": "modeling_minicpmo.MiniCPMO"
+    },
+    "batch_vision_input": true,
+    "bos_token_id": 151643,
+    "drop_vision_last_layer": false,
+    "eos_token_id": [
+        151645,
+        151643
+    ],
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 4096,
+    "image_size": 448,
+    "init_audio": true,
+    "init_tts": true,
+    "init_vision": true,
+    "initializer_range": 0.02,
+    "intermediate_size": 12288,
+    "listen_speak_type": "asr",
+    "max_position_embeddings": 40960,
+    "max_window_layers": 36,
+    "model_type": "minicpmo",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 8,
+    "patch_size": 14,
+    "quantization": {
+        "group_size": 64,
+        "bits": 4,
+        "mode": "affine"
+    },
+    "quantization_config": {
+        "group_size": 64,
+        "bits": 4,
+        "mode": "affine"
+    },
+    "query_num": 64,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": null,
+    "rope_theta": 1000000,
+    "slice_config": {
+        "max_slice_nums": 1,
+        "model_type": "minicpmv",
+        "patch_size": 14,
+        "scale_resolution": 448
+    },
+    "slice_mode": true,
+    "sliding_window": null,
+    "stream_input": true,
+    "tie_word_embeddings": false,
+    "transformers_version": "4.51.0",
+    "tts_config": {
+        "_attn_implementation_autoset": true,
+        "attention_type": "full_attention",
+        "attn_implementation": "sdpa",
+        "audio_bos_token_id": 151687,
+        "audio_tokenizer_sample_rate": 16000,
+        "audio_tokenizer_type": "s3tokenizer",
+        "aug_layer_loss_weight": false,
+        "aug_loss_weight": false,
+        "backbone_model": "llama",
+        "condition_type": "hidden_text_merge",
+        "cosyvoice_config_path": null,
+        "cosyvoice_model_dir": null,
+        "filter_tts_loss": false,
+        "hidden_act": "silu",
+        "hidden_size": 768,
+        "interleaved": false,
+        "intermediate_size": 3072,
+        "llm_dim": 4096,
+        "llm_dim_model_base": 256,
+        "llm_down_scale": false,
+        "llm_hidden_size": 4096,
+        "llm_intermediate_size": 768,
+        "long_weight": 0.1,
+        "max_position_embeddings": 4096,
+        "model_type": "minicpmtts",
+        "normalize_projected_hidden": true,
+        "num_attention_heads": 12,
+        "num_audio_tokens": 6562,
+        "num_hidden_layers": 20,
+        "num_key_value_heads": 12,
+        "num_mel_bins": 100,
+        "num_text_tokens": 152064,
+        "num_vq": 1,
+        "projector_type": "mlp",
+        "recomputed_chunks": 1,
+        "s3_stream_chunk_size": 25,
+        "s3_stream_generate": false,
+        "s3_stream_n_timesteps": 10,
+        "s3_stream_prelook_size": 3,
+        "short_weight": 0.1,
+        "streaming": false,
+        "streaming_audio_chunk_size": 50,
+        "streaming_sliding_window": false,
+        "streaming_sliding_window_audio_frame_rate": 50,
+        "streaming_sliding_window_audio_init_text_length": 10,
+        "streaming_sliding_window_audio_window_size": 300,
+        "streaming_sliding_window_average_speed": 5,
+        "streaming_sliding_window_fast_speed": 7,
+        "streaming_sliding_window_max_text_len": 500,
+        "streaming_sliding_window_slow_speed": 3,
+        "streaming_sliding_window_text_window_size": 50,
+        "streaming_text_chunk_max": 7,
+        "streaming_text_chunk_min": 3,
+        "streaming_text_reserved_len": 300,
+        "text_eos_token_id": 151692,
+        "tts_filter_loss_fix": false,
+        "use_llm_hidden_state": false,
+        "use_text": true,
+        "window_size": 2
+    },
+    "use_cache": true,
+    "use_image_id": true,
+    "use_sliding_window": false,
+    "version": "4.5",
+    "vision_batch_size": 16,
+    "vision_config": {
+        "_attn_implementation_autoset": true,
+        "attention_dropout": 0.0,
+        "hidden_act": "gelu_pytorch_tanh",
+        "hidden_size": 1152,
+        "image_size": 980,
+        "intermediate_size": 4304,
+        "layer_norm_eps": 1e-06,
+        "model_type": "siglip_vision_model",
+        "num_attention_heads": 16,
+        "num_channels": 3,
+        "num_hidden_layers": 27,
+        "patch_size": 14
+    },
+    "vocab_size": 151748
+}

configuration_minicpmo.py ADDED Viewed

	@@ -0,0 +1,260 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright 2026 The OpenBMB Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+from typing import Union
+from transformers import PretrainedConfig
+from transformers import Qwen3Config
+from transformers import WhisperConfig
+from transformers.utils import logging
+from .modeling_navit_siglip import SiglipVisionConfig
+logger = logging.get_logger(__name__)
+class MiniCPMVSliceConfig(PretrainedConfig):
+    model_type = "minicpmv"
+    def __init__(
+        self,
+        patch_size=14,
+        max_slice_nums=9,
+        scale_resolution=448,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.patch_size = patch_size
+        self.max_slice_nums = max_slice_nums
+        self.scale_resolution = scale_resolution
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        cls._set_token_in_kwargs(kwargs)
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        if config_dict.get("model_type") == "minicpmv":
+            config_dict = config_dict["slice_config"]
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+        return cls.from_dict(config_dict, **kwargs)
+class MiniCPMTTSConfig(PretrainedConfig):
+    model_type = "minicpmtts"
+    def __init__(
+        self,
+        llm_dim: int = 2560,
+        llm_intermediate_size: int = 768,
+        llm_down_scale: bool = False,
+        llm_dim_model_base: int = 256,
+        projector_type: str = "mlp",
+        hidden_act: str = "silu",
+        aug_loss_weight: bool = False,
+        aug_layer_loss_weight: bool = False,
+        filter_tts_loss: bool = False,
+        tts_filter_loss_fix: bool = False,
+        long_weight: float = 0.1,
+        short_weight: float = 0.1,
+        hidden_size: int = 768,
+        intermediate_size: int = 3072,
+        num_attention_heads: int = 12,
+        num_hidden_layers: int = 20,
+        num_key_value_heads: int = 12,
+        max_position_embeddings: int = 4096,
+        num_audio_tokens: int = 4097,
+        num_text_tokens: int = 21178,
+        num_mel_bins: int = 100,
+        num_vq: int = 1,
+        use_llm_hidden_state: bool = False,
+        audio_bos_token_id: int = 21132,
+        text_eos_token_id: int = 21133,
+        use_text: bool = True,
+        streaming: bool = False,
+        streaming_text_chunk_min: int = 3,
+        streaming_text_chunk_max: int = 7,
+        streaming_text_reserved_len: int = 300,
+        streaming_audio_chunk_size: int = 50,
+        attn_implementation: str = "sdpa",
+        condition_type: str = "llm_hidden",
+        backbone_model: str = "llama",
+        audio_tokenizer_type: str = "wavtokenizer",
+        audio_tokenizer_sample_rate: int = 24000,
+        streaming_sliding_window: bool = False,
+        streaming_sliding_window_max_text_len: int = 500,
+        streaming_sliding_window_average_speed: int = 5,
+        streaming_sliding_window_fast_speed: int = 7,
+        streaming_sliding_window_slow_speed: int = 3,
+        streaming_sliding_window_audio_frame_rate: int = 50,
+        streaming_sliding_window_audio_init_text_length: int = 10,
+        streaming_sliding_window_audio_window_size: int = 300,
+        normalize_projected_hidden: bool = False,
+        interleaved: bool = False,
+        attention_type: str = "sliding_recompute",
+        recomputed_chunks: int = 1,
+        window_size: int = 2,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.llm_dim = llm_dim
+        self.llm_hidden_size = llm_dim
+        self.llm_intermediate_size = llm_intermediate_size
+        self.llm_down_scale = llm_down_scale
+        self.llm_dim_model_base = llm_dim_model_base
+        self.projector_type = projector_type
+        self.aug_loss_weight = aug_loss_weight
+        self.aug_layer_loss_weight = aug_layer_loss_weight
+        self.tts_filter_loss_fix = tts_filter_loss_fix
+        self.filter_tts_loss = filter_tts_loss
+        self.long_weight = long_weight
+        self.short_weight = short_weight
+        self.hidden_act = hidden_act
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_attention_heads = num_attention_heads
+        self.num_hidden_layers = num_hidden_layers
+        self.num_key_value_heads = num_key_value_heads
+        self.max_position_embeddings = max_position_embeddings
+        self.num_audio_tokens = num_audio_tokens
+        self.num_text_tokens = num_text_tokens
+        self.num_mel_bins = num_mel_bins
+        self.num_vq = num_vq
+        self.use_llm_hidden_state = use_llm_hidden_state
+        self.audio_bos_token_id = audio_bos_token_id
+        self.text_eos_token_id = text_eos_token_id
+        self.use_text = use_text
+        self.streaming = streaming
+        self.streaming_text_chunk_min = streaming_text_chunk_min
+        self.streaming_text_chunk_max = streaming_text_chunk_max
+        self.streaming_text_reserved_len = streaming_text_reserved_len
+        self.streaming_audio_chunk_size = streaming_audio_chunk_size
+        self.attn_implementation = attn_implementation
+        self.condition_type = condition_type
+        self.backbone_model = backbone_model
+        self.audio_tokenizer_type = audio_tokenizer_type
+        self.audio_tokenizer_sample_rate = audio_tokenizer_sample_rate
+        self.streaming_sliding_window = streaming_sliding_window
+        self.streaming_sliding_window_max_text_len = streaming_sliding_window_max_text_len
+        self.streaming_sliding_window_average_speed = streaming_sliding_window_average_speed
+        self.streaming_sliding_window_fast_speed = streaming_sliding_window_fast_speed
+        self.streaming_sliding_window_slow_speed = streaming_sliding_window_slow_speed
+        self.streaming_sliding_window_audio_frame_rate = streaming_sliding_window_audio_frame_rate
+        self.streaming_sliding_window_audio_init_text_length = streaming_sliding_window_audio_init_text_length
+        self.streaming_sliding_window_audio_window_size = streaming_sliding_window_audio_window_size
+        self.normalize_projected_hidden = normalize_projected_hidden
+        self.interleaved = interleaved
+        self.attention_type = attention_type
+        self.recomputed_chunks = recomputed_chunks
+        self.window_size = window_size
+class MiniCPMOConfig(Qwen3Config):
+    model_type = "minicpmo"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    default_vision_config = {
+        "hidden_size": 1152,
+        "image_size": 980,
+        "intermediate_size": 4304,
+        "model_type": "siglip",
+        "num_attention_heads": 16,
+        "num_hidden_layers": 27,
+        "patch_size": 14,
+    }
+    def __init__(
+        self,
+        use_cache=True,
+        query_num=64,
+        image_size=448,
+        drop_vision_last_layer=True,
+        batch_vision_input=True,
+        slice_config=None,
+        vision_config=None,
+        audio_config=None,
+        tts_config=None,
+        use_image_id=True,
+        vision_batch_size=16,
+        audio_pool_step=5,
+        audio_chunk_length=1.0,
+        stream_input=False,
+        listen_speak_type="asr",
+        init_vision=True,
+        init_audio=True,
+        init_tts=True,
+        **kwargs,
+    ):
+        self.use_cache = use_cache
+        self.query_num = query_num
+        self.image_size = image_size
+        self.drop_vision_last_layer = drop_vision_last_layer
+        self.batch_vision_input = batch_vision_input
+        self.use_image_id = use_image_id
+        self.vision_batch_size = vision_batch_size
+        self.audio_pool_step = audio_pool_step
+        self.audio_chunk_length = audio_chunk_length
+        self.stream_input = stream_input
+        self.listen_speak_type = listen_speak_type
+        self.init_vision = init_vision
+        self.init_audio = init_audio
+        self.init_tts = init_tts
+        if slice_config is None:
+            self.slice_config = MiniCPMVSliceConfig(max_slice_nums=1)
+        else:
+            self.slice_config = MiniCPMVSliceConfig(**slice_config)
+        self.slice_mode = True
+        # same as HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit add tgt_sizes
+        if vision_config is None:
+            self.vision_config = SiglipVisionConfig(**self.default_vision_config)
+            logger.info("vision_config is None, using default vision config")
+        elif isinstance(vision_config, dict):
+            self.vision_config = SiglipVisionConfig(**vision_config)
+        elif isinstance(vision_config, SiglipVisionConfig):
+            self.vision_config = vision_config
+        if audio_config is None:
+            self.audio_config = WhisperConfig()
+        elif isinstance(audio_config, dict):
+            self.audio_config = WhisperConfig(**audio_config)
+        elif isinstance(audio_config, WhisperConfig):
+            self.audio_config = audio_config
+        if tts_config is None:
+            self.tts_config = MiniCPMTTSConfig()
+        elif isinstance(tts_config, dict):
+            self.tts_config = MiniCPMTTSConfig(**tts_config)
+        elif isinstance(tts_config, MiniCPMTTSConfig):
+            self.tts_config = tts_config
+        self.patch_size = self.vision_config.patch_size
+        super().__init__(**kwargs)

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "bos_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "temperature": 0.6,
+  "top_k": 20,
+  "top_p": 0.95
+}

model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e0d2dbc6eacf34177bcf3badf36ead8719f48e1b4c8adb2a2754be5a73218e6
+size 5092993723

model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b35835d4c370c057a6c11d87cb91565f54e973c09a9c6b8ee5564140ce34d2a
+size 527444905

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_minicpmo.py ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_navit_siglip.py ADDED Viewed

	@@ -0,0 +1,981 @@

+# coding=utf-8
+# Copyright 2024 Google AI and The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch Siglip model."""
+# Copied from  HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit and add tgt_sizes
+import math
+import os
+import warnings
+from dataclasses import dataclass
+from typing import Optional
+from typing import Tuple
+from typing import Union
+import numpy as np
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn.init import _calculate_fan_in_and_fan_out
+from transformers.activations import ACT2FN
+from transformers.configuration_utils import PretrainedConfig
+from transformers.modeling_attn_mask_utils import _prepare_4d_attention_mask
+from transformers.modeling_outputs import BaseModelOutput
+from transformers.modeling_outputs import BaseModelOutputWithPooling
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import add_start_docstrings
+from transformers.utils import add_start_docstrings_to_model_forward
+from transformers.utils import is_flash_attn_2_available
+from transformers.utils import logging
+from transformers.utils import ModelOutput
+from transformers.utils import replace_return_docstrings
+logger = logging.get_logger(__name__)
+class SiglipVisionConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`SiglipVisionModel`]. It is used to instantiate a
+    Siglip vision encoder according to the specified arguments, defining the model architecture. Instantiating a
+    configuration with the defaults will yield a similar configuration to that of the vision encoder of the Siglip
+    [google/siglip-base-patch16-224](https://huggingface.co/google/siglip-base-patch16-224) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        num_channels (`int`, *optional*, defaults to 3):
+            Number of channels in the input images.
+        image_size (`int`, *optional*, defaults to 224):
+            The size (resolution) of each image.
+        patch_size (`int`, *optional*, defaults to 16):
+            The size (resolution) of each patch.
+        hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the layer normalization layers.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+    Example:
+    ```python
+    >>> from transformers import SiglipVisionConfig, SiglipVisionModel
+    >>> # Initializing a SiglipVisionConfig with google/siglip-base-patch16-224 style configuration
+    >>> configuration = SiglipVisionConfig()
+    >>> # Initializing a SiglipVisionModel (with random weights) from the google/siglip-base-patch16-224 style configuration
+    >>> model = SiglipVisionModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "siglip_vision_model"
+    def __init__(
+        self,
+        hidden_size=768,
+        intermediate_size=3072,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        num_channels=3,
+        image_size=224,
+        patch_size=16,
+        hidden_act="gelu_pytorch_tanh",
+        layer_norm_eps=1e-6,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_channels = num_channels
+        self.patch_size = patch_size
+        self.image_size = image_size
+        self.attention_dropout = attention_dropout
+        self.layer_norm_eps = layer_norm_eps
+        self.hidden_act = hidden_act
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
+        cls._set_token_in_kwargs(kwargs)
+        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
+        # get the vision config dict if we are loading from SiglipConfig
+        if config_dict.get("model_type") == "siglip":
+            config_dict = config_dict["vision_config"]
+        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
+            logger.warning(
+                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
+                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
+            )
+        return cls.from_dict(config_dict, **kwargs)
+_CHECKPOINT_FOR_DOC = "google/siglip-base-patch16-224"
+SIGLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "google/siglip-base-patch16-224",
+    # See all SigLIP models at https://huggingface.co/models?filter=siglip
+]
+if is_flash_attn_2_available():
+    from flash_attn import flash_attn_func
+    from flash_attn import flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis  # noqa
+    from flash_attn.bert_padding import pad_input
+    from flash_attn.bert_padding import unpad_input
+# Copied from transformers.models.llama.modeling_llama._get_unpad_data
+def _get_unpad_data(attention_mask):
+    seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+    indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+    max_seqlen_in_batch = seqlens_in_batch.max().item()
+    cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
+    return (
+        indices,
+        cu_seqlens,
+        max_seqlen_in_batch,
+    )
+def _trunc_normal_(tensor, mean, std, a, b):
+    # Cut & paste from PyTorch official master until it's in a few official releases - RW
+    # Method based on https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    def norm_cdf(x):
+        # Computes standard normal cumulative distribution function
+        return (1.0 + math.erf(x / math.sqrt(2.0))) / 2.0
+    if (mean < a - 2 * std) or (mean > b + 2 * std):
+        warnings.warn(
+            "mean is more than 2 std from [a, b] in nn.init.trunc_normal_. "
+            "The distribution of values may be incorrect.",
+            stacklevel=2,
+        )
+    # Values are generated by using a truncated uniform distribution and
+    # then using the inverse CDF for the normal distribution.
+    # Get upper and lower cdf values
+    l = norm_cdf((a - mean) / std)
+    u = norm_cdf((b - mean) / std)
+    # Uniformly fill tensor with values from [l, u], then translate to
+    # [2l-1, 2u-1].
+    tensor.uniform_(2 * l - 1, 2 * u - 1)
+    # Use inverse cdf transform for normal distribution to get truncated
+    # standard normal
+    if tensor.dtype in [torch.float16, torch.bfloat16]:
+        # The `erfinv_` op is not (yet?) defined in float16+cpu, bfloat16+gpu
+        og_dtype = tensor.dtype
+        tensor = tensor.to(torch.float32)
+        tensor.erfinv_()
+        tensor = tensor.to(og_dtype)
+    else:
+        tensor.erfinv_()
+    # Transform to proper mean, std
+    tensor.mul_(std * math.sqrt(2.0))
+    tensor.add_(mean)
+    # Clamp to ensure it's in the proper range
+    if tensor.dtype == torch.float16:
+        # The `clamp_` op is not (yet?) defined in float16+cpu
+        tensor = tensor.to(torch.float32)
+        tensor.clamp_(min=a, max=b)
+        tensor = tensor.to(torch.float16)
+    else:
+        tensor.clamp_(min=a, max=b)
+def trunc_normal_tf_(
+    tensor: torch.Tensor,
+    mean: float = 0.0,
+    std: float = 1.0,
+    a: float = -2.0,
+    b: float = 2.0,
+) -> torch.Tensor:
+    """Fills the input Tensor with values drawn from a truncated
+    normal distribution. The values are effectively drawn from the
+    normal distribution :math:`\\mathcal{N}(\text{mean}, \text{std}^2)`
+    with values outside :math:`[a, b]` redrawn until they are within
+    the bounds. The method used for generating the random values works
+    best when :math:`a \\leq \text{mean} \\leq b`.
+    NOTE: this 'tf' variant behaves closer to Tensorflow / JAX impl where the
+    bounds [a, b] are applied when sampling the normal distribution with mean=0, std=1.0
+    and the result is subsquently scaled and shifted by the mean and std args.
+    Args:
+        tensor: an n-dimensional `torch.Tensor`
+        mean: the mean of the normal distribution
+        std: the standard deviation of the normal distribution
+        a: the minimum cutoff value
+        b: the maximum cutoff value
+    """
+    with torch.no_grad():
+        _trunc_normal_(tensor, 0, 1.0, a, b)
+        tensor.mul_(std).add_(mean)
+def variance_scaling_(tensor, scale=1.0, mode="fan_in", distribution="normal"):
+    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
+    if mode == "fan_in":
+        denom = fan_in
+    elif mode == "fan_out":
+        denom = fan_out
+    elif mode == "fan_avg":
+        denom = (fan_in + fan_out) / 2
+    variance = scale / denom
+    if distribution == "truncated_normal":
+        # constant is stddev of standard normal truncated to (-2, 2)
+        trunc_normal_tf_(tensor, std=math.sqrt(variance) / 0.87962566103423978)
+    elif distribution == "normal":
+        with torch.no_grad():
+            tensor.normal_(std=math.sqrt(variance))
+    elif distribution == "uniform":
+        bound = math.sqrt(3 * variance)
+        with torch.no_grad():
+            tensor.uniform_(-bound, bound)
+    else:
+        raise ValueError(f"invalid distribution {distribution}")
+def lecun_normal_(tensor):
+    variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
+def default_flax_embed_init(tensor):
+    variance_scaling_(tensor, mode="fan_in", distribution="normal")
+@dataclass
+# Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Siglip
+class SiglipVisionModelOutput(ModelOutput):
+    """
+    Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
+    Args:
+        image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
+            The image embeddings obtained by applying the projection layer to the pooler_output.
+        last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+            Sequence of hidden-states at the output of the last layer of the model.
+        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+            sequence_length)`.
+            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+            heads.
+    """
+    image_embeds: Optional[torch.FloatTensor] = None
+    last_hidden_state: torch.FloatTensor = None
+    hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+    attentions: Optional[Tuple[torch.FloatTensor]] = None
+class SiglipVisionEmbeddings(nn.Module):
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+        self.patch_embedding = nn.Conv2d(
+            in_channels=config.num_channels,
+            out_channels=self.embed_dim,
+            kernel_size=self.patch_size,
+            stride=self.patch_size,
+            padding="valid",
+        )
+        self.num_patches_per_side = self.image_size // self.patch_size
+        self.num_patches = self.num_patches_per_side**2
+        self.num_positions = self.num_patches
+        self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+    def forward(
+        self,
+        pixel_values: torch.FloatTensor,
+        patch_attention_mask: torch.BoolTensor,
+        tgt_sizes: Optional[torch.IntTensor] = None,
+    ) -> torch.Tensor:
+        batch_size = pixel_values.size(0)
+        patch_embeds = self.patch_embedding(pixel_values)
+        embeddings = patch_embeds.flatten(2).transpose(1, 2)
+        max_im_h, max_im_w = pixel_values.size(2), pixel_values.size(3)
+        max_nb_patches_h, max_nb_patches_w = (
+            max_im_h // self.patch_size,
+            max_im_w // self.patch_size,
+        )
+        boundaries = torch.arange(1 / self.num_patches_per_side, 1.0, 1 / self.num_patches_per_side)
+        position_ids = torch.full(
+            size=(
+                batch_size,
+                max_nb_patches_h * max_nb_patches_w,
+            ),
+            fill_value=0,
+        )
+        for batch_idx, p_attn_mask in enumerate(patch_attention_mask):
+            if tgt_sizes is not None:
+                nb_patches_h = tgt_sizes[batch_idx][0]
+                nb_patches_w = tgt_sizes[batch_idx][1]
+            else:
+                nb_patches_h = p_attn_mask[:, 0].sum()
+                nb_patches_w = p_attn_mask[0].sum()
+            fractional_coords_h = torch.arange(0, 1 - 1e-6, 1 / nb_patches_h)
+            fractional_coords_w = torch.arange(0, 1 - 1e-6, 1 / nb_patches_w)
+            bucket_coords_h = torch.bucketize(fractional_coords_h, boundaries, right=True)
+            bucket_coords_w = torch.bucketize(fractional_coords_w, boundaries, right=True)
+            pos_ids = (bucket_coords_h[:, None] * self.num_patches_per_side + bucket_coords_w).flatten()
+            position_ids[batch_idx][p_attn_mask.view(-1).cpu()] = pos_ids
+        position_ids = position_ids.to(self.position_embedding.weight.device)
+        embeddings = embeddings + self.position_embedding(position_ids)
+        return embeddings
+class SiglipAttention(nn.Module):
+    """Multi-headed attention from 'Attention Is All You Need' paper"""
+    # Copied from transformers.models.clip.modeling_clip.CLIPAttention.__init__
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.head_dim = self.embed_dim // self.num_heads
+        if self.head_dim * self.num_heads != self.embed_dim:
+            raise ValueError(
+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5
+        self.dropout = config.attention_dropout
+        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
+        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        """Input shape: Batch x Time x Channel"""
+        batch_size, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        query_states = query_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        k_v_seq_len = key_states.shape[-2]
+        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale
+        if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len):
+            raise ValueError(
+                f"Attention weights should be of size {(batch_size, self.num_heads, q_len, k_v_seq_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+        if attention_mask is not None:
+            if attention_mask.size() != (batch_size, 1, q_len, k_v_seq_len):
+                raise ValueError(
+                    f"Attention mask should be of size {(batch_size, 1, q_len, k_v_seq_len)}, but is {attention_mask.size()}"
+                )
+            attn_weights = attn_weights + attention_mask
+        # upcast attention to fp32
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
+        attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
+        attn_output = torch.matmul(attn_weights, value_states)
+        if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(batch_size, self.num_heads, q_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+        attn_output = attn_output.transpose(1, 2).contiguous()
+        attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim)
+        attn_output = self.out_proj(attn_output)
+        return attn_output, attn_weights
+class SiglipFlashAttention2(SiglipAttention):
+    """
+    Llama flash attention module. This module inherits from `LlamaAttention` as the weights of the module stays
+    untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
+    flash attention and deal with padding tokens in case the input contains any of them.
+    """
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.is_causal = False  # Hack to make sure we don't use a causal mask
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.LongTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        output_attentions: bool = False,
+        use_cache: bool = False,
+        **kwargs,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
+        output_attentions = False
+        bsz, q_len, _ = hidden_states.size()
+        query_states = self.q_proj(hidden_states)
+        key_states = self.k_proj(hidden_states)
+        value_states = self.v_proj(hidden_states)
+        # Flash attention requires the input to have the shape
+        # batch_size x seq_length x head_dim x hidden_dim
+        # therefore we just need to keep the original shape
+        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        key_states = key_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        value_states = value_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
+        kv_seq_len = key_states.shape[-2]
+        if past_key_value is not None:
+            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
+        # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
+        # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
+        # if past_key_value is not None:
+        #     cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
+        #     key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
+        # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
+        # to be able to avoid many of these transpose/reshape/view.
+        query_states = query_states.transpose(1, 2)
+        key_states = key_states.transpose(1, 2)
+        value_states = value_states.transpose(1, 2)
+        dropout_rate = self.dropout if self.training else 0.0
+        # In PEFT, usually we cast the layer norms in float32 for training stability reasons
+        # therefore the input hidden states gets silently casted in float32. Hence, we need
+        # cast them back in the correct dtype just to be sure everything works as expected.
+        # This might slowdown training & inference so it is recommended to not cast the LayerNorms
+        # in fp32. (LlamaRMSNorm handles it correctly)
+        input_dtype = query_states.dtype
+        if input_dtype == torch.float32:
+            if torch.is_autocast_enabled():
+                target_dtype = torch.get_autocast_gpu_dtype()
+            # Handle the case where the model is quantized
+            elif hasattr(self.config, "_pre_quantization_dtype"):
+                target_dtype = self.config._pre_quantization_dtype
+            else:
+                target_dtype = self.q_proj.weight.dtype
+            logger.warning_once(
+                "The input hidden states seems to be silently casted in float32, this might be related to the fact"
+                " you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
+                f" {target_dtype}."
+            )
+            query_states = query_states.to(target_dtype)
+            key_states = key_states.to(target_dtype)
+            value_states = value_states.to(target_dtype)
+        attn_output = self._flash_attention_forward(
+            query_states,
+            key_states,
+            value_states,
+            attention_mask,
+            q_len,
+            dropout=dropout_rate,
+        )
+        attn_output = attn_output.reshape(bsz, q_len, self.embed_dim).contiguous()
+        attn_output = self.out_proj(attn_output)
+        if not output_attentions:
+            attn_weights = None
+        return attn_output, attn_weights
+    def _flash_attention_forward(
+        self,
+        query_states,
+        key_states,
+        value_states,
+        attention_mask,
+        query_length,
+        dropout=0.0,
+        softmax_scale=None,
+    ):
+        """
+        Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
+        first unpad the input, then computes the attention scores and pad the final attention scores.
+        Args:
+            query_states (`torch.Tensor`):
+                Input query states to be passed to Flash Attention API
+            key_states (`torch.Tensor`):
+                Input key states to be passed to Flash Attention API
+            value_states (`torch.Tensor`):
+                Input value states to be passed to Flash Attention API
+            attention_mask (`torch.Tensor`):
+                The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
+                position of padding tokens and 1 for the position of non-padding tokens.
+            dropout (`int`, *optional*):
+                Attention dropout
+            softmax_scale (`float`, *optional*):
+                The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
+        """
+        # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
+        causal = self.is_causal and query_length != 1
+        # Contains at least one padding token in the sequence
+        if attention_mask is not None:
+            batch_size = query_states.shape[0]
+            (
+                query_states,
+                key_states,
+                value_states,
+                indices_q,
+                cu_seq_lens,
+                max_seq_lens,
+            ) = self._upad_input(query_states, key_states, value_states, attention_mask, query_length)
+            cu_seqlens_q, cu_seqlens_k = cu_seq_lens
+            max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
+            attn_output_unpad = flash_attn_varlen_func(
+                query_states,
+                key_states,
+                value_states,
+                cu_seqlens_q=cu_seqlens_q,
+                cu_seqlens_k=cu_seqlens_k,
+                max_seqlen_q=max_seqlen_in_batch_q,
+                max_seqlen_k=max_seqlen_in_batch_k,
+                dropout_p=dropout,
+                softmax_scale=softmax_scale,
+                causal=causal,
+            )
+            attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
+        else:
+            attn_output = flash_attn_func(
+                query_states,
+                key_states,
+                value_states,
+                dropout,
+                softmax_scale=softmax_scale,
+                causal=causal,
+            )
+        return attn_output
+    def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
+        indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
+        batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
+        key_layer = index_first_axis(
+            key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim),
+            indices_k,
+        )
+        value_layer = index_first_axis(
+            value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim),
+            indices_k,
+        )
+        if query_length == kv_seq_len:
+            query_layer = index_first_axis(
+                query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim),
+                indices_k,
+            )
+            cu_seqlens_q = cu_seqlens_k
+            max_seqlen_in_batch_q = max_seqlen_in_batch_k
+            indices_q = indices_k
+        elif query_length == 1:
+            max_seqlen_in_batch_q = 1
+            cu_seqlens_q = torch.arange(
+                batch_size + 1, dtype=torch.int32, device=query_layer.device
+            )  # There is a memcpy here, that is very bad.
+            indices_q = cu_seqlens_q[:-1]
+            query_layer = query_layer.squeeze(1)
+        else:
+            # The -q_len: slice assumes left padding.
+            attention_mask = attention_mask[:, -query_length:]
+            query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
+        return (
+            query_layer,
+            key_layer,
+            value_layer,
+            indices_q,
+            (cu_seqlens_q, cu_seqlens_k),
+            (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
+        )
+# Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->Siglip
+class SiglipMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
+        self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+# Copied from transformers.models.clip.modeling_clip.CLIPEncoderLayer with CLIP->Siglip
+class SiglipEncoderLayer(nn.Module):
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
+        self.self_attn = SiglipAttention(config) if not self._use_flash_attention_2 else SiglipFlashAttention2(config)
+        self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+        self.mlp = SiglipMLP(config)
+        self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.FloatTensor]:
+        """
+        Args:
+            hidden_states (`torch.FloatTensor`):
+                Input to the layer of shape `(batch, seq_len, embed_dim)`.
+            attention_mask (`torch.FloatTensor`):
+                Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements are indicated by very large negative values.
+            output_attentions (`bool`, *optional*, defaults to `False`):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+        """
+        residual = hidden_states
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states, attn_weights = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+        )
+        hidden_states = residual + hidden_states
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+        outputs = (hidden_states,)
+        if output_attentions:
+            outputs += (attn_weights,)
+        return outputs
+class SiglipPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = SiglipVisionConfig
+    base_model_prefix = "siglip"
+    supports_gradient_checkpointing = True
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, SiglipVisionEmbeddings):
+            width = self.config.hidden_size
+            nn.init.normal_(module.position_embedding.weight, std=1 / np.sqrt(width))
+        elif isinstance(module, nn.Embedding):
+            default_flax_embed_init(module.weight)
+        elif isinstance(module, SiglipAttention):
+            nn.init.normal_(module.q_proj.weight)
+            nn.init.normal_(module.k_proj.weight)
+            nn.init.normal_(module.v_proj.weight)
+            nn.init.normal_(module.out_proj.weight)
+            nn.init.zeros_(module.q_proj.bias)
+            nn.init.zeros_(module.k_proj.bias)
+            nn.init.zeros_(module.v_proj.bias)
+            nn.init.zeros_(module.out_proj.bias)
+        elif isinstance(module, SiglipMLP):
+            nn.init.normal_(module.fc1.weight)
+            nn.init.normal_(module.fc2.weight)
+            nn.init.normal_(module.fc1.bias, std=1e-6)
+            nn.init.normal_(module.fc2.bias, std=1e-6)
+        elif isinstance(module, (nn.Linear, nn.Conv2d)):
+            lecun_normal_(module.weight)
+            if module.bias is not None:
+                nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+SIGLIP_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`SiglipVisionConfig`]): Model configuration class with all the parameters of the model.
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+SIGLIP_VISION_INPUTS_DOCSTRING = r"""
+    Args:
+        pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
+            [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+# Copied from transformers.models.clip.modeling_clip.CLIPEncoder with CLIP->Siglip
+class SiglipEncoder(nn.Module):
+    """
+    Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
+    [`SiglipEncoderLayer`].
+    Args:
+        config: SiglipConfig
+    """
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList([SiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    # Ignore copy
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        r"""
+        Args:
+            inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
+                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
+                than the model's internal embedding lookup matrix.
+            attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+                - 1 for tokens that are **not masked**,
+                - 0 for tokens that are **masked**.
+                [What are attention masks?](../glossary#attention-mask)
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+                returned tensors for more detail.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+                for more detail.
+            return_dict (`bool`, *optional*):
+                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+        hidden_states = inputs_embeds
+        for encoder_layer in self.layers:
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+                layer_outputs = self._gradient_checkpointing_func(
+                    encoder_layer.__call__,
+                    hidden_states,
+                    attention_mask,
+                    output_attentions,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                    output_attentions=output_attentions,
+                )
+            hidden_states = layer_outputs[0]
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+        if not return_dict:
+            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=encoder_states,
+            attentions=all_attentions,
+        )
+@add_start_docstrings(
+    """The vision model from SigLIP without any head or projection on top.""",
+    SIGLIP_START_DOCSTRING,
+)
+class SiglipVisionTransformer(SiglipPreTrainedModel):
+    config_class = SiglipVisionConfig
+    main_input_name = "pixel_values"
+    _supports_flash_attn_2 = True
+    _no_split_modules = []
+    def __init__(self, config: SiglipVisionConfig):
+        super().__init__(config)
+        self.config = config
+        embed_dim = config.hidden_size
+        self.embeddings = SiglipVisionEmbeddings(config)
+        self.encoder = SiglipEncoder(config)
+        self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
+        self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self) -> nn.Module:
+        return self.embeddings.patch_embedding
+    @add_start_docstrings_to_model_forward(SIGLIP_VISION_INPUTS_DOCSTRING)
+    @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=SiglipVisionConfig)
+    def forward(
+        self,
+        pixel_values,
+        patch_attention_mask: Optional[torch.BoolTensor] = None,
+        tgt_sizes: Optional[torch.IntTensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutputWithPooling]:
+        r"""
+        Returns:
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        batch_size = pixel_values.size(0)
+        if patch_attention_mask is None:
+            patch_attention_mask = torch.ones(
+                size=(
+                    batch_size,
+                    pixel_values.size(2) // self.config.patch_size,
+                    pixel_values.size(3) // self.config.patch_size,
+                ),
+                dtype=torch.bool,
+                device=pixel_values.device,
+            )
+        hidden_states = self.embeddings(
+            pixel_values=pixel_values,
+            patch_attention_mask=patch_attention_mask,
+            tgt_sizes=tgt_sizes,
+        )
+        patch_attention_mask = patch_attention_mask.view(batch_size, -1)
+        # The call to `_upad_input` in `_flash_attention_forward` is expensive
+        # So when the `patch_attention_mask` is full of 1s (i.e. attending to the whole sequence),
+        # avoiding passing the attention_mask, which is equivalent to attending to the full sequence
+        if not torch.any(~patch_attention_mask):
+            attention_mask = None
+        else:
+            attention_mask = (
+                _prepare_4d_attention_mask(patch_attention_mask, hidden_states.dtype)
+                if not self._use_flash_attention_2
+                else patch_attention_mask
+            )
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            attention_mask=attention_mask,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+        )
+        last_hidden_state = encoder_outputs[0]
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+        if not return_dict:
+            return (last_hidden_state, None) + encoder_outputs[1:]
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=None,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "image_processor_type": "MiniCPMVImageProcessor",
+  "feature_extractor_type": "MiniCPMAAudioProcessor",
+  "auto_map": {
+    "AutoProcessor": "processing_minicpmo.MiniCPMOProcessor",
+    "AutoImageProcessor": "processing_minicpmo.MiniCPMVImageProcessor",
+    "AutoFeatureExtractor": "processing_minicpmo.MiniCPMAAudioProcessor"
+  },
+  "processor_class": "MiniCPMOProcessor",
+  "max_slice_nums": 9,
+  "scale_resolution": 448,
+  "patch_size": 14,
+  "use_image_id": true,
+  "image_feature_size": 64,
+  "im_start": "<image>",
+  "im_end": "</image>",
+  "slice_start": "<slice>",
+  "slice_end": "</slice>",
+  "unk": "<unk>",
+  "im_id_start": "<image_id>",
+  "im_id_end": "</image_id>",
+  "slice_mode": true,
+  "audio_pool_step": 5,
+  "norm_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "norm_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "version": 4.5
+}

processing_minicpmo.py ADDED Viewed

	@@ -0,0 +1,1665 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright 2026 The OpenBMB Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import math
+import re
+from typing import Any
+from typing import Dict
+from typing import List
+from typing import Optional
+from typing import Tuple
+from typing import Union
+import numpy as np
+import torch
+from PIL import Image
+from transformers import AutoImageProcessor
+from transformers.audio_utils import spectrogram
+from transformers.audio_utils import window_function
+from transformers.image_processing_utils import BaseImageProcessor
+from transformers.image_processing_utils import BatchFeature
+from transformers.image_transforms import to_channel_dimension_format
+from transformers.image_utils import ChannelDimension
+from transformers.image_utils import ImageInput
+from transformers.image_utils import infer_channel_dimension_format
+from transformers.image_utils import is_torch_tensor
+from transformers.image_utils import to_numpy_array
+from transformers.image_utils import valid_images
+from transformers.models.whisper.feature_extraction_whisper import WhisperFeatureExtractor
+from transformers.processing_utils import ProcessorMixin
+from transformers.tokenization_utils_base import PreTokenizedInput
+from transformers.tokenization_utils_base import TextInput
+from transformers.utils import is_torch_device
+from transformers.utils import is_torch_dtype
+from transformers.utils import requires_backends
+from transformers.utils import TensorType
+def recursive_converter(converter, value):
+    if isinstance(value, list):
+        new_value = []
+        for v in value:
+            new_value += [recursive_converter(converter, v)]
+        return new_value
+    else:
+        return converter(value)
+class MiniCPMOBatchFeature(BatchFeature):
+    """Extend from BatchFeature for supporting various image size"""
+    def __init__(self, data: Optional[Dict[str, Any]] = None, tensor_type: Union[None, str, TensorType] = None):
+        super().__init__(data)
+        self.convert_to_tensors(tensor_type=tensor_type)
+    def convert_to_tensors(self, tensor_type: Optional[Union[str, TensorType]] = None, **kwargs):
+        if tensor_type is None:
+            return self
+        is_tensor, as_tensor = self._get_is_as_tensor_fns(tensor_type)
+        def converter(value):
+            try:
+                if not is_tensor(value):
+                    tensor = as_tensor(value)
+                    return tensor
+            except:  # noqa E722
+                if key == "overflowing_values":
+                    raise ValueError("Unable to create tensor returning overflowing values of different lengths. ")
+                raise ValueError(
+                    "Unable to create tensor, you should probably activate padding "
+                    "with 'padding=True' to have batched tensors with the same length."
+                )
+        for key, value in self.items():
+            self[key] = recursive_converter(converter, value)
+        return self
+    def to(self, *args, **kwargs) -> "MiniCPMOBatchFeature":
+        requires_backends(self, ["torch"])
+        import torch
+        def cast_tensor(v):
+            if not torch.is_tensor(v):
+                return v
+            if torch.is_floating_point(v):
+                return v.to(*args, **kwargs)
+            elif device is not None:
+                return v.to(device=device)
+            else:
+                return v
+        new_data = {}
+        device = kwargs.get("device")
+        if device is None and len(args) > 0:
+            arg = args[0]
+            if is_torch_dtype(arg):
+                pass
+            elif isinstance(arg, str) or is_torch_device(arg) or isinstance(arg, int):
+                device = arg
+            else:
+                raise ValueError(f"Attempting to cast a BatchFeature to type {str(arg)}. This is not supported.")
+        # We cast only floating point tensors to avoid issues with tokenizers casting `LongTensor` to `FloatTensor`
+        for k, v in self.items():
+            new_data[k] = recursive_converter(cast_tensor, v)
+        self.data = new_data
+        return self
+class MiniCPMVImageProcessor(BaseImageProcessor):
+    model_input_names = ["pixel_values"]
+    def __init__(self, max_slice_nums=9, scale_resolution=448, patch_size=14, **kwargs):
+        super().__init__(**kwargs)
+        self.max_slice_nums = max_slice_nums
+        self.scale_resolution = scale_resolution
+        self.patch_size = patch_size
+        self.use_image_id = kwargs.pop("use_image_id", False)
+        self.image_feature_size = kwargs.pop("image_feature_size", 64)
+        self.im_start_token = kwargs.pop("im_start", "<image>")
+        self.im_end_token = kwargs.pop("im_end", "</image>")
+        self.slice_start_token = kwargs.pop("slice_start", "<slice>")
+        self.slice_end_token = kwargs.pop("slice_end", "</slice>")
+        self.unk_token = kwargs.pop("unk", "<unk>")
+        self.im_id_start = kwargs.pop("im_id_start", "<image_id>")
+        self.im_id_end = kwargs.pop("im_id_end", "</image_id>")
+        self.slice_mode = kwargs.pop("slice_mode", True)
+        self.mean = np.array(kwargs.pop("norm_mean", [0.5, 0.5, 0.5]))
+        self.std = np.array(kwargs.pop("norm_std", [0.5, 0.5, 0.5]))
+        self.version = kwargs.pop("version", 2.0)
+    @staticmethod
+    def ensure_divide(length, patch_size):
+        return max(round(length / patch_size) * patch_size, patch_size)
+    def find_best_resize(self, original_size, scale_resolution, patch_size, allow_upscale=False):
+        width, height = original_size
+        if (width * height > scale_resolution * scale_resolution) or allow_upscale:
+            r = width / height
+            height = int(scale_resolution / math.sqrt(r))
+            width = int(height * r)
+        best_width = self.ensure_divide(width, patch_size)
+        best_height = self.ensure_divide(height, patch_size)
+        return best_width, best_height
+    def get_refine_size(self, original_size, grid, scale_resolution, patch_size, allow_upscale=False):
+        width, height = original_size
+        grid_x, grid_y = grid
+        refine_width = self.ensure_divide(width, grid_x)
+        refine_height = self.ensure_divide(height, grid_y)
+        grid_width = refine_width / grid_x
+        grid_height = refine_height / grid_y
+        best_grid_size = self.find_best_resize(
+            (grid_width, grid_height), scale_resolution, patch_size, allow_upscale=allow_upscale
+        )
+        refine_size = (best_grid_size[0] * grid_x, best_grid_size[1] * grid_y)
+        return refine_size
+    @staticmethod
+    def split_to_patches(image, grid):
+        patches = []
+        width, height = image.size
+        grid_x = int(width / grid[0])
+        grid_y = int(height / grid[1])
+        for i in range(0, height, grid_y):
+            images = []
+            for j in range(0, width, grid_x):
+                box = (j, i, j + grid_x, i + grid_y)
+                patch = image.crop(box)
+                images.append(patch)
+            patches.append(images)
+        return patches
+    def slice_image(self, image, max_slice_nums=9, scale_resolution=448, patch_size=14, never_split=False):
+        original_size = image.size
+        source_image = None
+        best_grid = self.get_sliced_grid(original_size, max_slice_nums, never_split)
+        patches = []
+        if best_grid is None:
+            # dont need to slice, upsample
+            best_size = self.find_best_resize(original_size, scale_resolution, patch_size, allow_upscale=True)
+            source_image = image.resize(best_size, resample=Image.Resampling.BICUBIC)
+        else:
+            # source image, down-sampling and ensure divided by patch_size
+            best_resize = self.find_best_resize(original_size, scale_resolution, patch_size)
+            source_image = image.copy().resize(best_resize, resample=Image.Resampling.BICUBIC)
+            refine_size = self.get_refine_size(
+                original_size, best_grid, scale_resolution, patch_size, allow_upscale=True
+            )
+            refine_image = image.resize(refine_size, resample=Image.Resampling.BICUBIC)
+            patches = self.split_to_patches(refine_image, best_grid)
+        return source_image, patches, best_grid
+    def get_grid_placeholder(self, grid):
+        if grid is None:
+            return ""
+        slice_image_placeholder = (
+            self.slice_start_token + self.unk_token * self.image_feature_size + self.slice_end_token
+        )
+        cols = grid[0]
+        rows = grid[1]
+        slices = []
+        for i in range(rows):
+            lines = []
+            for j in range(cols):
+                lines.append(slice_image_placeholder)
+            slices.append("".join(lines))
+        slice_placeholder = "\n".join(slices)
+        return slice_placeholder
+    def get_image_id_placeholder(self, idx=0):
+        return f"{self.im_id_start}{idx}{self.im_id_end}"
+    def get_sliced_images(self, image, max_slice_nums=None):
+        slice_images = []
+        if not self.slice_mode:
+            return [image]
+        max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums)
+        assert max_slice_nums > 0
+        source_image, patches, sliced_grid = self.slice_image(
+            image, max_slice_nums, self.scale_resolution, self.patch_size  # default: 9  # default: 448  # default: 14
+        )
+        slice_images.append(source_image)
+        if len(patches) > 0:
+            for i in range(len(patches)):
+                for j in range(len(patches[0])):
+                    slice_images.append(patches[i][j])
+        return slice_images
+    def get_sliced_grid(self, image_size, max_slice_nums, nerver_split=False):
+        original_width, original_height = image_size
+        log_ratio = math.log(original_width / original_height)
+        ratio = original_width * original_height / (self.scale_resolution * self.scale_resolution)
+        multiple = min(math.ceil(ratio), max_slice_nums)
+        if multiple <= 1 or nerver_split:
+            return None
+        candidate_split_grids_nums = []
+        for i in [multiple - 1, multiple, multiple + 1]:
+            if i == 1 or i > max_slice_nums:
+                continue
+            candidate_split_grids_nums.append(i)
+        candidate_grids = []
+        for split_grids_nums in candidate_split_grids_nums:
+            m = 1
+            while m <= split_grids_nums:
+                if split_grids_nums % m == 0:
+                    candidate_grids.append([m, split_grids_nums // m])
+                m += 1
+        best_grid = [1, 1]
+        min_error = float("inf")
+        for grid in candidate_grids:
+            error = abs(log_ratio - math.log(grid[0] / grid[1]))
+            if error < min_error:
+                best_grid = grid
+                min_error = error
+        return best_grid
+    def get_slice_image_placeholder(self, image_size, image_idx=0, max_slice_nums=None, use_image_id=None):
+        max_slice_nums = self.max_slice_nums if max_slice_nums is None else int(max_slice_nums)
+        assert max_slice_nums > 0
+        grid = self.get_sliced_grid(image_size=image_size, max_slice_nums=max_slice_nums)
+        image_placeholder = self.im_start_token + self.unk_token * self.image_feature_size + self.im_end_token
+        use_image_id = self.use_image_id if use_image_id is None else bool(use_image_id)
+        if use_image_id:
+            final_placeholder = self.get_image_id_placeholder(image_idx) + image_placeholder
+        else:
+            final_placeholder = image_placeholder
+        if self.slice_mode:
+            final_placeholder = final_placeholder + self.get_grid_placeholder(grid=grid)
+        return final_placeholder
+    @staticmethod
+    def to_pil_image(image, rescale=None) -> Image.Image:
+        """Converts `image` to a PIL Image. Optionally rescales it and puts the channel dimension back
+        as the last axis if needed.
+        Args:
+            image (`Image.Image` or `numpy.ndarray` or `torch.Tensor`):
+                The image to convert to the PIL Image format.
+            rescale (`bool`, *optional*):
+                whether to apply the scaling factor (to make pixel values integers between 0 and 255). Will
+                default to `True` if the image type is a floating type, `False` otherwise.
+        """
+        if isinstance(image, Image.Image):
+            return image
+        if is_torch_tensor(image):
+            image = image.numpy()
+        if isinstance(image, np.ndarray):
+            if rescale is None:
+                # rescale default to the array being of floating type.
+                rescale = isinstance(image.flat[0], np.floating)
+            # If the channel as been moved to first dim, we put it back at the end.
+            if image.ndim == 3 and image.shape[0] in [1, 3]:
+                image = image.transpose(1, 2, 0)
+            if rescale:
+                image = image * 255
+            image = image.astype(np.uint8)
+            return Image.fromarray(image)
+        return image
+    def reshape_by_patch(self, image):
+        image = torch.from_numpy(image)
+        patch_size = self.patch_size
+        patches = torch.nn.functional.unfold(image, (patch_size, patch_size), stride=(patch_size, patch_size))
+        patches = patches.reshape(image.size(0), patch_size, patch_size, -1)
+        patches = patches.permute(0, 1, 3, 2).reshape(image.size(0), patch_size, -1)
+        return patches.numpy()
+    def preprocess(
+        self,
+        images: Union[Image.Image, List[Image.Image], List[List[Image.Image]]],
+        do_pad: Optional[bool] = True,
+        max_slice_nums: int = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **kwargs,
+    ) -> MiniCPMOBatchFeature:
+        if isinstance(images, Image.Image):
+            images_list = [[images]]
+        elif isinstance(images[0], Image.Image):
+            images_list = [images]
+        else:
+            images_list = images
+        new_images_list = []
+        image_sizes_list = []
+        tgt_sizes_list = []
+        for _images in images_list:
+            if _images is None or len(_images) == 0:
+                new_images_list.append([])
+                image_sizes_list.append([])
+                tgt_sizes_list.append([])
+                continue
+            if not valid_images(_images):
+                raise ValueError(
+                    "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
+                    "torch.Tensor, tf.Tensor or jax.ndarray."
+                )
+            _images = [self.to_pil_image(image).convert("RGB") for image in _images]
+            input_data_format = infer_channel_dimension_format(np.array(_images[0]))
+            new_images = []
+            image_sizes = [image.size for image in _images]
+            tgt_sizes = []
+            for image in _images:
+                image_patches = self.get_sliced_images(image, max_slice_nums)
+                image_patches = [to_numpy_array(image).astype(np.float32) / 255 for image in image_patches]
+                image_patches = [
+                    self.normalize(image=image, mean=self.mean, std=self.std, input_data_format=input_data_format)
+                    for image in image_patches
+                ]
+                image_patches = [
+                    to_channel_dimension_format(image, ChannelDimension.FIRST, input_channel_dim=input_data_format)
+                    for image in image_patches
+                ]
+                for slice_image in image_patches:
+                    new_images.append(self.reshape_by_patch(slice_image))
+                    tgt_sizes.append(
+                        np.array((slice_image.shape[1] // self.patch_size, slice_image.shape[2] // self.patch_size))
+                    )
+            if tgt_sizes:
+                tgt_sizes = np.vstack(tgt_sizes)
+            new_images_list.append(new_images)
+            image_sizes_list.append(image_sizes)
+            tgt_sizes_list.append(tgt_sizes)
+        return MiniCPMOBatchFeature(
+            data={"pixel_values": new_images_list, "image_sizes": image_sizes_list, "tgt_sizes": tgt_sizes_list},
+            tensor_type=return_tensors,
+        )
+AutoImageProcessor.register("MiniCPMVImageProcessor", MiniCPMVImageProcessor)
+def chunk_audio(audio: np.ndarray, max_duration_seconds: int = 30, sample_rate: int = 16000) -> List[np.ndarray]:
+    """split long audio into chunks
+    Args:
+        audio:
+        max_duration_seconds:
+        sample_rate:
+    Returns:
+        chunks
+    """
+    max_len = int(max_duration_seconds * sample_rate)
+    if len(audio) <= max_len:
+        return [audio]
+    chunks = []
+    for i in range(0, len(audio), max_len):
+        chunk = audio[i : i + max_len]
+        chunks.append(chunk)
+    return chunks
+def process_audio_batch(
+    audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]],
+    feature_extractor,
+    sampling_rate: int = 16000,
+    max_duration_seconds: int = 30,
+    return_attention_mask: bool = True,
+) -> Tuple[torch.Tensor, List[torch.Tensor]]:
+    """extract audio mel features
+    Args:
+        audios:
+        feature_extractor: WhisperFeatureExtractor
+        sampling_rate:
+        max_duration_seconds:
+        return_attention_mask:
+    Returns:
+        (audio_features, audio_feature_lens)
+        audio_features: [batch_size, n_mels, max_frames]
+        audio_feature_lens:
+    """
+    if isinstance(audios, np.ndarray):
+        audios_list = [[audios]]
+    elif len(audios) > 0 and isinstance(audios[0], np.ndarray):
+        audios_list = [audios]
+    else:
+        audios_list = audios
+    audio_features_all = []
+    audio_feature_lens_list = []
+    for batch_audios in audios_list:
+        batch_lens = []
+        for audio in batch_audios:
+            chunks = chunk_audio(audio, max_duration_seconds, sampling_rate)
+            for chunk in chunks:
+                audio_input = feature_extractor(
+                    chunk,
+                    sampling_rate=sampling_rate,
+                    return_tensors="pt",
+                    padding="max_length",
+                    return_attention_mask=return_attention_mask,
+                )
+                audio_feature = audio_input["input_features"]  # [1, 80, frames]
+                if return_attention_mask:
+                    actual_len = audio_input["attention_mask"].sum(dim=1)  # Tensor([frames])
+                    audio_feature = audio_feature[:, :, : actual_len[0]]
+                    batch_lens.append(actual_len[0])
+                else:
+                    batch_lens.append(torch.tensor(audio_feature.shape[2]))
+                audio_features_all.append(audio_feature.squeeze(0))  # [80, frames]
+        if len(batch_lens) > 0:
+            audio_feature_lens_list.append(torch.hstack(batch_lens))
+        else:
+            audio_feature_lens_list.append(torch.tensor([]))
+    # pad to same length
+    if audio_features_all:
+        audio_features = torch.nn.utils.rnn.pad_sequence(
+            [feat.transpose(0, 1) for feat in audio_features_all], batch_first=True, padding_value=0.0
+        ).transpose(
+            1, 2
+        )  # [batch, 80, max_frames]
+    else:
+        audio_features = torch.tensor([])
+    return audio_features, audio_feature_lens_list
+def regroup_audio_features(
+    audio_features: torch.Tensor, audio_feature_lens: List[torch.Tensor], regroup_seconds: int, fps: int = 100
+) -> Tuple[torch.Tensor, List[torch.Tensor]]:
+    """regroup audio features to fixed duration
+    Args:
+        audio_features: [batch, n_mels, frames]
+        audio_feature_lens: each batch's actual length
+        regroup_seconds: regroup duration (seconds)
+        fps: frames per second
+    Returns:
+        (regrouped_features, regrouped_lens)
+    """
+    # flatten to continuous frames sequence
+    all_lens = []
+    for lens in audio_feature_lens:
+        if isinstance(lens, torch.Tensor):
+            all_lens.extend(lens.tolist())
+        elif isinstance(lens, list):
+            all_lens.extend([int(x) for x in lens])
+    if len(all_lens) == 0:
+        return torch.tensor([]), []
+    # concatenate all valid features
+    flat_slices = [audio_features[i, :, :L] for i, L in enumerate(all_lens)]  # [n_mels, L]
+    if len(flat_slices) == 1:
+        full_feat = flat_slices[0]
+    else:
+        full_feat = torch.cat(flat_slices, dim=1)  # [n_mels, total_frames]
+    # split to fixed frames
+    frames_per_seg = int(regroup_seconds * fps)
+    segments = []
+    for start in range(0, full_feat.size(1), frames_per_seg):
+        seg = full_feat[:, start : start + frames_per_seg]
+        if seg.size(1) > 0:
+            segments.append(seg)
+    if len(segments) == 0:
+        return torch.tensor([]), []
+    # pad and convert to batch
+    seg_lens = [s.size(1) for s in segments]
+    segs_transposed = [s.transpose(0, 1) for s in segments]
+    padded = torch.nn.utils.rnn.pad_sequence(segs_transposed, batch_first=True, padding_value=0.0)  # [N, max_T, n_mels]
+    padded = padded.transpose(1, 2)  # [N, n_mels, max_T]
+    lens_tensor = torch.tensor(seg_lens, dtype=torch.int32, device=padded.device)
+    return padded, [lens_tensor]
+class MiniCPMAAudioProcessor(WhisperFeatureExtractor):
+    """
+    On top of WhisperFeatureExtractor:
+    - support dynamic_log_norm (original max-8dB, adjustable dynamic_range_db)
+    - or fixed log_floor_db (e.g. -10dB)
+        - this is because we need to do streaming scheme, in which we can't do dynamic setting
+        - this can be modified in the middle, through set_dynamic_log_norm
+    Two paths (torch / numpy) keep consistent clipping and scaling order:
+        log10 -> (dynamic/fixed lower limit clipping) -> (+4)/4
+    """
+    def __init__(
+        self,
+        *args,
+        dynamic_log_norm: bool = True,
+        dynamic_range_db: float = 8.0,
+        log_floor_db: float = -10.0,
+        **kwargs,
+    ):
+        super().__init__(*args, **kwargs)
+        self.dynamic_log_norm = bool(dynamic_log_norm)
+        self.dynamic_range_db = float(dynamic_range_db)
+        self.log_floor_db = float(log_floor_db)
+    def set_spac_log_norm(
+        self,
+        dynamic_range_db: Optional[float] = None,
+        log_floor_db: Optional[float] = None,
+        *,
+        inplace: bool = True,
+    ) -> "MiniCPMAAudioProcessor":
+        """Hot update dynamic/fixed lower limit strategy.
+        Args:
+            enabled: True=use dynamic threshold (max - dynamic_range_db), False=use fixed lower limit log_floor_db.
+                    None means keep unchanged.
+            dynamic_range_db: dynamic range (dB), only effective when enabled=True. None means keep unchanged.
+            log_floor_db: fixed log floor (dB, usually <= 0), only effective when enabled=False. None means keep unchanged.
+            inplace: True directly modify current instance; False return a shallow copy and modify on it.
+        Returns:
+            self or new instance (when inplace=False).
+        """
+        target = self if inplace else copy.copy(self)
+        if dynamic_range_db is not None:
+            val = float(dynamic_range_db)
+            if val < 0:
+                raise ValueError("dynamic_range_db must be >= 0.")
+            target.dynamic_log_norm = True  # explicitly set the value to dynamic mode
+            target.dynamic_range_db = val
+        if log_floor_db is not None:
+            val = float(log_floor_db)
+            # usually log10(mel) maximum is not more than ~0dB, floor should be <= 0; here do loose validation
+            if val > 0:
+                raise ValueError("log_floor_db should be <= 0 (log10 scale).")
+            target.dynamic_log_norm = False  # explicitly set the value to fixed lower limit mode
+            target.log_floor_db = val
+        return target
+    def _np_extract_fbank_features(self, waveform_batch: np.ndarray, device: str) -> np.ndarray:
+        """NumPy version consistent with upstream, but replace max-8dB with configurable dynamic/fixed lower limit clipping."""
+        if device != "cpu":
+            raise ValueError(
+                f"Got device `{device}` for feature extraction, but feature extraction on CUDA accelerator "
+                "devices requires torch. Set device='cpu' or install torch."
+            )
+        log_spec_batch: List[np.ndarray] = []
+        for waveform in waveform_batch:
+            # generate log10 Mel
+            log_spec = spectrogram(
+                waveform,
+                window_function(self.n_fft, "hann"),
+                frame_length=self.n_fft,
+                hop_length=self.hop_length,
+                power=2.0,
+                dither=self.dither,
+                mel_filters=self.mel_filters,
+                log_mel="log10",
+            )
+            # consistent with upstream: remove the last frame
+            log_spec = log_spec[:, :-1]
+            # dynamic/fixed clipping
+            if self.dynamic_log_norm:
+                threshold = log_spec.max() - self.dynamic_range_db
+                log_spec = np.maximum(log_spec, threshold)
+            else:
+                log_spec = np.maximum(log_spec, self.log_floor_db)
+            # consistent with Whisper linear scaling
+            log_spec = (log_spec + 4.0) / 4.0
+            log_spec_batch.append(log_spec)
+        return np.array(log_spec_batch)
+    def _torch_extract_fbank_features(self, waveform: np.ndarray, device: str = "cpu") -> np.ndarray:
+        if torch is None:
+            raise RuntimeError("PyTorch is not installed, cannot compute STFT on GPU.")
+        waveform = torch.from_numpy(waveform).to(device, torch.float32)
+        window = torch.hann_window(self.n_fft, device=device)
+        if self.dither != 0.0:
+            waveform = waveform + self.dither * torch.randn_like(waveform)
+        stft = torch.stft(waveform, n_fft=self.n_fft, hop_length=self.hop_length, window=window, return_complex=True)
+        magnitudes = stft[..., :-1].abs() ** 2
+        mel_filters = torch.from_numpy(self.mel_filters).to(device, torch.float32)  # [n_mels, 1+n_fft//2]
+        mel_spec = mel_filters.T @ magnitudes  # [..., n_mels, T]
+        log_spec = torch.clamp(mel_spec, min=1e-10).log10()  # <= 0
+        if self.dynamic_log_norm:
+            if waveform.dim() == 2:
+                max_val_t = log_spec.max(dim=2, keepdim=True)[0]  # over T
+                max_val_bt = max_val_t.max(dim=1, keepdim=True)[0]  # over mel
+                threshold = max_val_bt - self.dynamic_range_db
+                log_spec = torch.maximum(log_spec, threshold)
+            else:
+                threshold = log_spec.max() - self.dynamic_range_db
+                log_spec = torch.maximum(log_spec, threshold)
+        else:
+            floor_tensor = torch.tensor(self.log_floor_db, dtype=log_spec.dtype, device=log_spec.device)
+            log_spec = torch.maximum(log_spec, floor_tensor)
+        log_spec = (log_spec + 4.0) / 4.0
+        if device != "cpu":
+            log_spec = log_spec.detach().cpu()
+        return log_spec.numpy()
+    def process(self, *args, **kwargs):
+        """Alias of __call__ for convenience."""
+        return self.__call__(*args, **kwargs)
+class StreamingMelProcessorExact:
+    """Strictly offline equivalent streaming Mel processor.
+    - accumulate all historical audio into buffer; use the same feature_extractor to calculate the entire mel after each addition.
+    - only output "stable" frames: the frame center does not depend on future (right) context, i.e. center + n_fft//2 <= current buffer length.
+    - output the last batch of frames at the end (flush), ensuring complete consistency with offline full-calculation.
+    Cost: Each call performs feature extraction on the accumulated buffer (can be optimized to incremental if needed).
+    """
+    def __init__(
+        self,
+        feature_extractor: MiniCPMAAudioProcessor,
+        chunk_ms: int = 100,
+        first_chunk_ms: Optional[int] = None,
+        sample_rate: int = 16000,
+        n_fft: int = 400,
+        hop_length: int = 160,
+        n_mels: int = 80,
+        cnn_redundancy_ms: int = 10,  # (given in ms, usually 10ms=1 frame)
+        # sliding window parameters
+        enable_sliding_window: bool = False,  # whether to enable sliding window
+        slide_trigger_seconds: float = 30.0,  # trigger threshold for sliding window in seconds
+        slide_stride_seconds: float = 10.0,  # stride for sliding window in seconds
+    ):
+        self.feature_extractor = feature_extractor
+        self.chunk_ms = chunk_ms
+        self.first_chunk_ms = first_chunk_ms if first_chunk_ms is not None else chunk_ms
+        self.sample_rate = sample_rate
+        self.n_fft = n_fft
+        self.hop_length = hop_length
+        self.n_mels = n_mels
+        self.chunk_samples = int(round(chunk_ms * sample_rate / 1000))
+        self.chunk_frames = self.chunk_samples // hop_length
+        # align to hop_length to avoid frame boundary issues
+        hop = self.hop_length
+        raw_first_samples = int(round(self.first_chunk_ms * sample_rate / 1000))
+        aligned_first = max(hop, (raw_first_samples // hop) * hop)
+        self.first_chunk_samples = aligned_first
+        self.half_window = n_fft // 2  # required right context
+        # redundancy frames (in frames), <=1 frame: 10ms → 1 frame
+        self.cnn_redundancy_ms = cnn_redundancy_ms
+        self.cnn_redundancy_samples = int(cnn_redundancy_ms * sample_rate / 1000)
+        self.cnn_redundancy_frames = max(0, self.cnn_redundancy_samples // hop_length)
+        # sliding window configuration (Trigger mode)
+        self.enable_sliding_window = enable_sliding_window
+        self.trigger_seconds = slide_trigger_seconds
+        self.slide_seconds = slide_stride_seconds
+        # shift/base (global frame coordinates)
+        self.left_samples_dropped = 0  # samples dropped from the left
+        self.base_T = 0  # index of the "global frame" corresponding to mel_full[:, :, 0]
+        self.reset()
+    def reset(self):
+        self.buffer = np.zeros(0, dtype=np.float32)
+        self.last_emitted_T = 0
+        self.total_samples_processed = 0
+        self.chunk_count = 0
+        self.is_first = True
+        self.left_samples_dropped = 0
+        self.base_T = 0
+    def get_chunk_size(self) -> int:
+        return self.first_chunk_samples if self.is_first else self.chunk_samples
+    def get_expected_output_frames(self) -> int:
+        raise NotImplementedError("get_expected_output_frames is not implemented")
+    def _extract_full(self) -> torch.Tensor:
+        # when buffer length is less than n_fft, Whisper's internal STFT will raise an error in center=True and pad mode
+        # (pad is greater than input length). At this time, there is no stable frame to output, so return empty features directly.
+        if len(self.buffer) < self.n_fft:
+            raise ValueError(f"buffer length is shorter than n_fft {len(self.buffer)} < {self.n_fft}")
+        # if buffer length is less than 5s, use set_spac_log_norm(log_floor_db=-10) or the last cached result
+        if len(self.buffer) < 5 * self.sample_rate:
+            # TODO: here the best is to do some experiments to choose the best one, now this is selected through experience, can see MiniCPMAAudioProcessor's main implementation
+            self.feature_extractor.set_spac_log_norm(log_floor_db=-10)
+        # if buffer length is greater than 5s, use set_spac_log_norm(dynamic_range_db=8)
+        else:
+            self.feature_extractor.set_spac_log_norm(dynamic_range_db=8)
+        feats = self.feature_extractor(
+            self.buffer,
+            sampling_rate=self.sample_rate,
+            return_tensors="pt",
+            padding=False,
+        )
+        return feats.input_features  # [1, 80, T]
+    def _stable_frames_count(self) -> int:
+        # number of stable frames = floor((len(buffer) - half_window) / hop) + 1, minimum is 0
+        L = int(self.buffer.shape[0])
+        if L <= 0:
+            return 0
+        if L < self.half_window:
+            return 0
+        return max(0, (L - self.half_window) // self.hop_length + 1)
+    def _maybe_slide_buffer(self):
+        """Trigger mode sliding window: when the buffer reaches the trigger threshold, slide a fixed length window."""
+        if not self.enable_sliding_window:
+            return
+        sr = self.sample_rate
+        hop = self.hop_length
+        L = len(self.buffer)
+        # convert seconds to samples
+        trigger_samples = int(self.trigger_seconds * sr)
+        stride_samples = int(self.slide_seconds * sr)
+        # check if the trigger threshold is reached
+        if L < trigger_samples:
+            return
+        # calculate the number of samples to drop (fixed sliding stride_samples)
+        drop = stride_samples
+        # cannot drop the left context that is still needed for subsequent emission
+        # in trigger mode, we only need to protect the minimum necessary data
+        # i.e. ensure that we do not discard frames that may be needed in the future
+        last_emitted_local = self.last_emitted_T - self.base_T
+        # only protect necessary context (e.g. the most recent 1 second data)
+        min_keep_seconds = 1.0  # keep at least 1 second of data to ensure continuity
+        min_keep_samples = int(min_keep_seconds * sr)
+        # guard_samples are the minimum samples we must keep
+        guard_samples = min(min_keep_samples, L - drop)
+        # limit: do not exceed the safe boundary; and align hop
+        max_allowed_drop = max(0, L - guard_samples)
+        drop = min(drop, max_allowed_drop)
+        drop = (drop // hop) * hop
+        if drop <= 0:
+            return
+        # truly drop & update base
+        self.buffer = self.buffer[drop:]
+        self.left_samples_dropped += drop
+        self.base_T += drop // hop
+    def process(self, audio_chunk: np.ndarray, is_last_chunk: bool = False) -> Tuple[torch.Tensor, Dict]:
+        self.chunk_count += 1
+        # append to buffer
+        if len(self.buffer) == 0:
+            self.buffer = audio_chunk.astype(np.float32, copy=True)
+        else:
+            self.buffer = np.concatenate([self.buffer, audio_chunk.astype(np.float32, copy=True)])
+        # sliding window processing
+        self._maybe_slide_buffer()
+        # full extraction (for the current window)
+        mel_full = self._extract_full()
+        T_full = mel_full.shape[-1]  # local frames in the current window
+        stable_T = min(T_full, self._stable_frames_count())  # local stable frames
+        stable_T_global = self.base_T + stable_T  # map to global frame coordinates
+        # plan the core frames for the current emission (global coordinates)
+        core_start_g = self.last_emitted_T
+        core_end_g = core_start_g + self.chunk_frames
+        required_stable_g = core_end_g + self.cnn_redundancy_frames
+        if stable_T_global >= required_stable_g or is_last_chunk:
+            emit_start_g = max(0, core_start_g - self.cnn_redundancy_frames)
+            emit_end_g = core_end_g + self.cnn_redundancy_frames
+            # global -> local index
+            emit_start = max(0, emit_start_g - self.base_T)
+            emit_end = emit_end_g - self.base_T
+            emit_start = max(0, min(emit_start, T_full))
+            emit_end = max(emit_start, min(emit_end, T_full))
+            mel_output = mel_full[:, :, emit_start:emit_end]
+            self.last_emitted_T = core_end_g  # only advance the core frame pointer (global)
+        else:
+            mel_output = mel_full[:, :, 0:0]
+        self.total_samples_processed += len(audio_chunk)
+        self.is_first = False
+        info = {
+            "type": "exact_chunk",
+            "chunk_number": self.chunk_count,
+            "emitted_frames": mel_output.shape[-1],
+            "stable_T": stable_T,
+            "T_full": T_full,
+            "base_T": self.base_T,
+            "stable_T_global": stable_T_global,
+            "buffer_len_samples": int(self.buffer.shape[0]),
+            "left_samples_dropped": self.left_samples_dropped,
+            "core_start": core_start_g,  # if keep the original field name, use the global value here
+            "core_end": core_end_g,  # same as above
+        }
+        return mel_output, info
+    def flush(self) -> torch.Tensor:
+        """Called when the stream ends, output the remaining unemitted frames, ensuring consistency with offline (calculated by global coordinates)."""
+        if len(self.buffer) == 0:
+            return torch.zeros(1, 80, 0)
+        mel_full = self._extract_full()
+        T_local = mel_full.shape[-1]
+        T_global = self.base_T + T_local
+        if self.last_emitted_T < T_global:
+            start_l = max(0, self.last_emitted_T - self.base_T)
+            tail = mel_full[:, :, start_l:]
+            self.last_emitted_T = T_global
+            return tail
+        return mel_full[:, :, 0:0]
+    def get_config(self) -> Dict:
+        return {
+            "chunk_ms": self.chunk_ms,
+            "first_chunk_ms": self.first_chunk_ms,
+            "effective_first_chunk_ms": self.first_chunk_samples / self.sample_rate * 1000.0,
+            "sample_rate": self.sample_rate,
+            "n_fft": self.n_fft,
+            "hop_length": self.hop_length,
+            "cnn_redundancy_ms": self.cnn_redundancy_ms,
+            "cnn_redundancy_frames": self.cnn_redundancy_frames,
+            "enable_sliding_window": self.enable_sliding_window,
+            "trigger_seconds": self.trigger_seconds,
+            "slide_seconds": self.slide_seconds,
+        }
+    def get_state(self) -> Dict:
+        return {
+            "chunk_count": self.chunk_count,
+            "last_emitted_T": self.last_emitted_T,
+            "total_samples_processed": self.total_samples_processed,
+            "buffer_len": int(self.buffer.shape[0]),
+            "base_T": self.base_T,
+            "left_samples_dropped": self.left_samples_dropped,
+        }
+    def get_snapshot(self) -> Dict:
+        """Get a complete state snapshot (including buffer), used for recovery from a fast start.
+        Returns:
+            A dictionary containing the complete state, which can be used to restore the snapshot
+        """
+        buffer_copy = self.buffer.copy()
+        snapshot = {
+            "chunk_count": self.chunk_count,
+            "last_emitted_T": self.last_emitted_T,
+            "total_samples_processed": self.total_samples_processed,
+            "buffer": buffer_copy,
+            "base_T": self.base_T,
+            "left_samples_dropped": self.left_samples_dropped,
+            "is_first": self.is_first,
+            # save the state of the feature_extractor (key: ensure determinism of mel feature extraction)
+            "fe_dynamic_log_norm": getattr(self.feature_extractor, "dynamic_log_norm", None),
+            "fe_dynamic_range_db": getattr(self.feature_extractor, "dynamic_range_db", None),
+            "fe_log_floor_db": getattr(self.feature_extractor, "log_floor_db", None),
+        }
+        return snapshot
+    def restore_snapshot(self, snapshot: Dict) -> None:
+        """Restore state from a snapshot
+        Args:
+            snapshot: the snapshot dictionary returned by get_snapshot
+        """
+        # record the state before restoration
+        prev_state = {
+            "chunk_count": self.chunk_count,
+            "last_emitted_T": self.last_emitted_T,
+            "buffer_len": len(self.buffer),
+        }
+        # restore state
+        self.chunk_count = snapshot["chunk_count"]
+        self.last_emitted_T = snapshot["last_emitted_T"]
+        self.total_samples_processed = snapshot["total_samples_processed"]
+        self.buffer = snapshot["buffer"].copy()  # copy buffer
+        self.base_T = snapshot["base_T"]
+        self.left_samples_dropped = snapshot["left_samples_dropped"]
+        self.is_first = snapshot["is_first"]
+        # restore the state of the feature_extractor (key: ensure determinism of mel feature extraction)
+        if snapshot.get("fe_dynamic_log_norm") is not None:
+            self.feature_extractor.dynamic_log_norm = snapshot["fe_dynamic_log_norm"]
+        if snapshot.get("fe_dynamic_range_db") is not None:
+            self.feature_extractor.dynamic_range_db = snapshot["fe_dynamic_range_db"]
+        if snapshot.get("fe_log_floor_db") is not None:
+            self.feature_extractor.log_floor_db = snapshot["fe_log_floor_db"]
+class MiniCPMOProcessor(ProcessorMixin):
+    attributes = ["image_processor", "audio_processor", "tokenizer"]
+    audio_processor_class = "AutoFeatureExtractor"
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(self, image_processor=None, audio_processor=None, tokenizer=None, **kwargs):
+        super().__init__(image_processor, audio_processor, tokenizer)
+        self.version = image_processor.version if image_processor else None
+        # audio feature pooling step, needs to be consistent with config.audio_pool_step
+        self.pool_step = kwargs.get("audio_pool_step", 5)
+        # initialize the streaming audio processor
+        self._streaming_mel_processor = None
+        if audio_processor is not None:
+            self._init_streaming_processor()
+    def get_audio_placeholder(
+        self,
+        audio_lens: int,
+        chunk_input: bool = True,
+        chunk_length: int = 1,
+    ) -> str:
+        """
+        Public method to get audio placeholder string for vLLM integration.
+        Args:
+            audio_lens: Length of audio in samples
+            chunk_input: Whether to use chunked processing
+            chunk_length: Chunk length in seconds
+        Returns:
+            Audio placeholder string
+        """
+        pool_step = self.pool_step
+        feature_lens = math.ceil(audio_lens / self.audio_processor.hop_length)
+        feature_lens = (feature_lens - 1) // 2 + 1
+        output_lens = (feature_lens - pool_step) // pool_step + 1
+        if chunk_input:
+            fbank_feat_in_chunk = int(chunk_length * 100)
+            cnn_feat_in_chunk = (fbank_feat_in_chunk - 1) // 2 + 1
+            audio_embeds_in_chunk = (cnn_feat_in_chunk - pool_step) // pool_step + 1
+            num_audio_chunks = (output_lens + audio_embeds_in_chunk - 1) // audio_embeds_in_chunk
+            place_holders = ""
+            total_unk_len = 0
+            for _ in range(num_audio_chunks):
+                unk_len = min(audio_embeds_in_chunk, output_lens - total_unk_len)
+                place_holders += self.tokenizer.audio_start + "<unk>" * unk_len + self.tokenizer.audio_end
+                total_unk_len += unk_len
+            audio_placeholder = place_holders
+        else:
+            audio_placeholder = self.tokenizer.audio_start + "<unk>" * output_lens + self.tokenizer.audio_end
+        return audio_placeholder
+    def _init_streaming_processor(
+        self,
+        chunk_ms: int = 100,
+        cnn_redundancy_ms: int = 0,
+        *,
+        mode: str = "exact",
+        first_chunk_ms: Optional[int] = None,
+        enable_sliding_window: bool = False,
+        slide_trigger_seconds: float = 30.0,
+        slide_stride_seconds: float = 10.0,
+    ):
+        """Initialize the streaming processor
+        Args:
+            chunk_ms: Chunk size in milliseconds, also the sliding step.
+            cnn_redundancy_ms: CNN boundary redundancy in milliseconds (before and after), 0 means standard mode.
+            mode: streaming processing mode, currently only supports "exact"
+            first_chunk_ms: the size of the first chunk (milliseconds), if not specified, it is the same as chunk_ms
+            enable_sliding_window: whether to enable sliding window (trigger mode)
+            slide_trigger_seconds: trigger threshold for sliding window in seconds
+            slide_stride_seconds: stride for sliding window in seconds
+        """
+        if mode == "exact":
+            self._streaming_mel_processor = StreamingMelProcessorExact(
+                feature_extractor=self.audio_processor,
+                chunk_ms=chunk_ms,
+                first_chunk_ms=first_chunk_ms,
+                sample_rate=16000,
+                cnn_redundancy_ms=cnn_redundancy_ms,
+                enable_sliding_window=enable_sliding_window,
+                slide_trigger_seconds=slide_trigger_seconds,
+                slide_stride_seconds=slide_stride_seconds,
+            )
+        else:
+            raise ValueError(f"Unsupported mode: {mode}, only 'exact' is supported")
+        self._streaming_mode = mode if mode in ["exact"] else ("exact")
+    def set_streaming_mode(
+        self,
+        mode: str = "exact",
+        chunk_ms: int = 100,
+        cnn_redundancy_ms: int = 0,
+        *,
+        first_chunk_ms: Optional[int] = None,
+        enable_sliding_window: bool = False,
+        slide_trigger_seconds: float = 30.0,
+        slide_stride_seconds: float = 10.0,
+    ):
+        """Set streaming processing mode
+        Args:
+            mode: streaming processing mode, currently only supports "exact"
+            chunk_ms: chunk size in milliseconds, also the sliding step.
+            cnn_redundancy_ms: CNN boundary redundancy in milliseconds (before and after), 0 means standard mode.
+            first_chunk_ms: the size of the first chunk (milliseconds), if not specified, it is the same as chunk_ms
+            enable_sliding_window: whether to enable sliding window (trigger mode)
+            slide_trigger_seconds: trigger threshold for sliding window in seconds
+            slide_stride_seconds: stride for sliding window in seconds
+        """
+        if self.audio_processor is None:
+            raise ValueError("audio_processor is not set, cannot initialize the streaming processor")
+        self._init_streaming_processor(
+            chunk_ms=chunk_ms,
+            cnn_redundancy_ms=cnn_redundancy_ms,
+            mode=mode,
+            first_chunk_ms=first_chunk_ms,
+            enable_sliding_window=enable_sliding_window,
+            slide_trigger_seconds=slide_trigger_seconds,
+            slide_stride_seconds=slide_stride_seconds,
+        )
+    def process_image(
+        self,
+        images: Optional[ImageInput] = None,
+        do_pad: bool = True,
+        max_slice_nums: int = 1,
+        return_tensors: str = "pt",
+    ) -> MiniCPMOBatchFeature:
+        """Process image data
+        Args:
+            images: input images
+            do_pad: whether to pad
+            max_slice_nums: maximum number of slices
+            return_tensors: return tensor type
+        Returns:
+            MiniCPMOBatchFeature object
+        """
+        if images is None:
+            return MiniCPMOBatchFeature(data={"pixel_values": [[]], "image_sizes": [[]], "tgt_sizes": [[]]})
+        result = self.image_processor(
+            images, do_pad=do_pad, max_slice_nums=max_slice_nums, return_tensors=return_tensors
+        )
+        model_inputs = {
+            "pixel_values": result.get("pixel_values", [[]]),
+            "image_sizes": result.get("image_sizes", [[]]),
+            "tgt_sizes": result.get("tgt_sizes", [[]]),
+        }
+        return MiniCPMOBatchFeature(data=model_inputs)
+    def process_audio(
+        self,
+        audios: Optional[Union[np.ndarray, List[np.ndarray]]] = None,
+        sampling_rate: int = 16000,
+        regroup_to_seconds: Optional[int] = None,
+        fps: int = 100,
+    ) -> MiniCPMOBatchFeature:
+        """Process audio data in batch
+        Args:
+            audios: audio data
+            sampling_rate: sampling rate
+            regroup_to_seconds: regroup duration in seconds
+            fps: frames per second
+        Returns:
+            MiniCPMOBatchFeature object
+        """
+        if audios is None:
+            return MiniCPMOBatchFeature(data={"audio_features": [], "audio_feature_lens": []})
+        audio_features, audio_feature_lens = process_audio_batch(
+            audios=audios,
+            feature_extractor=self.audio_processor,
+            sampling_rate=sampling_rate,
+            max_duration_seconds=30,
+            return_attention_mask=True,
+        )
+        if regroup_to_seconds is not None and len(audio_features) > 0:
+            audio_features, audio_feature_lens = regroup_audio_features(
+                audio_features=audio_features,
+                audio_feature_lens=audio_feature_lens,
+                regroup_seconds=regroup_to_seconds,
+                fps=fps,
+            )
+        model_inputs = {"audio_features": audio_features, "audio_feature_lens": audio_feature_lens}
+        return MiniCPMOBatchFeature(data=model_inputs)
+    def process_audio_streaming(
+        self,
+        audio_chunk: np.ndarray,
+        reset: bool = False,
+        return_batch_feature: bool = False,
+        is_last_chunk: bool = False,
+    ) -> Union[Tuple[torch.Tensor, dict], MiniCPMOBatchFeature]:
+        """Process audio chunk in streaming
+        Args:
+            audio_chunk: audio data chunk (any audio, e.g. first process 125ms, then process 100ms)
+            reset: whether to reset the processor state
+            return_batch_feature: whether to return MiniCPMOBatchFeature format (consistent with process_audio)
+        Returns:
+            If return_batch_feature=False:
+                (audio_features, info)
+                - audio_features: [1, 80, n_frames] mel features
+                - info: processing information dictionary
+            If return_batch_feature=True:
+                MiniCPMOBatchFeature object, containing:
+                - audio_features: [1, 80, n_frames] mel features
+                - audio_feature_lens: [tensor([n_frames])]
+                - info: processing information (as an extra attribute)
+        """
+        if self._streaming_mel_processor is None:
+            raise ValueError("Streaming processor not initialized, please ensure audio_processor is set")
+        if reset:
+            self._streaming_mel_processor.reset()
+        # process chunk
+        mel_features, info = self._streaming_mel_processor.process(audio_chunk, is_last_chunk=is_last_chunk)
+        # determine the return format based on the parameters
+        if return_batch_feature:
+            # return the format consistent with process_audio
+            # note: info returns emitted_frames, which represents the actual output frames
+            n_frames = info.get("emitted_frames", mel_features.shape[-1])
+            model_inputs = {
+                "audio_features": mel_features,
+                "audio_feature_lens": [torch.tensor([n_frames])],
+                "streaming_info": info,  # add streaming processing information
+            }
+            return MiniCPMOBatchFeature(data=model_inputs)
+        else:
+            return mel_features, info
+    def reset_streaming(self):
+        if self._streaming_mel_processor is not None:
+            self._streaming_mel_processor.reset()
+    def get_streaming_chunk_size(self) -> int:
+        if self._streaming_mel_processor is None:
+            raise ValueError("Streaming processor not initialized")
+        return self._streaming_mel_processor.get_chunk_size()
+    def configure_streaming(
+        self,
+        chunk_ms: int = 100,
+        enable_sliding_window: bool = False,
+        slide_trigger_seconds: float = 30.0,
+        slide_stride_seconds: float = 10.0,
+    ):
+        """Configure streaming processor parameters
+        Args:
+            chunk_ms: chunk size in milliseconds
+            enable_sliding_window: whether to enable sliding window (trigger mode)
+            slide_trigger_seconds: trigger threshold for sliding window in seconds
+            slide_stride_seconds: stride for sliding window in seconds
+        """
+        if self.audio_processor is None:
+            raise ValueError("audio_processor is not set")
+        self._init_streaming_processor(
+            chunk_ms=chunk_ms,
+            enable_sliding_window=enable_sliding_window,
+            slide_trigger_seconds=slide_trigger_seconds,
+            slide_stride_seconds=slide_stride_seconds,
+        )
+    def get_streaming_config(self) -> dict:
+        if self._streaming_mel_processor is None:
+            return {}
+        return self._streaming_mel_processor.get_config()
+    def get_streaming_state(self) -> dict:
+        if self._streaming_mel_processor is None:
+            return {}
+        return self._streaming_mel_processor.get_state()
+    def get_streaming_snapshot(self) -> dict:
+        if self._streaming_mel_processor is None:
+            return {}
+        return self._streaming_mel_processor.get_snapshot()
+    def restore_streaming_snapshot(self, snapshot: dict) -> None:
+        if self._streaming_mel_processor is None:
+            return
+        if not snapshot:
+            return
+        self._streaming_mel_processor.restore_snapshot(snapshot)
+    def __call__(
+        self,
+        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
+        images: ImageInput = None,
+        audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]]] = None,
+        audio_parts: Optional[list] = None,
+        max_length: Optional[int] = None,
+        do_pad: Optional[bool] = True,
+        max_slice_nums: int = None,
+        use_image_id: bool = True,
+        stream_input: bool = False,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
+        sampling_rate: Optional[int] = 16000,
+        online_streaming: bool = False,
+        audio_chunk_idx: int = 0,
+        is_last_chunk: bool = False,
+        **kwargs,
+    ) -> MiniCPMOBatchFeature:
+        if images is not None:
+            image_inputs = self.process_image(
+                images=images, do_pad=do_pad, max_slice_nums=max_slice_nums, return_tensors=return_tensors
+            )
+        else:
+            image_inputs = None
+        audio_features, audio_feature_lens, audio_phs = self.audio_feature_extract(
+            audios,
+            audio_parts,
+            stream_input,
+            sampling_rate,
+            online_streaming=online_streaming,
+            is_last_chunk=is_last_chunk,
+        )
+        model_inputs = self._convert_omni_to_inputs(
+            image_inputs,
+            audio_phs,
+            text,
+            max_slice_nums=max_slice_nums,
+            use_image_id=use_image_id,
+            max_length=max_length,
+            **kwargs,
+        )
+        model_inputs["audio_features"] = audio_features
+        model_inputs["audio_feature_lens"] = audio_feature_lens
+        result = MiniCPMOBatchFeature(data={**model_inputs})
+        if online_streaming:
+            result.use_extra_context = True
+            result.prefix_extra_frames = 0 if audio_chunk_idx == 0 else 2
+            result.suffix_extra_frames = 2
+            result.chunk_idx = audio_chunk_idx
+        return result
+    def audio_feature_extract(
+        self,
+        audios: Union[np.ndarray, List[np.ndarray], List[List[np.ndarray]], None] = None,
+        audio_parts: Optional[list] = None,
+        stream_input: Optional[bool] = False,
+        sampling_rate: Optional[int] = None,
+        chunk_length: Optional[int] = 1,
+        online_streaming: bool = False,
+        is_last_chunk: bool = False,
+        **kwargs,
+    ):
+        if audios is None:
+            return [], [], []
+        if isinstance(audios, np.ndarray):
+            audios_list = [[audios]]
+        elif isinstance(audios[0], np.ndarray):
+            audios_list = [audios]
+        else:
+            audios_list = audios
+        if audio_parts is not None:
+            assert len(audio_parts) == len(audios_list)
+            for parts, audios in zip(audio_parts, audios_list):
+                assert len(parts) == len(audios)
+        audio_feature_lens_list = []
+        audio_ph_list = []
+        audio_features_all = []
+        # audio placeholder not dependent on audio_parts
+        for audios in audios_list:
+            if audios:
+                audio_ph_list.append(
+                    [
+                        self.get_audio_placeholder(len(a), chunk_input=stream_input, chunk_length=chunk_length)
+                        for a in audios
+                    ]
+                )
+            else:
+                audio_ph_list.append([])
+        for idx, audios in enumerate(audios_list):
+            if audio_parts is not None:
+                # same audio part merge
+                audio_part = audio_parts[idx]
+                merge_audio = []
+                cur_audio = []
+                for aid, (part, audio) in enumerate(zip(audio_part, audios)):
+                    if aid == 0 or audio_part[aid] == audio_part[aid - 1]:
+                        cur_audio.append(audio)
+                    else:
+                        merge_audio.append(np.hstack(cur_audio))
+                        cur_audio = [audio]
+                if cur_audio:
+                    merge_audio.append(np.hstack(cur_audio))
+            else:
+                merge_audio = audios
+            # If the audio exceeds 30 seconds, split it into chunks every 30 seconds.
+            final_merge_audio = []
+            max_audio_inp_len = 30 * sampling_rate
+            for audio in merge_audio:
+                if len(audio) <= max_audio_inp_len:
+                    final_merge_audio.append(audio)
+                else:
+                    for i in range(math.ceil(len(audio) / max_audio_inp_len)):
+                        final_merge_audio.append(audio[i * max_audio_inp_len : (i + 1) * max_audio_inp_len])
+            audio_feature_lens = []
+            if audios:
+                if online_streaming:
+                    # online streaming: only support single audio, directly use process_audio_streaming return format
+                    assert (
+                        len(final_merge_audio) == 1
+                    ), f"online streaming mode only supports single audio, currently there are {len(final_merge_audio)}"
+                    audio = final_merge_audio[0]
+                    result = self.process_audio_streaming(
+                        audio, reset=False, return_batch_feature=True, is_last_chunk=is_last_chunk
+                    )
+                    audio_features_all.append(
+                        result["audio_features"].squeeze(0)
+                    )  # [1, 80, T] -> [80, T], keep consistent with batch processing
+                    audio_feature_lens_list.append(result["audio_feature_lens"][0])
+                else:
+                    # batch processing
+                    audio_inputs = self.audio_processor(
+                        final_merge_audio,
+                        sampling_rate=sampling_rate,
+                        return_attention_mask=True,
+                        padding="max_length",
+                        return_tensors="pt",
+                        **kwargs,
+                    )
+                    audio_feature = audio_inputs["input_features"]
+                    actual_lens = audio_inputs["attention_mask"].sum(dim=1)
+                    for feat, lens in zip(audio_feature, actual_lens):
+                        audio_features_all.append(feat[:, :lens])
+                        audio_feature_lens.append(lens)
+                    audio_feature_lens = torch.hstack(audio_feature_lens)
+                    audio_feature_lens_list.append(audio_feature_lens)
+            else:
+                audio_feature_lens_list.append([])
+        if audio_features_all:
+            audio_features = [i.permute(1, 0) for i in audio_features_all]
+            audio_features = torch.nn.utils.rnn.pad_sequence(
+                audio_features, batch_first=True, padding_value=0.0
+            ).permute(0, 2, 1)
+        else:
+            audio_features = []
+        return audio_features, audio_feature_lens_list, audio_ph_list
+    def _convert(self, input_str, max_inp_length: Optional[int] = None):
+        old_input_ids = self.tokenizer.encode(input_str)
+        listen_token_id = self.tokenizer.convert_tokens_to_ids("<|listen|>")
+        input_ids = []
+        for token in old_input_ids:
+            if token != listen_token_id:
+                input_ids.append(token)
+        if max_inp_length is not None:
+            input_ids = input_ids[:max_inp_length]
+        input_ids = torch.tensor(input_ids, dtype=torch.int32)
+        ## image bound
+        start_cond = (input_ids == self.tokenizer.im_start_id) | (input_ids == self.tokenizer.slice_start_id)
+        end_cond = (input_ids == self.tokenizer.im_end_id) | (input_ids == self.tokenizer.slice_end_id)
+        image_start_idx = torch.where(start_cond)[0]
+        image_start_idx += 1
+        image_end_idx = torch.where(end_cond)[0]
+        valid_image_nums = max(len(image_start_idx), len(image_end_idx))
+        image_bounds = torch.hstack(
+            [
+                image_start_idx[:valid_image_nums].unsqueeze(-1),
+                image_end_idx[:valid_image_nums].unsqueeze(-1),
+            ]
+        )
+        ##  audio bound
+        audio_start_idx = torch.where(input_ids == self.tokenizer.audio_start_id)[0]
+        audio_end_idx = torch.where(input_ids == self.tokenizer.audio_end_id)[0]
+        assert len(audio_start_idx) == len(audio_end_idx)
+        audio_bounds = torch.hstack([(audio_start_idx + 1).unsqueeze(-1), audio_end_idx.unsqueeze(-1)])
+        spk_start_idx = torch.where(input_ids == self.tokenizer.spk_start_id)[0]
+        spk_end_idx = torch.where(input_ids == self.tokenizer.spk_end_id)[0]
+        assert len(spk_start_idx) == len(spk_end_idx)
+        spk_bounds = torch.hstack([(spk_start_idx + 1).unsqueeze(-1), spk_end_idx.unsqueeze(-1)])
+        return input_ids, image_bounds, audio_bounds, spk_bounds
+    def _convert_omni_to_inputs(
+        self,
+        images,
+        audio_phs,
+        texts: Union[str, List[str]],
+        truncation=None,
+        max_length=None,
+        max_slice_nums=None,
+        use_image_id=None,
+        return_tensors=None,
+        **kwargs,
+    ):
+        if images is None and audio_phs is None:
+            model_inputs = self.tokenizer(
+                texts, return_tensors=return_tensors, truncation=truncation, max_length=max_length, **kwargs
+            )
+            return MiniCPMOBatchFeature(data={**model_inputs})
+        image_pattern = "<image>./</image>"
+        audio_pattern = "<audio>./</audio>"
+        split_pattern = f"({image_pattern}|{audio_pattern})"
+        if isinstance(texts, str):
+            texts = [texts]
+        bs = len(texts)
+        if images is not None:
+            images, image_sizes, tgt_sizes = images["pixel_values"], images["image_sizes"], images["tgt_sizes"]
+        else:
+            images, image_sizes, tgt_sizes = [[]] * bs, [[]] * bs, [[]] * bs
+        input_ids_list = []
+        image_bounds_list = []
+        audio_bounds_list = []
+        spk_bounds_list = []
+        for index, text in enumerate(texts):
+            text_chunks = re.split(split_pattern, text)
+            image_tags = re.findall(image_pattern, text)
+            audio_tags = re.findall(audio_pattern, text)
+            if image_tags:
+                assert images is not None
+                assert len(image_tags) == len(image_sizes[index])
+            if audio_tags:
+                assert audio_phs is not None
+                assert len(audio_tags) == len(audio_phs[index])
+            image_id = 0
+            audio_id = 0
+            for i, chunk in enumerate(text_chunks):
+                if chunk == image_pattern:
+                    image_placeholder = self.image_processor.get_slice_image_placeholder(
+                        image_sizes[index][image_id], image_id, max_slice_nums, use_image_id
+                    )
+                    image_id += 1
+                    text_chunks[i] = image_placeholder
+                elif chunk == audio_pattern:
+                    audio_placeholder = audio_phs[index][audio_id]
+                    audio_id += 1
+                    text_chunks[i] = audio_placeholder
+            final_text = "".join(text_chunks)
+            input_ids, image_bounds, audio_bounds, spk_bounds = self._convert(final_text, max_length)
+            input_ids_list.append(input_ids)
+            image_bounds_list.append(image_bounds)
+            audio_bounds_list.append(audio_bounds)
+            spk_bounds_list.append(spk_bounds)
+        padded_input_ids, padding_lengths = self.pad(input_ids_list, padding_side="left")
+        attention_mask = torch.ones_like(padded_input_ids, dtype=torch.bool)
+        for i, length in enumerate(padding_lengths):
+            image_bounds_list[i] = image_bounds_list[i] + length
+            audio_bounds_list[i] = audio_bounds_list[i] + length
+            spk_bounds_list[i] = spk_bounds_list[i] + length
+            attention_mask[i, :length] = False
+        data = {
+            "input_ids": padded_input_ids,
+            "attention_mask": attention_mask,
+            "pixel_values": images,
+            "image_sizes": image_sizes,
+            "image_bound": image_bounds_list,
+            "tgt_sizes": tgt_sizes,
+            "audio_bounds": audio_bounds_list,
+            "spk_bounds": spk_bounds_list,
+        }
+        return data
+    def pad(self, inputs, max_length=None, padding_value=0, padding_side="left"):
+        items = []
+        if isinstance(inputs[0], list):
+            assert isinstance(inputs[0][0], torch.Tensor)
+            for it in inputs:
+                for tr in it:
+                    items.append(tr)
+        else:
+            assert isinstance(inputs[0], torch.Tensor)
+            items = inputs
+        batch_size = len(items)
+        shape = items[0].shape
+        dim = len(shape)
+        assert dim <= 2
+        if max_length is None:
+            max_length = 0
+        max_length = max(max_length, max(item.shape[-1] for item in items))
+        min_length = min(item.shape[-1] for item in items)
+        dtype = items[0].dtype
+        if dim == 0:
+            return torch.stack([item for item in items], dim=0), [0]
+        elif dim == 1:
+            if max_length == min_length:
+                return torch.stack([item for item in items], dim=0), [0] * batch_size
+            tensor = torch.zeros((batch_size, max_length), dtype=dtype) + padding_value
+        else:
+            tensor = torch.zeros((batch_size, max_length, shape[-1]), dtype=dtype) + padding_value
+        padding_length = []
+        for i, item in enumerate(items):
+            if dim == 1:
+                if padding_side == "left":
+                    tensor[i, -len(item) :] = item.clone()
+                else:
+                    tensor[i, : len(item)] = item.clone()
+            elif dim == 2:
+                if padding_side == "left":
+                    tensor[i, -len(item) :, :] = item.clone()
+                else:
+                    tensor[i, : len(item), :] = item.clone()
+            padding_length.append(tensor.shape[-1] - len(item))
+        return tensor, padding_length

processor_config.json ADDED Viewed

	@@ -0,0 +1,102 @@

+{
+  "audio_processor": {
+    "audio_pool_step": 5,
+    "auto_map": {
+      "AutoFeatureExtractor": "processing_minicpmo.MiniCPMAAudioProcessor",
+      "AutoImageProcessor": "processing_minicpmo.MiniCPMVImageProcessor",
+      "AutoProcessor": "processing_minicpmo.MiniCPMOProcessor"
+    },
+    "chunk_length": 30,
+    "dither": 0.0,
+    "dynamic_log_norm": true,
+    "dynamic_range_db": 8.0,
+    "feature_extractor_type": "MiniCPMAAudioProcessor",
+    "feature_size": 80,
+    "hop_length": 160,
+    "im_end": "</image>",
+    "im_id_end": "</image_id>",
+    "im_id_start": "<image_id>",
+    "im_start": "<image>",
+    "image_feature_size": 64,
+    "image_processor_type": "MiniCPMVImageProcessor",
+    "log_floor_db": -10.0,
+    "max_slice_nums": 9,
+    "n_fft": 400,
+    "n_samples": 480000,
+    "nb_max_frames": 3000,
+    "norm_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "norm_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "padding_side": "right",
+    "padding_value": 0.0,
+    "patch_size": 14,
+    "return_attention_mask": false,
+    "sampling_rate": 16000,
+    "scale_resolution": 448,
+    "slice_end": "</slice>",
+    "slice_mode": true,
+    "slice_start": "<slice>",
+    "unk": "<unk>",
+    "use_image_id": true,
+    "version": 4.5
+  },
+  "auto_map": {
+    "AutoProcessor": "processing_minicpmo.MiniCPMOProcessor"
+  },
+  "image_processor": {
+    "audio_pool_step": 5,
+    "auto_map": {
+      "AutoFeatureExtractor": "processing_minicpmo.MiniCPMAAudioProcessor",
+      "AutoImageProcessor": "processing_minicpmo.MiniCPMVImageProcessor",
+      "AutoProcessor": "processing_minicpmo.MiniCPMOProcessor"
+    },
+    "im_end": "</image>",
+    "im_end_token": "</image>",
+    "im_id_end": "</image_id>",
+    "im_id_start": "<image_id>",
+    "im_start": "<image>",
+    "im_start_token": "<image>",
+    "image_feature_size": 64,
+    "image_processor_type": "MiniCPMVImageProcessor",
+    "max_slice_nums": 9,
+    "mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "norm_mean": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "norm_std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "patch_size": 14,
+    "scale_resolution": 448,
+    "slice_end": "</slice>",
+    "slice_end_token": "</slice>",
+    "slice_mode": true,
+    "slice_start": "<slice>",
+    "slice_start_token": "<slice>",
+    "std": [
+      0.5,
+      0.5,
+      0.5
+    ],
+    "unk": "<unk>",
+    "unk_token": "<unk>",
+    "use_image_id": true,
+    "version": 4.5
+  },
+  "processor_class": "MiniCPMOProcessor"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,88 @@

+{
+  "additional_special_tokens": [
+    "<unk>",
+    "<image>",
+    "</image>",
+    "<ref>",
+    "</ref>",
+    "<box>",
+    "</box>",
+    "<quad>",
+    "</quad>",
+    "<point>",
+    "</point>",
+    "<slice>",
+    "</slice>",
+    "<image_id>",
+    "</image_id>",
+    "<unit>",
+    "</unit>",
+    "<answer>",
+    "</answer>",
+    "<focus>",
+    "</focus>",
+    "<line>",
+    "</line>",
+    "<perception>",
+    "</perception>",
+    "<source_image>",
+    "</source_image>",
+    "<image_save_to>",
+    "</image_save_to>",
+    "<|audio_start|>",
+    "<|audio|>",
+    "<|audio_end|>",
+    "<|spk_bos|>",
+    "<|spk|>",
+    "<|spk_eos|>",
+    "<|tts_bos|>",
+    "<|tts_eos|>",
+    "<|listen|>",
+    "<|speak|>",
+    "<|interrupt|>",
+    "<|vad_start|>",
+    "<|vad_end|>",
+    "<|emotion_start|>",
+    "<|emotion_end|>",
+    "<|speed_start|>",
+    "<|speed_end|>",
+    "<|pitch_start|>",
+    "<|pitch_end|>",
+    "<|turn_bos|>",
+    "<|turn_eos|>",
+    "<|chunk_eos|>",
+    "<|chunk_bos|>",
+    "<|chunk_tts_bos|>",
+    "<|chunk_tts_eos|>",
+    "<|tts_pad|>",
+    "<|timbre_7|>",
+    "<|timbre_8|>",
+    "<|timbre_9|>",
+    "<|timbre_10|>",
+    "<|timbre_11|>",
+    "<|timbre_12|>",
+    "<|timbre_13|>",
+    "<|timbre_14|>",
+    "<|timbre_15|>",
+    "<|timbre_16|>",
+    "<|timbre_17|>",
+    "<|timbre_18|>",
+    "<|timbre_19|>",
+    "<|timbre_20|>",
+    "<|timbre_21|>",
+    "<|timbre_22|>",
+    "<|timbre_23|>",
+    "<|timbre_24|>",
+    "<|timbre_25|>",
+    "<|timbre_26|>",
+    "<|timbre_27|>",
+    "<|timbre_28|>",
+    "<|timbre_29|>",
+    "<|timbre_30|>",
+    "<|timbre_31|>"
+  ],
+  "bos_token": "<|im_start|>",
+  "eos_token": "<|im_end|>",
+  "pad_token": "<|endoftext|>",
+  "unk_token": "<unk>"
+}

tokenization_minicpmo_fast.py ADDED Viewed

	@@ -0,0 +1,120 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright 2026 The OpenBMB Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import List
+from transformers import Qwen2TokenizerFast
+class MiniCPMOTokenizerFast(Qwen2TokenizerFast):
+    def __init__(self, **kwargs):
+        self._bad_token_ids = kwargs.pop("bad_token_ids", [])
+        super().__init__(**kwargs)
+        # image
+        self.im_start = "<image>"
+        self.im_end = "</image>"
+        self.ref_start = "<ref>"
+        self.ref_end = "</ref>"
+        self.box_start = "<box>"
+        self.box_end = "</box>"
+        self.quad_start = "<quad>"
+        self.quad_end = "</quad>"
+        self.slice_start = "<slice>"
+        self.slice_end = "</slice>"
+        self.im_id_start = "<image_id>"
+        self.im_id_end = "</image_id>"
+        # audio
+        self.audio_start = "<|audio_start|>"
+        self.audio_end = "<|audio_end|>"
+        self.spk_start = "<|spk_bos|>"
+        self.spk_end = "<|spk_eos|>"
+        self.tts_start = "<|tts_bos|>"
+        self.tts_end = "<|tts_eos|>"
+    @property
+    def eos_id(self):
+        return self.eos_token_id
+    @property
+    def bos_id(self):
+        return self.bos_token_id
+    @property
+    def unk_id(self):
+        return self.unk_token_id
+    @property
+    def im_start_id(self):
+        return self.convert_tokens_to_ids(self.im_start)
+    @property
+    def im_end_id(self):
+        return self.convert_tokens_to_ids(self.im_end)
+    @property
+    def slice_start_id(self):
+        return self.convert_tokens_to_ids(self.slice_start)
+    @property
+    def slice_end_id(self):
+        return self.convert_tokens_to_ids(self.slice_end)
+    @property
+    def im_id_start_id(self):
+        return self.convert_tokens_to_ids(self.im_id_start)
+    @property
+    def im_id_end_id(self):
+        return self.convert_tokens_to_ids(self.im_id_end)
+    @property
+    def audio_start_id(self):
+        return self.convert_tokens_to_ids(self.audio_start)
+    @property
+    def audio_end_id(self):
+        return self.convert_tokens_to_ids(self.audio_end)
+    @property
+    def spk_start_id(self):
+        return self.convert_tokens_to_ids(self.spk_start)
+    @property
+    def spk_end_id(self):
+        return self.convert_tokens_to_ids(self.spk_end)
+    @property
+    def tts_start_id(self):
+        return self.convert_tokens_to_ids(self.tts_start)
+    @property
+    def tts_end_id(self):
+        return self.convert_tokens_to_ids(self.tts_end)
+    @staticmethod
+    def escape(text: str) -> str:
+        return text
+    @staticmethod
+    def unescape(text: str) -> str:
+        return text
+    @property
+    def bad_token_ids(self) -> List[int]:
+        return self._bad_token_ids

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:66664f87759d9e829e7ef0ded96976727374dcd7ca6f3ae9bfe89bbda541e5af
+size 11437708

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "add_prefix_space": false,
+  "auto_map": {
+    "AutoProcessor": "processing_minicpmo.MiniCPMOProcessor",
+    "AutoTokenizer": [
+      null,
+      "tokenization_minicpmo_fast.MiniCPMOTokenizerFast"
+    ]
+  },
+  "backend": "tokenizers",
+  "bos_token": "<|im_start|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "is_local": true,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "processor_class": "MiniCPMOProcessor",
+  "split_special_tokens": false,
+  "tokenizer_class": "MiniCPMOTokenizer",
+  "unk_token": "<unk>"
+}

utils.py ADDED Viewed

	@@ -0,0 +1,2417 @@

+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Copyright 2026 The OpenBMB Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import logging
+from dataclasses import dataclass
+from typing import Any
+from typing import Dict
+from typing import List
+from typing import Literal
+from typing import Optional
+from typing import Tuple
+from typing import Union
+import torch
+import torch.nn.functional as F
+import torch.nn.utils.parametrize as P
+from transformers.cache_utils import DynamicCache
+logger = logging.getLogger(__name__)
+# text
+@dataclass
+class GenerateChunkOutput:
+    chunk_token_ids: torch.Tensor
+    current_inputs_embeds: torch.Tensor
+    input_last_hidden_states: Optional[torch.Tensor]  # for tts use_speaker_embedding
+    last_hidden_states: Optional[torch.Tensor]  # for tts input feature (projector_semantic)
+    past_key_values: Optional[torch.Tensor]
+    finished: bool
+class ChunkPrefillChunkGenerate:
+    def __init__(self, model, tokenizer, terminators):
+        self.tokenizer = tokenizer
+        self.model = model
+        self.terminators = terminators
+        self.terminators_ids = [tokenizer.convert_tokens_to_ids(i) for i in self.terminators]
+        self.embedding_layer = self.model.get_input_embeddings()
+        self.forbidden_tokens = [
+            ":",
+            "：",
+            "；",
+            "#",
+            "“",
+            "”",
+            "‘",
+            "’",
+            "@",
+            "*",
+            "【",
+            "】",
+            "「",
+            "」",
+            "(",
+            ")",
+            "（",
+            "）",
+            "[",
+            "]",
+            "&",
+            "/",
+            "$",
+        ]
+        self.forbidden_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in self.forbidden_tokens]
+        bad_token_ids = getattr(tokenizer, "bad_token_ids", [])
+        if bad_token_ids:
+            self.forbidden_token_ids.extend(bad_token_ids)
+    @staticmethod
+    def prepare_generation_config(do_sample, max_new_tokens=50, min_new_tokens=0, **kwargs):
+        num_beams = kwargs.get("num_beams", 3)
+        generation_config = {
+            "num_beams": num_beams,
+            "top_p": 0.8,
+            "top_k": 100,
+            "temperature": 0.7,
+            "do_sample": True,
+            "repetition_penalty": 1.05,
+        }
+        if do_sample:
+            generation_config.update(
+                {
+                    "top_p": 0.8,
+                    "top_k": 100,
+                    "temperature": 0.7,
+                    "do_sample": True,
+                    "repetition_penalty": 1.05,
+                }
+            )
+        elif num_beams > 1:
+            generation_config.update({"num_beams": num_beams, "repetition_penalty": 1.2, "do_sample": False})
+        else:
+            generation_config.update({"do_sample": False, "repetition_penalty": 1.05})
+        generation_config.update((k, kwargs[k]) for k in generation_config.keys() & kwargs.keys())
+        generation_config["min_new_tokens"] = min_new_tokens
+        generation_config["max_new_tokens"] = max_new_tokens
+        return generation_config
+    def chunk_generate(
+        self,
+        inputs_embeds: torch.Tensor,
+        past_key_values,
+        is_first_generate_chunk: bool,
+        chunk_size: int,
+        return_hidden_states: bool,
+        do_sample: bool,
+        temperature: float,
+        top_p: float,
+        top_k: int,
+        repetition_penalty: float = 1.05,
+        length_penalty: float = 1.0,
+        all_input_ids: Optional[torch.Tensor] = None,
+    ) -> GenerateChunkOutput:
+        """
+        Args:
+            inputs_embeds: [1, seq_len, hidden_dim], Input embeddings of current chunk.
+            past_key_values: [num_layers, 2, batch_size, num_heads, seq_len, head_dim], Past key values for llm.
+            is_first_generate_chunk: bool, Whether this is the first generate chunk.
+            chunk_size: int, The size of the current chunk, default is 10, and it is fixed during training.
+            return_hidden_states: bool Whether to return the hidden states, default is True.
+            do_sample: bool Whether to sample from the model, default is True.
+            temperature: float The temperature for the model, default is 0.7.
+            top_p: float The top-p for the model, default is 0.8.
+            top_k: int The top-k for the model, default is 100.
+            repetition_penalty: float, The repetition penalty for the model, default is 1.05.
+            length_penalty: float, The length penalty for the model, default is 1.0. Higher value means more detailed generation.
+            all_input_ids: Optional[torch.Tensor], The input ids for the current chunk.
+        """
+        finished = False
+        current_inputs_embeds = inputs_embeds.clone()
+        input_last_hidden_states = []
+        last_hidden_states = []
+        generated_tokens = []
+        for token_idx in range(chunk_size):
+            if is_first_generate_chunk and token_idx == 0:
+                # first generate chunk, prefill inputs_embeds
+                model_inputs = {
+                    "inputs_embeds": current_inputs_embeds,
+                    "past_key_values": past_key_values,
+                    "use_cache": True,
+                    "output_hidden_states": return_hidden_states,
+                }
+            else:  # for all other cases: prefill the latest generated token
+                model_inputs = {
+                    "inputs_embeds": current_inputs_embeds[:, -1:, :],
+                    "past_key_values": past_key_values,
+                    "use_cache": True,
+                    "output_hidden_states": return_hidden_states,
+                }
+            with torch.no_grad():
+                outputs = self.model(**model_inputs)
+            # last token's logits
+            logits = outputs.logits[:, -1, :].to(copy=True, dtype=torch.float32, device=inputs_embeds.device)
+            # forbid specific tokens decoding = model.generate@suppress_tokens
+            if self.forbidden_token_ids:
+                logits[:, self.forbidden_token_ids] = float("-inf")
+            past_key_values = outputs.past_key_values
+            PENALTY_WINDOW_SIZE = 128
+            # apply repetition penalty
+            if repetition_penalty != 1.0:
+                # get token ids for repetition penalty
+                if all_input_ids is not None:
+                    # use global input ids (including original input and generated part)
+                    if len(generated_tokens) > 0:
+                        generated_token_ids = torch.cat(generated_tokens, dim=1)
+                        current_sequence = torch.cat(
+                            [
+                                all_input_ids[:, -PENALTY_WINDOW_SIZE:],
+                                generated_token_ids,
+                            ],
+                            dim=1,
+                        )
+                    else:
+                        current_sequence = all_input_ids[:, -PENALTY_WINDOW_SIZE:]
+                    unique_token_ids = torch.unique(current_sequence.squeeze(0))
+                elif len(generated_tokens) > 0:
+                    # revert to original logic: only use generated tokens
+                    generated_token_ids = torch.cat(generated_tokens, dim=1).squeeze(0)
+                    unique_token_ids = torch.unique(generated_token_ids)
+                else:
+                    unique_token_ids = torch.tensor([], dtype=torch.long, device=logits.device)
+                # apply repetition penalty
+                for token_id in unique_token_ids:
+                    if logits[0, token_id] > 0:
+                        logits[0, token_id] = logits[0, token_id] / repetition_penalty
+                    else:
+                        logits[0, token_id] = logits[0, token_id] * repetition_penalty
+            # apply length penalty, higher value means more detailed generation
+            if length_penalty != 1.0:
+                for eos_token_id in self.terminators_ids:
+                    if logits[0, eos_token_id] > 0:
+                        logits[0, eos_token_id] = logits[0, eos_token_id] / length_penalty
+                    else:
+                        logits[0, eos_token_id] = logits[0, eos_token_id] * length_penalty
+            # apply temperature
+            if temperature != 1.0:
+                logits = logits / temperature
+            if do_sample:
+                # Top-k filtering
+                if top_k > 0:
+                    top_k_logits, top_k_indices = torch.topk(logits, min(top_k, logits.size(-1)))
+                    logits_filtered = torch.full_like(logits, float("-inf"))
+                    logits_filtered.scatter_(1, top_k_indices, top_k_logits)
+                    logits = logits_filtered
+                # Top-p filtering
+                if top_p < 1.0:
+                    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+                    cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    # remove tokens with cumulative probability greater than top_p
+                    sorted_indices_to_remove = cumulative_probs > top_p
+                    sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+                    sorted_indices_to_remove[..., 0] = 0
+                    indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
+                    logits[indices_to_remove] = float("-inf")
+                # sampling
+                probs = F.softmax(logits, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1)
+            else:
+                next_token = torch.argmax(logits, dim=-1, keepdim=True)
+            if return_hidden_states:
+                if is_first_generate_chunk and token_idx == 0:
+                    input_last_hidden_states.append(outputs.hidden_states[-1])
+                else:
+                    last_hidden_states.append(outputs.hidden_states[-1])
+            # if terminator token, stop generating
+            if next_token.item() in self.terminators_ids:
+                finished = True
+                break
+            generated_tokens.append(next_token)
+            # convert new token to embeddings and concatenate
+            next_token_embed = self.embedding_layer(next_token)
+            # update inputs_embeds, add one
+            current_inputs_embeds = torch.cat([current_inputs_embeds, next_token_embed], dim=1)
+        if len(generated_tokens) > 0:
+            chunk_token_ids = torch.cat(generated_tokens, dim=1)
+        else:
+            # special case: if last chunk and first predict is eos token, return last token of previous chunk. return a tensor with shape (1, 0)
+            if finished:
+                chunk_token_ids = torch.zeros((1, 0), dtype=torch.long, device=current_inputs_embeds.device)
+            else:
+                raise Exception("this should not happen")
+        if len(last_hidden_states) > 0:
+            last_hidden_states = torch.cat(last_hidden_states, dim=1)
+        else:
+            # special case: if last chunk, return last token of previous chunk.
+            if finished:
+                last_hidden_states = torch.cat(last_hidden_states, dim=1)
+            else:
+                raise Exception("this should not happen")
+        if len(input_last_hidden_states) > 0:
+            input_last_hidden_states = torch.cat(input_last_hidden_states, dim=1)
+        else:
+            input_last_hidden_states = None
+        return GenerateChunkOutput(
+            chunk_token_ids=chunk_token_ids,
+            current_inputs_embeds=current_inputs_embeds,
+            input_last_hidden_states=input_last_hidden_states,
+            last_hidden_states=last_hidden_states,
+            past_key_values=past_key_values,
+            finished=finished,
+        )
+def streaming_token_decoder(token_iterator, tokenizer, skip_special_tokens=False):
+    """
+    Incrementally decode tokens from an iterator, handling partial multi-byte characters.
+    When streaming tokens, multi-byte characters (like Chinese) may be split across multiple
+    tokens. Decoding partial tokens results in replacement characters (U+FFFD). This function
+    buffers tokens and only yields complete characters.
+    Args:
+        token_iterator: An iterator yielding (token_ids, is_finished) tuples.
+                       token_ids can be torch.Tensor or any iterable of integers.
+        tokenizer: The tokenizer to use for decoding.
+        skip_special_tokens: Whether to skip special tokens during decoding.
+    Yields:
+        (decoded_text, is_finished) tuples where decoded_text is the new text since last yield.
+    """
+    accumulated_token_ids = []
+    yielded_text_len = 0
+    for token_ids, is_finished in token_iterator:
+        # Accumulate token IDs
+        if torch.is_tensor(token_ids):
+            accumulated_token_ids.extend(token_ids.reshape(-1).tolist())
+        else:
+            accumulated_token_ids.extend(list(token_ids) if hasattr(token_ids, "__iter__") else [token_ids])
+        # Decode all accumulated tokens
+        full_decoded = tokenizer.decode(accumulated_token_ids, skip_special_tokens=skip_special_tokens)
+        if is_finished:
+            # Final chunk - yield all remaining text
+            new_text = full_decoded[yielded_text_len:]
+            yield new_text, is_finished
+        else:
+            # Find safe prefix without incomplete multi-byte characters
+            # The replacement character '�' (U+FFFD) indicates incomplete decoding
+            new_text = full_decoded[yielded_text_len:]
+            # Hold back text ending with replacement character (incomplete UTF-8 sequence)
+            safe_end = len(new_text)
+            while safe_end > 0 and new_text[safe_end - 1] == "\ufffd":
+                safe_end -= 1
+            safe_text = new_text[:safe_end] if safe_end > 0 else ""
+            yielded_text_len += len(safe_text)
+            yield safe_text, is_finished
+def torch_clone_recursive(obj):
+    """Recursively clone nested containers of torch.Tensors.
+    Supported container types: dict, list, tuple. Non-container non-Tensor
+    objects are returned as-is.
+    """
+    if torch.is_tensor(obj):
+        return obj.clone()
+    elif isinstance(obj, dict):
+        return {k: torch_clone_recursive(v) for k, v in obj.items()}
+    elif isinstance(obj, list):
+        return [torch_clone_recursive(v) for v in obj]
+    elif isinstance(obj, tuple):
+        return tuple(torch_clone_recursive(v) for v in obj)
+    else:
+        raise ValueError(f"Unsupported type: {type(obj)}")
+def rotate_half(x: torch.Tensor) -> torch.Tensor:
+    """Rotate half the hidden dims of the input for RoPE."""
+    dim = x.shape[-1]
+    x1 = x[..., : dim // 2]
+    x2 = x[..., dim // 2 :]
+    return torch.cat((-x2, x1), dim=-1)
+@dataclass
+class SpeculativeSnapshot:
+    """Speculative snapshot for VAD speculative rollback.
+    Used in VAD speculative execution: creates a snapshot after streaming_prefill
+    and before streaming_generate. If speculation fails (user continues speaking),
+    the state can be restored to continue streaming_prefill.
+    Implementation:
+    - LLM KV Cache: only record length, restore by truncation (zero extra VRAM)
+    - Audio KV Cache: requires cloning, as generate sets it to None
+    - Mel processor: save full state snapshot (including buffer)
+    """
+    # KV Cache length (for truncation recovery)
+    llm_cache_length: int
+    audio_cache_length: int
+    # session state
+    new_user_msg: bool
+    llm_generated: bool
+    llm_generate_completed: bool
+    # Round management
+    next_round_id: int
+    pending_round_id: Optional[int]
+    omni_chunk_history_length: int
+    # TTS state (requires cloning, but usually small)
+    tts_last_turn_tokens: Optional[torch.Tensor]
+    # Streaming processor state
+    audio_chunk_idx: int
+    # Mel processor state snapshot (including buffer)
+    mel_processor_snapshot: Optional[dict] = None
+    # Audio encoder KV cache (requires cloning to ensure determinism after recovery)
+    audio_past_key_values: Optional[tuple] = None
+    # timestamp (for debugging)
+    timestamp: float = 0.0
+    # debug field: for verifying correctness of recovery
+    llm_cache_checksum: Optional[float] = None  # LLM KV Cache first layer K sum
+    audio_cache_checksum: Optional[float] = None  # Audio KV Cache first layer K sum
+    mel_buffer_checksum: Optional[float] = None  # Mel buffer sum
+    # RNG state (key: for ensuring determinism of dithering etc. after recovery)
+    rng_state_cpu: Optional[torch.Tensor] = None  # torch CPU RNG state
+    rng_state_cuda: Optional[torch.Tensor] = None  # torch CUDA RNG state (if on GPU)
+    def summary(self) -> str:
+        mel_buf_len = 0
+        if self.mel_processor_snapshot:
+            buf = self.mel_processor_snapshot.get("buffer")
+            if buf is not None:
+                mel_buf_len = len(buf)
+        return (
+            f"llm_cache={self.llm_cache_length}, "
+            f"audio_cache={self.audio_cache_length}, "
+            f"audio_chunk_idx={self.audio_chunk_idx}, "
+            f"mel_buffer={mel_buf_len}, "
+            f"history_len={self.omni_chunk_history_length}, "
+            f"new_user_msg={self.new_user_msg}, "
+            f"llm_generated={self.llm_generated}"
+        )
+# tts
+@dataclass
+class TTSSamplingParams:
+    top_p: float = 0.85
+    min_p: float = 0.01
+    top_k: int = 25
+    repetition_penalty: float = 1.05
+    temperature: float = 0.8
+    win_size: int = 16
+    tau_r: float = 0.1
+class TTSStreamingGenerator:
+    """
+    Streaming generator for TTS that processes chunks and yields audio tokens in real-time.
+    Supported attention types:
+    - full_attention: Full attention, all tokens can attend to each other
+    - sliding_window: Sliding window attention, KV cache is truncated to fixed size (token_window_size)
+    - sliding_recompute: Sliding recompute, only keep previous chunk and recompute with current chunk
+    - reindex: Keep first chunk as sink, reindex sliding window positions via RoPE rotation
+    """
+    def __init__(
+        self,
+        model,
+        temperature: float,
+        eos_token: Union[int, torch.Tensor],
+        chunk_size: int = 25,  # s3tokenizer 1s = 25token
+        tts_last_turn_tokens: torch.Tensor = None,
+        logits_processors=None,
+        logits_warpers=None,
+    ):
+        self.tts = model
+        self.device = model.device
+        self.temperature = torch.tensor([temperature], dtype=torch.float, device=self.device)
+        self.eos_token = (
+            torch.tensor(eos_token, device=self.device) if isinstance(eos_token, int) else eos_token.to(self.device)
+        )
+        self.num_vq = model.num_vq
+        self.num_audio_tokens = model.num_audio_tokens
+        self.recomputed_chunks = model.recomputed_chunks
+        self.emb_code = model.emb_code
+        self.head_code = model.head_code
+        # Attention type and window sizes
+        self.attention_type = model.attention_type  # "full_attention", "sliding_window", "sliding_recompute", "reindex"
+        self.chunk_window_size = model.chunk_window_size  # chunk-level window for sliding_recompute (default 2)
+        self.token_window_size = model.token_window_size  # token-level window for sliding_window/reindex (default 300)
+        # RoPE config (for reindex mode)
+        self.rope_theta = model.model.config.rope_theta
+        self.head_dim = model.model.config.hidden_size // model.model.config.num_attention_heads
+        # Logits processors
+        self.logits_processors = logits_processors if logits_processors is not None else []
+        # Logits warpers (like TopP/TopK), separate from processors
+        self.logits_warpers = logits_warpers if logits_warpers is not None else []
+        # initialize state
+        self.past_key_values = None
+        self.text_start_pos = 0
+        self.idx = -1  # start from -1, become 0 when first called
+        self.all_conditions = []
+        self.all_generated_tokens = []
+        self.tts_last_turn_tokens = tts_last_turn_tokens
+        self.spk_emb = None
+        audio_bos = [self.tts.audio_bos_token_id]
+        audio_bos = torch.Tensor(audio_bos).to(self.tts.emb_text.weight.device, dtype=torch.long)
+        self.audio_bos_embeds = self.tts.emb_text(audio_bos).unsqueeze(0)
+        self.text_eos_embed = self.tts.emb_text(
+            torch.tensor(
+                [self.tts.config.text_eos_token_id],
+                device=self.tts.emb_text.weight.device,
+                dtype=torch.long,
+            )
+        ).unsqueeze(0)
+        # buffer related, used to fill up chunk_size and yield to outside
+        self.chunk_size = chunk_size
+        self._token_buffer: List[torch.Tensor] = []
+        # Chunk info tracking for sliding_recompute and reindex
+        self._chunk_info: List[dict] = []
+        self._total_seq_len = 0
+        # Reindex mode: track sink (first chunk) length
+        self._sink_kv_len = 0
+    def _build_recompute_inputs(self, current_condition: torch.Tensor) -> torch.Tensor:
+        """Build recompute inputs for sliding_recompute mode."""
+        if len(self._chunk_info) == 0:
+            return current_condition
+        prev_chunk = self._chunk_info[-1]
+        prev_condition = prev_chunk["condition"]
+        prev_audio_tokens = prev_chunk["audio_tokens"]
+        recompute_list = [prev_condition]
+        if len(prev_audio_tokens) > 0:
+            prev_audio_embeds = torch.cat([self.emb_code[0](tok) for tok in prev_audio_tokens], dim=1)
+            recompute_list.append(prev_audio_embeds)
+        recompute_list.append(current_condition)
+        return torch.cat(recompute_list, dim=1)
+    def _truncate_kv_cache_sliding_window(self):
+        """Truncate KV cache for sliding_window mode."""
+        if self.past_key_values is None:
+            return
+        if hasattr(self.past_key_values, "get_seq_length"):
+            current_kv_len = self.past_key_values.get_seq_length()
+        else:
+            current_kv_len = self.past_key_values[0][0].shape[2]
+        if current_kv_len <= self.token_window_size:
+            return
+        new_cache = DynamicCache()
+        num_layers = (
+            len(self.past_key_values.key_cache)
+            if hasattr(self.past_key_values, "key_cache")
+            else len(self.past_key_values)
+        )
+        for layer_idx in range(num_layers):
+            if hasattr(self.past_key_values, "key_cache"):
+                key = self.past_key_values.key_cache[layer_idx][:, :, -self.token_window_size :, :]
+                value = self.past_key_values.value_cache[layer_idx][:, :, -self.token_window_size :, :]
+            else:
+                key = self.past_key_values[layer_idx][0][:, :, -self.token_window_size :, :]
+                value = self.past_key_values[layer_idx][1][:, :, -self.token_window_size :, :]
+            new_cache.update(key, value, layer_idx)
+        self.past_key_values = new_cache
+    @staticmethod
+    def _apply_rope_rotation(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
+        """Apply RoPE rotation to tensor."""
+        return x * cos + rotate_half(x) * sin
+    def _compute_rope_cos_sin(self, positions: torch.Tensor, device: torch.device, dtype: torch.dtype):
+        """Compute RoPE cos and sin for given positions."""
+        dim_half = self.head_dim // 2
+        freq_seq = torch.arange(0, dim_half, dtype=torch.float32, device=device)
+        inv_freq = 1.0 / (self.rope_theta ** (freq_seq / dim_half))
+        # positions: [seq_len]
+        angles = positions.float().unsqueeze(-1) * inv_freq.unsqueeze(0)  # [seq_len, dim_half]
+        angles = torch.cat([angles, angles], dim=-1)  # [seq_len, head_dim]
+        cos = angles.cos().to(dtype)
+        sin = angles.sin().to(dtype)
+        return cos, sin
+    def _reindex_kv_cache(self):
+        """
+        Reindex KV cache for reindex mode:
+        1. Keep first chunk as attention sink
+        2. Keep last chunk
+        3. Discard middle chunks
+        4. Reindex the last chunk's key positions to be right after sink via RoPE rotation
+        """
+        if self.past_key_values is None or len(self._chunk_info) < 2:
+            return
+        # Get current KV cache length
+        if hasattr(self.past_key_values, "get_seq_length"):
+            current_kv_len = self.past_key_values.get_seq_length()
+        else:
+            current_kv_len = self.past_key_values[0][0].shape[2]
+        # Calculate sink length (first chunk)
+        sink_len = self._chunk_info[0]["condition_len"] + self._chunk_info[0]["audio_token_count"]
+        # Last chunk length
+        last_chunk = self._chunk_info[-1]
+        last_chunk_len = last_chunk["condition_len"] + last_chunk["audio_token_count"]
+        keep_len = sink_len + last_chunk_len
+        # Get device and dtype
+        device = self.past_key_values.key_cache[0].device
+        dtype = self.past_key_values.key_cache[0].dtype
+        if current_kv_len <= keep_len:
+            last_chunk_kv_len = current_kv_len - sink_len
+            if last_chunk_kv_len <= 0:
+                return
+            self.text_start_pos = current_kv_len
+            return
+        # Step 1: Truncate KV cache - keep sink and last chunk
+        new_cache = DynamicCache()
+        num_layers = len(self.past_key_values.key_cache)
+        original_start_pos = current_kv_len - last_chunk_len
+        new_start_pos = sink_len
+        delta = new_start_pos - original_start_pos  # This is a scalar constant
+        delta_positions = torch.full((last_chunk_len,), delta, dtype=torch.float32, device=device)
+        # Compute rotation cos/sin
+        cos, sin = self._compute_rope_cos_sin(delta_positions, device, dtype)
+        cos = cos.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, head_dim]
+        sin = sin.unsqueeze(0).unsqueeze(0)
+        for layer_idx in range(num_layers):
+            key_full = self.past_key_values.key_cache[layer_idx]
+            value_full = self.past_key_values.value_cache[layer_idx]
+            # Extract sink and last chunk
+            key_sink = key_full[:, :, :sink_len, :]
+            value_sink = value_full[:, :, :sink_len, :]
+            key_last = key_full[:, :, -last_chunk_len:, :]
+            value_last = value_full[:, :, -last_chunk_len:, :]
+            # Apply RoPE rotation to reindex key positions
+            key_last_reindexed = self._apply_rope_rotation(key_last, cos, sin)
+            # Concatenate sink and reindexed last chunk
+            key = torch.cat([key_sink, key_last_reindexed], dim=2)
+            value = torch.cat([value_sink, value_last], dim=2)
+            new_cache.update(key, value, layer_idx)
+        self.past_key_values = new_cache
+        # Update text_start_pos to reflect new positions
+        self.text_start_pos = sink_len + last_chunk_len
+    @torch.inference_mode()
+    def generate_with_buffer(
+        self,
+        condition: torch.Tensor,
+        text_finished: bool = False,
+        max_new_token: int = 500,
+    ):
+        """input a condition embedding chunk, generate audio token each time,
+        and accumulate to buffer, only yield when buffer satisfies chunk_size.
+        Yields:
+            torch.Tensor of shape [chunk_size] (2D: [1, chunk_size])
+        """
+        self.idx += 1
+        self.device = self.tts.device
+        # if text finished, first concatenate Text EOS
+        if text_finished:
+            condition = torch.cat([condition, self.text_eos_embed], dim=1)
+        # always concatenate Audio BOS
+        condition = torch.cat([condition, self.audio_bos_embeds], dim=1).to(self.device)
+        self.all_conditions.append(condition)
+        # Initialize current chunk info
+        current_chunk_info = {
+            "condition_len": condition.shape[1],
+            "audio_token_count": 0,
+            "condition": condition.clone(),
+            "audio_tokens": [],
+        }
+        # Handle different attention types
+        if self.attention_type == "sliding_recompute" and self.idx >= 1:
+            # sliding_recompute: discard KV cache, recompute with previous + current chunk
+            self.past_key_values = None
+            current_condition = self._build_recompute_inputs(condition)
+            self.text_start_pos = 0
+        elif self.attention_type == "reindex" and self.idx >= 1:
+            # reindex: truncate KV cache keeping sink + last chunk, reindex positions via RoPE
+            self._reindex_kv_cache()
+            current_condition = condition
+            # Always update text_start_pos based on actual KV cache length (like reference code)
+            if self.past_key_values is not None:
+                if hasattr(self.past_key_values, "get_seq_length"):
+                    kv_len = self.past_key_values.get_seq_length()
+                else:
+                    kv_len = self.past_key_values[0][0].shape[2]
+                self.text_start_pos = kv_len
+        else:
+            current_condition = condition
+        condition_length = current_condition.shape[1]
+        prefill_len = condition_length
+        finished = torch.zeros(1, dtype=torch.bool, device=self.device)
+        chunk_generated_tokens = []
+        for t in range(max_new_token):
+            if t == 0:
+                inputs_embeds = current_condition
+                pos_ids = torch.arange(
+                    self.text_start_pos,
+                    self.text_start_pos + condition_length,
+                    dtype=torch.long,
+                    device=self.device,
+                ).unsqueeze(0)
+            else:
+                last = self.all_generated_tokens[-1]
+                # last: [1,1], directly as code id
+                inputs_embeds = self.emb_code[0](last)
+                pos_ids = torch.tensor(
+                    [self.text_start_pos + prefill_len + t - 1],
+                    dtype=torch.long,
+                    device=self.device,
+                ).unsqueeze(0)
+            outputs = self.tts.model(
+                position_ids=pos_ids,
+                past_key_values=self.past_key_values,
+                inputs_embeds=inputs_embeds,
+                use_cache=True,
+            )
+            hidden_states = outputs.last_hidden_state
+            # Handle KV cache based on attention type
+            if self.attention_type == "sliding_window":
+                self.past_key_values = outputs.past_key_values
+                self._truncate_kv_cache_sliding_window()
+            else:
+                self.past_key_values = outputs.past_key_values
+            with P.cached():
+                logits = torch.empty(
+                    hidden_states.size(0),
+                    hidden_states.size(1),
+                    self.num_audio_tokens,
+                    self.num_vq,
+                    dtype=torch.float,
+                    device=self.device,
+                )
+                for num_vq_iter in range(self.num_vq):
+                    x: torch.Tensor = self.head_code[num_vq_iter](hidden_states)
+                    logits[..., num_vq_iter] = x
+                    del x
+            del hidden_states
+            logits = logits[:, -1].float()
+            logits = logits.permute(0, 2, 1)
+            logits = logits.reshape(-1, logits.size(2))
+            logits /= self.temperature
+            audio_bos = len(self.all_generated_tokens) == 0 and t == 0
+            if not audio_bos:
+                # use generated tokens (current chunk) as input for processor/warper (align with modeling_minicpmo)
+                all_generated_tokens = torch.cat(self.all_generated_tokens, dim=1).to(self.device)  # [1, T]
+                for processor in self.logits_processors:
+                    logits = processor(all_generated_tokens, logits)
+                for warper in self.logits_warpers:
+                    logits = warper(all_generated_tokens, logits)
+                del all_generated_tokens
+            # sample next token (only use first codebook, same as generate)
+            scores = F.softmax(logits, dim=-1)
+            idx_next = torch.multinomial(scores, num_samples=1)  # [(B*num_vq), 1]
+            next_id = idx_next.view(-1, self.num_vq)[:, 0:1]  # only take first codebook → [B, 1]
+            del scores
+            if next_id.eq(
+                self.eos_token
+            ).any():  # generated audio eos token, means this chunk is finished, no longer generate new tokens
+                finished[:] = True
+            else:  # eos token cannot be added to buffer, he does not speak.
+                # convert next_id to correct shape [1, 1], no num_vq dimension
+                if next_id.dim() == 0:  # if scalar
+                    next_tok = next_id.unsqueeze(0).unsqueeze(0)  # [1, 1]
+                elif next_id.dim() == 1:  # if 1D [1]
+                    next_tok = next_id.unsqueeze(0)  # [1, 1]
+                else:
+                    next_tok = next_id
+                self.all_generated_tokens.append(next_tok)
+                chunk_generated_tokens.append(next_tok)
+                # Update chunk info for sliding_recompute
+                current_chunk_info["audio_tokens"].append(next_tok.clone())
+                current_chunk_info["audio_token_count"] += 1
+                self._token_buffer.append(next_tok)
+            if len(self._token_buffer) == 0:
+                # case 1: if last text chunk, yield None
+                if text_finished:
+                    yield torch.empty(1, 0, dtype=torch.long, device=self.device), True
+                    break
+                # case 2: if not last text chunk, break directly
+                else:
+                    break
+            else:  # buffer has something
+                # case 1: if buffer is larger/equal to chunk_size, yield out
+                if len(self._token_buffer) >= self.chunk_size:
+                    batch = torch.cat(self._token_buffer[: self.chunk_size], dim=1)  # [1, chunk_size]
+                    yield batch, False  # → [1, chunk_size]
+                    # discard yielded part
+                    self._token_buffer = self._token_buffer[self.chunk_size :]
+                # case 2: if buffer is smaller than chunk_size
+                else:
+                    # if generation finished, and is the last text chunk, yield all remaining tokens, then break
+                    if finished.all():
+                        if text_finished:
+                            batch = torch.cat(self._token_buffer, dim=1)  # [1, chunk_size]
+                            yield batch, True  # → [1, chunk_size]
+                            self._token_buffer = []
+                            break
+                        else:
+                            # not the last text chunk, need to wait for next text chunk to fill up buffer, then this call ends
+                            break
+                    else:  # generation of this audio chunk is not finished, continue generating
+                        continue
+        # Save current chunk info for sliding_recompute and reindex
+        self._chunk_info.append(current_chunk_info)
+        self._total_seq_len += condition.shape[1] + len(chunk_generated_tokens)
+        # Update text_start_pos based on attention type
+        if self.attention_type == "sliding_recompute":
+            # sliding_recompute: will be reset at next chunk start, update normally here
+            self.text_start_pos += prefill_len + len(chunk_generated_tokens)
+        elif self.attention_type == "reindex":
+            # reindex: position based on actual KV cache length (positions have been reindexed to be continuous)
+            if self.past_key_values is not None:
+                if hasattr(self.past_key_values, "get_seq_length"):
+                    self.text_start_pos = self.past_key_values.get_seq_length()
+                else:
+                    self.text_start_pos = self.past_key_values[0][0].shape[2]
+            else:
+                self.text_start_pos += condition.shape[1] + len(chunk_generated_tokens)
+        else:
+            self.text_start_pos += condition.shape[1] + len(chunk_generated_tokens)
+        # note: remaining tokens in buffer will be kept, and accumulated next time
+# sliding window
+@dataclass
+class StreamingWindowConfig:
+    text_window_high_tokens: int = 8000
+    text_window_low_tokens: int = 6000
+@dataclass
+class DuplexWindowConfig:
+    """duplex sliding window configuration
+    sliding window mode:
+    - "off": disable sliding window
+    - "basic": basic sliding window (trigger by cache length)
+    - "context": sliding window with context (trigger by unit number, preserve generated text to previous)
+    """
+    # sliding window mode
+    sliding_window_mode: str = "off"  # "off" / "basic" / "context"
+    # basic sliding window parameters
+    basic_window_high_tokens: int = 8000  # high watermark: trigger sliding window when exceeded
+    basic_window_low_tokens: int = 6000  # low watermark: keep to this value after sliding window
+    # context sliding window parameters
+    context_previous_max_tokens: int = 500  # previous maximum token number
+    context_max_units: int = 24  # maximum unit number (trigger sliding window when exceeded)
+    # verification mode (for comparison test)
+    verify_mode: bool = False  # whether to enable verification log
+def as_dynamic_cache(past_key_values):
+    """Convert legacy tuple cache to DynamicCache if needed."""
+    if isinstance(past_key_values, DynamicCache):
+        return past_key_values
+    if isinstance(past_key_values, tuple):
+        return DynamicCache.from_legacy_cache(past_key_values)
+    return past_key_values
+def get_kv_cache_length(cache) -> int:
+    """Get the sequence length of a KV cache.
+    Args:
+        cache: DynamicCache or tuple-based cache
+    Returns:
+        The number of tokens in the cache
+    """
+    if cache is None:
+        return 0
+    if isinstance(cache, DynamicCache):
+        if not cache.key_cache or not cache.key_cache[0].numel():
+            return 0
+        return cache.key_cache[0].shape[-2]
+    if isinstance(cache, tuple):
+        return cache[0][0].shape[2]
+    return 0
+def get_rotary_cos_sin(
+    head_dim: int,
+    positions: torch.Tensor,
+    device: torch.device,
+    dtype: torch.dtype,
+    rope_theta: float = 10000.0,
+    inv_freq_cache: Optional[Dict[Tuple, torch.Tensor]] = None,
+) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Compute RoPE cos and sin components for given positions.
+    Args:
+        head_dim: Dimension of each attention head
+        positions: Position indices tensor
+        device: Target device
+        dtype: Target dtype
+        rope_theta: RoPE base frequency (default 10000.0)
+        inv_freq_cache: Optional cache dict for inverse frequencies
+    Returns:
+        Tuple of (cos, sin) tensors with shape [1, 1, seq_len, head_dim]
+    """
+    cache_key = (head_dim, device)
+    inv_freq = inv_freq_cache.get(cache_key) if inv_freq_cache is not None else None
+    if inv_freq is None or inv_freq.device != device or inv_freq.shape[0] != head_dim // 2:
+        exponent = torch.arange(0, head_dim, 2, device=device, dtype=torch.float32) / head_dim
+        inv_freq = 1.0 / (rope_theta**exponent)
+        if inv_freq_cache is not None:
+            inv_freq_cache[cache_key] = inv_freq
+    positions = positions.to(device=device, dtype=torch.float32)
+    angles = torch.einsum("i,j->ij", positions, inv_freq)
+    cos = torch.cos(angles)
+    sin = torch.sin(angles)
+    # Use cat instead of repeat_interleave, consistent with model's original RotaryEmbedding
+    # Original: emb = torch.cat((freqs, freqs), dim=-1) -> [f0, f1, ..., f_{d/2}, f0, f1, ..., f_{d/2}]
+    cos_full = torch.cat([cos, cos], dim=-1).to(dtype=dtype)
+    sin_full = torch.cat([sin, sin], dim=-1).to(dtype=dtype)
+    cos_full = cos_full.unsqueeze(0).unsqueeze(0)
+    sin_full = sin_full.unsqueeze(0).unsqueeze(0)
+    return cos_full, sin_full
+def realign_rotary_suffix(
+    suffix_keys: torch.Tensor,
+    old_positions: torch.Tensor,
+    new_positions: torch.Tensor,
+    rope_theta: float = 10000.0,
+    inv_freq_cache: Optional[Dict[Tuple, torch.Tensor]] = None,
+) -> torch.Tensor:
+    """Realign RoPE position encoding after cache eviction.
+    When tokens are dropped from the middle of a cache, the suffix tokens
+    need their RoPE embeddings recalculated with new position indices.
+    Args:
+        suffix_keys: Key tensor to realign, shape [batch, heads, seq_len, head_dim]
+        old_positions: Original position indices
+        new_positions: New position indices after eviction
+        rope_theta: RoPE base frequency
+        inv_freq_cache: Optional cache dict for inverse frequencies
+    Returns:
+        Realigned key tensor with same shape as input
+    """
+    if suffix_keys.numel() == 0:
+        return suffix_keys
+    head_dim = suffix_keys.shape[-1]
+    device = suffix_keys.device
+    dtype = suffix_keys.dtype
+    # Compute old position cos/sin
+    cos_old, sin_old = get_rotary_cos_sin(head_dim, old_positions, device, dtype, rope_theta, inv_freq_cache)
+    # Inverse transform: recover original key
+    base = cos_old * suffix_keys - sin_old * rotate_half(suffix_keys)
+    # Compute new position cos/sin
+    cos_new, sin_new = get_rotary_cos_sin(head_dim, new_positions, device, dtype, rope_theta, inv_freq_cache)
+    # Forward transform: re-encode with new positions
+    return cos_new * base + sin_new * rotate_half(base)
+def drop_tokens_from_cache(
+    cache: Optional[DynamicCache | Tuple],
+    length: int,
+    preserve: int,
+    position_offset: int,
+    rope_theta: float = 10000.0,
+    inv_freq_cache: Optional[Dict[Tuple, torch.Tensor]] = None,
+) -> Tuple[Optional[DynamicCache], int, bool]:
+    """Drop tokens from a KV cache while preserving system prompt.
+    Removes tokens in the range [preserve, preserve + length) from the cache,
+    realigning RoPE embeddings for the suffix.
+    Args:
+        cache: DynamicCache or tuple-based cache (will be converted to DynamicCache)
+        length: Number of tokens to drop
+        preserve: Number of tokens to preserve at the start (system prompt)
+        position_offset: Current position offset for RoPE calculation
+        rope_theta: RoPE base frequency
+        inv_freq_cache: Optional cache dict for inverse frequencies
+    Returns:
+        Tuple of (cache, new_position_offset, success)
+        Note: Tuple cache will be converted to DynamicCache. Modification is in-place.
+    """
+    if cache is None or length <= 0:
+        return cache, position_offset, False
+    cache = as_dynamic_cache(cache)
+    total_len = get_kv_cache_length(cache)
+    if total_len <= 0:
+        return cache, position_offset, False
+    preserve = min(preserve, total_len)
+    available = total_len - preserve
+    if available < length:
+        logger.warning(
+            "Cannot drop %d tokens: only %d available (total=%d, preserve=%d)",
+            length,
+            available,
+            total_len,
+            preserve,
+        )
+        return cache, position_offset, False
+    suffix_len = total_len - preserve - length
+    # note: after RoPE reindex, the position of cache has been compressed (from preserve start)
+    # so here should not add position_offset, but use the actual layout of current cache
+    suffix_offset = preserve + length  # suffix current position in cache
+    prefix_offset = preserve  # suffix new position (follow preserve)
+    # Prepare position tensors for RoPE realignment
+    old_positions = None
+    new_positions = None
+    if suffix_len > 0:
+        device = cache.key_cache[0].device
+        old_positions = torch.arange(
+            suffix_offset,
+            suffix_offset + suffix_len,
+            device=device,
+            dtype=torch.long,
+        )
+        new_positions = torch.arange(
+            prefix_offset,
+            prefix_offset + suffix_len,
+            device=device,
+            dtype=torch.long,
+        )
+    keep_len = total_len - length
+    # Process each layer (in-place modification)
+    for layer_idx in range(len(cache.key_cache)):
+        key_tensor = cache.key_cache[layer_idx]
+        value_tensor = cache.value_cache[layer_idx]
+        if not key_tensor.numel():
+            continue
+        # Preserve prefix (system prompt)
+        prefix_keys = key_tensor[:, :, :preserve, :]
+        prefix_values = value_tensor[:, :, :preserve, :]
+        if suffix_len > 0:
+            # Keep and realign suffix
+            suffix_keys = key_tensor[:, :, preserve + length :, :]
+            suffix_values = value_tensor[:, :, preserve + length :, :]
+            if old_positions is not None and new_positions is not None and suffix_keys.numel():
+                suffix_keys = realign_rotary_suffix(
+                    suffix_keys,
+                    old_positions,
+                    new_positions,
+                    rope_theta,
+                    inv_freq_cache,
+                )
+            cache.key_cache[layer_idx] = torch.cat([prefix_keys, suffix_keys], dim=-2).contiguous()
+            cache.value_cache[layer_idx] = torch.cat([prefix_values, suffix_values], dim=-2).contiguous()
+        else:
+            cache.key_cache[layer_idx] = prefix_keys.contiguous()
+            cache.value_cache[layer_idx] = prefix_values.contiguous()
+    cache.crop(keep_len)
+    cache._seen_tokens = max(keep_len, 0)
+    new_offset = position_offset + length
+    logger.debug("Dropped %d tokens from cache, new length=%d", length, keep_len)
+    return cache, new_offset, True
+# stream decoder
+def top_k_top_p_filtering(logits, top_k=0, top_p=0.0, filter_value=-float("inf")):
+    logits = logits.clone()
+    # Top-k filtering
+    if top_k > 0:
+        top_k = min(top_k, logits.size(-1))
+        indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
+        logits[indices_to_remove] = filter_value
+    # Top-p (nucleus) filtering
+    if top_p > 0.0:
+        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+        probs = F.softmax(sorted_logits, dim=-1)
+        cumulative_probs = torch.cumsum(probs, dim=-1)
+        sorted_indices_to_remove = cumulative_probs > top_p
+        # keep the first token that exceeds top_p
+        sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
+        sorted_indices_to_remove[..., 0] = 0
+        indices_to_remove = sorted_indices[sorted_indices_to_remove]
+        logits[0, indices_to_remove] = filter_value
+    return logits
+class StreamDecoder:
+    def __init__(self, llm, tokenizer, special_token_ids=None, forbidden_token_ids=None):
+        self.m = llm
+        self.tokenizer = tokenizer
+        self.listen_id = self.tokenizer.eos_token_id
+        self.chunk_eos_id = self.tokenizer.convert_tokens_to_ids("<|chunk_eos|>")
+        self.chunk_tts_eos_id = self.tokenizer.convert_tokens_to_ids("<|chunk_tts_eos|>")
+        self.turn_eos_id = self.tokenizer.convert_tokens_to_ids("<|turn_eos|>")
+        self.speak_id = self.tokenizer.convert_tokens_to_ids("<|speak|>")
+        self.special_token_ids = special_token_ids if special_token_ids is not None else []
+        # cache special tokens (used for context sliding window filtering)
+        self._all_special_ids = set()
+        self._all_special_tokens_text = set()
+        if self.tokenizer:
+            if hasattr(self.tokenizer, "all_special_ids"):
+                self._all_special_ids = set(self.tokenizer.all_special_ids)
+            if hasattr(self.tokenizer, "all_special_tokens"):
+                self._all_special_tokens_text = set(self.tokenizer.all_special_tokens)
+        custom_special_tokens = [
+            "<unit>",
+            "</unit>",
+            "<image>",
+            "</image>",
+            "<slice>",
+            "</slice>",
+            "<|listen|>",
+            "<|speak|>",
+            "<|tts_bos|>",
+            "<|tts_eos|>",
+            "<|audio_start|>",
+            "<|audio_end|>",
+            "<|chunk_eos|>",
+            "<|chunk_tts_eos|>",
+            "<|turn_eos|>",
+            "<|audio_start|>",
+            "<|audio_end|>",
+        ]
+        self._all_special_tokens_text.update(custom_special_tokens)
+        for token in custom_special_tokens:
+            token_id = self.tokenizer.convert_tokens_to_ids(token)
+            if token_id is not None and token_id != self.tokenizer.unk_token_id:
+                self._all_special_ids.add(token_id)
+        if forbidden_token_ids is None:
+            self.forbidden_token_ids = []
+        elif isinstance(forbidden_token_ids, int):
+            self.forbidden_token_ids = [self.forbidden_token_ids]
+        else:
+            self.forbidden_token_ids = forbidden_token_ids
+        self.forbidden_token_ids.append(self.chunk_eos_id)
+        assert isinstance(self.forbidden_token_ids, list)
+        self.cache = None
+        self.context = ""
+        self.generated_tokens = []  # track generated tokens
+        self.generated_special_tokens = []  # track generated special tokens
+        self.reset()
+        self.embeds = None
+        self.system_embeds = None
+        # sliding window related states
+        self._unit_history: List[Dict[str, Any]] = []
+        self._next_unit_id: int = 0
+        self._pending_unit_id: Optional[int] = None
+        self._pending_unit_start_cache_len: int = 0
+        self._system_preserve_length: int = 0
+        self._position_offset: int = 0
+        self._window_config = DuplexWindowConfig()
+        self._window_enabled: bool = True
+        self._rope_inv_freq_cache: Dict[Tuple, torch.Tensor] = {}
+        # context preserving sliding window states
+        # initial cache layout: [prefix] [suffix] [units...]
+        # after first sliding window: [prefix] [previous_marker + content] [suffix] [units...]
+        #                              fixed     dynamic sliding region      fixed
+        self._preserve_prefix_length: int = 0  # original prefix length (fixed)
+        self._previous_content_length: int = 0  # previous content length (dynamic, including marker)
+        self._suffix_token_ids: List[int] = []  # suffix token ids (e.g. <|im_end|>)
+        # previous marker (added dynamically after first sliding window)
+        self._previous_marker: str = "\n\nprevious: "  # fixed prefix marker
+        self._previous_marker_token_ids: List[int] = []  # marker token ids (initialized)
+        self._has_previous: bool = False  # whether previous marker has been added
+        # previous content
+        self._previous_text: str = ""  # accumulated generated text (without marker)
+        self._previous_token_ids: List[int] = []  # previous full token ids (including marker)
+        # validation statistics
+        self._sliding_event_count: int = 0  # sliding window trigger count
+        self._total_dropped_tokens: int = 0  # total dropped token count
+        self._total_dropped_units: int = 0  # total dropped unit count
+    def sliding_embeds(self):
+        # tmp = system_embeds
+        # tmp +-》 embeds after 5s
+        # reset
+        # feed
+        pass
+    def reset(self):
+        self.context = ""
+        self.cache = None
+        self.generated_tokens = []
+        self.generated_special_tokens = []
+        self.embeds = None
+        self.system_embeds = None
+        # sliding window state reset
+        old_unit_count = len(self._unit_history) if hasattr(self, "_unit_history") else 0
+        self._unit_history = []
+        self._next_unit_id = 0
+        self._pending_unit_id = None
+        self._pending_unit_start_cache_len = 0
+        self._system_preserve_length = 0
+        self._position_offset = 0
+        self._rope_inv_freq_cache = {}
+        # context preserving sliding window state reset
+        self._preserve_prefix_length = 0
+        self._previous_content_length = 0
+        self._suffix_token_ids = []
+        self._previous_marker = "\n\nprevious: "
+        self._previous_marker_token_ids = []
+        self._has_previous = False
+        self._previous_text = ""
+        self._previous_token_ids = []
+        # validation statistics
+        self._sliding_event_count = 0  # sliding window trigger count
+        self._total_dropped_tokens = 0  # total dropped token count
+        self._total_dropped_units = 0  # total dropped unit count
+    def get_cache_length(self) -> int:
+        if self.cache is None:
+            return 0
+        if isinstance(self.cache, DynamicCache):
+            if len(self.cache.key_cache) > 0 and self.cache.key_cache[0].numel() > 0:
+                return self.cache.key_cache[0].shape[2]
+            return 0
+        # Tuple cache format
+        return self.cache[0][0].shape[2]
+    def get_total_generated_tokens(self) -> int:
+        return sum(len(u.get("generated_tokens", [])) for u in self._unit_history)
+    def register_unit_start(self) -> int:
+        self._pending_unit_id = self._next_unit_id
+        self._pending_unit_start_cache_len = self.get_cache_length()
+        return self._pending_unit_id
+    def register_unit_end(
+        self,
+        input_type: str,
+        generated_tokens: Optional[List[int]] = None,
+        is_listen: bool = False,
+        generated_text: Optional[str] = None,
+    ):
+        """Call when unit ends, record unit information
+        Should be called after feeding </unit> token
+        Args:
+            input_type: "audio" / "video" / "omni" / "system"
+            generated_tokens: tokens generated by the unit (token ids)
+            is_listen: whether the unit is in listen state
+            generated_text: text generated by the unit (used for context preserving mode)
+        """
+        if self._pending_unit_id is None:
+            logger.warning("register_unit_end called without register_unit_start")
+            return
+        # calculate the length of the unit
+        current_cache_len = self.get_cache_length()
+        unit_len = current_cache_len - self._pending_unit_start_cache_len
+        if unit_len > 0:
+            entry = {
+                "unit_id": self._pending_unit_id,
+                "length": unit_len,
+                "type": input_type,
+                "generated_tokens": generated_tokens or [],
+                "generated_text": generated_text or "",  # used for context preserving mode
+                "is_listen": is_listen,
+            }
+            self._unit_history.append(entry)
+        self._pending_unit_id = None
+        self._pending_unit_start_cache_len = 0
+        self._next_unit_id += 1
+    def register_system_prompt(self):
+        """Call after system prompt prefill, record preserve length"""
+        self._system_preserve_length = self.get_cache_length()
+    # sliding window core methods
+    def _get_rope_theta(self) -> float:
+        """get model rope_theta configuration"""
+        return float(getattr(self.m.config, "rope_theta", 10000.0))
+    def _drop_tokens_from_cache(self, length: int) -> bool:
+        """remove specified number of tokens from cache (protect system prompt)
+        remove tokens in the range [preserve, preserve + length)
+        supports DynamicCache and tuple cache formats
+        """
+        if self.cache is None or length <= 0:
+            return False
+        cache_type = "DynamicCache" if isinstance(self.cache, DynamicCache) else "TupleCache"
+        cache_len_before = self.get_cache_length()
+        offset_before = self._position_offset
+        new_cache, new_offset, success = drop_tokens_from_cache(
+            cache=self.cache,
+            length=length,
+            preserve=self._system_preserve_length,
+            position_offset=self._position_offset,
+            rope_theta=self._get_rope_theta(),
+            inv_freq_cache=self._rope_inv_freq_cache,
+        )
+        if success:
+            self.cache = new_cache  # For DynamicCache this is the same object (in-place)
+            self._position_offset = new_offset
+        return success
+    def _drop_unit(self, unit_id: int) -> bool:
+        """remove specified unit"""
+        entries = [u for u in self._unit_history if u["unit_id"] == unit_id]
+        if not entries:
+            return False
+        total_len = sum(e["length"] for e in entries)
+        if total_len <= 0:
+            for e in entries:
+                self._unit_history.remove(e)
+            return False
+        if not self._drop_tokens_from_cache(total_len):
+            return False
+        for e in entries:
+            self._unit_history.remove(e)
+        return True
+    def _drop_next_unit(self) -> bool:
+        """remove the earliest non-system unit"""
+        for entry in self._unit_history:
+            unit_id = entry.get("unit_id")
+            if unit_id is None:
+                continue
+            # skip system type
+            if entry.get("type") == "system":
+                continue
+            if self._drop_unit(unit_id):
+                return True
+        return False
+    def enforce_window(self) -> bool:
+        """enforce sliding window strategy (same as single-mode, only look at cache length)
+        when cache length exceeds high water line, loop to remove the earliest unit,
+        until cache length drops below the low water line.
+        """
+        if not self._window_enabled:
+            return False
+        cfg = self._window_config
+        cache_len_before = self.get_cache_length()
+        if cache_len_before <= cfg.basic_window_high_tokens:
+            return False  # not above high water line, no trigger
+        dropped_count = 0
+        cache_len = cache_len_before
+        while cache_len > cfg.basic_window_low_tokens:
+            if not self._drop_next_unit():
+                break
+            dropped_count += 1
+            cache_len = self.get_cache_length()
+        if dropped_count > 0:
+            # update statistics counters
+            self._sliding_event_count += 1
+            self._total_dropped_tokens += cache_len_before - cache_len
+            self._total_dropped_units += dropped_count
+            # consistency check
+            expected = self._system_preserve_length + sum(u["length"] for u in self._unit_history)
+            is_consistent = expected == cache_len
+            if not is_consistent:
+                logger.error(
+                    "CONSISTENCY ERROR! preserve=%d + sum(units)=%d != cache=%d, offset=%d",
+                    self._system_preserve_length,
+                    sum(u["length"] for u in self._unit_history),
+                    cache_len,
+                    self._position_offset,
+                )
+        return dropped_count > 0
+    # context preserving sliding window methods
+    def register_system_prompt_with_context(
+        self,
+        suffix_token_ids: Optional[List[int]] = None,
+        context_previous_marker: str = "\n\nprevious: ",
+    ):
+        """register system prompt (with context preserving mode)
+        initial cache layout: [prefix] [suffix] [units...]
+        after first sliding window: [prefix] [context_previous_marker + content] [suffix] [units...]
+        when calling this method, cache should only have prefix (without previous marker)
+        suffix will be fed in later
+        Args:
+            suffix_token_ids: suffix token ids (e.g. id of <|im_end|>)
+            context_previous_marker: previous marker prefix, e.g. "\\n\\nprevious: "
+        """
+        # prefix = current cache content (fixed, without previous marker)
+        self._preserve_prefix_length = self.get_cache_length()
+        self._previous_content_length = 0  # initially no previous content
+        self._suffix_token_ids = suffix_token_ids or []
+        # total preserve length = prefix + suffix (initially no previous)
+        self._system_preserve_length = self._preserve_prefix_length + len(self._suffix_token_ids)
+        # initialize previous related states
+        self._previous_marker = context_previous_marker
+        self._previous_marker_token_ids = (
+            self.tokenizer.encode(context_previous_marker, add_special_tokens=False) if self.tokenizer else []
+        )
+        self._has_previous = False
+        self._previous_text = ""
+        self._previous_token_ids = []
+    def _extract_generated_text(self, units: List[Dict[str, Any]]) -> Tuple[str, List[int]]:
+        """extract generated text and token ids from units
+        Args:
+            units: list of units to extract
+        Returns:
+            (text, token_ids): concatenated text and token ids (filtered out special tokens)
+        """
+        text_parts = []
+        token_ids = []
+        for u in units:
+            # only keep generated content of non-listen units
+            if u.get("is_listen", False):
+                continue
+            gen_text = u.get("generated_text", "")
+            gen_tokens = u.get("generated_tokens", [])
+            # filter out special tokens from text
+            if gen_text:
+                clean_text = gen_text
+                for st in self._all_special_tokens_text:
+                    clean_text = clean_text.replace(st, "")
+                if clean_text.strip():
+                    text_parts.append(clean_text)
+            # filter out special tokens
+            if gen_tokens:
+                filtered_tokens = [t for t in gen_tokens if t not in self._all_special_ids]
+                token_ids.extend(filtered_tokens)
+        return "".join(text_parts), token_ids
+    def _rebuild_cache_with_previous(
+        self,
+        new_previous_tokens: List[int],
+        units_to_keep_len: Optional[int] = None,
+    ) -> bool:
+        """rebuild cache, insert new previous content between prefix and suffix
+        cache layout change:
+        [prefix] [old_prev] [suffix] [old_units]  →  [prefix] [new_prev] [suffix] [remaining_units]
+        Args:
+            new_previous_tokens: new previous token ids
+            units_to_keep_len: length of units to keep (from cache end backwards)
+                                if None, calculate based on unit_history
+        Returns:
+            whether successful rebuild
+        """
+        if self.cache is None:
+            return False
+        old_previous_len = self._previous_content_length
+        new_previous_len = len(new_previous_tokens)
+        suffix_len = len(self._suffix_token_ids)
+        total_cache_len = self.get_cache_length()
+        # calculate length of units to keep
+        if units_to_keep_len is None:
+            units_to_keep_len = sum(u["length"] for u in self._unit_history)
+        # special case: if previous is unchanged (new and old are empty), no need to rebuild prefix+suffix part of cache
+        # but still need to reindex units RoPE (because a unit was deleted, position changed)
+        if new_previous_len == 0 and old_previous_len == 0:
+            # cache layout: [prefix(7)] [suffix(1)] [units...]
+            # only keep prefix + suffix + remaining_units
+            preserve_len = self._preserve_prefix_length + suffix_len
+            # simply slice cache: [prefix+suffix] + [remaining_units]
+            # remaining_units in cache end
+            if units_to_keep_len > 0:
+                # [0:preserve_len] + [total-units_to_keep_len:total]
+                prefix_suffix_cache = self._slice_cache(0, preserve_len)
+                units_cache = self._slice_cache(total_cache_len - units_to_keep_len, None)
+                # calculate number of dropped tokens
+                dropped_tokens = total_cache_len - preserve_len - units_to_keep_len
+                # reindex units RoPE: position from (preserve_len + dropped_tokens) to preserve_len
+                # note: no position_offset, because cache position has been compressed (from 0 start)
+                if dropped_tokens > 0:
+                    old_start = preserve_len + dropped_tokens
+                    new_start = preserve_len
+                    units_cache = self._reindex_rope_for_cache(units_cache, old_start, new_start, units_to_keep_len)
+                self.cache = self._concat_caches(prefix_suffix_cache, units_cache)
+            else:
+                self.cache = self._slice_cache(0, preserve_len)
+            return True
+        # 1. get prefix cache (fixed)
+        prefix_end = self._preserve_prefix_length
+        prefix_cache = self._slice_cache(0, prefix_end)
+        # 2. get units cache to keep (from end)
+        units_start_in_old_cache = total_cache_len - units_to_keep_len
+        units_cache = None
+        if units_to_keep_len > 0:
+            units_cache = self._slice_cache(units_start_in_old_cache, None)
+        # 3. calculate new previous + suffix cache (needs forward)
+        # merge previous tokens and suffix tokens
+        prev_suffix_tokens = new_previous_tokens + self._suffix_token_ids
+        prev_suffix_len = len(prev_suffix_tokens)
+        new_prefix_prev_suffix_cache = prefix_cache
+        if prev_suffix_len > 0:
+            # Embed tokens
+            prev_suffix_embeds = self.embed_tokens(prev_suffix_tokens)
+            # calculate start position (after prefix)
+            start_pos = self._preserve_prefix_length + self._position_offset
+            # forward calculate KV cache
+            with torch.no_grad():
+                device = prev_suffix_embeds.device
+                position_ids = torch.arange(
+                    start_pos,
+                    start_pos + prev_suffix_len,
+                    device=device,
+                ).unsqueeze(0)
+                # use prefix cache as past_key_values
+                outputs = self.m(
+                    inputs_embeds=(
+                        prev_suffix_embeds.unsqueeze(0) if prev_suffix_embeds.dim() == 2 else prev_suffix_embeds
+                    ),
+                    position_ids=position_ids,
+                    past_key_values=prefix_cache,
+                    use_cache=True,
+                    return_dict=True,
+                )
+                # new cache contains prefix + new_previous + suffix
+                new_prefix_prev_suffix_cache = outputs.past_key_values
+        # 4. adjust units cache RoPE
+        # new layout: [prefix] [new_prev] [suffix] [units]
+        # note: no position_offset, because cache position has been compressed (from 0 start)
+        new_system_total = prefix_end + new_previous_len + suffix_len
+        if units_cache is not None and self._get_cache_len(units_cache) > 0:
+            old_start = units_start_in_old_cache
+            new_start = new_system_total
+            if old_start != new_start:
+                units_cache = self._reindex_rope_for_cache(units_cache, old_start, new_start, units_to_keep_len)
+        # 5. concatenate new cache
+        if units_cache is not None and self._get_cache_len(units_cache) > 0:
+            self.cache = self._concat_caches(new_prefix_prev_suffix_cache, units_cache)
+        else:
+            self.cache = new_prefix_prev_suffix_cache
+        # 6. update length
+        self._previous_content_length = new_previous_len
+        # total preserve length = prefix + previous + suffix
+        self._system_preserve_length = prefix_end + new_previous_len + suffix_len
+        # print detailed cache layout information
+        prev_text_preview = self._previous_text[:50] + "..." if len(self._previous_text) > 50 else self._previous_text
+        suffix_preview = self.tokenizer.decode(self._suffix_token_ids) if self._suffix_token_ids else ""
+        return True
+    def _slice_cache(self, start: int, end: Optional[int], clone: bool = True):
+        """slice cache
+        Args:
+            start: start position
+            end: end position (None means to end)
+            clone: whether to clone (default True, to prevent shared memory issues)
+        """
+        if self.cache is None:
+            return None
+        if isinstance(self.cache, DynamicCache):
+            # DynamicCache
+            new_key_cache = [
+                k[:, :, start:end, :].clone() if clone else k[:, :, start:end, :] for k in self.cache.key_cache
+            ]
+            new_value_cache = [
+                v[:, :, start:end, :].clone() if clone else v[:, :, start:end, :] for v in self.cache.value_cache
+            ]
+            new_cache = DynamicCache()
+            new_cache.key_cache = new_key_cache
+            new_cache.value_cache = new_value_cache
+            return new_cache
+        else:
+            # Tuple cache
+            if clone:
+                return tuple(
+                    (layer[0][:, :, start:end, :].clone(), layer[1][:, :, start:end, :].clone()) for layer in self.cache
+                )
+            else:
+                return tuple((layer[0][:, :, start:end, :], layer[1][:, :, start:end, :]) for layer in self.cache)
+    @staticmethod
+    def _get_cache_len(cache) -> int:
+        if cache is None:
+            return 0
+        if isinstance(cache, DynamicCache):
+            if len(cache.key_cache) > 0 and cache.key_cache[0].numel() > 0:
+                return cache.key_cache[0].shape[2]
+            return 0
+        if cache and cache[0] and cache[0][0] is not None:
+            return cache[0][0].shape[2]
+        return 0
+    @staticmethod
+    def _concat_caches(cache1, cache2):
+        if cache1 is None:
+            return cache2
+        if cache2 is None:
+            return cache1
+        if isinstance(cache1, DynamicCache):
+            new_cache = DynamicCache()
+            new_cache.key_cache = [torch.cat([k1, k2], dim=2) for k1, k2 in zip(cache1.key_cache, cache2.key_cache)]
+            new_cache.value_cache = [
+                torch.cat([v1, v2], dim=2) for v1, v2 in zip(cache1.value_cache, cache2.value_cache)
+            ]
+            return new_cache
+        else:
+            return tuple(
+                (
+                    torch.cat([layer1[0], layer2[0]], dim=2),
+                    torch.cat([layer1[1], layer2[1]], dim=2),
+                )
+                for layer1, layer2 in zip(cache1, cache2)
+            )
+    def _reindex_rope_for_cache(self, cache, old_start: int, new_start: int, length: int):
+        """reindex RoPE position for cache"""
+        if cache is None or length <= 0:
+            return cache
+        if isinstance(cache, DynamicCache):
+            device = cache.key_cache[0].device if cache.key_cache else None
+        else:
+            device = cache[0][0].device if cache and cache[0] else None
+        if device is None:
+            return cache
+        old_positions = torch.arange(old_start, old_start + length, device=device, dtype=torch.long)
+        new_positions = torch.arange(new_start, new_start + length, device=device, dtype=torch.long)
+        rope_theta = self._get_rope_theta()
+        if isinstance(cache, DynamicCache):
+            new_key_cache = []
+            for k in cache.key_cache:
+                new_k = realign_rotary_suffix(k, old_positions, new_positions, rope_theta, self._rope_inv_freq_cache)
+                new_key_cache.append(new_k)
+            cache.key_cache = new_key_cache
+            return cache
+        else:
+            new_cache = []
+            for layer in cache:
+                new_k = realign_rotary_suffix(
+                    layer[0], old_positions, new_positions, rope_theta, self._rope_inv_freq_cache
+                )
+                new_cache.append((new_k, layer[1]))
+            return tuple(new_cache)
+    def _update_previous(
+        self,
+        new_text: str,
+        new_tokens: List[int],
+        max_tokens: int,
+    ) -> None:
+        """update previous context (also update cache)
+        when first sliding window, dynamically add marker + text, subsequent sliding window append text
+        when content exceeds max_tokens, truncate content (keep marker)
+        rebuild cache to maintain consistency
+        Args:
+            new_text: new text
+            new_tokens: new token ids
+            max_tokens: previous content maximum token count (without marker)
+        """
+        marker_len = len(self._previous_marker_token_ids)
+        tokens_to_drop = 0
+        # if no new content, do not add marker, but still need to rebuild cache
+        if not new_tokens and not new_text:
+            # still need to rebuild cache (because a unit was deleted)
+            self._rebuild_cache_with_previous(self._previous_token_ids)
+            return
+        if not self._has_previous:
+            # when first has actual content: add marker + text
+            self._previous_text = new_text
+            self._previous_token_ids = self._previous_marker_token_ids.copy() + new_tokens
+            self._has_previous = True
+        else:
+            # subsequent sliding window: append text to previous
+            self._previous_text += new_text
+            self._previous_token_ids.extend(new_tokens)
+        # calculate token count of content (without marker)
+        content_token_count = len(self._previous_token_ids) - marker_len
+        # check if need to truncate content (keep marker)
+        if content_token_count > max_tokens:
+            # truncate left content, keep marker + latest max_tokens content
+            tokens_to_drop = content_token_count - max_tokens
+            old_text = self._previous_text
+            # keep marker + truncated content
+            content_tokens = self._previous_token_ids[marker_len + tokens_to_drop :]
+            self._previous_token_ids = self._previous_marker_token_ids.copy() + content_tokens
+            # redecode text (only decode content part)
+            try:
+                self._previous_text = self.tokenizer.decode(
+                    content_tokens,
+                    skip_special_tokens=True,
+                )
+            except Exception as e:
+                logger.warning("_update_previous: decode failed: %s", e)
+        # rebuild cache
+        self._rebuild_cache_with_previous(self._previous_token_ids)
+    def _drop_unit_with_context(
+        self,
+        unit_id: int,
+        max_previous_tokens: int,
+    ) -> Tuple[bool, str, List[int]]:
+        """remove specified unit and return its generated content (for context preserving)
+        process:
+        1. extract generated content of unit
+        2. remove unit from cache (without prefix+previous)
+        3. append generated content to previous
+        4. rebuild cache (in _update_previous)
+        Args:
+            unit_id: unit ID to remove
+            max_previous_tokens: previous maximum token count
+        Returns:
+            (success, extracted_text, extracted_tokens): whether successful, extracted text and tokens
+        """
+        entries = [u for u in self._unit_history if u["unit_id"] == unit_id]
+        if not entries:
+            return False, "", []
+        # extract generated content
+        extracted_text, extracted_tokens = self._extract_generated_text(entries)
+        # calculate total length
+        total_len = sum(e["length"] for e in entries)
+        if total_len <= 0:
+            for e in entries:
+                self._unit_history.remove(e)
+            return False, extracted_text, extracted_tokens
+        cache_before = self.get_cache_length()
+        # remove from unit_history (record for later processing)
+        for e in entries:
+            self._unit_history.remove(e)
+        # note: here no longer call _drop_tokens_from_cache
+        # because _update_previous will rebuild the entire cache
+        # update previous (also rebuild cache)
+        self._update_previous(extracted_text, extracted_tokens, max_previous_tokens)
+        return True, extracted_text, extracted_tokens
+    def _drop_next_unit_with_context(self, max_previous_tokens: int) -> bool:
+        """remove the earliest non-system unit (with context preserving)"""
+        for entry in self._unit_history:
+            unit_id = entry.get("unit_id")
+            if unit_id is None:
+                continue
+            if entry.get("type") == "system":
+                continue
+            success, _, _ = self._drop_unit_with_context(unit_id, max_previous_tokens)
+            if success:
+                return True
+        return False
+    def enforce_window_with_context(self) -> bool:
+        """context preserving sliding window execution
+        when unit count exceeds max_units, remove the earliest unit,
+        and accumulate its generated content to previous.
+        Cache will be automatically rebuilt in _update_previous.
+        Returns:
+            whether sliding window is executed
+        """
+        if not self._window_enabled:
+            return False
+        cfg = self._window_config
+        if cfg.sliding_window_mode != "context":
+            # if not context mode, fallback to basic sliding window
+            return self.enforce_window()
+        cache_len_before = self.get_cache_length()
+        units_before = len(self._unit_history)
+        # context preserving mode: only check if unit count exceeds limit
+        # (previous exceeds limit in _update_previous will automatically truncate left)
+        if units_before <= cfg.context_max_units:
+            return False
+        # sliding window loop: remove unit until count ≤ max_units
+        dropped_count = 0
+        while len(self._unit_history) > cfg.context_max_units:
+            if not self._drop_next_unit_with_context(cfg.context_previous_max_tokens):
+                break
+            dropped_count += 1
+        cache_len_after = self.get_cache_length()
+        if dropped_count > 0:
+            # update statistics counter
+            self._sliding_event_count += 1
+            self._total_dropped_tokens += cache_len_before - cache_len_after
+            self._total_dropped_units += dropped_count
+            # consistency check
+            expected = self._system_preserve_length + sum(u["length"] for u in self._unit_history)
+        return dropped_count > 0
+    def get_previous_context(self) -> Tuple[str, List[int]]:
+        """get current accumulated previous context
+        Returns:
+            (previous_text, previous_token_ids): current accumulated text and token ids
+        """
+        return self._previous_text, self._previous_token_ids.copy()
+    def get_window_stats(self) -> Dict[str, Any]:
+        """get sliding window statistics"""
+        unit_lengths = [u["length"] for u in self._unit_history]
+        return {
+            "cache_length": self.get_cache_length(),
+            "unit_count": len(self._unit_history),
+            "unit_lengths": unit_lengths,
+            "unit_total_length": sum(unit_lengths),
+            "system_preserve_length": self._system_preserve_length,
+            "position_offset": self._position_offset,
+            "window_enabled": self._window_enabled,
+            "total_generated_tokens": self.get_total_generated_tokens(),
+            "pending_unit_id": self._pending_unit_id,
+            "next_unit_id": self._next_unit_id,
+            "config": {
+                "sliding_window_mode": self._window_config.sliding_window_mode,
+                "basic_window_high_tokens": self._window_config.basic_window_high_tokens,
+                "basic_window_low_tokens": self._window_config.basic_window_low_tokens,
+                "context_previous_max_tokens": self._window_config.context_previous_max_tokens,
+                "context_max_units": self._window_config.context_max_units,
+            },
+            # context preserving related
+            "preserve_prefix_length": self._preserve_prefix_length,
+            "previous_content_length": self._previous_content_length,
+            "suffix_token_count": len(self._suffix_token_ids),
+            "previous_text_length": len(self._previous_text),
+            "previous_token_count": len(self._previous_token_ids),
+            "has_system_template": self._system_prompt_template is not None,
+        }
+    def _verify_consistency(self) -> bool:
+        """verify unit history and cache length consistency"""
+        expected = self._system_preserve_length + sum(u["length"] for u in self._unit_history)
+        actual = self.get_cache_length()
+        return expected == actual
+    def print_verification_summary(self) -> Dict[str, Any]:
+        """print verification summary (for comparing off/basic/context mode)
+        Returns:
+            dictionary containing key verification data
+        """
+        cfg = self._window_config
+        # collect all generated text
+        all_generated_text = []
+        all_generated_tokens = []
+        for u in self._unit_history:
+            if not u.get("is_listen", False):
+                gen_text = u.get("generated_text", "")
+                gen_tokens = u.get("generated_tokens", [])
+                if gen_text:
+                    all_generated_text.append(gen_text)
+                if gen_tokens:
+                    all_generated_tokens.extend(gen_tokens)
+        combined_text = "".join(all_generated_text)
+        summary = {
+            "mode": cfg.sliding_window_mode,
+            "final_cache_length": self.get_cache_length(),
+            "final_unit_count": len(self._unit_history),
+            "sliding_event_count": self._sliding_event_count,
+            "total_dropped_tokens": self._total_dropped_tokens,
+            "total_dropped_units": self._total_dropped_units,
+            "total_generated_tokens": len(all_generated_tokens),
+            "generated_text": combined_text,
+            "previous_text": self._previous_text,
+            "previous_token_count": len(self._previous_token_ids),
+            "position_offset": self._position_offset,
+            "system_preserve_length": self._system_preserve_length,
+        }
+        return summary
+    def set_window_config(self, config: DuplexWindowConfig) -> None:
+        """set sliding window configuration"""
+        self._window_config = config
+    def set_window_enabled(self, enabled: bool) -> None:
+        """enable/disable sliding window"""
+        old_enabled = self._window_enabled
+        self._window_enabled = enabled
+    def get_context(self):
+        return self.context
+    def embed_token(self, tid):
+        if isinstance(tid, int):
+            tid = torch.tensor([tid], device=self.m.device)
+        return self.m.model.embed_tokens(tid)
+    def embed_tokens(self, token_ids: List[int]) -> torch.Tensor:
+        """batch embed multiple tokens
+        Args:
+            token_ids: list of token ids
+        Returns:
+            embeddings tensor [L, H]
+        """
+        if not token_ids:
+            return torch.empty(0, self.m.config.hidden_size, device=self.m.device)
+        tids = torch.tensor(token_ids, device=self.m.device)
+        return self.m.model.embed_tokens(tids)
+    @torch.no_grad()
+    def feed(self, embeds: torch.Tensor, return_logits: bool = False):
+        """
+        embeds : [L, H]   —— new embedding sequence fed into model at once
+        """
+        L = embeds.size(0)
+        device = embeds.device
+        past_len = self.get_cache_length()
+        pos_ids = torch.arange(past_len, past_len + L, device=device).unsqueeze(0)  # [1, L]
+        out = self.m(
+            inputs_embeds=embeds.unsqueeze(0),  # [1, L, H]
+            position_ids=pos_ids,
+            past_key_values=self.cache,
+            # use_cache = True,
+            return_dict=True,
+            output_hidden_states=True,
+            # attention_mask=attention_mask
+        )
+        self.cache = out.past_key_values
+        if return_logits:
+            logits = self.m.lm_head(out.hidden_states[-1])[:, -1]  # [1, vocab]
+            return logits, out.hidden_states[-1]
+    @torch.no_grad()
+    def decode(
+        self,
+        logits,
+        mode: Literal["sampling", "greedy"] = "sampling",
+        temperature=0.7,
+        top_k=20,
+        top_p=0.8,
+        listen_top_k=None,
+        listen_prob_scale=1.0,
+        text_repetition_penalty=1.05,
+        text_repetition_window_size=512,
+    ):
+        """
+        Args:
+            logits:
+            mode: sampling or greedy
+            temperature:
+            top_k:
+            top_p:
+            listen_top_k: force listen_id to be in top-k to keep
+            listen_prob_scale: multiply listen_id probability by a weight (<1 means decrease, >1 means increase)
+            text_repetition_penalty: repetition penalty coefficient, >1.0 means decrease repetition, <1.0 means increase repetition
+            text_repetition_window_size: repetition penalty window size
+        Sampling strategy:
+            1. first sample all tokens with original logits (apply temperature)
+            2. if sampled chunk_eos, return directly (keep the original model's decision of when to stop)
+            3. if not sampled chunk_eos, mask it (set logit to -inf), continue sampling text tokens
+            4. apply repetition penalty, top-k, top-p, etc. to the text tokens for the final sampling
+        """
+        logits = logits.clone()
+        # 0. independently check chunk_eos before sampling
+        eos_id = self.chunk_eos_id
+        with torch.no_grad():
+            if mode == "greedy":
+                sampled_token = torch.argmax(logits[0]).item()
+            else:
+                original_probs = F.softmax(logits[0], dim=-1)
+                sampled_token = torch.multinomial(original_probs, num_samples=1).item()
+            # if sampled chunk_eos, return directly
+            if sampled_token == eos_id:
+                next_token_id = torch.tensor([eos_id], device=logits.device)
+                next_token_str = self.tokenizer.decode(next_token_id)
+                return next_token_id
+        # if not sampled chunk_eos, set its logit to -inf
+        if self.forbidden_token_ids:
+            logits[:, self.forbidden_token_ids] = float("-inf")
+        # 1. apply repetition penalty
+        if text_repetition_penalty != 1.0 and len(self.generated_tokens) > 0:
+            # get recent tokens (within window size) considering special tokens and normal tokens
+            recent_tokens = self.generated_tokens[-text_repetition_window_size:]
+            # make it unique
+            recent_tokens = list(set(recent_tokens))
+            # apply penalty to repeated tokens
+            for token_id in recent_tokens:
+                if token_id < logits.size(-1):  # ensure token_id is in vocabulary range
+                    if text_repetition_penalty > 1.0:
+                        # penalize repetition: decrease logits
+                        logits[0, token_id] /= text_repetition_penalty
+                    else:
+                        # encourage repetition: increase logits
+                        logits[0, token_id] *= 1.0 / text_repetition_penalty
+        if listen_prob_scale != 1.0:  # modify listen token logit separately
+            logits[0, self.listen_id] *= listen_prob_scale
+        listen_rank = (logits[0] > logits[0, self.listen_id]).sum().item()
+        if listen_top_k is not None and listen_rank < listen_top_k:  # listen_id is in top-k, return directly
+            next_token_id = torch.tensor([self.listen_id], device=logits.device)
+            next_token_str = self.tokenizer.decode(next_token_id)
+            if next_token_str == "<|listen|>":
+                self.context += " "
+            else:
+                self.context += next_token_str
+            return next_token_id
+        if mode == "greedy":
+            next_token_id = torch.argmax(logits, dim=-1)
+        elif mode == "sampling":
+            logits = logits / temperature
+            logits = top_k_top_p_filtering(logits, top_k=top_k, top_p=top_p)
+            probs = F.softmax(logits, dim=-1)
+            next_token_id = torch.multinomial(probs, num_samples=1).squeeze(1)
+        else:
+            raise ValueError(f"Unsupported decode mode: {mode}")
+        if next_token_id.item() not in self.special_token_ids:
+            self.generated_tokens.append(next_token_id.item())
+        else:
+            self.generated_special_tokens.append(next_token_id.item())
+        return next_token_id
+def _download_url_to_tempfile(url: str, suffix: str = "", timeout: int = 60) -> str:
+    """
+    Download a URL to a temporary file and return the path.
+    Args:
+        url: HTTP/HTTPS URL to download
+        suffix: File suffix (e.g., ".jpg", ".wav", ".mp4")
+        timeout: Download timeout in seconds
+    Returns:
+        Path to the downloaded temporary file
+    """
+    import tempfile
+    import requests
+    response = requests.get(url, timeout=timeout)
+    response.raise_for_status()
+    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as f:
+        f.write(response.content)
+        return f.name
+def _is_url(path: str) -> bool:
+    return path.startswith(("http://", "https://"))
+def normalize_content_item(item) -> Union[str, Any, List[Any]]:
+    """Normalize structured content item to native format.
+    Supports:
+    - Native format: str, PIL.Image, np.ndarray (pass through)
+    - OpenAI structured format:
+        - {"type": "text", "text": "..."} -> str
+        - {"type": "image_url", "image_url": {"url": "..."}} -> PIL.Image
+        - {"type": "audio_url", "audio_url": {"url": "..."}} -> np.ndarray
+        - {"type": "video_url", "video_url": {"url": "...", ...}} -> List[Image, ndarray, ...]
+    URL formats supported:
+        - Local file path: "/path/to/file.jpg"
+        - HTTP/HTTPS URL: "https://example.com/image.jpg"
+    Args:
+        item: Content item to normalize
+    Returns:
+        Normalized item. For video_url, returns a tuple ("__video_contents__", list)
+        that will be flattened by normalize_content().
+    Raises:
+        ValueError: If content type is unknown or unsupported
+    """
+    import os
+    import numpy as np
+    from PIL import Image
+    if isinstance(item, str):
+        return item
+    if isinstance(item, Image.Image):
+        return item
+    if isinstance(item, np.ndarray):
+        return item
+    if isinstance(item, dict):
+        item_type = item.get("type")
+        if item_type == "text":
+            return item.get("text", "")
+        elif item_type == "image_url":
+            image_url_obj = item.get("image_url", {})
+            url = image_url_obj.get("url", "") if isinstance(image_url_obj, dict) else image_url_obj
+            if _is_url(url):
+                # Download to temp file
+                temp_path = _download_url_to_tempfile(url, suffix=".jpg", timeout=30)
+                img = Image.open(temp_path)
+                os.unlink(temp_path)
+                return img
+            else:
+                return Image.open(url)
+        elif item_type == "audio_url":
+            import librosa
+            audio_url_obj = item.get("audio_url", {})
+            url = audio_url_obj.get("url", "") if isinstance(audio_url_obj, dict) else audio_url_obj
+            if _is_url(url):
+                # Download to temp file
+                temp_path = _download_url_to_tempfile(url, suffix=".wav", timeout=60)
+                audio_np, _ = librosa.load(temp_path, sr=16000, mono=True)
+                os.unlink(temp_path)
+                return audio_np
+            else:
+                audio_np, _ = librosa.load(url, sr=16000, mono=True)
+                return audio_np
+        elif item_type == "video_url":
+            # Video processing - returns a LIST of items (frames + audio segments)
+            # Note: Unlike image_url/audio_url which return single items,
+            # video_url returns a list that will be flattened into the content
+            from minicpmo.utils import get_video_frame_audio_segments
+            video_url_obj = item.get("video_url", {})
+            if isinstance(video_url_obj, dict):
+                video_url = video_url_obj.get("url", "")
+                # Get optional parameters from video_url object (OpenAI style)
+                stack_frames = video_url_obj.get("stack_frames", 1)
+                use_ffmpeg = video_url_obj.get("use_ffmpeg", False)
+                use_audio = video_url_obj.get("use_audio", True)
+            else:
+                video_url = video_url_obj
+                stack_frames = 1
+                use_ffmpeg = False
+                use_audio = True
+            # Handle HTTP/HTTPS URL - download to temp file
+            temp_video_path = None
+            if _is_url(video_url):
+                temp_video_path = _download_url_to_tempfile(video_url, suffix=".mp4", timeout=120)
+                video_path = temp_video_path
+            else:
+                video_path = video_url
+            # Extract frames and audio segments
+            video_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(
+                video_path,
+                stack_frames=stack_frames,
+                use_ffmpeg=use_ffmpeg,
+                use_audio=use_audio
+            )
+            # Clean up temp file if downloaded
+            if temp_video_path is not None:
+                os.unlink(temp_video_path)
+            # Build omni_contents (interleaved frames and audio, or frames only)
+            omni_contents = []
+            for i in range(len(video_frames)):
+                omni_contents.append(video_frames[i])
+                if use_audio and audio_segments is not None:
+                    omni_contents.append(audio_segments[i])
+                if stacked_frames is not None and i < len(stacked_frames) and stacked_frames[i] is not None:
+                    omni_contents.append(stacked_frames[i])
+            # Return as a special marker to be flattened later
+            return "__video_contents__", omni_contents
+        else:
+            raise ValueError(f"Unknown content type: {item_type}")
+    raise ValueError(f"Cannot normalize content item of type: {type(item)}")
+def normalize_content(content) -> list:
+    """Normalize message content to list of native items.
+    Input formats:
+    - str: "hello" -> ["hello"]
+    - list of native items: [str, Image, np.ndarray] -> pass through with normalization
+    - list of structured items: [{"type": "text", ...}] -> normalize each
+    - video type: automatically expanded to omni_contents
+    - mixed: works too
+    Args:
+        content: Message content in any supported format
+    Returns:
+        List of native items (str, PIL.Image, np.ndarray)
+    Examples:
+        >>> normalize_content("hello")
+        ["hello"]
+        >>> normalize_content([{"type": "text", "text": "hi"}])
+        ["hi"]
+        >>> normalize_content([{"type": "video", "video": "/path/to/video.mp4"}])
+        [<PIL.Image>, <np.ndarray>, <PIL.Image>, <np.ndarray>, ...]
+    """
+    import numpy as np
+    from PIL import Image
+    if isinstance(content, str):
+        return [content]
+    if isinstance(content, list):
+        result = []
+        for item in content:
+            normalized = normalize_content_item(item)
+            # Handle video content (returns tuple with marker)
+            if isinstance(normalized, tuple) and len(normalized) == 2 and normalized[0] == "__video_contents__":
+                # Flatten video contents into result
+                result.extend(normalized[1])
+            else:
+                result.append(normalized)
+        return result
+    # Single non-list item (Image or np.ndarray)
+    if isinstance(content, (Image.Image, np.ndarray)):
+        return [content]
+    normalized = normalize_content_item(content)
+    if isinstance(normalized, tuple) and len(normalized) == 2 and normalized[0] == "__video_contents__":
+        return normalized[1]
+    return [normalized]

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff