erfanzar commited on
Commit
65399b9
·
verified ·
1 Parent(s): f9ad04c

Upload KimiVLForConditionalGeneration

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +0 -0
  2. README.md +123 -0
  3. checkpoint_metadata.json +8 -0
  4. config.json +399 -0
  5. generation_config.json +10 -0
  6. model/language_model/lm_head/kernel/.zarray +1 -0
  7. model/language_model/lm_head/kernel/0.0 +3 -0
  8. model/language_model/lm_head/kernel/0.1 +3 -0
  9. model/language_model/lm_head/kernel/0.2 +3 -0
  10. model/language_model/lm_head/kernel/0.3 +3 -0
  11. model/language_model/model/embed_tokens/embedding/.zarray +1 -0
  12. model/language_model/model/embed_tokens/embedding/0.0 +3 -0
  13. model/language_model/model/embed_tokens/embedding/0.1 +3 -0
  14. model/language_model/model/embed_tokens/embedding/0.2 +3 -0
  15. model/language_model/model/embed_tokens/embedding/0.3 +3 -0
  16. model/language_model/model/layers/0/input_layernorm/kernel/.zarray +1 -0
  17. model/language_model/model/layers/0/input_layernorm/kernel/0 +0 -0
  18. model/language_model/model/layers/0/mlp/down_proj/kernel/.zarray +1 -0
  19. model/language_model/model/layers/0/mlp/down_proj/kernel/0.0 +3 -0
  20. model/language_model/model/layers/0/mlp/down_proj/kernel/1.0 +3 -0
  21. model/language_model/model/layers/0/mlp/down_proj/kernel/2.0 +3 -0
  22. model/language_model/model/layers/0/mlp/down_proj/kernel/3.0 +3 -0
  23. model/language_model/model/layers/0/mlp/gate_proj/kernel/.zarray +1 -0
  24. model/language_model/model/layers/0/mlp/gate_proj/kernel/0.0 +3 -0
  25. model/language_model/model/layers/0/mlp/gate_proj/kernel/0.1 +3 -0
  26. model/language_model/model/layers/0/mlp/gate_proj/kernel/0.2 +3 -0
  27. model/language_model/model/layers/0/mlp/gate_proj/kernel/0.3 +3 -0
  28. model/language_model/model/layers/0/mlp/up_proj/kernel/.zarray +1 -0
  29. model/language_model/model/layers/0/mlp/up_proj/kernel/0.0 +3 -0
  30. model/language_model/model/layers/0/mlp/up_proj/kernel/0.1 +3 -0
  31. model/language_model/model/layers/0/mlp/up_proj/kernel/0.2 +3 -0
  32. model/language_model/model/layers/0/mlp/up_proj/kernel/0.3 +3 -0
  33. model/language_model/model/layers/0/post_attention_layernorm/kernel/.zarray +1 -0
  34. model/language_model/model/layers/0/post_attention_layernorm/kernel/0 +0 -0
  35. model/language_model/model/layers/0/self_attn/kv_a_layernorm/kernel/.zarray +1 -0
  36. model/language_model/model/layers/0/self_attn/kv_a_layernorm/kernel/0 +0 -0
  37. model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/.zarray +1 -0
  38. model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.0 +3 -0
  39. model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.1 +3 -0
  40. model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.2 +3 -0
  41. model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.3 +3 -0
  42. model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/.zarray +1 -0
  43. model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.0 +3 -0
  44. model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.1 +3 -0
  45. model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.2 +3 -0
  46. model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.3 +3 -0
  47. model/language_model/model/layers/0/self_attn/o_proj/kernel/.zarray +1 -0
  48. model/language_model/model/layers/0/self_attn/o_proj/kernel/0.0 +3 -0
  49. model/language_model/model/layers/0/self_attn/o_proj/kernel/1.0 +3 -0
  50. model/language_model/model/layers/0/self_attn/o_proj/kernel/2.0 +3 -0
.gitattributes CHANGED
The diff for this file is too large to render. See raw diff
 
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - EasyDeL
4
+ - KimiVLForConditionalGeneration
5
+ - TaskType.IMAGE_TEXT_TO_TEXT
6
+ - AttentionMechanisms.RAGGED_PAGE_ATTENTION_V3
7
+ - safetensors
8
+ - TPU
9
+ - GPU
10
+ - XLA
11
+ - Flax
12
+ ---
13
+ <p align="center">
14
+ <a href="https://github.com/erfanzar/EasyDeL">
15
+ <img src="https://raw.githubusercontent.com/erfanzar/easydel/main/images/easydel-logo-with-text.png" height="80">
16
+ </a>
17
+ </p>
18
+
19
+ <p align="center">
20
+ <a href="https://github.com/erfanzar/EasyDeL">
21
+ <img src="https://img.shields.io/badge/🤗_EasyDeL-0.2.0-blue.svg" />
22
+ </a>
23
+ <a href="https://github.com/erfanzar/EasyDeL">
24
+ <img src="https://img.shields.io/badge/Model_Type-KimiVLForConditionalGeneration-green.svg" />
25
+ </a>
26
+ </p>
27
+
28
+ # EasyDeL/Kimi-VL-A3B-Instruct
29
+
30
+ A model implemented using the EasyDeL framework, designed to deliver optimal performance for large-scale natural language processing tasks.
31
+
32
+ ## Overview
33
+
34
+ This model is built using [EasyDeL](https://github.com/erfanzar/EasyDeL), an open-source framework designed to enhance and streamline the training and serving process of machine learning models, with a primary focus on Jax/Flax on TPU/GPU at scale.
35
+
36
+ EasyDeL provides an efficient, highly-optimized, and customizable machine learning model compatible with both GPU and TPU environments. Built with JAX, this model supports advanced features such as sharded model parallelism, making it suitable for distributed training and inference and customized kernels.
37
+
38
+ ## Features Provided by EasyDeL
39
+
40
+ **EasyDeL Framework Features:**
41
+
42
+ - **Efficient Implementation**: Built with JAX/Flax for high-performance computation.
43
+ - **Modern Architecture**: Built on Flax NNX for better integration, modularity, and performance.
44
+ - **Multi-Device Support**: Optimized to run on TPU, GPU, and CPU environments.
45
+ - **Sharded Model Parallelism**: Supports model parallelism across multiple devices for scalability (using `auto_shard_model=True`).
46
+ - **Customizable Precision**: Allows specification of `dtype`, `param_dtype`, and `precision`.
47
+ - **Advanced Serving**: Includes `eSurge` LLM serving engine, `vWhisper` speech endpoints, and OpenAI-compatible APIs.
48
+ - **Optimized Kernels**: Integrates multiple attention mechanisms (like `AttentionMechanisms.RAGGED_PAGE_ATTENTION_V3`) and platform-specific optimizations.
49
+
50
+ ## Installation
51
+
52
+ To use this model via EasyDeL, first install EasyDeL:
53
+
54
+ ```bash
55
+ pip install easydel
56
+ ```
57
+
58
+ ## Usage
59
+
60
+ ### Loading the Pre-trained Model
61
+
62
+ To load this pre-trained model with EasyDeL:
63
+
64
+ ```python
65
+ from easydel import AutoEasyDeLModelForCausalLM, EasyDeLBaseConfigDict, AttentionMechanisms
66
+ from jax import numpy as jnp, lax
67
+
68
+ # Define max_length if needed for memory optimization
69
+ max_length = None
70
+
71
+ # Load model and parameters
72
+ # Set auto_shard_model=True to automatically distribute across devices
73
+ model = AutoEasyDeLModelForCausalLM.from_pretrained(
74
+ "EasyDeL/Kimi-VL-A3B-Instruct",
75
+ config_kwargs=EasyDeLBaseConfigDict(
76
+ # use_scan_mlp=False, # Set to True to potentially reduce memory usage
77
+ attn_dtype=jnp.float16, # Or jnp.bfloat16
78
+ # freq_max_position_embeddings=max_length, # Set if using RoPE and need truncation
79
+ # mask_max_position_embeddings=max_length, # Set if max length is defined
80
+ attn_mechanism=AttentionMechanisms.RAGGED_PAGE_ATTENTION_V3 # Matches the mechanism used by this model
81
+ ),
82
+ dtype=jnp.float16, # Or jnp.bfloat16 - Computation data type
83
+ param_dtype=jnp.float16, # Or jnp.bfloat16 - Parameter data type
84
+ precision=lax.Precision("fastest"), # Like "default", "fastest", "high", "highest"
85
+ auto_shard_model=True, # Auto-shard across available devices
86
+ )
87
+ ```
88
+
89
+ ## Supported Tasks
90
+
91
+ The primary task for this model is **TaskType.IMAGE_TEXT_TO_TEXT**. Further specific supported tasks are not explicitly listed.
92
+
93
+ ## Limitations
94
+
95
+ **General Limitations:**
96
+
97
+ - **Hardware Dependency**: Performance can vary significantly based on the hardware (TPU/GPU) used.
98
+ - **JAX/Flax Setup Required**: The environment must support JAX/Flax for optimal use.
99
+ - **Experimental Features**: Some EasyDeL features (like custom kernels) may require additional configuration.
100
+
101
+ ## License 📜
102
+
103
+ EasyDeL is released under the Apache v2 license. The license for this specific model might differ; please consult the original model repository or documentation.
104
+
105
+ ```code
106
+ # Apache License 2.0 (referring to EasyDeL Framework)
107
+ # ... (Full license text usually included in the main repo) ...
108
+ ```
109
+
110
+ ## Citation
111
+
112
+ If you use EasyDeL in your research or work, please cite it:
113
+
114
+ ```bibtex
115
+ @misc{Zare Chavoshi_2023,
116
+ title={EasyDeL: An open-source library for enhancing and streamlining the training process of machine learning models},
117
+ url={https://github.com/erfanzar/EasyDeL},
118
+ author={Zare Chavoshi, Erfan},
119
+ year={2023}
120
+ }
121
+ ```
122
+
123
+ Please also consider citing the original paper or source for the **EasyDeL/Kimi-VL-A3B-Instruct** model architecture if applicable.
checkpoint_metadata.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "0.0.85",
3
+ "timestamp": "2025-12-15T11:21:29.648456",
4
+ "checksum": {},
5
+ "array_metadata": {},
6
+ "framework_version": null,
7
+ "custom_metadata": {}
8
+ }
config.json ADDED
@@ -0,0 +1,399 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "KimiVLForConditionalGeneration"
4
+ ],
5
+ "attn_mechanism": "ragged_page_attention_v3",
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_kimi_vl.KimiVLConfig",
8
+ "AutoModel": "modeling_kimi_vl.KimiVLForConditionalGeneration",
9
+ "AutoModelForCausalLM": "modeling_kimi_vl.KimiVLForConditionalGeneration"
10
+ },
11
+ "backend": null,
12
+ "bits": null,
13
+ "blocksize_b": 1,
14
+ "blocksize_k": 128,
15
+ "blocksize_q": 128,
16
+ "decode_attn_mechanism": null,
17
+ "dtype": "bfloat16",
18
+ "easy_method": "train",
19
+ "fcm_max_ratio": 0.0,
20
+ "fcm_min_ratio": 0.0,
21
+ "flash_attention_backward_pass_impl": "triton",
22
+ "freq_max_position_embeddings": 40960,
23
+ "fsdp_is_ep_bound": true,
24
+ "gradient_checkpointing": "",
25
+ "gradient_checkpointing_targets": null,
26
+ "hardware_abstraction": true,
27
+ "ignore_index": -100,
28
+ "kv_cache_quantization_config": null,
29
+ "kv_cache_sharding_sequence_axis_name": "sp",
30
+ "mask_max_position_embeddings": 40960,
31
+ "media_placeholder_token_id": 163605,
32
+ "model_type": "kimi_vl",
33
+ "moe_force_xla_gmm": false,
34
+ "moe_method": "fused_moe",
35
+ "moe_tiling_size_batch": 4,
36
+ "moe_tiling_size_dim": 128,
37
+ "moe_tiling_size_seqlen": 128,
38
+ "operation_configs": null,
39
+ "pad_token_id": 0,
40
+ "pallas_k_block_size": 128,
41
+ "pallas_m_block_size": 128,
42
+ "pallas_n_block_size": 128,
43
+ "partition_axis": {
44
+ "attention_dim_axis": null,
45
+ "attention_kv_dim_axis": null,
46
+ "batch_axis": [
47
+ "fsdp",
48
+ "dp"
49
+ ],
50
+ "bias_head_sequence_axis": null,
51
+ "bias_key_sequence_axis": null,
52
+ "data_parallel_axis": "dp",
53
+ "decode_attention_dim_axis": null,
54
+ "decode_attention_kv_dim_axis": null,
55
+ "decode_batch_axis": [
56
+ "fsdp",
57
+ "dp"
58
+ ],
59
+ "decode_head_axis": "tp",
60
+ "decode_key_sequence_axis": "sp",
61
+ "decode_kv_head_axis": "tp",
62
+ "decode_query_sequence_axis": null,
63
+ "expert_axis": "ep",
64
+ "expert_gate_axis": null,
65
+ "expert_parallel_axis": "ep",
66
+ "fully_sharded_data_parallel_axis": "fsdp",
67
+ "head_axis": "tp",
68
+ "hidden_state_axis": "tp",
69
+ "key_sequence_axis": "sp",
70
+ "kv_head_axis": "tp",
71
+ "mlp_intermediate_axis": "tp",
72
+ "query_sequence_axis": "sp",
73
+ "sequence_axis": "sp",
74
+ "sequence_parallel_axis": "sp",
75
+ "tensor_parallel_axis": "tp",
76
+ "vocab_axis": "tp"
77
+ },
78
+ "platform": null,
79
+ "precompute_masks": true,
80
+ "pretraining_tp": 1,
81
+ "quantization_config": null,
82
+ "scan_attention_layers": false,
83
+ "scan_mlp_chunk_size": 1024,
84
+ "scan_ring_attention": true,
85
+ "sequence_axis_name": "sp",
86
+ "sharding_axis_dims": [
87
+ 1,
88
+ 1,
89
+ 1,
90
+ -1,
91
+ 1
92
+ ],
93
+ "sharding_axis_names": [
94
+ "dp",
95
+ "fsdp",
96
+ "ep",
97
+ "tp",
98
+ "sp"
99
+ ],
100
+ "sharding_dcn_axis_dims": null,
101
+ "sp_is_ep_bound": true,
102
+ "text_config": {
103
+ "architectures": [
104
+ "KimiVLForConditionalGeneration"
105
+ ],
106
+ "attention_bias": false,
107
+ "attention_dropout": 0.0,
108
+ "attn_mechanism": "ragged_page_attention_v3",
109
+ "auto_map": {
110
+ "AutoConfig": "configuration_kimi_vl.KimiVLConfig",
111
+ "AutoModel": "modeling_kimi_vl.KimiVLForConditionalGeneration",
112
+ "AutoModelForCausalLM": "modeling_kimi_vl.KimiVLForConditionalGeneration"
113
+ },
114
+ "aux_loss_alpha": 0.001,
115
+ "backend": null,
116
+ "bits": null,
117
+ "blocksize_b": 1,
118
+ "blocksize_k": 128,
119
+ "blocksize_q": 128,
120
+ "bos_token_id": 163584,
121
+ "decode_attn_mechanism": null,
122
+ "dtype": "bfloat16",
123
+ "easy_method": "train",
124
+ "eos_token_id": 163585,
125
+ "ep_size": 1,
126
+ "fcm_max_ratio": 0.0,
127
+ "fcm_min_ratio": 0.0,
128
+ "first_k_dense_replace": 1,
129
+ "flash_attention_backward_pass_impl": "triton",
130
+ "freq_max_position_embeddings": 40960,
131
+ "fsdp_is_ep_bound": true,
132
+ "gradient_checkpointing": "",
133
+ "gradient_checkpointing_targets": null,
134
+ "hardware_abstraction": true,
135
+ "head_dim": 192,
136
+ "hidden_act": "silu",
137
+ "hidden_size": 2048,
138
+ "initializer_range": 0.02,
139
+ "intermediate_size": 11264,
140
+ "kv_cache_quantization_config": null,
141
+ "kv_cache_sharding_sequence_axis_name": "sp",
142
+ "kv_lora_rank": 512,
143
+ "layer_types": [
144
+ "full_attention",
145
+ "full_attention",
146
+ "full_attention",
147
+ "full_attention",
148
+ "full_attention",
149
+ "full_attention",
150
+ "full_attention",
151
+ "full_attention",
152
+ "full_attention",
153
+ "full_attention",
154
+ "full_attention",
155
+ "full_attention",
156
+ "full_attention",
157
+ "full_attention",
158
+ "full_attention",
159
+ "full_attention",
160
+ "full_attention",
161
+ "full_attention",
162
+ "full_attention",
163
+ "full_attention",
164
+ "full_attention",
165
+ "full_attention",
166
+ "full_attention",
167
+ "full_attention",
168
+ "full_attention",
169
+ "full_attention",
170
+ "full_attention"
171
+ ],
172
+ "mask_max_position_embeddings": 40960,
173
+ "max_position_embeddings": 131072,
174
+ "model_type": "deepseek_v3",
175
+ "moe_force_xla_gmm": false,
176
+ "moe_intermediate_size": 1408,
177
+ "moe_layer_freq": 1,
178
+ "moe_method": "fused_moe",
179
+ "moe_tiling_size_batch": 4,
180
+ "moe_tiling_size_dim": 128,
181
+ "moe_tiling_size_seqlen": 128,
182
+ "n_group": 1,
183
+ "n_routed_experts": 64,
184
+ "n_shared_experts": 2,
185
+ "norm_topk_prob": true,
186
+ "num_attention_heads": 16,
187
+ "num_experts_per_tok": 6,
188
+ "num_hidden_layers": 27,
189
+ "num_key_value_heads": 16,
190
+ "num_nextn_predict_layers": 1,
191
+ "operation_configs": null,
192
+ "pad_token_id": 163839,
193
+ "pallas_k_block_size": 128,
194
+ "pallas_m_block_size": 128,
195
+ "pallas_n_block_size": 128,
196
+ "partition_axis": {
197
+ "attention_dim_axis": null,
198
+ "attention_kv_dim_axis": null,
199
+ "batch_axis": [
200
+ "fsdp",
201
+ "dp"
202
+ ],
203
+ "bias_head_sequence_axis": null,
204
+ "bias_key_sequence_axis": null,
205
+ "data_parallel_axis": "dp",
206
+ "decode_attention_dim_axis": null,
207
+ "decode_attention_kv_dim_axis": null,
208
+ "decode_batch_axis": [
209
+ "fsdp",
210
+ "dp"
211
+ ],
212
+ "decode_head_axis": "tp",
213
+ "decode_key_sequence_axis": "sp",
214
+ "decode_kv_head_axis": "tp",
215
+ "decode_query_sequence_axis": null,
216
+ "expert_axis": "ep",
217
+ "expert_gate_axis": null,
218
+ "expert_parallel_axis": "ep",
219
+ "fully_sharded_data_parallel_axis": "fsdp",
220
+ "head_axis": "tp",
221
+ "hidden_state_axis": "tp",
222
+ "key_sequence_axis": "sp",
223
+ "kv_head_axis": "tp",
224
+ "mlp_intermediate_axis": "tp",
225
+ "query_sequence_axis": "sp",
226
+ "sequence_axis": "sp",
227
+ "sequence_parallel_axis": "sp",
228
+ "tensor_parallel_axis": "tp",
229
+ "vocab_axis": "tp"
230
+ },
231
+ "platform": null,
232
+ "precompute_masks": true,
233
+ "pretraining_tp": 1,
234
+ "q_lora_rank": null,
235
+ "qk_nope_head_dim": 128,
236
+ "qk_rope_head_dim": 64,
237
+ "quantization_config": null,
238
+ "rms_norm_eps": 1e-05,
239
+ "rope_scaling": null,
240
+ "rope_theta": 800000.0,
241
+ "routed_scaling_factor": 2.446,
242
+ "scan_attention_layers": false,
243
+ "scan_mlp_chunk_size": 1024,
244
+ "scan_ring_attention": true,
245
+ "scoring_func": "sigmoid",
246
+ "seq_aux": true,
247
+ "sequence_axis_name": "sp",
248
+ "sharding_axis_dims": [
249
+ 1,
250
+ 1,
251
+ 1,
252
+ -1,
253
+ 1
254
+ ],
255
+ "sharding_axis_names": [
256
+ "dp",
257
+ "fsdp",
258
+ "ep",
259
+ "tp",
260
+ "sp"
261
+ ],
262
+ "sharding_dcn_axis_dims": null,
263
+ "sp_is_ep_bound": true,
264
+ "topk_group": 1,
265
+ "topk_method": "noaux_tc",
266
+ "use_cache": true,
267
+ "use_expert_tensor_mode": false,
268
+ "use_ring_of_experts": false,
269
+ "use_scan_mlp": false,
270
+ "use_sharded_kv_caching": false,
271
+ "use_sharding_constraint": false,
272
+ "v_head_dim": 128,
273
+ "vocab_size": 163840
274
+ },
275
+ "tie_word_embeddings": false,
276
+ "transformers_version": "4.57.3",
277
+ "use_expert_tensor_mode": false,
278
+ "use_ring_of_experts": false,
279
+ "use_scan_mlp": false,
280
+ "use_sharded_kv_caching": false,
281
+ "use_sharding_constraint": false,
282
+ "vision_config": {
283
+ "architectures": [
284
+ "KimiVLForConditionalGeneration"
285
+ ],
286
+ "attn_mechanism": "ragged_page_attention_v3",
287
+ "auto_map": {
288
+ "AutoConfig": "configuration_kimi_vl.KimiVLConfig",
289
+ "AutoModel": "modeling_kimi_vl.KimiVLForConditionalGeneration",
290
+ "AutoModelForCausalLM": "modeling_kimi_vl.KimiVLForConditionalGeneration"
291
+ },
292
+ "backend": null,
293
+ "bits": null,
294
+ "blocksize_b": 1,
295
+ "blocksize_k": 128,
296
+ "blocksize_q": 128,
297
+ "decode_attn_mechanism": null,
298
+ "dtype": "bfloat16",
299
+ "easy_method": "train",
300
+ "fcm_max_ratio": 0.0,
301
+ "fcm_min_ratio": 0.0,
302
+ "flash_attention_backward_pass_impl": "triton",
303
+ "freq_max_position_embeddings": 40960,
304
+ "fsdp_is_ep_bound": true,
305
+ "gradient_checkpointing": "",
306
+ "gradient_checkpointing_targets": null,
307
+ "hardware_abstraction": true,
308
+ "hidden_size": 1152,
309
+ "init_pos_emb_height": 64,
310
+ "init_pos_emb_width": 64,
311
+ "intermediate_size": 4304,
312
+ "kv_cache_quantization_config": null,
313
+ "kv_cache_sharding_sequence_axis_name": "sp",
314
+ "mask_max_position_embeddings": 40960,
315
+ "merge_kernel_size": [
316
+ 2,
317
+ 2
318
+ ],
319
+ "model_type": "moonvit",
320
+ "moe_force_xla_gmm": false,
321
+ "moe_method": "fused_moe",
322
+ "moe_tiling_size_batch": 4,
323
+ "moe_tiling_size_dim": 128,
324
+ "moe_tiling_size_seqlen": 128,
325
+ "num_attention_heads": 16,
326
+ "num_hidden_layers": 27,
327
+ "operation_configs": null,
328
+ "pallas_k_block_size": 128,
329
+ "pallas_m_block_size": 128,
330
+ "pallas_n_block_size": 128,
331
+ "partition_axis": {
332
+ "attention_dim_axis": null,
333
+ "attention_kv_dim_axis": null,
334
+ "batch_axis": [
335
+ "fsdp",
336
+ "dp"
337
+ ],
338
+ "bias_head_sequence_axis": null,
339
+ "bias_key_sequence_axis": null,
340
+ "data_parallel_axis": "dp",
341
+ "decode_attention_dim_axis": null,
342
+ "decode_attention_kv_dim_axis": null,
343
+ "decode_batch_axis": [
344
+ "fsdp",
345
+ "dp"
346
+ ],
347
+ "decode_head_axis": "tp",
348
+ "decode_key_sequence_axis": "sp",
349
+ "decode_kv_head_axis": "tp",
350
+ "decode_query_sequence_axis": null,
351
+ "expert_axis": "ep",
352
+ "expert_gate_axis": null,
353
+ "expert_parallel_axis": "ep",
354
+ "fully_sharded_data_parallel_axis": "fsdp",
355
+ "head_axis": "tp",
356
+ "hidden_state_axis": "tp",
357
+ "key_sequence_axis": "sp",
358
+ "kv_head_axis": "tp",
359
+ "mlp_intermediate_axis": "tp",
360
+ "query_sequence_axis": "sp",
361
+ "sequence_axis": "sp",
362
+ "sequence_parallel_axis": "sp",
363
+ "tensor_parallel_axis": "tp",
364
+ "vocab_axis": "tp"
365
+ },
366
+ "patch_size": 14,
367
+ "platform": null,
368
+ "precompute_masks": true,
369
+ "pretraining_tp": 1,
370
+ "quantization_config": null,
371
+ "scan_attention_layers": false,
372
+ "scan_mlp_chunk_size": 1024,
373
+ "scan_ring_attention": true,
374
+ "sequence_axis_name": "sp",
375
+ "sharding_axis_dims": [
376
+ 1,
377
+ 1,
378
+ 1,
379
+ -1,
380
+ 1
381
+ ],
382
+ "sharding_axis_names": [
383
+ "dp",
384
+ "fsdp",
385
+ "ep",
386
+ "tp",
387
+ "sp"
388
+ ],
389
+ "sharding_dcn_axis_dims": null,
390
+ "sp_is_ep_bound": true,
391
+ "use_expert_tensor_mode": false,
392
+ "use_ring_of_experts": false,
393
+ "use_scan_mlp": false,
394
+ "use_sharded_kv_caching": false,
395
+ "use_sharding_constraint": false,
396
+ "vocab_size": 163840
397
+ },
398
+ "vocab_size": 163840
399
+ }
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 163584,
3
+ "do_sample": true,
4
+ "eos_token_id": [
5
+ 163585
6
+ ],
7
+ "pad_token_id": 163838,
8
+ "temperature": 0.2,
9
+ "transformers_version": "4.57.3"
10
+ }
model/language_model/lm_head/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[2048,40960],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[2048,163840],"zarr_format":2}
model/language_model/lm_head/kernel/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a63e9c3ff3401c20facd55ecb94d60847d1006607efbd5c7e8aceff55d02084b
3
+ size 131102355
model/language_model/lm_head/kernel/0.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:33e75e80569aa4c19cde796fe58f236fbf968f3a2237fefe33394ebd50ad10f7
3
+ size 130719145
model/language_model/lm_head/kernel/0.2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:438cf01dcd64730af247c6c8128d2e54cf4ee8fd214923f8e436c9d7d136bf78
3
+ size 130532818
model/language_model/lm_head/kernel/0.3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:58ee279bd18ec439d7f228e3d1df025a67098ad34e8c2b52c67e96a065d92187
3
+ size 130384513
model/language_model/model/embed_tokens/embedding/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[163840,512],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[163840,2048],"zarr_format":2}
model/language_model/model/embed_tokens/embedding/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be8065b661be9ad3be95867fb8089352c2305de4cca9aa40f5e7d0806c37ac1a
3
+ size 130810030
model/language_model/model/embed_tokens/embedding/0.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5064efeb57188fc540e2e2c9598eb03f909d685c2f53de470d9c46d832cb7e76
3
+ size 130810992
model/language_model/model/embed_tokens/embedding/0.2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:366f6841e6e5abbc9ff762ebbf87427760ace33e335342a4a69ec3bbfada9512
3
+ size 130807493
model/language_model/model/embed_tokens/embedding/0.3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:37545eb2886fe601aeb669c4a35890e135032a15353f8716142db70bd5472b19
3
+ size 130807472
model/language_model/model/layers/0/input_layernorm/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[2048],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[2048],"zarr_format":2}
model/language_model/model/layers/0/input_layernorm/kernel/0 ADDED
Binary file (2.24 kB). View file
 
model/language_model/model/layers/0/mlp/down_proj/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[2816,2048],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[11264,2048],"zarr_format":2}
model/language_model/model/layers/0/mlp/down_proj/kernel/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf60a32f0b3214fe9b7ac01a6d1e2f0c5be77f0d856be9b3cc711638b577e5d1
3
+ size 9021981
model/language_model/model/layers/0/mlp/down_proj/kernel/1.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59638343feb8361352694bb7c87f91bcd140d29619658bd77cc1e0101ecb0c92
3
+ size 9021753
model/language_model/model/layers/0/mlp/down_proj/kernel/2.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b302ffe8b6b1551439721f113ff717beeab8029a0013a6fe36fa58de348fd0d0
3
+ size 9020679
model/language_model/model/layers/0/mlp/down_proj/kernel/3.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4d7d7804f3d4b74fce03c521a6c5aff8267679141d62dbb9a15a03e91d812f65
3
+ size 9021529
model/language_model/model/layers/0/mlp/gate_proj/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[2048,2816],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[2048,11264],"zarr_format":2}
model/language_model/model/layers/0/mlp/gate_proj/kernel/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e942350103be8aa516671d8b9f4857054d62f4bb59b6f1cc0c0d30a1b3337226
3
+ size 9029129
model/language_model/model/layers/0/mlp/gate_proj/kernel/0.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e431343bda4d6c5ac45225b8a8903a70e52363c599c7ee8fed3a3b2813092b6
3
+ size 9029287
model/language_model/model/layers/0/mlp/gate_proj/kernel/0.2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f35b8a4b06bcf1afae6fc1a5c41cb1393f710d19fa280e8ea55af6f05f2f609
3
+ size 9029213
model/language_model/model/layers/0/mlp/gate_proj/kernel/0.3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4ed0c0c6779b9e6c6197c54f5fd89aa2c3ca99531dda4a1b7201af571e64173a
3
+ size 9028435
model/language_model/model/layers/0/mlp/up_proj/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[2048,2816],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[2048,11264],"zarr_format":2}
model/language_model/model/layers/0/mlp/up_proj/kernel/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10850d1789777f84472564c7a1659081e6672ec35b6b018b2f65e5ac47a0c780
3
+ size 9028870
model/language_model/model/layers/0/mlp/up_proj/kernel/0.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:11416ff25247b61577a28c9d77d85ce52b5de3e70c58adb9270905bb304a388a
3
+ size 9028826
model/language_model/model/layers/0/mlp/up_proj/kernel/0.2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17e97b3e55ce2e988fc58caab9a2ebdacf984702d70d21b485b0cac75bcf69db
3
+ size 9028315
model/language_model/model/layers/0/mlp/up_proj/kernel/0.3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fe6d299d6c14199a0601a540dabaeadca817c58b690b3815393e96f0013af14a
3
+ size 9027448
model/language_model/model/layers/0/post_attention_layernorm/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[2048],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[2048],"zarr_format":2}
model/language_model/model/layers/0/post_attention_layernorm/kernel/0 ADDED
Binary file (1.86 kB). View file
 
model/language_model/model/layers/0/self_attn/kv_a_layernorm/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[512],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[512],"zarr_format":2}
model/language_model/model/layers/0/self_attn/kv_a_layernorm/kernel/0 ADDED
Binary file (730 Bytes). View file
 
model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[2048,144],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[2048,576],"zarr_format":2}
model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:051dcdda963f127302981ccc4151c884488ea8eeddd946c1b537ab4ac750936f
3
+ size 461611
model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b31e64bae7db92df30e0c60b23633def93f2a9162c974df9db0677f45abf5ff
3
+ size 461422
model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:75a22b7852e0cfd23ee23203234daec00b5187ab0531277a9cc7dc19275f55ea
3
+ size 461724
model/language_model/model/layers/0/self_attn/kv_a_proj_with_mqa/kernel/0.3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38166fc7a4fb13c84176a7e1fe1ad3fb17b59387e206fc7ce7af213177ef6c07
3
+ size 465596
model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[512,1024],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[512,4096],"zarr_format":2}
model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eca0650432a15207c76ab6ee550f0894bc84cb22c6cd3306e375b47d10236a97
3
+ size 842700
model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.1 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03e9dde9879a68cd8689de0c0e284a531ebb5a6b61924d259d26b43352d1feb6
3
+ size 851802
model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:221683d111f1969998441fdb2a17411b031e13e7eff596796249172626660d23
3
+ size 846798
model/language_model/model/layers/0/self_attn/kv_b_proj/kernel/0.3 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5d32510b19ba51e931249a92756bddba5d4e12cdfd4d3ec6c582bce063c50ea
3
+ size 852033
model/language_model/model/layers/0/self_attn/o_proj/kernel/.zarray ADDED
@@ -0,0 +1 @@
 
 
1
+ {"chunks":[512,2048],"compressor":{"id":"zstd","level":1},"dimension_separator":".","dtype":"bfloat16","fill_value":null,"filters":null,"order":"C","shape":[2048,2048],"zarr_format":2}
model/language_model/model/layers/0/self_attn/o_proj/kernel/0.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bf995252924f618313b1b47735d0afde9638e64a164d8a2692968832bc42f454
3
+ size 1632089
model/language_model/model/layers/0/self_attn/o_proj/kernel/1.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:daf927088d1e489cbdae6c365dc53b1ec819b37d191c76fac337858a75972dd0
3
+ size 1629443
model/language_model/model/layers/0/self_attn/o_proj/kernel/2.0 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b982469cf6789ad1418136046eaefaa5c6bf824d03fe4729a2894d057bfcdd5a
3
+ size 1629410