Instructions to use nvidia/Hymba-1.5B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Hymba-1.5B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Hymba-1.5B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("nvidia/Hymba-1.5B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Hymba-1.5B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Hymba-1.5B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Hymba-1.5B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Hymba-1.5B-Instruct

SGLang

How to use nvidia/Hymba-1.5B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Hymba-1.5B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Hymba-1.5B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Hymba-1.5B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Hymba-1.5B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Hymba-1.5B-Instruct with Docker Model Runner:
```
docker model run hf.co/nvidia/Hymba-1.5B-Instruct
```

Added chat_template

by shizhediao2 - opened Nov 4, 2024

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+839

-93886

Files changed (16) hide show

README.md +62 -158
added_tokens.json +0 -3
config.json +14 -14
generation_config.json +2 -1
images/instruct_performance.png +0 -0
images/performance1.png +0 -0
images/performance2.png +0 -0
instruct_performance.png +0 -0
tokenizer.model → model-00001-of-00002.safetensors +2 -2
model.safetensors → model-00002-of-00002.safetensors +2 -2
model.safetensors.index.json +618 -0
modeling_hymba.py +137 -177
setup.sh +0 -44
special_tokens_map.json +0 -30
tokenizer.json +0 -0
tokenizer_config.json +0 -52

README.md CHANGED Viewed

@@ -1,201 +1,105 @@
 ---
-base_model:
-- nvidia/Hymba-1.5B-Base
-library_name: transformers
-license: other
-license_name: nvidia-open-model-license
-license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
-pipeline_tag: text-generation
 ---
-# Hymba-1.5B-Instruct
-<p align="center">
- 💾 <a href="https://github.com/NVlabs/hymba">Github</a>&nbsp&nbsp | &nbsp&nbsp 📄 <a href="https://arxiv.org/abs/2411.13676">Paper</a> | &nbsp&nbsp 📜 <a href="https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/">Blog</a> &nbsp
-</p>
-## Model Overview
-Hymba-1.5B-Instruct is a 1.5B parameter model finetuned from [Hymba-1.5B-Base](https://huggingface.co/nvidia/Hymba-1.5B-Base) using a combination of open source instruction datasets and internally collected synthetic datasets. This model is finetuned with supervised fine-tuning and direct preference optimization.
-Hymba-1.5B-Instruct is capable of many complex and important tasks like math reasoning, function calling, and role playing.
-This model is ready for commercial use.
-**Model Developer:** NVIDIA
-**Model Dates:** Hymba-1.5B-Instruct was trained between September 4, 2024 and November 10th, 2024.
-**License:**
-This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
-## Model Architecture
-> ⚡️ We've released a minimal implementation of Hymba on GitHub to help developers understand and implement its design principles in their own models. Check it out! [barebones-hymba](https://github.com/NVlabs/hymba/tree/main/barebones_hymba).
->
-Hymba-1.5B-Instruct has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel.  Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
-Features of this architecture:
-- Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs.
 <div align="center">
 <img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/module.png" alt="Hymba Module" width="600">
 </div>
-- Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention.
-- Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency.
 <div align="center">
 <img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
 </div>
-## Performance Highlights
-- Hymba-1.5B-Instruct outperforms popular small language models and achieves the highest average performance across all tasks.
 <div align="center">
-<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/instruct_performance.png" alt="Compare with SoTA Small LMs" width="600">
 </div>
-## Model Usage
-### Step 1: Environment Setup
-Since Hymba-1.5B-Instruct employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, we provide two ways to setup the environment:
-- **[Local install]** Install the related packages using our provided `setup.sh` (support CUDA 12.1/12.4):
-```
-wget --header="Authorization: Bearer YOUR_HF_TOKEN" https://huggingface.co/nvidia/Hymba-1.5B-Base/resolve/main/setup.sh
-bash setup.sh
-```
-- **[Docker]** A docker image is provided with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:
-```
-docker pull ghcr.io/tilmto/hymba:v1
-docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
-```
-### Step 2: Chat with Hymba-1.5B-Instruct
-After setting up the environment, you can use the following script to chat with our Model
-```py
-from transformers import AutoModelForCausalLM, AutoTokenizer, StopStringCriteria, StoppingCriteriaList
-import torch
-# Load the tokenizer and model
-repo_name = "nvidia/Hymba-1.5B-Instruct"
-tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
-model = model.cuda().to(torch.bfloat16)
-# Chat with Hymba
-prompt = input()
-messages = [
-    {"role": "system", "content": "You are a helpful assistant."}
-]
-messages.append({"role": "user", "content": prompt})
-# Apply chat template
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')
-stopping_criteria = StoppingCriteriaList([StopStringCriteria(tokenizer=tokenizer, stop_strings="</s>")])
-outputs = model.generate(
-    tokenized_chat,
-    max_new_tokens=256,
-    do_sample=False,
-    temperature=0.7,
-    use_cache=True,
-    stopping_criteria=stopping_criteria
-)
-input_length = tokenized_chat.shape[1]
-response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
-print(f"Model response: {response}")
 ```
-The prompt template used by Hymba-1.5B-Instruct is as follows, which has been integrated into the tokenizer and can be applied using `tokenizer.apply_chat_template`:
 ```
-<extra_id_0>System
-{system prompt}
-<extra_id_1>User
-<tool> ... </tool>
-<context> ... </context>
-{prompt}
-<extra_id_1>Assistant
-<toolcall> ... </toolcall>
-<extra_id_1>Tool
-{tool response}
-<extra_id_1>Assistant\n
 ```
-## Finetuning Hymba
-[LMFlow](https://github.com/OptimalScale/LMFlow) is a complete pipeline for fine-tuning large language models.
-The following steps provide an example of how to fine-tune the `Hymba-1.5B-Base` models using LMFlow.
-1. Using Docker
-    ```
-      docker pull ghcr.io/tilmto/hymba:v1
-      docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
-    ```
-2. Install LMFlow
-    ```
-      git clone https://github.com/OptimalScale/LMFlow.git
-      cd LMFlow
-      conda create -n lmflow python=3.9 -y
-      conda activate lmflow
-      conda install mpi4py
-      pip install -e .
-    ```
-3. Fine-tune the model using the following command.
-    ```
-      cd LMFlow
-      bash ./scripts/run_finetune_hymba.sh
-    ```
-With LMFlow, you can also fine-tune the model on your custom dataset. The only thing you need to do is transform your dataset into the [LMFlow data format](https://optimalscale.github.io/LMFlow/examples/DATASETS.html).
-In addition to full-finetuniing, you can also fine-tune hymba efficiently with [DoRA](https://arxiv.org/html/2402.09353v4), [LoRA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lora), [LISA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lisa), [Flash Attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md), and other acceleration techniques.
-For more details, please refer to the [LMFlow for Hymba](https://github.com/OptimalScale/LMFlow/tree/main/experimental/Hymba) documentation.
-## Limitations
-The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
-The testing suggests that this model is susceptible to jailbreak attacks. If using this model in a RAG or agentic setting, we recommend strong output validation controls to ensure security and safety risks from user-controlled model outputs are consistent with the intended use cases.
-## Ethical Considerations
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
-Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
-## Citation
-```
-@misc{dong2024hymbahybridheadarchitecturesmall,
-      title={Hymba: A Hybrid-head Architecture for Small Language Models},
-      author={Xin Dong and Yonggan Fu and Shizhe Diao and Wonmin Byeon and Zijia Chen and Ameya Sunil Mahabaleshwarkar and Shih-Yang Liu and Matthijs Van Keirsbilck and Min-Hung Chen and Yoshi Suhara and Yingyan Lin and Jan Kautz and Pavlo Molchanov},
-      year={2024},
-      eprint={2411.13676},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2411.13676},
-}
 ```

 ---
+{}
 ---
+# Hymba: A Hybrid-head Architecture for Small Language Models
+[[Slide](https://docs.google.com/presentation/d/1uidqBfDy8a149yE1-AKtNnPm1qwa01hp8sOj3_KAoMI/edit#slide=id.g2f73b22dcb8_0_1017)][Technical Report]  **!!! This huggingface repo is still under development.**
+Developed by Deep Learning Efficiency Research (DLER) team at NVIDIA Research.
+## Hymba: A Novel LM Architecture
+- Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs
 <div align="center">
 <img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/module.png" alt="Hymba Module" width="600">
 </div>
+- Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention
+- Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency
 <div align="center">
 <img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
 </div>
+## Hymba: Performance Highlights
+- [Hymba-1.5B-Base](https://huggingface.co/nvidia/Hymba-1.5B): Outperform all sub-2B public models, e.g., matching Llama 3.2 3B’s commonsense reasoning accuracy, being 3.49× faster, and reducing cache size by 11.7×
 <div align="center">
+<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/performance1.png" alt="Compare with SoTA Small LMs" width="600">
 </div>
+- Hymba-1.5B-Instruct: Outperform SOTA small LMs.
+<div align="center">
+<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/instruct_performance.png" alt="Compare with SoTA Small LMs" width="600">
+</div>
+## Hymba-1.5B-Instruct: Model Usage
+We release our Hymba-1.5B-Instruct model and offer the instructions to use our model as follows.
+### Step 1: Environment Setup
+Since our model employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, we provide three ways to set up the environment:
+- **[Pip]** Install the related packages using our provided `requirement.txt`:
+```
+pip install -r https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/requirements.txt
 ```
+- **[Docker]** We have prepared a docker image with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:
 ```
+wget http://10.137.9.244:8000/hymba_docker.tar
+docker load -i hymba.tar
+docker run --security-opt seccomp=unconfined --gpus all -v /home/$USER:/home/$USER -it hymba:v1 bash
 ```
+- **[Internal Only]** If you are an internal user from NVIDIA and are using the ORD cluster, you can use our prepared `sqsh` file to apply for an interactive node:
+   ```
+   srun -A nvr_lpr_llm --partition interactive --time 4:00:00 --gpus 8 --container-image /lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25.sqsh --container-mounts=$HOME:/home,/lustre:/lustre  --pty bash
+   ```
+### Step 2: Chat with Hymba
+After setting up the environment, you can use the following script to chat with our Model
+```
+from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, AutoModel
+from huggingface_hub import login
+import torch
+login()
+# Load LLaMA2's tokenizer
+tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")
+# Load Hymba-1.5B
+model = AutoModelForCausalLM.from_pretrained("nvidia/Hymba-1.5B-Instruct", trust_remote_code=True).cuda().to(torch.bfloat16)
+# Chat with our model
+def chat_with_model(prompt, model, tokenizer, max_length=64):
+    inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
+    outputs = model.generate(inputs.input_ids, max_length=max_length, do_sample=False, temperature=0.7, use_cache=True)
+    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+    return response
+print("Chat with the model (type 'exit' to quit):")
+while True:
+    print("User:")
+    prompt = input()
+    if prompt.lower() == "exit":
+        break
+    # Get the model's response
+    response = chat_with_model(prompt, model, tokenizer)
+    print(f"Model: {response}")
 ```

added_tokens.json DELETED Viewed

@@ -1,3 +0,0 @@
-{
-  "[PAD]": 32000
-}

config.json CHANGED Viewed

@@ -15,6 +15,14 @@
   "conv_dim": {
     "0": 3200,
     "1": 3200,
     "10": 3200,
     "11": 3200,
     "12": 3200,
@@ -25,7 +33,6 @@
     "17": 3200,
     "18": 3200,
     "19": 3200,
-    "2": 3200,
     "20": 3200,
     "21": 3200,
     "22": 3200,
@@ -36,15 +43,8 @@
     "27": 3200,
     "28": 3200,
     "29": 3200,
-    "3": 3200,
     "30": 3200,
-    "31": 3200,
-    "4": 3200,
-    "5": 3200,
-    "6": 3200,
-    "7": 3200,
-    "8": 3200,
-    "9": 3200
   },
   "eos_token_id": 2,
   "global_attn_idx": [
@@ -160,7 +160,7 @@
   "mamba_expand": 2,
   "mamba_inner_layernorms": true,
   "mamba_proj_bias": false,
-  "max_position_embeddings": 8192,
   "memory_tokens_interspersed_every": 0,
   "mlp_hidden_act": "silu",
   "model_type": "hymba",
@@ -171,18 +171,18 @@
   "num_key_value_heads": 5,
   "num_mamba": 1,
   "num_memory_tokens": 128,
-  "orig_max_position_embeddings": 2048,
   "output_router_logits": false,
   "pad_token_id": 0,
   "rms_norm_eps": 1e-06,
   "rope": true,
   "rope_theta": 10000.0,
-  "rope_type": "ntk",
   "router_aux_loss_coef": 0.001,
-  "seq_length": 8192,
   "sliding_window": 1024,
   "tie_word_embeddings": true,
-  "torch_dtype": "bfloat16",
   "transformers_version": "4.44.0",
   "use_cache": false,
   "use_mamba_kernels": true,

   "conv_dim": {
     "0": 3200,
     "1": 3200,
+    "2": 3200,
+    "3": 3200,
+    "4": 3200,
+    "5": 3200,
+    "6": 3200,
+    "7": 3200,
+    "8": 3200,
+    "9": 3200,
     "10": 3200,
     "11": 3200,
     "12": 3200,
     "17": 3200,
     "18": 3200,
     "19": 3200,
     "20": 3200,
     "21": 3200,
     "22": 3200,
     "27": 3200,
     "28": 3200,
     "29": 3200,
     "30": 3200,
+    "31": 3200
   },
   "eos_token_id": 2,
   "global_attn_idx": [
   "mamba_expand": 2,
   "mamba_inner_layernorms": true,
   "mamba_proj_bias": false,
+  "max_position_embeddings": 1024,
   "memory_tokens_interspersed_every": 0,
   "mlp_hidden_act": "silu",
   "model_type": "hymba",
   "num_key_value_heads": 5,
   "num_mamba": 1,
   "num_memory_tokens": 128,
+  "orig_max_position_embeddings": null,
   "output_router_logits": false,
   "pad_token_id": 0,
   "rms_norm_eps": 1e-06,
   "rope": true,
   "rope_theta": 10000.0,
+  "rope_type": null,
   "router_aux_loss_coef": 0.001,
+  "seq_length": 1024,
   "sliding_window": 1024,
   "tie_word_embeddings": true,
+  "torch_dtype": "float32",
   "transformers_version": "4.44.0",
   "use_cache": false,
   "use_mamba_kernels": true,

generation_config.json CHANGED Viewed

@@ -4,5 +4,6 @@
   "eos_token_id": 2,
   "pad_token_id": 0,
   "transformers_version": "4.44.0",
-  "use_cache": false
 }

   "eos_token_id": 2,
   "pad_token_id": 0,
   "transformers_version": "4.44.0",
+  "use_cache": false,
+  "chat_template": "{{'<extra_id_0>System'}}{% for message in messages %}{% if message['role'] == 'system' %}{{'\n' + message['content'].strip()}}{% if tools or contexts %}{{'\n'}}{% endif %}{% endif %}{% endfor %}{% if tools %}{% for tool in tools %}{{ '\n<tool> ' + tool|tojson + ' </tool>' }}{% endfor %}{% endif %}{% if contexts %}{% if tools %}{{'\n'}}{% endif %}{% for context in contexts %}{{ '\n<context> ' + context.strip() + ' </context>' }}{% endfor %}{% endif %}{{'\n\n'}}{% for message in messages %}{% if message['role'] == 'user' %}{{ '<extra_id_1>User\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<extra_id_1>Assistant\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'tool' %}{{ '<extra_id_1>Tool\n' + message['content'].strip() + '\n' }}{% endif %}{% endfor %}{%- if add_generation_prompt %}{{'<extra_id_1>Assistant\n'}}{%- endif %}"
 }

images/instruct_performance.png CHANGED Viewed

images/performance1.png ADDED Viewed

images/performance2.png ADDED Viewed

instruct_performance.png DELETED Viewed

Binary file (97.9 kB)

tokenizer.model → model-00001-of-00002.safetensors RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
-size 499723

 version https://git-lfs.github.com/spec/v1
+oid sha256:7f01b19a43514af19def4c812a1d453dfd66f5c1b0be9674090a5bf37b699fc1
+size 4988876320

model.safetensors → model-00002-of-00002.safetensors RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:83e5b3b0f41d82964e0c22809786ff0eb10afc116d43cbbe53325ebf6cba85f1
-size 3045665048

 version https://git-lfs.github.com/spec/v1
+oid sha256:b11f9bec9246d8dc80612bb4e9d20f58b5744ca90ffae8944fffa0658789fde8
+size 1102383712

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,618 @@

+{
+  "metadata": {
+    "total_size": 6091191296
+  },
+  "weight_map": {
+    "model.embed_tokens.weight": "model-00001-of-00002.safetensors",
+    "model.final_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.0.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.1.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.10.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.11.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.12.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.13.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.14.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.15.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.16.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.17.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.18.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.19.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.2.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.20.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.21.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.22.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.23.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.24.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.25.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.A_log.0": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.D.0": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.26.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.26.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.A_log.0": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.D.0": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.27.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.A_log.0": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.D.0": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.28.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.A_log.0": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.D.0": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.29.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.3.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.A_log.0": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.D.0": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.30.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.A_log.0": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.D.0": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
+    "model.layers.31.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
+    "model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.4.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.5.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.6.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.7.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.8.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.A_log.0": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.D.0": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
+    "model.layers.9.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
+    "model.memory_tokens": "model-00001-of-00002.safetensors"
+  }
+}

modeling_hymba.py CHANGED Viewed

@@ -39,13 +39,16 @@ from .configuration_hymba import HymbaConfig
 from torch.utils.checkpoint import checkpoint
-from flash_attn import flash_attn_func, flash_attn_varlen_func
-from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
-_flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
-from einops import rearrange, repeat, reduce, pack, unpack
-from einops.layers.torch import Rearrange
 if is_torch_fx_available():
@@ -396,7 +399,7 @@ class HybridMambaAttentionDynamicCache(DynamicCache):
             if has_mamba_state:
                 if hasattr(config, 'conv_dim'):
-                    conv_dim = config.conv_dim[str(i)]
                 else:
                     conv_dim = intermediate_size
                 self.conv_states += [
@@ -543,14 +546,6 @@ class HymbaAttention(nn.Module):
         if self.config.rope:
             self._init_rope()
-    def set_rope(self, rope_type, orig_max_position_embeddings, max_position_embeddings):
-        self.config.rope_type = rope_type
-        self.config.orig_max_position_embeddings = orig_max_position_embeddings
-        self.config.max_position_embeddings = max_position_embeddings
-        self._init_rope()
     def _init_rope(self):
@@ -1233,7 +1228,7 @@ class HymbaFlexAttention(HymbaFlashAttention2):
         self.attn_mask = or_masks(attn_mask, register_mask)
-        self.block_mask = create_block_mask(self.attn_mask, B=None, H=None, Q_LEN=qk_length, KV_LEN=qk_length)
         self.flex_attention = torch.compile(flex_attention)
@@ -1523,7 +1518,7 @@ class HymbaBlock(nn.Module):
         num_ssm_param = 1
         if not hasattr(config, 'conv_dim'):
-            config.conv_dim = {str(i):0 for i in range(config.num_hidden_layers)}
         self.conv1d = nn.Conv1d(
             in_channels=self.intermediate_size,
@@ -1534,7 +1529,7 @@ class HymbaBlock(nn.Module):
             padding=self.conv_kernel_size - 1
             )
-        config.conv_dim[str(self.layer_idx)] = self.intermediate_size
         self.x_proj = nn.ModuleList([nn.Linear(self.intermediate_size, self.time_step_rank + self.ssm_state_size * 2, bias=False) for _ in range(num_ssm_param)])
         self.dt_proj = nn.ModuleList([nn.Linear(self.time_step_rank, self.intermediate_size, bias=True) for _ in range(num_ssm_param)])
@@ -1579,133 +1574,145 @@ class HymbaBlock(nn.Module):
     def cuda_kernels_forward(self, hidden_states: torch.Tensor, cache_params: HybridMambaAttentionDynamicCache = None, attention_mask=None, position_ids=None, kv_last_layer=None, use_cache=False, use_swa=False):
         projected_states = self.in_proj(hidden_states).transpose(1, 2)  ## (bs, latent_dim, seq_len)
-        ## Handle padding for Mamba: Set padding tokens to 0
-        if projected_states.shape[-1] > 1 and attention_mask is not None and (attention_mask == 0).any():
-            projected_states = projected_states * attention_mask.unsqueeze(1).to(projected_states)
-        batch_size, seq_len, _ = hidden_states.shape
-        use_precomputed_states = (
-            cache_params is not None
-            and cache_params.has_previous_state
-            and seq_len == 1
-            and cache_params.conv_states[self.layer_idx].shape[0]
-            == cache_params.ssm_states[self.layer_idx].shape[0]
-            == batch_size
-            and use_cache
-        )
-        hidden_states, gate = projected_states.tensor_split((self.latent_dim,), dim=1)
-        conv_weights = self.conv1d.weight.view(self.conv1d.weight.size(0), self.conv1d.weight.size(2))
-        if self.reuse_kv:
-            query_states, hidden_states = hidden_states.tensor_split((self.attn_hidden_size,), dim=1)
-            query_states = query_states.transpose(1,2)
         else:
-            query_states, key_states, value_states, hidden_states = hidden_states.tensor_split((self.attn_hidden_size, self.attn_hidden_size + self.k_hidden_size, self.attn_hidden_size + self.k_hidden_size + self.v_hidden_size), dim=1)
-            query_states = query_states.transpose(1,2)
-            key_states = key_states.transpose(1,2)
-            value_states = value_states.transpose(1,2)
-        if use_precomputed_states:
-            hidden_states = causal_conv1d_update(
-                hidden_states.squeeze(-1),
-                cache_params.conv_states[self.layer_idx],
-                conv_weights,
-                self.conv1d.bias,
-                self.activation,
             )
-            hidden_states = hidden_states.unsqueeze(-1)
-            cache_params.mamba_past_length[self.layer_idx] += seq_len
-        else:
-            if cache_params is not None:
-                conv_states = nn.functional.pad(
-                    hidden_states, (self.conv_kernel_size - hidden_states.shape[-1], 0)
-                )
-                cache_params.conv_states[self.layer_idx].copy_(conv_states)
-                cache_params.mamba_past_length[self.layer_idx] += seq_len
-            hidden_states = causal_conv1d_fn(
-                hidden_states, conv_weights, self.conv1d.bias, activation=self.activation
-            )
-        ## Handle padding for Mamba: Set padding tokens to 0
-        if seq_len > 1 and attention_mask is not None and (attention_mask == 0).any():
-            hidden_states = hidden_states * attention_mask.unsqueeze(1).to(hidden_states)
-        if self.reuse_kv:
-            assert kv_last_layer is not None
-            attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, kv_last_layer=kv_last_layer, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
-        else:
-            attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, key_states=key_states, value_states=value_states, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
-        ## Mamba head
-        index = 0
-        ssm_parameters = self.x_proj[index](hidden_states.transpose(1, 2))
-        time_step, B, C = torch.split(
-            ssm_parameters, [self.time_step_rank, self.ssm_state_size, self.ssm_state_size], dim=-1
-        )
-        time_step, B, C = self._apply_layernorms(time_step, B, C)
-        if hasattr(self.dt_proj[index], "base_layer"):
-            time_proj_bias = self.dt_proj[index].base_layer.bias
-            self.dt_proj[index].base_layer.bias = None
-        else:
-            time_proj_bias = self.dt_proj[index].bias
-            self.dt_proj[index].bias = None
-        discrete_time_step = self.dt_proj[index](time_step).transpose(1, 2)  # [batch, intermediate_size, seq_len]
-        if hasattr(self.dt_proj[index], "base_layer"):
-            self.dt_proj[index].base_layer.bias = time_proj_bias
-        else:
-            self.dt_proj[index].bias = time_proj_bias
-        A = -torch.exp(self.A_log[index].float())
-        time_proj_bias = time_proj_bias.float() if time_proj_bias is not None else None
-        if use_precomputed_states:
-            scan_outputs = selective_state_update(
-                cache_params.ssm_states[self.layer_idx],
-                hidden_states[..., 0],
-                discrete_time_step[..., 0],
-                A,
-                B[:, 0],
-                C[:, 0],
-                self.D[index],
-                gate[..., 0],
-                time_proj_bias,
-                dt_softplus=True,
-            ).unsqueeze(-1)
-        else:
-            outputs = selective_scan_fn(
-                hidden_states,
-                discrete_time_step,
-                A,
-                B.transpose(1, 2),
-                C.transpose(1, 2),
-                self.D[index].float(),
-                z=gate,
-                delta_bias=time_proj_bias,
-                delta_softplus=True,
-                return_last_state=True,
             )
-            if len(outputs) == 3:
-                scan_outputs, ssm_state, _ = outputs
             else:
-                scan_outputs, ssm_state = outputs
-            if ssm_state is not None and cache_params is not None:
-                cache_params.ssm_states[self.layer_idx].copy_(ssm_state)
-        scan_outputs = scan_outputs.transpose(1, 2)
-        hidden_states = (self.pre_avg_layernorm1(attn_outputs) + self.pre_avg_layernorm2(scan_outputs)) / 2
-        contextualized_states = self.out_proj(hidden_states)
         return contextualized_states, attn_key_value
@@ -2025,49 +2032,6 @@ class HymbaPreTrainedModel(PreTrainedModel):
-def shift_zeros_to_front(attention_mask, hidden_states, position_ids):
-    """
-    Move all zero entries in 'attention_mask' to the front of the sequence
-    and reorder 'hidden_states' accordingly, preserving the order of zeros
-    and the order of ones.
-    Args:
-      attention_mask: (batch_size, seq_len), values in {0, 1}.
-      hidden_states:  (batch_size, seq_len, dim).
-    Returns:
-      shifted_mask:   (batch_size, seq_len) with zeros at the front.
-      shifted_states: (batch_size, seq_len, dim) reordered accordingly.
-    """
-    B, L = attention_mask.shape
-    D = hidden_states.shape[-1]
-    shifted_mask = torch.empty_like(attention_mask)
-    shifted_states = torch.empty_like(hidden_states)
-    shifted_position_ids = torch.empty_like(position_ids)
-    # Process each batch row independently
-    for b in range(B):
-        row_mask = attention_mask[b]       # (seq_len,)
-        row_states = hidden_states[b]      # (seq_len, dim)
-        row_pos = position_ids[b]       # (seq_len,)
-        # Find positions of zeros and ones
-        zero_indices = torch.where(row_mask == 0)[0]
-        one_indices  = torch.where(row_mask == 1)[0]
-        # Concatenate zero indices (in order) then one indices
-        new_order = torch.cat([zero_indices, one_indices], dim=0)
-        # Reorder mask and states
-        shifted_mask[b] = row_mask[new_order]
-        shifted_states[b] = row_states[new_order]
-        shifted_position_ids[b] = row_pos[new_order]
-    return shifted_mask, shifted_states, shifted_position_ids
 HYMBA_INPUTS_DOCSTRING = r"""
     Args: To be added later. Please refer to the forward function.
 """
@@ -2236,11 +2200,7 @@ class HymbaModel(HymbaPreTrainedModel):
             if position_ids is not None and position_ids.shape[1] != inputs_embeds.shape[1]:
                 position_ids = torch.arange(inputs_embeds.shape[1], device=inputs_embeds.device).unsqueeze(0)
-            ## Handle paddings: Shift all padding tokens to the beginning of the sequence
-            if inputs_embeds.shape[1] > 1 and attention_mask is not None and (attention_mask == 0).any():
-                attention_mask, inputs_embeds, position_ids = shift_zeros_to_front(attention_mask, inputs_embeds, position_ids)
         attention_mask_raw = attention_mask
         if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:

 from torch.utils.checkpoint import checkpoint
+try:
+    from flash_attn import flash_attn_func, flash_attn_varlen_func
+    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
+    _flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
+    from einops import rearrange, repeat, reduce, pack, unpack
+    from einops.layers.torch import Rearrange
+except ImportError:
+    pass
 if is_torch_fx_available():
             if has_mamba_state:
                 if hasattr(config, 'conv_dim'):
+                    conv_dim = config.conv_dim[i]
                 else:
                     conv_dim = intermediate_size
                 self.conv_states += [
         if self.config.rope:
             self._init_rope()
     def _init_rope(self):
         self.attn_mask = or_masks(attn_mask, register_mask)
+        self.block_mask = create_block_mask(self.attn_mask, B=None, H=None, Q_LEN=qk_length, KV_LEN=qk_length, _compile=True)
         self.flex_attention = torch.compile(flex_attention)
         num_ssm_param = 1
         if not hasattr(config, 'conv_dim'):
+            config.conv_dim = {i:0 for i in range(config.num_hidden_layers)}
         self.conv1d = nn.Conv1d(
             in_channels=self.intermediate_size,
             padding=self.conv_kernel_size - 1
             )
+        config.conv_dim[self.layer_idx] = self.intermediate_size
         self.x_proj = nn.ModuleList([nn.Linear(self.intermediate_size, self.time_step_rank + self.ssm_state_size * 2, bias=False) for _ in range(num_ssm_param)])
         self.dt_proj = nn.ModuleList([nn.Linear(self.time_step_rank, self.intermediate_size, bias=True) for _ in range(num_ssm_param)])
     def cuda_kernels_forward(self, hidden_states: torch.Tensor, cache_params: HybridMambaAttentionDynamicCache = None, attention_mask=None, position_ids=None, kv_last_layer=None, use_cache=False, use_swa=False):
         projected_states = self.in_proj(hidden_states).transpose(1, 2)  ## (bs, latent_dim, seq_len)
+        if (
+                self.training and cache_params is None and not self.apply_inner_layernorms
+        ):  # Doesn't support outputting the states -> used for training
+            contextualized_states = mamba_inner_fn(
+                projected_states,
+                self.conv1d.weight,
+                self.conv1d.bias if self.use_conv_bias else None,
+                self.x_proj.weight,
+                self.dt_proj.weight,
+                self.out_proj.weight,
+                self.out_proj.bias.float() if self.use_bias else None,
+                -torch.exp(self.A_log.float()),
+                None,  # input-dependent B
+                None,  # input-dependent C
+                self.D.float(),
+                delta_bias=self.dt_proj.bias.float(),
+                delta_softplus=True,
+            )
         else:
+            batch_size, seq_len, _ = hidden_states.shape
+            use_precomputed_states = (
+                cache_params is not None
+                and cache_params.has_previous_state
+                and seq_len == 1
+                and cache_params.conv_states[self.layer_idx].shape[0]
+                == cache_params.ssm_states[self.layer_idx].shape[0]
+                == batch_size
+                and use_cache
             )
+            hidden_states, gate = projected_states.tensor_split((self.latent_dim,), dim=1)
+            conv_weights = self.conv1d.weight.view(self.conv1d.weight.size(0), self.conv1d.weight.size(2))
+            if self.reuse_kv:
+                query_states, hidden_states = hidden_states.tensor_split((self.attn_hidden_size,), dim=1)
+                query_states = query_states.transpose(1,2)
+            else:
+                query_states, key_states, value_states, hidden_states = hidden_states.tensor_split((self.attn_hidden_size, self.attn_hidden_size + self.k_hidden_size, self.attn_hidden_size + self.k_hidden_size + self.v_hidden_size), dim=1)
+                query_states = query_states.transpose(1,2)
+                key_states = key_states.transpose(1,2)
+                value_states = value_states.transpose(1,2)
+            if use_precomputed_states:
+                hidden_states = causal_conv1d_update(
+                    hidden_states.squeeze(-1),
+                    cache_params.conv_states[self.layer_idx],
+                    conv_weights,
+                    self.conv1d.bias,
+                    self.activation,
+                )
+                hidden_states = hidden_states.unsqueeze(-1)
+                cache_params.mamba_past_length[self.layer_idx] += seq_len
+            else:
+                if cache_params is not None:
+                    conv_states = nn.functional.pad(
+                        hidden_states, (self.conv_kernel_size - hidden_states.shape[-1], 0)
+                    )
+                    cache_params.conv_states[self.layer_idx].copy_(conv_states)
+                    cache_params.mamba_past_length[self.layer_idx] += seq_len
+                hidden_states = causal_conv1d_fn(
+                    hidden_states, conv_weights, self.conv1d.bias, activation=self.activation
+                )
+            if self.reuse_kv:
+                assert kv_last_layer is not None
+                attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, kv_last_layer=kv_last_layer, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
+            else:
+                attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, key_states=key_states, value_states=value_states, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
+            ## Mamba head
+            index = 0
+            ssm_parameters = self.x_proj[index](hidden_states.transpose(1, 2))
+            time_step, B, C = torch.split(
+                ssm_parameters, [self.time_step_rank, self.ssm_state_size, self.ssm_state_size], dim=-1
             )
+            time_step, B, C = self._apply_layernorms(time_step, B, C)
+            if hasattr(self.dt_proj[index], "base_layer"):
+                time_proj_bias = self.dt_proj[index].base_layer.bias
+                self.dt_proj[index].base_layer.bias = None
             else:
+                time_proj_bias = self.dt_proj[index].bias
+                self.dt_proj[index].bias = None
+            discrete_time_step = self.dt_proj[index](time_step).transpose(1, 2)  # [batch, intermediate_size, seq_len]
+            if hasattr(self.dt_proj[index], "base_layer"):
+                self.dt_proj[index].base_layer.bias = time_proj_bias
+            else:
+                self.dt_proj[index].bias = time_proj_bias
+            A = -torch.exp(self.A_log[index].float())
+            time_proj_bias = time_proj_bias.float() if time_proj_bias is not None else None
+            if use_precomputed_states:
+                scan_outputs = selective_state_update(
+                    cache_params.ssm_states[self.layer_idx],
+                    hidden_states[..., 0],
+                    discrete_time_step[..., 0],
+                    A,
+                    B[:, 0],
+                    C[:, 0],
+                    self.D[index],
+                    gate[..., 0],
+                    time_proj_bias,
+                    dt_softplus=True,
+                ).unsqueeze(-1)
+            else:
+                outputs = selective_scan_fn(
+                    hidden_states,
+                    discrete_time_step,
+                    A,
+                    B.transpose(1, 2),
+                    C.transpose(1, 2),
+                    self.D[index].float(),
+                    z=gate,
+                    delta_bias=time_proj_bias,
+                    delta_softplus=True,
+                    return_last_state=True,
+                )
+                if len(outputs) == 3:
+                    scan_outputs, ssm_state, _ = outputs
+                else:
+                    scan_outputs, ssm_state = outputs
+                if ssm_state is not None and cache_params is not None:
+                    cache_params.ssm_states[self.layer_idx].copy_(ssm_state)
+            scan_outputs = scan_outputs.transpose(1, 2)
+            hidden_states = (self.pre_avg_layernorm1(attn_outputs) + self.pre_avg_layernorm2(scan_outputs)) / 2
+            contextualized_states = self.out_proj(hidden_states)
         return contextualized_states, attn_key_value
 HYMBA_INPUTS_DOCSTRING = r"""
     Args: To be added later. Please refer to the forward function.
 """
             if position_ids is not None and position_ids.shape[1] != inputs_embeds.shape[1]:
                 position_ids = torch.arange(inputs_embeds.shape[1], device=inputs_embeds.device).unsqueeze(0)
         attention_mask_raw = attention_mask
         if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:

setup.sh DELETED Viewed

@@ -1,44 +0,0 @@
-#!/bin/bash
-# Prompt user to specify CUDA version
-read -p "Enter CUDA version (12.1 or 12.4): " cuda_version
-# Verify CUDA version input
-if [[ "$cuda_version" != "12.1" && "$cuda_version" != "12.4" ]]; then
-  echo "Invalid CUDA version specified. Please choose either 12.1 or 12.4."
-  exit 1
-fi
-# Install PyTorch with the specified CUDA version
-conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=$cuda_version -c pytorch -c nvidia
-# Install other packages
-pip install --upgrade transformers
-pip install tiktoken
-pip install sentencepiece
-pip install protobuf
-pip install ninja einops triton packaging
-# Clone and install Mamba
-git clone https://github.com/state-spaces/mamba.git
-cd mamba
-pip install -e .
-cd ..
-# Clone and install causal-conv1d with specified CUDA version
-git clone https://github.com/Dao-AILab/causal-conv1d.git
-cd causal-conv1d
-export CUDA_HOME=/usr/local/cuda-$cuda_version
-TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0" python setup.py install
-cd ..
-# Clone and install attention-gym
-git clone https://github.com/pytorch-labs/attention-gym.git
-cd attention-gym
-pip install .
-cd ..
-# Install Flash Attention
-pip install flash_attn
-echo "Installation completed with CUDA $cuda_version."

special_tokens_map.json DELETED Viewed

@@ -1,30 +0,0 @@
-{
-  "bos_token": {
-    "content": "<s>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "</s>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "[PAD]",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "unk_token": {
-    "content": "<unk>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
-}

tokenizer.json DELETED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json DELETED Viewed

@@ -1,52 +0,0 @@
-{
-  "add_bos_token": true,
-  "add_eos_token": false,
-  "add_prefix_space": true,
-  "added_tokens_decoder": {
-    "0": {
-      "content": "<unk>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "1": {
-      "content": "<s>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "2": {
-      "content": "</s>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "32000": {
-      "content": "[PAD]",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "bos_token": "<s>",
-  "chat_template": "{{'<extra_id_0>System'}}{% for message in messages %}{% if message['role'] == 'system' %}{{'\n' + message['content'].strip()}}{% if tools or contexts %}{{'\n'}}{% endif %}{% endif %}{% endfor %}{% if tools %}{% for tool in tools %}{{ '\n<tool> ' + tool|tojson + ' </tool>' }}{% endfor %}{% endif %}{% if contexts %}{% if tools %}{{'\n'}}{% endif %}{% for context in contexts %}{{ '\n<context> ' + context.strip() + ' </context>' }}{% endfor %}{% endif %}{{'\n\n'}}{% for message in messages %}{% if message['role'] == 'user' %}{{ '<extra_id_1>User\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<extra_id_1>Assistant\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'tool' %}{{ '<extra_id_1>Tool\n' + message['content'].strip() + '\n' }}{% endif %}{% endfor %}{%- if add_generation_prompt %}{{'<extra_id_1>Assistant\n'}}{%- endif %}",
-  "clean_up_tokenization_spaces": false,
-  "eos_token": "</s>",
-  "legacy": true,
-  "model_max_length": 1000000000000000019884624838656,
-  "pad_token": "[PAD]",
-  "padding_side": "left",
-  "sp_model_kwargs": {},
-  "spaces_between_special_tokens": false,
-  "tokenizer_class": "LlamaTokenizer",
-  "unk_token": "<unk>",
-  "use_default_system_prompt": false
-}