Added chat_template
#1
by
shizhediao2
- opened
- README.md +62 -158
- added_tokens.json +0 -3
- config.json +14 -14
- generation_config.json +2 -1
- images/instruct_performance.png +0 -0
- images/performance1.png +0 -0
- images/performance2.png +0 -0
- instruct_performance.png +0 -0
- tokenizer.model → model-00001-of-00002.safetensors +2 -2
- model.safetensors → model-00002-of-00002.safetensors +2 -2
- model.safetensors.index.json +618 -0
- modeling_hymba.py +137 -177
- setup.sh +0 -44
- special_tokens_map.json +0 -30
- tokenizer.json +0 -0
- tokenizer_config.json +0 -52
README.md
CHANGED
|
@@ -1,201 +1,105 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
- nvidia/Hymba-1.5B-Base
|
| 4 |
-
library_name: transformers
|
| 5 |
-
license: other
|
| 6 |
-
license_name: nvidia-open-model-license
|
| 7 |
-
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
|
| 8 |
-
pipeline_tag: text-generation
|
| 9 |
---
|
|
|
|
| 10 |
|
| 11 |
-
#
|
| 12 |
|
| 13 |
-
|
| 14 |
-
💾 <a href="https://github.com/NVlabs/hymba">Github</a>   |    📄 <a href="https://arxiv.org/abs/2411.13676">Paper</a> |    📜 <a href="https://developer.nvidia.com/blog/hymba-hybrid-head-architecture-boosts-small-language-model-performance/">Blog</a>  
|
| 15 |
-
</p>
|
| 16 |
|
| 17 |
-
## Model Overview
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
Hymba-1.5B-Instruct is capable of many complex and important tasks like math reasoning, function calling, and role playing.
|
| 23 |
-
|
| 24 |
-
This model is ready for commercial use.
|
| 25 |
-
|
| 26 |
-
**Model Developer:** NVIDIA
|
| 27 |
-
|
| 28 |
-
**Model Dates:** Hymba-1.5B-Instruct was trained between September 4, 2024 and November 10th, 2024.
|
| 29 |
-
|
| 30 |
-
**License:**
|
| 31 |
-
This model is released under the [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf).
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
## Model Architecture
|
| 35 |
-
|
| 36 |
-
> ⚡️ We've released a minimal implementation of Hymba on GitHub to help developers understand and implement its design principles in their own models. Check it out! [barebones-hymba](https://github.com/NVlabs/hymba/tree/main/barebones_hymba).
|
| 37 |
-
>
|
| 38 |
-
|
| 39 |
-
Hymba-1.5B-Instruct has a model embedding size of 1600, 25 attention heads, and an MLP intermediate dimension of 5504, with 32 layers in total, 16 SSM states, 3 full attention layers, the rest are sliding window attention. Unlike the standard Transformer, each attention layer in Hymba has a hybrid combination of standard attention heads and Mamba heads in parallel. Additionally, it uses Grouped-Query Attention (GQA) and Rotary Position Embeddings (RoPE).
|
| 40 |
-
|
| 41 |
-
Features of this architecture:
|
| 42 |
-
|
| 43 |
-
- Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs.
|
| 44 |
|
| 45 |
<div align="center">
|
| 46 |
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/module.png" alt="Hymba Module" width="600">
|
| 47 |
</div>
|
| 48 |
|
| 49 |
-
- Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention
|
| 50 |
|
| 51 |
-
- Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency
|
| 52 |
|
| 53 |
<div align="center">
|
| 54 |
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
|
| 55 |
</div>
|
| 56 |
|
| 57 |
|
| 58 |
-
## Performance Highlights
|
| 59 |
-
|
| 60 |
-
- Hymba-1.5B-Instruct outperforms popular small language models and achieves the highest average performance across all tasks.
|
| 61 |
-
|
| 62 |
|
| 63 |
<div align="center">
|
| 64 |
-
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/
|
| 65 |
</div>
|
| 66 |
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
### Step 1: Environment Setup
|
| 71 |
-
|
| 72 |
-
Since Hymba-1.5B-Instruct employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, we provide two ways to setup the environment:
|
| 73 |
|
| 74 |
-
- **[Local install]** Install the related packages using our provided `setup.sh` (support CUDA 12.1/12.4):
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
```
|
| 80 |
|
| 81 |
-
- **[Docker]** A docker image is provided with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:
|
| 82 |
-
```
|
| 83 |
-
docker pull ghcr.io/tilmto/hymba:v1
|
| 84 |
-
docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
|
| 85 |
-
```
|
| 86 |
|
|
|
|
| 87 |
|
| 88 |
-
|
| 89 |
-
After setting up the environment, you can use the following script to chat with our Model
|
| 90 |
|
| 91 |
-
|
| 92 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer, StopStringCriteria, StoppingCriteriaList
|
| 93 |
-
import torch
|
| 94 |
|
| 95 |
-
|
| 96 |
-
repo_name = "nvidia/Hymba-1.5B-Instruct"
|
| 97 |
-
|
| 98 |
-
tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
|
| 99 |
-
model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True)
|
| 100 |
-
model = model.cuda().to(torch.bfloat16)
|
| 101 |
-
|
| 102 |
-
# Chat with Hymba
|
| 103 |
-
prompt = input()
|
| 104 |
-
|
| 105 |
-
messages = [
|
| 106 |
-
{"role": "system", "content": "You are a helpful assistant."}
|
| 107 |
-
]
|
| 108 |
-
messages.append({"role": "user", "content": prompt})
|
| 109 |
-
|
| 110 |
-
# Apply chat template
|
| 111 |
-
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to('cuda')
|
| 112 |
-
stopping_criteria = StoppingCriteriaList([StopStringCriteria(tokenizer=tokenizer, stop_strings="</s>")])
|
| 113 |
-
outputs = model.generate(
|
| 114 |
-
tokenized_chat,
|
| 115 |
-
max_new_tokens=256,
|
| 116 |
-
do_sample=False,
|
| 117 |
-
temperature=0.7,
|
| 118 |
-
use_cache=True,
|
| 119 |
-
stopping_criteria=stopping_criteria
|
| 120 |
-
)
|
| 121 |
-
input_length = tokenized_chat.shape[1]
|
| 122 |
-
response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
|
| 123 |
-
|
| 124 |
-
print(f"Model response: {response}")
|
| 125 |
|
|
|
|
|
|
|
|
|
|
| 126 |
```
|
| 127 |
|
| 128 |
-
|
| 129 |
|
| 130 |
```
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
<extra_id_1>User
|
| 135 |
-
<tool> ... </tool>
|
| 136 |
-
<context> ... </context>
|
| 137 |
-
{prompt}
|
| 138 |
-
<extra_id_1>Assistant
|
| 139 |
-
<toolcall> ... </toolcall>
|
| 140 |
-
<extra_id_1>Tool
|
| 141 |
-
{tool response}
|
| 142 |
-
<extra_id_1>Assistant\n
|
| 143 |
```
|
| 144 |
|
|
|
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
[LMFlow](https://github.com/OptimalScale/LMFlow) is a complete pipeline for fine-tuning large language models.
|
| 150 |
-
The following steps provide an example of how to fine-tune the `Hymba-1.5B-Base` models using LMFlow.
|
| 151 |
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
```
|
| 155 |
-
docker pull ghcr.io/tilmto/hymba:v1
|
| 156 |
-
docker run --gpus all -v /home/$USER:/home/$USER -it ghcr.io/tilmto/hymba:v1 bash
|
| 157 |
-
```
|
| 158 |
-
2. Install LMFlow
|
| 159 |
-
|
| 160 |
-
```
|
| 161 |
-
git clone https://github.com/OptimalScale/LMFlow.git
|
| 162 |
-
cd LMFlow
|
| 163 |
-
conda create -n lmflow python=3.9 -y
|
| 164 |
-
conda activate lmflow
|
| 165 |
-
conda install mpi4py
|
| 166 |
-
pip install -e .
|
| 167 |
-
```
|
| 168 |
-
|
| 169 |
-
3. Fine-tune the model using the following command.
|
| 170 |
-
|
| 171 |
-
```
|
| 172 |
-
cd LMFlow
|
| 173 |
-
bash ./scripts/run_finetune_hymba.sh
|
| 174 |
-
```
|
| 175 |
-
|
| 176 |
-
With LMFlow, you can also fine-tune the model on your custom dataset. The only thing you need to do is transform your dataset into the [LMFlow data format](https://optimalscale.github.io/LMFlow/examples/DATASETS.html).
|
| 177 |
-
In addition to full-finetuniing, you can also fine-tune hymba efficiently with [DoRA](https://arxiv.org/html/2402.09353v4), [LoRA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lora), [LISA](https://github.com/OptimalScale/LMFlow?tab=readme-ov-file#lisa), [Flash Attention](https://github.com/OptimalScale/LMFlow/blob/main/readme/flash_attn2.md), and other acceleration techniques.
|
| 178 |
-
For more details, please refer to the [LMFlow for Hymba](https://github.com/OptimalScale/LMFlow/tree/main/experimental/Hymba) documentation.
|
| 179 |
-
|
| 180 |
-
## Limitations
|
| 181 |
-
The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|
| 182 |
-
|
| 183 |
-
The testing suggests that this model is susceptible to jailbreak attacks. If using this model in a RAG or agentic setting, we recommend strong output validation controls to ensure security and safety risks from user-controlled model outputs are consistent with the intended use cases.
|
| 184 |
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
|
|
|
| 188 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
-
## Citation
|
| 191 |
-
```
|
| 192 |
-
@misc{dong2024hymbahybridheadarchitecturesmall,
|
| 193 |
-
title={Hymba: A Hybrid-head Architecture for Small Language Models},
|
| 194 |
-
author={Xin Dong and Yonggan Fu and Shizhe Diao and Wonmin Byeon and Zijia Chen and Ameya Sunil Mahabaleshwarkar and Shih-Yang Liu and Matthijs Van Keirsbilck and Min-Hung Chen and Yoshi Suhara and Yingyan Lin and Jan Kautz and Pavlo Molchanov},
|
| 195 |
-
year={2024},
|
| 196 |
-
eprint={2411.13676},
|
| 197 |
-
archivePrefix={arXiv},
|
| 198 |
-
primaryClass={cs.CL},
|
| 199 |
-
url={https://arxiv.org/abs/2411.13676},
|
| 200 |
-
}
|
| 201 |
```
|
|
|
|
| 1 |
---
|
| 2 |
+
{}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
+
# Hymba: A Hybrid-head Architecture for Small Language Models
|
| 5 |
|
| 6 |
+
[[Slide](https://docs.google.com/presentation/d/1uidqBfDy8a149yE1-AKtNnPm1qwa01hp8sOj3_KAoMI/edit#slide=id.g2f73b22dcb8_0_1017)][Technical Report] **!!! This huggingface repo is still under development.**
|
| 7 |
|
| 8 |
+
Developed by Deep Learning Efficiency Research (DLER) team at NVIDIA Research.
|
|
|
|
|
|
|
| 9 |
|
|
|
|
| 10 |
|
| 11 |
+
## Hymba: A Novel LM Architecture
|
| 12 |
+
- Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
<div align="center">
|
| 15 |
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/module.png" alt="Hymba Module" width="600">
|
| 16 |
</div>
|
| 17 |
|
| 18 |
+
- Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention
|
| 19 |
|
| 20 |
+
- Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency
|
| 21 |
|
| 22 |
<div align="center">
|
| 23 |
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
|
| 24 |
</div>
|
| 25 |
|
| 26 |
|
| 27 |
+
## Hymba: Performance Highlights
|
| 28 |
+
- [Hymba-1.5B-Base](https://huggingface.co/nvidia/Hymba-1.5B): Outperform all sub-2B public models, e.g., matching Llama 3.2 3B’s commonsense reasoning accuracy, being 3.49× faster, and reducing cache size by 11.7×
|
|
|
|
|
|
|
| 29 |
|
| 30 |
<div align="center">
|
| 31 |
+
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/performance1.png" alt="Compare with SoTA Small LMs" width="600">
|
| 32 |
</div>
|
| 33 |
|
| 34 |
|
| 35 |
+
- Hymba-1.5B-Instruct: Outperform SOTA small LMs.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
|
|
|
| 37 |
|
| 38 |
+
<div align="center">
|
| 39 |
+
<img src="https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/images/instruct_performance.png" alt="Compare with SoTA Small LMs" width="600">
|
| 40 |
+
</div>
|
|
|
|
| 41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
## Hymba-1.5B-Instruct: Model Usage
|
| 44 |
|
| 45 |
+
We release our Hymba-1.5B-Instruct model and offer the instructions to use our model as follows.
|
|
|
|
| 46 |
|
| 47 |
+
### Step 1: Environment Setup
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
Since our model employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, we provide three ways to set up the environment:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
- **[Pip]** Install the related packages using our provided `requirement.txt`:
|
| 52 |
+
```
|
| 53 |
+
pip install -r https://huggingface.co/nvidia/Hymba-1.5B-Instruct/resolve/main/requirements.txt
|
| 54 |
```
|
| 55 |
|
| 56 |
+
- **[Docker]** We have prepared a docker image with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:
|
| 57 |
|
| 58 |
```
|
| 59 |
+
wget http://10.137.9.244:8000/hymba_docker.tar
|
| 60 |
+
docker load -i hymba.tar
|
| 61 |
+
docker run --security-opt seccomp=unconfined --gpus all -v /home/$USER:/home/$USER -it hymba:v1 bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
```
|
| 63 |
|
| 64 |
+
- **[Internal Only]** If you are an internal user from NVIDIA and are using the ORD cluster, you can use our prepared `sqsh` file to apply for an interactive node:
|
| 65 |
|
| 66 |
+
```
|
| 67 |
+
srun -A nvr_lpr_llm --partition interactive --time 4:00:00 --gpus 8 --container-image /lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25.sqsh --container-mounts=$HOME:/home,/lustre:/lustre --pty bash
|
| 68 |
+
```
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
### Step 2: Chat with Hymba
|
| 71 |
+
After setting up the environment, you can use the following script to chat with our Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
+
```
|
| 74 |
+
from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, AutoModel
|
| 75 |
+
from huggingface_hub import login
|
| 76 |
+
import torch
|
| 77 |
|
| 78 |
+
login()
|
| 79 |
+
|
| 80 |
+
# Load LLaMA2's tokenizer
|
| 81 |
+
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")
|
| 82 |
+
|
| 83 |
+
# Load Hymba-1.5B
|
| 84 |
+
model = AutoModelForCausalLM.from_pretrained("nvidia/Hymba-1.5B-Instruct", trust_remote_code=True).cuda().to(torch.bfloat16)
|
| 85 |
+
|
| 86 |
+
# Chat with our model
|
| 87 |
+
def chat_with_model(prompt, model, tokenizer, max_length=64):
|
| 88 |
+
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
|
| 89 |
+
outputs = model.generate(inputs.input_ids, max_length=max_length, do_sample=False, temperature=0.7, use_cache=True)
|
| 90 |
+
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 91 |
+
return response
|
| 92 |
+
|
| 93 |
+
print("Chat with the model (type 'exit' to quit):")
|
| 94 |
+
while True:
|
| 95 |
+
print("User:")
|
| 96 |
+
prompt = input()
|
| 97 |
+
if prompt.lower() == "exit":
|
| 98 |
+
break
|
| 99 |
+
|
| 100 |
+
# Get the model's response
|
| 101 |
+
response = chat_with_model(prompt, model, tokenizer)
|
| 102 |
+
|
| 103 |
+
print(f"Model: {response}")
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
```
|
added_tokens.json
DELETED
|
@@ -1,3 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"[PAD]": 32000
|
| 3 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
config.json
CHANGED
|
@@ -15,6 +15,14 @@
|
|
| 15 |
"conv_dim": {
|
| 16 |
"0": 3200,
|
| 17 |
"1": 3200,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
"10": 3200,
|
| 19 |
"11": 3200,
|
| 20 |
"12": 3200,
|
|
@@ -25,7 +33,6 @@
|
|
| 25 |
"17": 3200,
|
| 26 |
"18": 3200,
|
| 27 |
"19": 3200,
|
| 28 |
-
"2": 3200,
|
| 29 |
"20": 3200,
|
| 30 |
"21": 3200,
|
| 31 |
"22": 3200,
|
|
@@ -36,15 +43,8 @@
|
|
| 36 |
"27": 3200,
|
| 37 |
"28": 3200,
|
| 38 |
"29": 3200,
|
| 39 |
-
"3": 3200,
|
| 40 |
"30": 3200,
|
| 41 |
-
"31": 3200
|
| 42 |
-
"4": 3200,
|
| 43 |
-
"5": 3200,
|
| 44 |
-
"6": 3200,
|
| 45 |
-
"7": 3200,
|
| 46 |
-
"8": 3200,
|
| 47 |
-
"9": 3200
|
| 48 |
},
|
| 49 |
"eos_token_id": 2,
|
| 50 |
"global_attn_idx": [
|
|
@@ -160,7 +160,7 @@
|
|
| 160 |
"mamba_expand": 2,
|
| 161 |
"mamba_inner_layernorms": true,
|
| 162 |
"mamba_proj_bias": false,
|
| 163 |
-
"max_position_embeddings":
|
| 164 |
"memory_tokens_interspersed_every": 0,
|
| 165 |
"mlp_hidden_act": "silu",
|
| 166 |
"model_type": "hymba",
|
|
@@ -171,18 +171,18 @@
|
|
| 171 |
"num_key_value_heads": 5,
|
| 172 |
"num_mamba": 1,
|
| 173 |
"num_memory_tokens": 128,
|
| 174 |
-
"orig_max_position_embeddings":
|
| 175 |
"output_router_logits": false,
|
| 176 |
"pad_token_id": 0,
|
| 177 |
"rms_norm_eps": 1e-06,
|
| 178 |
"rope": true,
|
| 179 |
"rope_theta": 10000.0,
|
| 180 |
-
"rope_type":
|
| 181 |
"router_aux_loss_coef": 0.001,
|
| 182 |
-
"seq_length":
|
| 183 |
"sliding_window": 1024,
|
| 184 |
"tie_word_embeddings": true,
|
| 185 |
-
"torch_dtype": "
|
| 186 |
"transformers_version": "4.44.0",
|
| 187 |
"use_cache": false,
|
| 188 |
"use_mamba_kernels": true,
|
|
|
|
| 15 |
"conv_dim": {
|
| 16 |
"0": 3200,
|
| 17 |
"1": 3200,
|
| 18 |
+
"2": 3200,
|
| 19 |
+
"3": 3200,
|
| 20 |
+
"4": 3200,
|
| 21 |
+
"5": 3200,
|
| 22 |
+
"6": 3200,
|
| 23 |
+
"7": 3200,
|
| 24 |
+
"8": 3200,
|
| 25 |
+
"9": 3200,
|
| 26 |
"10": 3200,
|
| 27 |
"11": 3200,
|
| 28 |
"12": 3200,
|
|
|
|
| 33 |
"17": 3200,
|
| 34 |
"18": 3200,
|
| 35 |
"19": 3200,
|
|
|
|
| 36 |
"20": 3200,
|
| 37 |
"21": 3200,
|
| 38 |
"22": 3200,
|
|
|
|
| 43 |
"27": 3200,
|
| 44 |
"28": 3200,
|
| 45 |
"29": 3200,
|
|
|
|
| 46 |
"30": 3200,
|
| 47 |
+
"31": 3200
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
},
|
| 49 |
"eos_token_id": 2,
|
| 50 |
"global_attn_idx": [
|
|
|
|
| 160 |
"mamba_expand": 2,
|
| 161 |
"mamba_inner_layernorms": true,
|
| 162 |
"mamba_proj_bias": false,
|
| 163 |
+
"max_position_embeddings": 1024,
|
| 164 |
"memory_tokens_interspersed_every": 0,
|
| 165 |
"mlp_hidden_act": "silu",
|
| 166 |
"model_type": "hymba",
|
|
|
|
| 171 |
"num_key_value_heads": 5,
|
| 172 |
"num_mamba": 1,
|
| 173 |
"num_memory_tokens": 128,
|
| 174 |
+
"orig_max_position_embeddings": null,
|
| 175 |
"output_router_logits": false,
|
| 176 |
"pad_token_id": 0,
|
| 177 |
"rms_norm_eps": 1e-06,
|
| 178 |
"rope": true,
|
| 179 |
"rope_theta": 10000.0,
|
| 180 |
+
"rope_type": null,
|
| 181 |
"router_aux_loss_coef": 0.001,
|
| 182 |
+
"seq_length": 1024,
|
| 183 |
"sliding_window": 1024,
|
| 184 |
"tie_word_embeddings": true,
|
| 185 |
+
"torch_dtype": "float32",
|
| 186 |
"transformers_version": "4.44.0",
|
| 187 |
"use_cache": false,
|
| 188 |
"use_mamba_kernels": true,
|
generation_config.json
CHANGED
|
@@ -4,5 +4,6 @@
|
|
| 4 |
"eos_token_id": 2,
|
| 5 |
"pad_token_id": 0,
|
| 6 |
"transformers_version": "4.44.0",
|
| 7 |
-
"use_cache": false
|
|
|
|
| 8 |
}
|
|
|
|
| 4 |
"eos_token_id": 2,
|
| 5 |
"pad_token_id": 0,
|
| 6 |
"transformers_version": "4.44.0",
|
| 7 |
+
"use_cache": false,
|
| 8 |
+
"chat_template": "{{'<extra_id_0>System'}}{% for message in messages %}{% if message['role'] == 'system' %}{{'\n' + message['content'].strip()}}{% if tools or contexts %}{{'\n'}}{% endif %}{% endif %}{% endfor %}{% if tools %}{% for tool in tools %}{{ '\n<tool> ' + tool|tojson + ' </tool>' }}{% endfor %}{% endif %}{% if contexts %}{% if tools %}{{'\n'}}{% endif %}{% for context in contexts %}{{ '\n<context> ' + context.strip() + ' </context>' }}{% endfor %}{% endif %}{{'\n\n'}}{% for message in messages %}{% if message['role'] == 'user' %}{{ '<extra_id_1>User\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<extra_id_1>Assistant\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'tool' %}{{ '<extra_id_1>Tool\n' + message['content'].strip() + '\n' }}{% endif %}{% endfor %}{%- if add_generation_prompt %}{{'<extra_id_1>Assistant\n'}}{%- endif %}"
|
| 9 |
}
|
images/instruct_performance.png
CHANGED
|
|
images/performance1.png
ADDED
|
images/performance2.png
ADDED
|
instruct_performance.png
DELETED
|
Binary file (97.9 kB)
|
|
|
tokenizer.model → model-00001-of-00002.safetensors
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7f01b19a43514af19def4c812a1d453dfd66f5c1b0be9674090a5bf37b699fc1
|
| 3 |
+
size 4988876320
|
model.safetensors → model-00002-of-00002.safetensors
RENAMED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b11f9bec9246d8dc80612bb4e9d20f58b5744ca90ffae8944fffa0658789fde8
|
| 3 |
+
size 1102383712
|
model.safetensors.index.json
ADDED
|
@@ -0,0 +1,618 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"metadata": {
|
| 3 |
+
"total_size": 6091191296
|
| 4 |
+
},
|
| 5 |
+
"weight_map": {
|
| 6 |
+
"model.embed_tokens.weight": "model-00001-of-00002.safetensors",
|
| 7 |
+
"model.final_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 8 |
+
"model.layers.0.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 9 |
+
"model.layers.0.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 10 |
+
"model.layers.0.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 11 |
+
"model.layers.0.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 12 |
+
"model.layers.0.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 13 |
+
"model.layers.0.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 14 |
+
"model.layers.0.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 15 |
+
"model.layers.0.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 16 |
+
"model.layers.0.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 17 |
+
"model.layers.0.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 18 |
+
"model.layers.0.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 19 |
+
"model.layers.0.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 20 |
+
"model.layers.0.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 21 |
+
"model.layers.0.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 22 |
+
"model.layers.0.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 23 |
+
"model.layers.0.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 24 |
+
"model.layers.0.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 25 |
+
"model.layers.0.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 26 |
+
"model.layers.0.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 27 |
+
"model.layers.1.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 28 |
+
"model.layers.1.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 29 |
+
"model.layers.1.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 30 |
+
"model.layers.1.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 31 |
+
"model.layers.1.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 32 |
+
"model.layers.1.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 33 |
+
"model.layers.1.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 34 |
+
"model.layers.1.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 35 |
+
"model.layers.1.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 36 |
+
"model.layers.1.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 37 |
+
"model.layers.1.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 38 |
+
"model.layers.1.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 39 |
+
"model.layers.1.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 40 |
+
"model.layers.1.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 41 |
+
"model.layers.1.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 42 |
+
"model.layers.1.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 43 |
+
"model.layers.1.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 44 |
+
"model.layers.1.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 45 |
+
"model.layers.1.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 46 |
+
"model.layers.10.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 47 |
+
"model.layers.10.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 48 |
+
"model.layers.10.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 49 |
+
"model.layers.10.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 50 |
+
"model.layers.10.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 51 |
+
"model.layers.10.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 52 |
+
"model.layers.10.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 53 |
+
"model.layers.10.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 54 |
+
"model.layers.10.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 55 |
+
"model.layers.10.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 56 |
+
"model.layers.10.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 57 |
+
"model.layers.10.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 58 |
+
"model.layers.10.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 59 |
+
"model.layers.10.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 60 |
+
"model.layers.10.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 61 |
+
"model.layers.10.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 62 |
+
"model.layers.10.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 63 |
+
"model.layers.10.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 64 |
+
"model.layers.10.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 65 |
+
"model.layers.11.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 66 |
+
"model.layers.11.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 67 |
+
"model.layers.11.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 68 |
+
"model.layers.11.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 69 |
+
"model.layers.11.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 70 |
+
"model.layers.11.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 71 |
+
"model.layers.11.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 72 |
+
"model.layers.11.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 73 |
+
"model.layers.11.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 74 |
+
"model.layers.11.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 75 |
+
"model.layers.11.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 76 |
+
"model.layers.11.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 77 |
+
"model.layers.11.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 78 |
+
"model.layers.11.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 79 |
+
"model.layers.11.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 80 |
+
"model.layers.11.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 81 |
+
"model.layers.11.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 82 |
+
"model.layers.11.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 83 |
+
"model.layers.11.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 84 |
+
"model.layers.12.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 85 |
+
"model.layers.12.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 86 |
+
"model.layers.12.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 87 |
+
"model.layers.12.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 88 |
+
"model.layers.12.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 89 |
+
"model.layers.12.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 90 |
+
"model.layers.12.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 91 |
+
"model.layers.12.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 92 |
+
"model.layers.12.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 93 |
+
"model.layers.12.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 94 |
+
"model.layers.12.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 95 |
+
"model.layers.12.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 96 |
+
"model.layers.12.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 97 |
+
"model.layers.12.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 98 |
+
"model.layers.12.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 99 |
+
"model.layers.12.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 100 |
+
"model.layers.12.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 101 |
+
"model.layers.12.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 102 |
+
"model.layers.12.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 103 |
+
"model.layers.13.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 104 |
+
"model.layers.13.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 105 |
+
"model.layers.13.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 106 |
+
"model.layers.13.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 107 |
+
"model.layers.13.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 108 |
+
"model.layers.13.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 109 |
+
"model.layers.13.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 110 |
+
"model.layers.13.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 111 |
+
"model.layers.13.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 112 |
+
"model.layers.13.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 113 |
+
"model.layers.13.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 114 |
+
"model.layers.13.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 115 |
+
"model.layers.13.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 116 |
+
"model.layers.13.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 117 |
+
"model.layers.13.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 118 |
+
"model.layers.13.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 119 |
+
"model.layers.13.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 120 |
+
"model.layers.13.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 121 |
+
"model.layers.13.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 122 |
+
"model.layers.14.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 123 |
+
"model.layers.14.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 124 |
+
"model.layers.14.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 125 |
+
"model.layers.14.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 126 |
+
"model.layers.14.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 127 |
+
"model.layers.14.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 128 |
+
"model.layers.14.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 129 |
+
"model.layers.14.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 130 |
+
"model.layers.14.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 131 |
+
"model.layers.14.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 132 |
+
"model.layers.14.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 133 |
+
"model.layers.14.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 134 |
+
"model.layers.14.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 135 |
+
"model.layers.14.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 136 |
+
"model.layers.14.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 137 |
+
"model.layers.14.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 138 |
+
"model.layers.14.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 139 |
+
"model.layers.14.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 140 |
+
"model.layers.14.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 141 |
+
"model.layers.15.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 142 |
+
"model.layers.15.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 143 |
+
"model.layers.15.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 144 |
+
"model.layers.15.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 145 |
+
"model.layers.15.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 146 |
+
"model.layers.15.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 147 |
+
"model.layers.15.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 148 |
+
"model.layers.15.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 149 |
+
"model.layers.15.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 150 |
+
"model.layers.15.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 151 |
+
"model.layers.15.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 152 |
+
"model.layers.15.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 153 |
+
"model.layers.15.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 154 |
+
"model.layers.15.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 155 |
+
"model.layers.15.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 156 |
+
"model.layers.15.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 157 |
+
"model.layers.15.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 158 |
+
"model.layers.15.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 159 |
+
"model.layers.15.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 160 |
+
"model.layers.16.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 161 |
+
"model.layers.16.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 162 |
+
"model.layers.16.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 163 |
+
"model.layers.16.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 164 |
+
"model.layers.16.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 165 |
+
"model.layers.16.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 166 |
+
"model.layers.16.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 167 |
+
"model.layers.16.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 168 |
+
"model.layers.16.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 169 |
+
"model.layers.16.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 170 |
+
"model.layers.16.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 171 |
+
"model.layers.16.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 172 |
+
"model.layers.16.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 173 |
+
"model.layers.16.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 174 |
+
"model.layers.16.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 175 |
+
"model.layers.16.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 176 |
+
"model.layers.16.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 177 |
+
"model.layers.16.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 178 |
+
"model.layers.16.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 179 |
+
"model.layers.17.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 180 |
+
"model.layers.17.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 181 |
+
"model.layers.17.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 182 |
+
"model.layers.17.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 183 |
+
"model.layers.17.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 184 |
+
"model.layers.17.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 185 |
+
"model.layers.17.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 186 |
+
"model.layers.17.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 187 |
+
"model.layers.17.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 188 |
+
"model.layers.17.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 189 |
+
"model.layers.17.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 190 |
+
"model.layers.17.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 191 |
+
"model.layers.17.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 192 |
+
"model.layers.17.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 193 |
+
"model.layers.17.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 194 |
+
"model.layers.17.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 195 |
+
"model.layers.17.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 196 |
+
"model.layers.17.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 197 |
+
"model.layers.17.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 198 |
+
"model.layers.18.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 199 |
+
"model.layers.18.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 200 |
+
"model.layers.18.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 201 |
+
"model.layers.18.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 202 |
+
"model.layers.18.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 203 |
+
"model.layers.18.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 204 |
+
"model.layers.18.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 205 |
+
"model.layers.18.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 206 |
+
"model.layers.18.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 207 |
+
"model.layers.18.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 208 |
+
"model.layers.18.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 209 |
+
"model.layers.18.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 210 |
+
"model.layers.18.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 211 |
+
"model.layers.18.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 212 |
+
"model.layers.18.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 213 |
+
"model.layers.18.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 214 |
+
"model.layers.18.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 215 |
+
"model.layers.18.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 216 |
+
"model.layers.18.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 217 |
+
"model.layers.19.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 218 |
+
"model.layers.19.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 219 |
+
"model.layers.19.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 220 |
+
"model.layers.19.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 221 |
+
"model.layers.19.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 222 |
+
"model.layers.19.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 223 |
+
"model.layers.19.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 224 |
+
"model.layers.19.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 225 |
+
"model.layers.19.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 226 |
+
"model.layers.19.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 227 |
+
"model.layers.19.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 228 |
+
"model.layers.19.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 229 |
+
"model.layers.19.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 230 |
+
"model.layers.19.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 231 |
+
"model.layers.19.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 232 |
+
"model.layers.19.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 233 |
+
"model.layers.19.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 234 |
+
"model.layers.19.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 235 |
+
"model.layers.19.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 236 |
+
"model.layers.2.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 237 |
+
"model.layers.2.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 238 |
+
"model.layers.2.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 239 |
+
"model.layers.2.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 240 |
+
"model.layers.2.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 241 |
+
"model.layers.2.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 242 |
+
"model.layers.2.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 243 |
+
"model.layers.2.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 244 |
+
"model.layers.2.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 245 |
+
"model.layers.2.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 246 |
+
"model.layers.2.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 247 |
+
"model.layers.2.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 248 |
+
"model.layers.2.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 249 |
+
"model.layers.2.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 250 |
+
"model.layers.2.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 251 |
+
"model.layers.2.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 252 |
+
"model.layers.2.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 253 |
+
"model.layers.2.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 254 |
+
"model.layers.2.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 255 |
+
"model.layers.20.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 256 |
+
"model.layers.20.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 257 |
+
"model.layers.20.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 258 |
+
"model.layers.20.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 259 |
+
"model.layers.20.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 260 |
+
"model.layers.20.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 261 |
+
"model.layers.20.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 262 |
+
"model.layers.20.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 263 |
+
"model.layers.20.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 264 |
+
"model.layers.20.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 265 |
+
"model.layers.20.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 266 |
+
"model.layers.20.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 267 |
+
"model.layers.20.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 268 |
+
"model.layers.20.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 269 |
+
"model.layers.20.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 270 |
+
"model.layers.20.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 271 |
+
"model.layers.20.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 272 |
+
"model.layers.20.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 273 |
+
"model.layers.20.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 274 |
+
"model.layers.21.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 275 |
+
"model.layers.21.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 276 |
+
"model.layers.21.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 277 |
+
"model.layers.21.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 278 |
+
"model.layers.21.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 279 |
+
"model.layers.21.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 280 |
+
"model.layers.21.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 281 |
+
"model.layers.21.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 282 |
+
"model.layers.21.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 283 |
+
"model.layers.21.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 284 |
+
"model.layers.21.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 285 |
+
"model.layers.21.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 286 |
+
"model.layers.21.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 287 |
+
"model.layers.21.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 288 |
+
"model.layers.21.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 289 |
+
"model.layers.21.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 290 |
+
"model.layers.21.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 291 |
+
"model.layers.21.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 292 |
+
"model.layers.21.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 293 |
+
"model.layers.22.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 294 |
+
"model.layers.22.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 295 |
+
"model.layers.22.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 296 |
+
"model.layers.22.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 297 |
+
"model.layers.22.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 298 |
+
"model.layers.22.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 299 |
+
"model.layers.22.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 300 |
+
"model.layers.22.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 301 |
+
"model.layers.22.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 302 |
+
"model.layers.22.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 303 |
+
"model.layers.22.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 304 |
+
"model.layers.22.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 305 |
+
"model.layers.22.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 306 |
+
"model.layers.22.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 307 |
+
"model.layers.22.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 308 |
+
"model.layers.22.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 309 |
+
"model.layers.22.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 310 |
+
"model.layers.22.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 311 |
+
"model.layers.22.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 312 |
+
"model.layers.23.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 313 |
+
"model.layers.23.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 314 |
+
"model.layers.23.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 315 |
+
"model.layers.23.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 316 |
+
"model.layers.23.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 317 |
+
"model.layers.23.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 318 |
+
"model.layers.23.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 319 |
+
"model.layers.23.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 320 |
+
"model.layers.23.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 321 |
+
"model.layers.23.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 322 |
+
"model.layers.23.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 323 |
+
"model.layers.23.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 324 |
+
"model.layers.23.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 325 |
+
"model.layers.23.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 326 |
+
"model.layers.23.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 327 |
+
"model.layers.23.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 328 |
+
"model.layers.23.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 329 |
+
"model.layers.23.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 330 |
+
"model.layers.23.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 331 |
+
"model.layers.24.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 332 |
+
"model.layers.24.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 333 |
+
"model.layers.24.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 334 |
+
"model.layers.24.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 335 |
+
"model.layers.24.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 336 |
+
"model.layers.24.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 337 |
+
"model.layers.24.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 338 |
+
"model.layers.24.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 339 |
+
"model.layers.24.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 340 |
+
"model.layers.24.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 341 |
+
"model.layers.24.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 342 |
+
"model.layers.24.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 343 |
+
"model.layers.24.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 344 |
+
"model.layers.24.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 345 |
+
"model.layers.24.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 346 |
+
"model.layers.24.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 347 |
+
"model.layers.24.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 348 |
+
"model.layers.24.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 349 |
+
"model.layers.24.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 350 |
+
"model.layers.25.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 351 |
+
"model.layers.25.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 352 |
+
"model.layers.25.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 353 |
+
"model.layers.25.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 354 |
+
"model.layers.25.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 355 |
+
"model.layers.25.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 356 |
+
"model.layers.25.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 357 |
+
"model.layers.25.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 358 |
+
"model.layers.25.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 359 |
+
"model.layers.25.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 360 |
+
"model.layers.25.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 361 |
+
"model.layers.25.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 362 |
+
"model.layers.25.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 363 |
+
"model.layers.25.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 364 |
+
"model.layers.25.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 365 |
+
"model.layers.25.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 366 |
+
"model.layers.25.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 367 |
+
"model.layers.25.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 368 |
+
"model.layers.25.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 369 |
+
"model.layers.26.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 370 |
+
"model.layers.26.mamba.A_log.0": "model-00002-of-00002.safetensors",
|
| 371 |
+
"model.layers.26.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 372 |
+
"model.layers.26.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 373 |
+
"model.layers.26.mamba.D.0": "model-00002-of-00002.safetensors",
|
| 374 |
+
"model.layers.26.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
|
| 375 |
+
"model.layers.26.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
|
| 376 |
+
"model.layers.26.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 377 |
+
"model.layers.26.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
|
| 378 |
+
"model.layers.26.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 379 |
+
"model.layers.26.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
|
| 380 |
+
"model.layers.26.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
|
| 381 |
+
"model.layers.26.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 382 |
+
"model.layers.26.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 383 |
+
"model.layers.26.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 384 |
+
"model.layers.26.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
|
| 385 |
+
"model.layers.26.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
|
| 386 |
+
"model.layers.26.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
|
| 387 |
+
"model.layers.26.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 388 |
+
"model.layers.27.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 389 |
+
"model.layers.27.mamba.A_log.0": "model-00002-of-00002.safetensors",
|
| 390 |
+
"model.layers.27.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 391 |
+
"model.layers.27.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 392 |
+
"model.layers.27.mamba.D.0": "model-00002-of-00002.safetensors",
|
| 393 |
+
"model.layers.27.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
|
| 394 |
+
"model.layers.27.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
|
| 395 |
+
"model.layers.27.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 396 |
+
"model.layers.27.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
|
| 397 |
+
"model.layers.27.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 398 |
+
"model.layers.27.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
|
| 399 |
+
"model.layers.27.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
|
| 400 |
+
"model.layers.27.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
|
| 401 |
+
"model.layers.27.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
|
| 402 |
+
"model.layers.27.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 403 |
+
"model.layers.27.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
|
| 404 |
+
"model.layers.27.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
|
| 405 |
+
"model.layers.27.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
|
| 406 |
+
"model.layers.27.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 407 |
+
"model.layers.28.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 408 |
+
"model.layers.28.mamba.A_log.0": "model-00002-of-00002.safetensors",
|
| 409 |
+
"model.layers.28.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 410 |
+
"model.layers.28.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 411 |
+
"model.layers.28.mamba.D.0": "model-00002-of-00002.safetensors",
|
| 412 |
+
"model.layers.28.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
|
| 413 |
+
"model.layers.28.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
|
| 414 |
+
"model.layers.28.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 415 |
+
"model.layers.28.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
|
| 416 |
+
"model.layers.28.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 417 |
+
"model.layers.28.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
|
| 418 |
+
"model.layers.28.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
|
| 419 |
+
"model.layers.28.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
|
| 420 |
+
"model.layers.28.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
|
| 421 |
+
"model.layers.28.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 422 |
+
"model.layers.28.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
|
| 423 |
+
"model.layers.28.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
|
| 424 |
+
"model.layers.28.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
|
| 425 |
+
"model.layers.28.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 426 |
+
"model.layers.29.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 427 |
+
"model.layers.29.mamba.A_log.0": "model-00002-of-00002.safetensors",
|
| 428 |
+
"model.layers.29.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 429 |
+
"model.layers.29.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 430 |
+
"model.layers.29.mamba.D.0": "model-00002-of-00002.safetensors",
|
| 431 |
+
"model.layers.29.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
|
| 432 |
+
"model.layers.29.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
|
| 433 |
+
"model.layers.29.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 434 |
+
"model.layers.29.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
|
| 435 |
+
"model.layers.29.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 436 |
+
"model.layers.29.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
|
| 437 |
+
"model.layers.29.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
|
| 438 |
+
"model.layers.29.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
|
| 439 |
+
"model.layers.29.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
|
| 440 |
+
"model.layers.29.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 441 |
+
"model.layers.29.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
|
| 442 |
+
"model.layers.29.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
|
| 443 |
+
"model.layers.29.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
|
| 444 |
+
"model.layers.29.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 445 |
+
"model.layers.3.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 446 |
+
"model.layers.3.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 447 |
+
"model.layers.3.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 448 |
+
"model.layers.3.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 449 |
+
"model.layers.3.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 450 |
+
"model.layers.3.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 451 |
+
"model.layers.3.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 452 |
+
"model.layers.3.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 453 |
+
"model.layers.3.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 454 |
+
"model.layers.3.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 455 |
+
"model.layers.3.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 456 |
+
"model.layers.3.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 457 |
+
"model.layers.3.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 458 |
+
"model.layers.3.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 459 |
+
"model.layers.3.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 460 |
+
"model.layers.3.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 461 |
+
"model.layers.3.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 462 |
+
"model.layers.3.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 463 |
+
"model.layers.3.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 464 |
+
"model.layers.30.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 465 |
+
"model.layers.30.mamba.A_log.0": "model-00002-of-00002.safetensors",
|
| 466 |
+
"model.layers.30.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 467 |
+
"model.layers.30.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 468 |
+
"model.layers.30.mamba.D.0": "model-00002-of-00002.safetensors",
|
| 469 |
+
"model.layers.30.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
|
| 470 |
+
"model.layers.30.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
|
| 471 |
+
"model.layers.30.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 472 |
+
"model.layers.30.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
|
| 473 |
+
"model.layers.30.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 474 |
+
"model.layers.30.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
|
| 475 |
+
"model.layers.30.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
|
| 476 |
+
"model.layers.30.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
|
| 477 |
+
"model.layers.30.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
|
| 478 |
+
"model.layers.30.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 479 |
+
"model.layers.30.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
|
| 480 |
+
"model.layers.30.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
|
| 481 |
+
"model.layers.30.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
|
| 482 |
+
"model.layers.30.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 483 |
+
"model.layers.31.input_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 484 |
+
"model.layers.31.mamba.A_log.0": "model-00002-of-00002.safetensors",
|
| 485 |
+
"model.layers.31.mamba.B_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 486 |
+
"model.layers.31.mamba.C_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 487 |
+
"model.layers.31.mamba.D.0": "model-00002-of-00002.safetensors",
|
| 488 |
+
"model.layers.31.mamba.conv1d.bias": "model-00002-of-00002.safetensors",
|
| 489 |
+
"model.layers.31.mamba.conv1d.weight": "model-00002-of-00002.safetensors",
|
| 490 |
+
"model.layers.31.mamba.dt_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 491 |
+
"model.layers.31.mamba.dt_proj.0.bias": "model-00002-of-00002.safetensors",
|
| 492 |
+
"model.layers.31.mamba.dt_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 493 |
+
"model.layers.31.mamba.in_proj.weight": "model-00002-of-00002.safetensors",
|
| 494 |
+
"model.layers.31.mamba.out_proj.weight": "model-00002-of-00002.safetensors",
|
| 495 |
+
"model.layers.31.mamba.pre_avg_layernorm1.weight": "model-00002-of-00002.safetensors",
|
| 496 |
+
"model.layers.31.mamba.pre_avg_layernorm2.weight": "model-00002-of-00002.safetensors",
|
| 497 |
+
"model.layers.31.mamba.x_proj.0.weight": "model-00002-of-00002.safetensors",
|
| 498 |
+
"model.layers.31.moe.experts.0.down_proj.weight": "model-00002-of-00002.safetensors",
|
| 499 |
+
"model.layers.31.moe.experts.0.gate_proj.weight": "model-00002-of-00002.safetensors",
|
| 500 |
+
"model.layers.31.moe.experts.0.up_proj.weight": "model-00002-of-00002.safetensors",
|
| 501 |
+
"model.layers.31.pre_moe_layernorm.weight": "model-00002-of-00002.safetensors",
|
| 502 |
+
"model.layers.4.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 503 |
+
"model.layers.4.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 504 |
+
"model.layers.4.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 505 |
+
"model.layers.4.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 506 |
+
"model.layers.4.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 507 |
+
"model.layers.4.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 508 |
+
"model.layers.4.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 509 |
+
"model.layers.4.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 510 |
+
"model.layers.4.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 511 |
+
"model.layers.4.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 512 |
+
"model.layers.4.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 513 |
+
"model.layers.4.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 514 |
+
"model.layers.4.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 515 |
+
"model.layers.4.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 516 |
+
"model.layers.4.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 517 |
+
"model.layers.4.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 518 |
+
"model.layers.4.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 519 |
+
"model.layers.4.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 520 |
+
"model.layers.4.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 521 |
+
"model.layers.5.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 522 |
+
"model.layers.5.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 523 |
+
"model.layers.5.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 524 |
+
"model.layers.5.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 525 |
+
"model.layers.5.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 526 |
+
"model.layers.5.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 527 |
+
"model.layers.5.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 528 |
+
"model.layers.5.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 529 |
+
"model.layers.5.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 530 |
+
"model.layers.5.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 531 |
+
"model.layers.5.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 532 |
+
"model.layers.5.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 533 |
+
"model.layers.5.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 534 |
+
"model.layers.5.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 535 |
+
"model.layers.5.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 536 |
+
"model.layers.5.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 537 |
+
"model.layers.5.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 538 |
+
"model.layers.5.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 539 |
+
"model.layers.5.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 540 |
+
"model.layers.6.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 541 |
+
"model.layers.6.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 542 |
+
"model.layers.6.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 543 |
+
"model.layers.6.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 544 |
+
"model.layers.6.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 545 |
+
"model.layers.6.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 546 |
+
"model.layers.6.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 547 |
+
"model.layers.6.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 548 |
+
"model.layers.6.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 549 |
+
"model.layers.6.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 550 |
+
"model.layers.6.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 551 |
+
"model.layers.6.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 552 |
+
"model.layers.6.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 553 |
+
"model.layers.6.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 554 |
+
"model.layers.6.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 555 |
+
"model.layers.6.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 556 |
+
"model.layers.6.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 557 |
+
"model.layers.6.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 558 |
+
"model.layers.6.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 559 |
+
"model.layers.7.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 560 |
+
"model.layers.7.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 561 |
+
"model.layers.7.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 562 |
+
"model.layers.7.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 563 |
+
"model.layers.7.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 564 |
+
"model.layers.7.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 565 |
+
"model.layers.7.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 566 |
+
"model.layers.7.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 567 |
+
"model.layers.7.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 568 |
+
"model.layers.7.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 569 |
+
"model.layers.7.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 570 |
+
"model.layers.7.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 571 |
+
"model.layers.7.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 572 |
+
"model.layers.7.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 573 |
+
"model.layers.7.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 574 |
+
"model.layers.7.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 575 |
+
"model.layers.7.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 576 |
+
"model.layers.7.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 577 |
+
"model.layers.7.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 578 |
+
"model.layers.8.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 579 |
+
"model.layers.8.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 580 |
+
"model.layers.8.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 581 |
+
"model.layers.8.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 582 |
+
"model.layers.8.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 583 |
+
"model.layers.8.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 584 |
+
"model.layers.8.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 585 |
+
"model.layers.8.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 586 |
+
"model.layers.8.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 587 |
+
"model.layers.8.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 588 |
+
"model.layers.8.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 589 |
+
"model.layers.8.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 590 |
+
"model.layers.8.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 591 |
+
"model.layers.8.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 592 |
+
"model.layers.8.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 593 |
+
"model.layers.8.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 594 |
+
"model.layers.8.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 595 |
+
"model.layers.8.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 596 |
+
"model.layers.8.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 597 |
+
"model.layers.9.input_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 598 |
+
"model.layers.9.mamba.A_log.0": "model-00001-of-00002.safetensors",
|
| 599 |
+
"model.layers.9.mamba.B_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 600 |
+
"model.layers.9.mamba.C_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 601 |
+
"model.layers.9.mamba.D.0": "model-00001-of-00002.safetensors",
|
| 602 |
+
"model.layers.9.mamba.conv1d.bias": "model-00001-of-00002.safetensors",
|
| 603 |
+
"model.layers.9.mamba.conv1d.weight": "model-00001-of-00002.safetensors",
|
| 604 |
+
"model.layers.9.mamba.dt_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 605 |
+
"model.layers.9.mamba.dt_proj.0.bias": "model-00001-of-00002.safetensors",
|
| 606 |
+
"model.layers.9.mamba.dt_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 607 |
+
"model.layers.9.mamba.in_proj.weight": "model-00001-of-00002.safetensors",
|
| 608 |
+
"model.layers.9.mamba.out_proj.weight": "model-00001-of-00002.safetensors",
|
| 609 |
+
"model.layers.9.mamba.pre_avg_layernorm1.weight": "model-00001-of-00002.safetensors",
|
| 610 |
+
"model.layers.9.mamba.pre_avg_layernorm2.weight": "model-00001-of-00002.safetensors",
|
| 611 |
+
"model.layers.9.mamba.x_proj.0.weight": "model-00001-of-00002.safetensors",
|
| 612 |
+
"model.layers.9.moe.experts.0.down_proj.weight": "model-00001-of-00002.safetensors",
|
| 613 |
+
"model.layers.9.moe.experts.0.gate_proj.weight": "model-00001-of-00002.safetensors",
|
| 614 |
+
"model.layers.9.moe.experts.0.up_proj.weight": "model-00001-of-00002.safetensors",
|
| 615 |
+
"model.layers.9.pre_moe_layernorm.weight": "model-00001-of-00002.safetensors",
|
| 616 |
+
"model.memory_tokens": "model-00001-of-00002.safetensors"
|
| 617 |
+
}
|
| 618 |
+
}
|
modeling_hymba.py
CHANGED
|
@@ -39,13 +39,16 @@ from .configuration_hymba import HymbaConfig
|
|
| 39 |
from torch.utils.checkpoint import checkpoint
|
| 40 |
|
| 41 |
|
| 42 |
-
|
| 43 |
-
from flash_attn
|
|
|
|
| 44 |
|
| 45 |
-
_flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
|
| 46 |
|
| 47 |
-
from einops import rearrange, repeat, reduce, pack, unpack
|
| 48 |
-
from einops.layers.torch import Rearrange
|
|
|
|
|
|
|
| 49 |
|
| 50 |
|
| 51 |
if is_torch_fx_available():
|
|
@@ -396,7 +399,7 @@ class HybridMambaAttentionDynamicCache(DynamicCache):
|
|
| 396 |
|
| 397 |
if has_mamba_state:
|
| 398 |
if hasattr(config, 'conv_dim'):
|
| 399 |
-
conv_dim = config.conv_dim[
|
| 400 |
else:
|
| 401 |
conv_dim = intermediate_size
|
| 402 |
self.conv_states += [
|
|
@@ -543,14 +546,6 @@ class HymbaAttention(nn.Module):
|
|
| 543 |
|
| 544 |
if self.config.rope:
|
| 545 |
self._init_rope()
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
def set_rope(self, rope_type, orig_max_position_embeddings, max_position_embeddings):
|
| 549 |
-
self.config.rope_type = rope_type
|
| 550 |
-
self.config.orig_max_position_embeddings = orig_max_position_embeddings
|
| 551 |
-
self.config.max_position_embeddings = max_position_embeddings
|
| 552 |
-
|
| 553 |
-
self._init_rope()
|
| 554 |
|
| 555 |
|
| 556 |
def _init_rope(self):
|
|
@@ -1233,7 +1228,7 @@ class HymbaFlexAttention(HymbaFlashAttention2):
|
|
| 1233 |
|
| 1234 |
self.attn_mask = or_masks(attn_mask, register_mask)
|
| 1235 |
|
| 1236 |
-
self.block_mask = create_block_mask(self.attn_mask, B=None, H=None, Q_LEN=qk_length, KV_LEN=qk_length)
|
| 1237 |
|
| 1238 |
self.flex_attention = torch.compile(flex_attention)
|
| 1239 |
|
|
@@ -1523,7 +1518,7 @@ class HymbaBlock(nn.Module):
|
|
| 1523 |
num_ssm_param = 1
|
| 1524 |
|
| 1525 |
if not hasattr(config, 'conv_dim'):
|
| 1526 |
-
config.conv_dim = {
|
| 1527 |
|
| 1528 |
self.conv1d = nn.Conv1d(
|
| 1529 |
in_channels=self.intermediate_size,
|
|
@@ -1534,7 +1529,7 @@ class HymbaBlock(nn.Module):
|
|
| 1534 |
padding=self.conv_kernel_size - 1
|
| 1535 |
)
|
| 1536 |
|
| 1537 |
-
config.conv_dim[
|
| 1538 |
|
| 1539 |
self.x_proj = nn.ModuleList([nn.Linear(self.intermediate_size, self.time_step_rank + self.ssm_state_size * 2, bias=False) for _ in range(num_ssm_param)])
|
| 1540 |
self.dt_proj = nn.ModuleList([nn.Linear(self.time_step_rank, self.intermediate_size, bias=True) for _ in range(num_ssm_param)])
|
|
@@ -1579,133 +1574,145 @@ class HymbaBlock(nn.Module):
|
|
| 1579 |
def cuda_kernels_forward(self, hidden_states: torch.Tensor, cache_params: HybridMambaAttentionDynamicCache = None, attention_mask=None, position_ids=None, kv_last_layer=None, use_cache=False, use_swa=False):
|
| 1580 |
projected_states = self.in_proj(hidden_states).transpose(1, 2) ## (bs, latent_dim, seq_len)
|
| 1581 |
|
| 1582 |
-
|
| 1583 |
-
|
| 1584 |
-
|
| 1585 |
-
|
| 1586 |
-
|
| 1587 |
-
|
| 1588 |
-
|
| 1589 |
-
|
| 1590 |
-
|
| 1591 |
-
|
| 1592 |
-
|
| 1593 |
-
|
| 1594 |
-
|
| 1595 |
-
|
| 1596 |
-
|
| 1597 |
-
|
| 1598 |
-
|
| 1599 |
-
|
| 1600 |
|
| 1601 |
-
if self.reuse_kv:
|
| 1602 |
-
query_states, hidden_states = hidden_states.tensor_split((self.attn_hidden_size,), dim=1)
|
| 1603 |
-
query_states = query_states.transpose(1,2)
|
| 1604 |
else:
|
| 1605 |
-
|
| 1606 |
-
|
| 1607 |
-
|
| 1608 |
-
|
| 1609 |
-
|
| 1610 |
-
|
| 1611 |
-
|
| 1612 |
-
|
| 1613 |
-
|
| 1614 |
-
cache_params.conv_states[self.layer_idx],
|
| 1615 |
-
conv_weights,
|
| 1616 |
-
self.conv1d.bias,
|
| 1617 |
-
self.activation,
|
| 1618 |
)
|
| 1619 |
-
hidden_states = hidden_states.unsqueeze(-1)
|
| 1620 |
|
| 1621 |
-
|
| 1622 |
-
else:
|
| 1623 |
-
if cache_params is not None:
|
| 1624 |
-
conv_states = nn.functional.pad(
|
| 1625 |
-
hidden_states, (self.conv_kernel_size - hidden_states.shape[-1], 0)
|
| 1626 |
-
)
|
| 1627 |
|
| 1628 |
-
|
| 1629 |
|
| 1630 |
-
|
| 1631 |
-
|
| 1632 |
-
|
| 1633 |
-
|
| 1634 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1635 |
|
| 1636 |
-
|
| 1637 |
-
|
| 1638 |
-
|
| 1639 |
-
|
| 1640 |
-
|
| 1641 |
-
|
| 1642 |
-
attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, kv_last_layer=kv_last_layer, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
|
| 1643 |
-
else:
|
| 1644 |
-
attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, key_states=key_states, value_states=value_states, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
|
| 1645 |
|
| 1646 |
-
|
| 1647 |
-
index = 0
|
| 1648 |
-
ssm_parameters = self.x_proj[index](hidden_states.transpose(1, 2))
|
| 1649 |
-
time_step, B, C = torch.split(
|
| 1650 |
-
ssm_parameters, [self.time_step_rank, self.ssm_state_size, self.ssm_state_size], dim=-1
|
| 1651 |
-
)
|
| 1652 |
-
time_step, B, C = self._apply_layernorms(time_step, B, C)
|
| 1653 |
|
| 1654 |
-
|
| 1655 |
-
|
| 1656 |
-
|
| 1657 |
-
|
| 1658 |
-
|
| 1659 |
-
|
| 1660 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1661 |
|
| 1662 |
-
|
| 1663 |
-
|
| 1664 |
-
|
| 1665 |
-
|
| 1666 |
-
|
| 1667 |
-
A = -torch.exp(self.A_log[index].float())
|
| 1668 |
-
|
| 1669 |
-
time_proj_bias = time_proj_bias.float() if time_proj_bias is not None else None
|
| 1670 |
-
if use_precomputed_states:
|
| 1671 |
-
scan_outputs = selective_state_update(
|
| 1672 |
-
cache_params.ssm_states[self.layer_idx],
|
| 1673 |
-
hidden_states[..., 0],
|
| 1674 |
-
discrete_time_step[..., 0],
|
| 1675 |
-
A,
|
| 1676 |
-
B[:, 0],
|
| 1677 |
-
C[:, 0],
|
| 1678 |
-
self.D[index],
|
| 1679 |
-
gate[..., 0],
|
| 1680 |
-
time_proj_bias,
|
| 1681 |
-
dt_softplus=True,
|
| 1682 |
-
).unsqueeze(-1)
|
| 1683 |
-
else:
|
| 1684 |
-
outputs = selective_scan_fn(
|
| 1685 |
-
hidden_states,
|
| 1686 |
-
discrete_time_step,
|
| 1687 |
-
A,
|
| 1688 |
-
B.transpose(1, 2),
|
| 1689 |
-
C.transpose(1, 2),
|
| 1690 |
-
self.D[index].float(),
|
| 1691 |
-
z=gate,
|
| 1692 |
-
delta_bias=time_proj_bias,
|
| 1693 |
-
delta_softplus=True,
|
| 1694 |
-
return_last_state=True,
|
| 1695 |
)
|
| 1696 |
-
|
| 1697 |
-
|
| 1698 |
-
|
|
|
|
|
|
|
| 1699 |
else:
|
| 1700 |
-
|
|
|
|
|
|
|
| 1701 |
|
| 1702 |
-
if
|
| 1703 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1704 |
|
| 1705 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1706 |
|
| 1707 |
-
|
| 1708 |
-
|
| 1709 |
|
| 1710 |
return contextualized_states, attn_key_value
|
| 1711 |
|
|
@@ -2025,49 +2032,6 @@ class HymbaPreTrainedModel(PreTrainedModel):
|
|
| 2025 |
|
| 2026 |
|
| 2027 |
|
| 2028 |
-
def shift_zeros_to_front(attention_mask, hidden_states, position_ids):
|
| 2029 |
-
"""
|
| 2030 |
-
Move all zero entries in 'attention_mask' to the front of the sequence
|
| 2031 |
-
and reorder 'hidden_states' accordingly, preserving the order of zeros
|
| 2032 |
-
and the order of ones.
|
| 2033 |
-
|
| 2034 |
-
Args:
|
| 2035 |
-
attention_mask: (batch_size, seq_len), values in {0, 1}.
|
| 2036 |
-
hidden_states: (batch_size, seq_len, dim).
|
| 2037 |
-
|
| 2038 |
-
Returns:
|
| 2039 |
-
shifted_mask: (batch_size, seq_len) with zeros at the front.
|
| 2040 |
-
shifted_states: (batch_size, seq_len, dim) reordered accordingly.
|
| 2041 |
-
"""
|
| 2042 |
-
B, L = attention_mask.shape
|
| 2043 |
-
D = hidden_states.shape[-1]
|
| 2044 |
-
|
| 2045 |
-
shifted_mask = torch.empty_like(attention_mask)
|
| 2046 |
-
shifted_states = torch.empty_like(hidden_states)
|
| 2047 |
-
shifted_position_ids = torch.empty_like(position_ids)
|
| 2048 |
-
|
| 2049 |
-
# Process each batch row independently
|
| 2050 |
-
for b in range(B):
|
| 2051 |
-
row_mask = attention_mask[b] # (seq_len,)
|
| 2052 |
-
row_states = hidden_states[b] # (seq_len, dim)
|
| 2053 |
-
row_pos = position_ids[b] # (seq_len,)
|
| 2054 |
-
|
| 2055 |
-
# Find positions of zeros and ones
|
| 2056 |
-
zero_indices = torch.where(row_mask == 0)[0]
|
| 2057 |
-
one_indices = torch.where(row_mask == 1)[0]
|
| 2058 |
-
|
| 2059 |
-
# Concatenate zero indices (in order) then one indices
|
| 2060 |
-
new_order = torch.cat([zero_indices, one_indices], dim=0)
|
| 2061 |
-
|
| 2062 |
-
# Reorder mask and states
|
| 2063 |
-
shifted_mask[b] = row_mask[new_order]
|
| 2064 |
-
shifted_states[b] = row_states[new_order]
|
| 2065 |
-
shifted_position_ids[b] = row_pos[new_order]
|
| 2066 |
-
|
| 2067 |
-
return shifted_mask, shifted_states, shifted_position_ids
|
| 2068 |
-
|
| 2069 |
-
|
| 2070 |
-
|
| 2071 |
HYMBA_INPUTS_DOCSTRING = r"""
|
| 2072 |
Args: To be added later. Please refer to the forward function.
|
| 2073 |
"""
|
|
@@ -2236,11 +2200,7 @@ class HymbaModel(HymbaPreTrainedModel):
|
|
| 2236 |
|
| 2237 |
if position_ids is not None and position_ids.shape[1] != inputs_embeds.shape[1]:
|
| 2238 |
position_ids = torch.arange(inputs_embeds.shape[1], device=inputs_embeds.device).unsqueeze(0)
|
| 2239 |
-
|
| 2240 |
-
## Handle paddings: Shift all padding tokens to the beginning of the sequence
|
| 2241 |
-
if inputs_embeds.shape[1] > 1 and attention_mask is not None and (attention_mask == 0).any():
|
| 2242 |
-
attention_mask, inputs_embeds, position_ids = shift_zeros_to_front(attention_mask, inputs_embeds, position_ids)
|
| 2243 |
-
|
| 2244 |
attention_mask_raw = attention_mask
|
| 2245 |
|
| 2246 |
if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
|
|
|
|
| 39 |
from torch.utils.checkpoint import checkpoint
|
| 40 |
|
| 41 |
|
| 42 |
+
try:
|
| 43 |
+
from flash_attn import flash_attn_func, flash_attn_varlen_func
|
| 44 |
+
from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
|
| 45 |
|
| 46 |
+
_flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
|
| 47 |
|
| 48 |
+
from einops import rearrange, repeat, reduce, pack, unpack
|
| 49 |
+
from einops.layers.torch import Rearrange
|
| 50 |
+
except ImportError:
|
| 51 |
+
pass
|
| 52 |
|
| 53 |
|
| 54 |
if is_torch_fx_available():
|
|
|
|
| 399 |
|
| 400 |
if has_mamba_state:
|
| 401 |
if hasattr(config, 'conv_dim'):
|
| 402 |
+
conv_dim = config.conv_dim[i]
|
| 403 |
else:
|
| 404 |
conv_dim = intermediate_size
|
| 405 |
self.conv_states += [
|
|
|
|
| 546 |
|
| 547 |
if self.config.rope:
|
| 548 |
self._init_rope()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 549 |
|
| 550 |
|
| 551 |
def _init_rope(self):
|
|
|
|
| 1228 |
|
| 1229 |
self.attn_mask = or_masks(attn_mask, register_mask)
|
| 1230 |
|
| 1231 |
+
self.block_mask = create_block_mask(self.attn_mask, B=None, H=None, Q_LEN=qk_length, KV_LEN=qk_length, _compile=True)
|
| 1232 |
|
| 1233 |
self.flex_attention = torch.compile(flex_attention)
|
| 1234 |
|
|
|
|
| 1518 |
num_ssm_param = 1
|
| 1519 |
|
| 1520 |
if not hasattr(config, 'conv_dim'):
|
| 1521 |
+
config.conv_dim = {i:0 for i in range(config.num_hidden_layers)}
|
| 1522 |
|
| 1523 |
self.conv1d = nn.Conv1d(
|
| 1524 |
in_channels=self.intermediate_size,
|
|
|
|
| 1529 |
padding=self.conv_kernel_size - 1
|
| 1530 |
)
|
| 1531 |
|
| 1532 |
+
config.conv_dim[self.layer_idx] = self.intermediate_size
|
| 1533 |
|
| 1534 |
self.x_proj = nn.ModuleList([nn.Linear(self.intermediate_size, self.time_step_rank + self.ssm_state_size * 2, bias=False) for _ in range(num_ssm_param)])
|
| 1535 |
self.dt_proj = nn.ModuleList([nn.Linear(self.time_step_rank, self.intermediate_size, bias=True) for _ in range(num_ssm_param)])
|
|
|
|
| 1574 |
def cuda_kernels_forward(self, hidden_states: torch.Tensor, cache_params: HybridMambaAttentionDynamicCache = None, attention_mask=None, position_ids=None, kv_last_layer=None, use_cache=False, use_swa=False):
|
| 1575 |
projected_states = self.in_proj(hidden_states).transpose(1, 2) ## (bs, latent_dim, seq_len)
|
| 1576 |
|
| 1577 |
+
if (
|
| 1578 |
+
self.training and cache_params is None and not self.apply_inner_layernorms
|
| 1579 |
+
): # Doesn't support outputting the states -> used for training
|
| 1580 |
+
contextualized_states = mamba_inner_fn(
|
| 1581 |
+
projected_states,
|
| 1582 |
+
self.conv1d.weight,
|
| 1583 |
+
self.conv1d.bias if self.use_conv_bias else None,
|
| 1584 |
+
self.x_proj.weight,
|
| 1585 |
+
self.dt_proj.weight,
|
| 1586 |
+
self.out_proj.weight,
|
| 1587 |
+
self.out_proj.bias.float() if self.use_bias else None,
|
| 1588 |
+
-torch.exp(self.A_log.float()),
|
| 1589 |
+
None, # input-dependent B
|
| 1590 |
+
None, # input-dependent C
|
| 1591 |
+
self.D.float(),
|
| 1592 |
+
delta_bias=self.dt_proj.bias.float(),
|
| 1593 |
+
delta_softplus=True,
|
| 1594 |
+
)
|
| 1595 |
|
|
|
|
|
|
|
|
|
|
| 1596 |
else:
|
| 1597 |
+
batch_size, seq_len, _ = hidden_states.shape
|
| 1598 |
+
use_precomputed_states = (
|
| 1599 |
+
cache_params is not None
|
| 1600 |
+
and cache_params.has_previous_state
|
| 1601 |
+
and seq_len == 1
|
| 1602 |
+
and cache_params.conv_states[self.layer_idx].shape[0]
|
| 1603 |
+
== cache_params.ssm_states[self.layer_idx].shape[0]
|
| 1604 |
+
== batch_size
|
| 1605 |
+
and use_cache
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1606 |
)
|
|
|
|
| 1607 |
|
| 1608 |
+
hidden_states, gate = projected_states.tensor_split((self.latent_dim,), dim=1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1609 |
|
| 1610 |
+
conv_weights = self.conv1d.weight.view(self.conv1d.weight.size(0), self.conv1d.weight.size(2))
|
| 1611 |
|
| 1612 |
+
if self.reuse_kv:
|
| 1613 |
+
query_states, hidden_states = hidden_states.tensor_split((self.attn_hidden_size,), dim=1)
|
| 1614 |
+
query_states = query_states.transpose(1,2)
|
| 1615 |
+
else:
|
| 1616 |
+
query_states, key_states, value_states, hidden_states = hidden_states.tensor_split((self.attn_hidden_size, self.attn_hidden_size + self.k_hidden_size, self.attn_hidden_size + self.k_hidden_size + self.v_hidden_size), dim=1)
|
| 1617 |
+
|
| 1618 |
+
query_states = query_states.transpose(1,2)
|
| 1619 |
+
key_states = key_states.transpose(1,2)
|
| 1620 |
+
value_states = value_states.transpose(1,2)
|
| 1621 |
+
|
| 1622 |
+
if use_precomputed_states:
|
| 1623 |
+
hidden_states = causal_conv1d_update(
|
| 1624 |
+
hidden_states.squeeze(-1),
|
| 1625 |
+
cache_params.conv_states[self.layer_idx],
|
| 1626 |
+
conv_weights,
|
| 1627 |
+
self.conv1d.bias,
|
| 1628 |
+
self.activation,
|
| 1629 |
+
)
|
| 1630 |
+
hidden_states = hidden_states.unsqueeze(-1)
|
| 1631 |
|
| 1632 |
+
cache_params.mamba_past_length[self.layer_idx] += seq_len
|
| 1633 |
+
else:
|
| 1634 |
+
if cache_params is not None:
|
| 1635 |
+
conv_states = nn.functional.pad(
|
| 1636 |
+
hidden_states, (self.conv_kernel_size - hidden_states.shape[-1], 0)
|
| 1637 |
+
)
|
|
|
|
|
|
|
|
|
|
| 1638 |
|
| 1639 |
+
cache_params.conv_states[self.layer_idx].copy_(conv_states)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1640 |
|
| 1641 |
+
cache_params.mamba_past_length[self.layer_idx] += seq_len
|
| 1642 |
+
|
| 1643 |
+
hidden_states = causal_conv1d_fn(
|
| 1644 |
+
hidden_states, conv_weights, self.conv1d.bias, activation=self.activation
|
| 1645 |
+
)
|
| 1646 |
+
|
| 1647 |
+
if self.reuse_kv:
|
| 1648 |
+
assert kv_last_layer is not None
|
| 1649 |
+
attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, kv_last_layer=kv_last_layer, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
|
| 1650 |
+
else:
|
| 1651 |
+
attn_outputs, attn_key_value = self.self_attn(attention_mask=attention_mask, position_ids=position_ids, query_states=query_states, key_states=key_states, value_states=value_states, use_swa=use_swa, use_cache=use_cache, past_key_value=cache_params)
|
| 1652 |
|
| 1653 |
+
## Mamba head
|
| 1654 |
+
index = 0
|
| 1655 |
+
ssm_parameters = self.x_proj[index](hidden_states.transpose(1, 2))
|
| 1656 |
+
time_step, B, C = torch.split(
|
| 1657 |
+
ssm_parameters, [self.time_step_rank, self.ssm_state_size, self.ssm_state_size], dim=-1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1658 |
)
|
| 1659 |
+
time_step, B, C = self._apply_layernorms(time_step, B, C)
|
| 1660 |
+
|
| 1661 |
+
if hasattr(self.dt_proj[index], "base_layer"):
|
| 1662 |
+
time_proj_bias = self.dt_proj[index].base_layer.bias
|
| 1663 |
+
self.dt_proj[index].base_layer.bias = None
|
| 1664 |
else:
|
| 1665 |
+
time_proj_bias = self.dt_proj[index].bias
|
| 1666 |
+
self.dt_proj[index].bias = None
|
| 1667 |
+
discrete_time_step = self.dt_proj[index](time_step).transpose(1, 2) # [batch, intermediate_size, seq_len]
|
| 1668 |
|
| 1669 |
+
if hasattr(self.dt_proj[index], "base_layer"):
|
| 1670 |
+
self.dt_proj[index].base_layer.bias = time_proj_bias
|
| 1671 |
+
else:
|
| 1672 |
+
self.dt_proj[index].bias = time_proj_bias
|
| 1673 |
+
|
| 1674 |
+
A = -torch.exp(self.A_log[index].float())
|
| 1675 |
+
|
| 1676 |
+
time_proj_bias = time_proj_bias.float() if time_proj_bias is not None else None
|
| 1677 |
+
if use_precomputed_states:
|
| 1678 |
+
scan_outputs = selective_state_update(
|
| 1679 |
+
cache_params.ssm_states[self.layer_idx],
|
| 1680 |
+
hidden_states[..., 0],
|
| 1681 |
+
discrete_time_step[..., 0],
|
| 1682 |
+
A,
|
| 1683 |
+
B[:, 0],
|
| 1684 |
+
C[:, 0],
|
| 1685 |
+
self.D[index],
|
| 1686 |
+
gate[..., 0],
|
| 1687 |
+
time_proj_bias,
|
| 1688 |
+
dt_softplus=True,
|
| 1689 |
+
).unsqueeze(-1)
|
| 1690 |
+
else:
|
| 1691 |
+
outputs = selective_scan_fn(
|
| 1692 |
+
hidden_states,
|
| 1693 |
+
discrete_time_step,
|
| 1694 |
+
A,
|
| 1695 |
+
B.transpose(1, 2),
|
| 1696 |
+
C.transpose(1, 2),
|
| 1697 |
+
self.D[index].float(),
|
| 1698 |
+
z=gate,
|
| 1699 |
+
delta_bias=time_proj_bias,
|
| 1700 |
+
delta_softplus=True,
|
| 1701 |
+
return_last_state=True,
|
| 1702 |
+
)
|
| 1703 |
|
| 1704 |
+
if len(outputs) == 3:
|
| 1705 |
+
scan_outputs, ssm_state, _ = outputs
|
| 1706 |
+
else:
|
| 1707 |
+
scan_outputs, ssm_state = outputs
|
| 1708 |
+
|
| 1709 |
+
if ssm_state is not None and cache_params is not None:
|
| 1710 |
+
cache_params.ssm_states[self.layer_idx].copy_(ssm_state)
|
| 1711 |
+
|
| 1712 |
+
scan_outputs = scan_outputs.transpose(1, 2)
|
| 1713 |
|
| 1714 |
+
hidden_states = (self.pre_avg_layernorm1(attn_outputs) + self.pre_avg_layernorm2(scan_outputs)) / 2
|
| 1715 |
+
contextualized_states = self.out_proj(hidden_states)
|
| 1716 |
|
| 1717 |
return contextualized_states, attn_key_value
|
| 1718 |
|
|
|
|
| 2032 |
|
| 2033 |
|
| 2034 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2035 |
HYMBA_INPUTS_DOCSTRING = r"""
|
| 2036 |
Args: To be added later. Please refer to the forward function.
|
| 2037 |
"""
|
|
|
|
| 2200 |
|
| 2201 |
if position_ids is not None and position_ids.shape[1] != inputs_embeds.shape[1]:
|
| 2202 |
position_ids = torch.arange(inputs_embeds.shape[1], device=inputs_embeds.device).unsqueeze(0)
|
| 2203 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2204 |
attention_mask_raw = attention_mask
|
| 2205 |
|
| 2206 |
if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
|
setup.sh
DELETED
|
@@ -1,44 +0,0 @@
|
|
| 1 |
-
#!/bin/bash
|
| 2 |
-
|
| 3 |
-
# Prompt user to specify CUDA version
|
| 4 |
-
read -p "Enter CUDA version (12.1 or 12.4): " cuda_version
|
| 5 |
-
|
| 6 |
-
# Verify CUDA version input
|
| 7 |
-
if [[ "$cuda_version" != "12.1" && "$cuda_version" != "12.4" ]]; then
|
| 8 |
-
echo "Invalid CUDA version specified. Please choose either 12.1 or 12.4."
|
| 9 |
-
exit 1
|
| 10 |
-
fi
|
| 11 |
-
|
| 12 |
-
# Install PyTorch with the specified CUDA version
|
| 13 |
-
conda install pytorch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 pytorch-cuda=$cuda_version -c pytorch -c nvidia
|
| 14 |
-
|
| 15 |
-
# Install other packages
|
| 16 |
-
pip install --upgrade transformers
|
| 17 |
-
pip install tiktoken
|
| 18 |
-
pip install sentencepiece
|
| 19 |
-
pip install protobuf
|
| 20 |
-
pip install ninja einops triton packaging
|
| 21 |
-
|
| 22 |
-
# Clone and install Mamba
|
| 23 |
-
git clone https://github.com/state-spaces/mamba.git
|
| 24 |
-
cd mamba
|
| 25 |
-
pip install -e .
|
| 26 |
-
cd ..
|
| 27 |
-
|
| 28 |
-
# Clone and install causal-conv1d with specified CUDA version
|
| 29 |
-
git clone https://github.com/Dao-AILab/causal-conv1d.git
|
| 30 |
-
cd causal-conv1d
|
| 31 |
-
export CUDA_HOME=/usr/local/cuda-$cuda_version
|
| 32 |
-
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;8.9;9.0" python setup.py install
|
| 33 |
-
cd ..
|
| 34 |
-
|
| 35 |
-
# Clone and install attention-gym
|
| 36 |
-
git clone https://github.com/pytorch-labs/attention-gym.git
|
| 37 |
-
cd attention-gym
|
| 38 |
-
pip install .
|
| 39 |
-
cd ..
|
| 40 |
-
|
| 41 |
-
# Install Flash Attention
|
| 42 |
-
pip install flash_attn
|
| 43 |
-
|
| 44 |
-
echo "Installation completed with CUDA $cuda_version."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
special_tokens_map.json
DELETED
|
@@ -1,30 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"bos_token": {
|
| 3 |
-
"content": "<s>",
|
| 4 |
-
"lstrip": false,
|
| 5 |
-
"normalized": false,
|
| 6 |
-
"rstrip": false,
|
| 7 |
-
"single_word": false
|
| 8 |
-
},
|
| 9 |
-
"eos_token": {
|
| 10 |
-
"content": "</s>",
|
| 11 |
-
"lstrip": false,
|
| 12 |
-
"normalized": false,
|
| 13 |
-
"rstrip": false,
|
| 14 |
-
"single_word": false
|
| 15 |
-
},
|
| 16 |
-
"pad_token": {
|
| 17 |
-
"content": "[PAD]",
|
| 18 |
-
"lstrip": false,
|
| 19 |
-
"normalized": false,
|
| 20 |
-
"rstrip": false,
|
| 21 |
-
"single_word": false
|
| 22 |
-
},
|
| 23 |
-
"unk_token": {
|
| 24 |
-
"content": "<unk>",
|
| 25 |
-
"lstrip": false,
|
| 26 |
-
"normalized": false,
|
| 27 |
-
"rstrip": false,
|
| 28 |
-
"single_word": false
|
| 29 |
-
}
|
| 30 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
tokenizer.json
DELETED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
DELETED
|
@@ -1,52 +0,0 @@
|
|
| 1 |
-
{
|
| 2 |
-
"add_bos_token": true,
|
| 3 |
-
"add_eos_token": false,
|
| 4 |
-
"add_prefix_space": true,
|
| 5 |
-
"added_tokens_decoder": {
|
| 6 |
-
"0": {
|
| 7 |
-
"content": "<unk>",
|
| 8 |
-
"lstrip": false,
|
| 9 |
-
"normalized": false,
|
| 10 |
-
"rstrip": false,
|
| 11 |
-
"single_word": false,
|
| 12 |
-
"special": true
|
| 13 |
-
},
|
| 14 |
-
"1": {
|
| 15 |
-
"content": "<s>",
|
| 16 |
-
"lstrip": false,
|
| 17 |
-
"normalized": false,
|
| 18 |
-
"rstrip": false,
|
| 19 |
-
"single_word": false,
|
| 20 |
-
"special": true
|
| 21 |
-
},
|
| 22 |
-
"2": {
|
| 23 |
-
"content": "</s>",
|
| 24 |
-
"lstrip": false,
|
| 25 |
-
"normalized": false,
|
| 26 |
-
"rstrip": false,
|
| 27 |
-
"single_word": false,
|
| 28 |
-
"special": true
|
| 29 |
-
},
|
| 30 |
-
"32000": {
|
| 31 |
-
"content": "[PAD]",
|
| 32 |
-
"lstrip": false,
|
| 33 |
-
"normalized": false,
|
| 34 |
-
"rstrip": false,
|
| 35 |
-
"single_word": false,
|
| 36 |
-
"special": true
|
| 37 |
-
}
|
| 38 |
-
},
|
| 39 |
-
"bos_token": "<s>",
|
| 40 |
-
"chat_template": "{{'<extra_id_0>System'}}{% for message in messages %}{% if message['role'] == 'system' %}{{'\n' + message['content'].strip()}}{% if tools or contexts %}{{'\n'}}{% endif %}{% endif %}{% endfor %}{% if tools %}{% for tool in tools %}{{ '\n<tool> ' + tool|tojson + ' </tool>' }}{% endfor %}{% endif %}{% if contexts %}{% if tools %}{{'\n'}}{% endif %}{% for context in contexts %}{{ '\n<context> ' + context.strip() + ' </context>' }}{% endfor %}{% endif %}{{'\n\n'}}{% for message in messages %}{% if message['role'] == 'user' %}{{ '<extra_id_1>User\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'assistant' %}{{ '<extra_id_1>Assistant\n' + message['content'].strip() + '\n' }}{% elif message['role'] == 'tool' %}{{ '<extra_id_1>Tool\n' + message['content'].strip() + '\n' }}{% endif %}{% endfor %}{%- if add_generation_prompt %}{{'<extra_id_1>Assistant\n'}}{%- endif %}",
|
| 41 |
-
"clean_up_tokenization_spaces": false,
|
| 42 |
-
"eos_token": "</s>",
|
| 43 |
-
"legacy": true,
|
| 44 |
-
"model_max_length": 1000000000000000019884624838656,
|
| 45 |
-
"pad_token": "[PAD]",
|
| 46 |
-
"padding_side": "left",
|
| 47 |
-
"sp_model_kwargs": {},
|
| 48 |
-
"spaces_between_special_tokens": false,
|
| 49 |
-
"tokenizer_class": "LlamaTokenizer",
|
| 50 |
-
"unk_token": "<unk>",
|
| 51 |
-
"use_default_system_prompt": false
|
| 52 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|