Instructions to use ConicCat/Nemo-super-wip-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ConicCat/Nemo-super-wip-lora with PEFT:

from peft import PeftModel
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("nvidia/Llama-3_3-Nemotron-Super-49B-v1_5")
model = PeftModel.from_pretrained(base_model, "ConicCat/Nemo-super-wip-lora")

Transformers

How to use ConicCat/Nemo-super-wip-lora with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ConicCat/Nemo-super-wip-lora", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ConicCat/Nemo-super-wip-lora", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ConicCat/Nemo-super-wip-lora with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ConicCat/Nemo-super-wip-lora"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ConicCat/Nemo-super-wip-lora",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ConicCat/Nemo-super-wip-lora

SGLang

How to use ConicCat/Nemo-super-wip-lora with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ConicCat/Nemo-super-wip-lora" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ConicCat/Nemo-super-wip-lora",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ConicCat/Nemo-super-wip-lora" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ConicCat/Nemo-super-wip-lora",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ConicCat/Nemo-super-wip-lora with Docker Model Runner:
```
docker model run hf.co/ConicCat/Nemo-super-wip-lora
```

ConicCat commited on Mar 31

Commit

0776dca

verified ·

1 Parent(s): bbf0c64

Upload folder using huggingface_hub

Browse files

Files changed (16) hide show

.gitattributes +1 -0
README.md +143 -0
adapter_config.json +46 -0
adapter_model.safetensors +3 -0
block_config.py +118 -0
chat_template.jinja +20 -0
config.json +1497 -0
configuration_decilm.py +65 -0
debug.log +730 -0
runs/Mar31_01-27-28_b8de28f8ab2a/events.out.tfevents.1774920448.b8de28f8ab2a.3556.0 +3 -0
runs/Mar31_01-31-17_b8de28f8ab2a/events.out.tfevents.1774920677.b8de28f8ab2a.6000.0 +3 -0
runs/Mar31_01-41-00_b8de28f8ab2a/events.out.tfevents.1774921260.b8de28f8ab2a.9806.0 +3 -0
tokenizer.json +3 -0
tokenizer_config.json +14 -0
transformers_4_44_2__configuration_llama.py +203 -0
transformers_4_44_2__modeling_rope_utils.py +559 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,143 @@

+---
+library_name: peft
+license: other
+base_model: nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
+tags:
+- axolotl
+- base_model:adapter:nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
+- lora
+- transformers
+datasets:
+- ConicCat/GLiMA_Thinking
+- ConicCat/Gutenberg-SFT
+- ConicCat/Condor-SFT-Filtered
+- ConicCat/Ao3_Soft_Refusal
+- ConicCat/VSF
+pipeline_tag: text-generation
+model-index:
+- name: Writer-Stage-1
+  results: []
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
+<details><summary>See axolotl config</summary>
+axolotl version: `0.16.0.dev0`
+```yaml
+base_model: nvidia/Llama-3_3-Nemotron-Super-49B-v1_5
+load_in_8bit: true
+load_in_4bit: false
+sequence_len: 5120
+max_sample_length: 5120
+sample_packing: true
+gradient_checkpointing: true
+bf16: true
+tf32: true
+flash_attention: true
+lora_mlp_kernel: false
+lora_qkv_kernel: false
+lora_o_kernel: false
+datasets:
+  - path: ConicCat/GLiMA_Thinking
+    type: chat_template
+    roles_to_train: []
+    train_on_eos: turn
+    message_field_training: train
+  - path: ConicCat/Gutenberg-SFT
+    type: chat_template
+  - path: ConicCat/Condor-SFT-Filtered
+    split: train[:250]
+    type: chat_template
+  - path: ConicCat/Ao3_Soft_Refusal
+    type: chat_template
+  - path: ConicCat/VSF
+    type: chat_template
+chat_template_jinja: "{% set bos = \"<|begin_of_text|>\" %}{%- set enable_thinking = false -%}{% set system_start_header = \"<|start_header_id|>\" %}{% set system_end_header = \"<|end_header_id|>\n\n\" %}{% set start_header = \"<|start_header_id|>\" %}{% set end_header = \"<|end_header_id|>\n\n\" %}{% set eot = \"<|eot_id|>\" %}{% set system_token = \"system\" %}{% set user_token = \"user\" %}{% set assistant_token = \"assistant\" %}{% set tool_token = \"tool\" %}{{- bos ~ system_start_header ~ system_token ~ system_end_header -}}{%- if messages[0].role == 'system' and messages[0].content != '' -%}{%- set system_content = messages[0].content -%}{%- if '/no_think' in system_content -%}{%- set system_content = system_content.replace('/no_think', '')|trim -%}{%- set enable_thinking = false -%}{%- elif '/think' in system_content -%}{%- set system_content = system_content.replace('/think', '')|trim -%}{%- set enable_thinking = true -%}{%- endif -%}{{- system_content + '\n\n' -}}{%- endif -%}{%- if tools -%}{{- 'You can use the following tools to assist the user if required:\n<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>\n\nIf you decide to call any tool(s), use the following format:\n<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": \"tool_args1\"}}, {{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\nResponse from tool(s) will be returned in this format:\n<TOOL_RESPONSE>[{{\"response\": \"tool_response1\"}}, {{\"response\": \"tool_response2\"}}]</TOOL_RESPONSE>\n\nBased on the results returned by the tool(s), you can call additional tools if needed, correct tool calls if any errors are found, or just respond with the answer to the user.' -}}{%- endif -%}{{- eot -}}{%- for message in messages -%}{%- if message.role == user_token -%}{{- start_header ~ user_token ~ end_header -}}{{ message.content -}}{{ eot -}}{%- elif message.role == assistant_token -%}{%- if '</think>' in message.content -%}{%- set content = message.content.split('</think>')[-1].lstrip() -%}{%- else -%}{%- set content = message.content -%}{%- endif -%}{{- start_header ~ assistant_token ~ end_header -}}{{ content -}}{%- if message.tool_calls -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{\"name\": \"' + fn.name + '\", \"arguments\": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- eot -}}{%- elif message.role == tool_token -%}{%- if loop.first or (messages[loop.index0 - 1].role != tool_token) -%}{{- start_header ~ tool_token ~ end_header -}}{{ '<TOOL_RESPONSE>[' -}}{%- endif -%}{{- message.content -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == tool_token) else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != tool_token) -%}{{- ']</TOOL_RESPONSE>' -}}{{ eot -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- start_header ~ assistant_token ~ end_header -}}{%- if not enable_thinking -%}{{- '<think>\n\n</think>\n\n' -}}{%- endif -%}{%- endif -%}"
+trust_remote_code: true
+adapter: lora
+lora_r: 32
+lora_alpha: 64
+lora_dropout: 0.0
+lora_bias: None
+lora_target_linear: true
+use_tensorboard: true
+optimizer: paged_adamw_8bit
+learning_rate: 1.25e-5 # 1e-4 / 4
+loraplus_lr_ratio: 16
+# Training arguments
+output_dir: ./Writer-Stage-1
+num_epochs: 3
+micro_batch_size: 1
+gradient_accumulation_steps: 16
+save_strategy: 'no'
+warmup_ratio: 0.05
+lr_scheduler: 'constant_with_warmup'
+max_grad_norm: 1
+logging_steps: 1
+seed: 42
+```
+</details><br>
+# Writer-Stage-1
+This model is a fine-tuned version of [nvidia/Llama-3_3-Nemotron-Super-49B-v1_5](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5) on the ConicCat/GLiMA_Thinking, the ConicCat/Gutenberg-SFT, the ConicCat/Condor-SFT-Filtered, the ConicCat/Ao3_Soft_Refusal and the ConicCat/VSF datasets.
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 1.25e-05
+- train_batch_size: 1
+- eval_batch_size: 1
+- seed: 42
+- gradient_accumulation_steps: 16
+- total_train_batch_size: 16
+- optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: constant_with_warmup
+- lr_scheduler_warmup_steps: 2
+- training_steps: 54
+### Training results
+### Framework versions
+- PEFT 0.18.1
+- Transformers 5.3.0
+- Pytorch 2.9.1+cu128
+- Datasets 4.5.0
+- Tokenizers 0.22.2

adapter_config.json ADDED Viewed

	@@ -0,0 +1,46 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": null,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 64,
+  "lora_bias": false,
+  "lora_dropout": 0.0,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.18.1",
+  "qalora_group_size": 16,
+  "r": 32,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "o_proj",
+    "q_proj",
+    "gate_proj",
+    "down_proj",
+    "k_proj",
+    "v_proj",
+    "up_proj"
+  ],
+  "target_parameters": [],
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:35aec8e7edeb4728f563221db3318570ae5a184c1fd972dc577cca44c6ddab69
+size 1203621016

block_config.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import dataclasses
+import json
+import warnings
+from dataclasses import dataclass, MISSING
+from functools import partial
+from typing import Optional, Any
+@partial(dataclass, frozen=True, kw_only=True)
+class JsonComparable:
+    def to_json(self) -> str:
+        return json.dumps(dataclasses.asdict(self))
+    def __eq__(self, other: "JsonComparable") -> bool:
+        return self.to_json() == other.to_json()
+    def __hash__(self) -> int:
+        return hash(self.to_json())
+    def __lt__(self, other: "JsonComparable") -> bool:
+        return self.to_json() < other.to_json()
+@partial(dataclass, frozen=True, kw_only=True)
+class SubblockConfig(JsonComparable):
+    no_op: bool = False
+    replace_with_linear: bool = False
+    sparsify: Optional[list[str]] = None
+    def __post_init__(self):
+        assert not (self.no_op and self.replace_with_linear)
+    def _force_setattr(self, name: str, value: Any) -> None:
+        """
+        Set an attribute even in frozen dataclasses.
+        Use only inside __post_init__!
+        """
+        object.__setattr__(self, name, value)
+@partial(dataclass, frozen=True, kw_only=True)
+class AttentionConfig(SubblockConfig):
+    n_heads_in_group: Optional[int] = None
+    window_length: Optional[int] = None
+    num_sink_tokens: Optional[int] = None
+    use_prefill_window_in_sink_attention: bool = False
+    unshifted_sink: bool = False
+    def __post_init__(self):
+        super().__post_init__()
+        assert not (self.no_op and self.replace_with_linear)
+        if self.no_op or self.replace_with_linear:
+            for irrelevant_att in ["n_heads_in_group", "window_length", "num_sink_tokens"]:
+                self._force_setattr(irrelevant_att, None)
+        else:
+            assert self.n_heads_in_group is not None
+        if self.is_sink:
+            assert not (self.unshifted_sink and self.use_prefill_window_in_sink_attention), \
+                ("Unshifted sink uses its own kind of explicit masking, not standard window. "
+                 "Set use_prefill_window_in_sink_attention to False.")
+            assert not (self.num_sink_tokens == 0 and not self.unshifted_sink), \
+                "Fake sink attention with 0 sink tokens is only supported with unshifted_sink=True"
+    @property
+    def prefill_sliding_window(self) -> Optional[int]:
+        if self.window_length is not None:
+            if not self.is_sink or self.use_prefill_window_in_sink_attention:
+                return self.window_length
+        return None
+    @property
+    def is_sliding(self) -> bool:
+        return self.prefill_sliding_window is not None
+    @property
+    def is_sink(self) -> bool:
+        return (
+                (self.window_length is not None)
+                and
+                (self.num_sink_tokens is not None)
+        )
+@partial(dataclass, frozen=True, kw_only=True)
+class FFNConfig(SubblockConfig):
+    ffn_mult: Optional[float] = None
+    def __post_init__(self):
+        super().__post_init__()
+        if self.no_op or self.replace_with_linear:
+            self._force_setattr("ffn_mult", None)
+        else:
+            assert self.ffn_mult is not None
+            self._force_setattr("ffn_mult", round(self.ffn_mult, 6))
+@partial(dataclass, frozen=True, kw_only=True)
+class BlockConfig(JsonComparable):
+    attention: AttentionConfig = MISSING
+    ffn: FFNConfig = MISSING
+    def __post_init__(self):
+        """
+        Init subblock dataclasses from dicts
+        """
+        for subblock_name in dataclasses.fields(self):
+            subblock_config = getattr(self, subblock_name.name)
+            if isinstance(subblock_config, dict):
+                subblock_fields = [field.name for field in dataclasses.fields(subblock_name.type)]
+                unsupported_fields = [field_name for field_name in subblock_config.keys()
+                                      if field_name not in subblock_fields]
+                if len(unsupported_fields) > 0:
+                    warnings.warn(f"Removed unsupported fields {unsupported_fields} from {subblock_name.type.__name__}")
+                subblock_config = {k: v for k, v in subblock_config.items() if k not in unsupported_fields}
+                object.__setattr__(self, subblock_name.name,
+                                   subblock_name.type(**subblock_config))  # __setattr__ to overcome frozen=True

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,20 @@

+{% set bos = "<|begin_of_text|>" %}{%- set enable_thinking = false -%}{% set system_start_header = "<|start_header_id|>" %}{% set system_end_header = "<|end_header_id|>
+" %}{% set start_header = "<|start_header_id|>" %}{% set end_header = "<|end_header_id|>
+" %}{% set eot = "<|eot_id|>" %}{% set system_token = "system" %}{% set user_token = "user" %}{% set assistant_token = "assistant" %}{% set tool_token = "tool" %}{{- bos ~ system_start_header ~ system_token ~ system_end_header -}}{%- if messages[0].role == 'system' and messages[0].content != '' -%}{%- set system_content = messages[0].content -%}{%- if '/no_think' in system_content -%}{%- set system_content = system_content.replace('/no_think', '')|trim -%}{%- set enable_thinking = false -%}{%- elif '/think' in system_content -%}{%- set system_content = system_content.replace('/think', '')|trim -%}{%- set enable_thinking = true -%}{%- endif -%}{{- system_content + '
+' -}}{%- endif -%}{%- if tools -%}{{- 'You can use the following tools to assist the user if required:
+<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>
+If you decide to call any tool(s), use the following format:
+<TOOLCALL>[{{"name": "tool_name1", "arguments": "tool_args1"}}, {{"name": "tool_name2", "arguments": "tool_args2"}}]</TOOLCALL>
+Response from tool(s) will be returned in this format:
+<TOOL_RESPONSE>[{{"response": "tool_response1"}}, {{"response": "tool_response2"}}]</TOOL_RESPONSE>
+Based on the results returned by the tool(s), you can call additional tools if needed, correct tool calls if any errors are found, or just respond with the answer to the user.' -}}{%- endif -%}{{- eot -}}{%- for message in messages -%}{%- if message.role == user_token -%}{{- start_header ~ user_token ~ end_header -}}{{ message.content -}}{{ eot -}}{%- elif message.role == assistant_token -%}{%- if '</think>' in message.content -%}{%- set content = message.content.split('</think>')[-1].lstrip() -%}{%- else -%}{%- set content = message.content -%}{%- endif -%}{{- start_header ~ assistant_token ~ end_header -}}{{ content -}}{%- if message.tool_calls -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{"name": "' + fn.name + '", "arguments": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- eot -}}{%- elif message.role == tool_token -%}{%- if loop.first or (messages[loop.index0 - 1].role != tool_token) -%}{{- start_header ~ tool_token ~ end_header -}}{{ '<TOOL_RESPONSE>[' -}}{%- endif -%}{{- message.content -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == tool_token) else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != tool_token) -%}{{- ']</TOOL_RESPONSE>' -}}{{ eot -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- start_header ~ assistant_token ~ end_header -}}{%- if not enable_thinking -%}{{- '<think>
+</think>
+' -}}{%- endif -%}{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,1497 @@

+{
+  "architectures": [
+    "DeciLMForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_decilm.DeciLMConfig",
+    "AutoModelForCausalLM": "modeling_decilm.DeciLMForCausalLM"
+  },
+  "block_configs": [
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 2.625,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 2.625,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 2.625,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 3.28125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 2.625,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 2.625,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 2.625,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.3125,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 0.5,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 0.5,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 0.5,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 0.5,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 1.0,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 0.5,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": null,
+        "no_op": true,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 0.5,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    },
+    {
+      "attention": {
+        "n_heads_in_group": 8,
+        "no_op": false,
+        "num_sink_tokens": null,
+        "replace_with_linear": false,
+        "sparsify": null,
+        "unshifted_sink": false,
+        "use_prefill_window_in_sink_attention": false,
+        "window_length": null
+      },
+      "ffn": {
+        "ffn_mult": 5.25,
+        "no_op": false,
+        "replace_with_linear": false,
+        "sparsify": null
+      }
+    }
+  ],
+  "bos_token_id": 128000,
+  "dtype": "bfloat16",
+  "eos_token_id": 128009,
+  "hidden_act": "silu",
+  "hidden_size": 8192,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 131072,
+  "mlp_bias": false,
+  "model_type": "nemotron-nas",
+  "num_attention_heads": 64,
+  "num_hidden_layers": 80,
+  "num_key_value_heads": null,
+  "pad_token_id": null,
+  "pretraining_tp": 1,
+  "quantization_config": {
+    "_load_in_4bit": false,
+    "_load_in_8bit": true,
+    "bnb_4bit_compute_dtype": "float32",
+    "bnb_4bit_quant_storage": "uint8",
+    "bnb_4bit_quant_type": "fp4",
+    "bnb_4bit_use_double_quant": false,
+    "llm_int8_enable_fp32_cpu_offload": false,
+    "llm_int8_has_fp16_weight": false,
+    "llm_int8_skip_modules": null,
+    "llm_int8_threshold": 6.0,
+    "load_in_4bit": false,
+    "load_in_8bit": true,
+    "quant_method": "bitsandbytes"
+  },
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "factor": 16.0,
+    "high_freq_factor": 4.0,
+    "low_freq_factor": 1.0,
+    "original_max_position_embeddings": 8192,
+    "rope_theta": 500000.0,
+    "rope_type": "llama3"
+  },
+  "rope_theta": 500000.0,
+  "tie_word_embeddings": false,
+  "transformers_version": "5.3.0",
+  "use_cache": false,
+  "vocab_size": 128256
+}

configuration_decilm.py ADDED Viewed

	@@ -0,0 +1,65 @@

+# coding=utf-8
+# Copyright 2024 Nvidia Corporation. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import dataclasses
+import warnings
+from typing import Dict, Any
+from transformers.utils import is_flash_attn_2_available
+from .block_config import BlockConfig
+from .transformers_4_44_2__configuration_llama import LlamaConfig
+from .transformers_4_44_2__modeling_rope_utils import \
+    rope_config_validation  # fake import to make AutoConfig infer the dependency
+rope_config_validation  # this line is here to make sure that auto-formatting doesn't remove the import
+class DeciLMConfig(LlamaConfig):
+    model_type = "nemotron-nas"
+    def __init__(
+            self,
+            block_configs: list[dict] | list[BlockConfig] = None,
+            **kwargs,
+    ):
+        attn_implementation = kwargs.pop("attn_implementation", None)
+        if attn_implementation is None and is_flash_attn_2_available():
+            attn_implementation = "flash_attention_2"
+        if block_configs is not None:
+            if isinstance(block_configs[0], dict):
+                block_configs = [BlockConfig(**conf) for conf in block_configs]
+            using_unshifted_sink = any([block_config.attention.unshifted_sink for block_config in block_configs])
+            if using_unshifted_sink and attn_implementation != "eager":
+                warnings.warn("Forcing attn_implementation='eager' since some attention layers use unshifted sink")
+                attn_implementation = "eager"
+        super().__init__(attn_implementation=attn_implementation, **kwargs)
+        self.intermediate_size = None
+        self.num_key_value_heads = None
+        if block_configs is not None:
+            assert len(block_configs) == self.num_hidden_layers
+        self.block_configs: list[BlockConfig] = block_configs
+    def to_dict(self) -> Dict[str, Any]:
+        self_dict = super().to_dict()
+        if self.block_configs is not None:
+            self_dict["block_configs"] = [dataclasses.asdict(conf) for conf in self.block_configs]
+        return self_dict

debug.log ADDED Viewed

	@@ -0,0 +1,730 @@

+[2026-03-31 02:46:14,052] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:10906] baseline 0.000GB ()
+[2026-03-31 02:46:14,053] [INFO] [axolotl.cli.config.load_cfg:341] [PID:10906] config:
+{
+  "activation_offloading": false,
+  "adapter": "lora",
+  "axolotl_config_path": "writer.yaml",
+  "base_model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
+  "base_model_config": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
+  "batch_size": 16,
+  "bf16": true,
+  "capabilities": {
+    "bf16": true,
+    "compute_capability": "sm_90",
+    "fp8": true,
+    "n_gpu": 1,
+    "n_node": 1,
+    "tf32": true
+  },
+  "chat_template": "jinja",
+  "chat_template_jinja": "{% set bos = \"<|begin_of_text|>\" %}{%- set enable_thinking = false -%}{% set system_start_header = \"<|start_header_id|>\" %}{% set system_end_header = \"<|end_header_id|>\n\n\" %}{% set start_header = \"<|start_header_id|>\" %}{% set end_header = \"<|end_header_id|>\n\n\" %}{% set eot = \"<|eot_id|>\" %}{% set system_token = \"system\" %}{% set user_token = \"user\" %}{% set assistant_token = \"assistant\" %}{% set tool_token = \"tool\" %}{{- bos ~ system_start_header ~ system_token ~ system_end_header -}}{%- if messages[0].role == 'system' and messages[0].content != '' -%}{%- set system_content = messages[0].content -%}{%- if '/no_think' in system_content -%}{%- set system_content = system_content.replace('/no_think', '')|trim -%}{%- set enable_thinking = false -%}{%- elif '/think' in system_content -%}{%- set system_content = system_content.replace('/think', '')|trim -%}{%- set enable_thinking = true -%}{%- endif -%}{{- system_content + '\n\n' -}}{%- endif -%}{%- if tools -%}{{- 'You can use the following tools to assist the user if required:\n<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>\n\nIf you decide to call any tool(s), use the following format:\n<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": \"tool_args1\"}}, {{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\nResponse from tool(s) will be returned in this format:\n<TOOL_RESPONSE>[{{\"response\": \"tool_response1\"}}, {{\"response\": \"tool_response2\"}}]</TOOL_RESPONSE>\n\nBased on the results returned by the tool(s), you can call additional tools if needed, correct tool calls if any errors are found, or just respond with the answer to the user.' -}}{%- endif -%}{{- eot -}}{%- for message in messages -%}{%- if message.role == user_token -%}{{- start_header ~ user_token ~ end_header -}}{{ message.content -}}{{ eot -}}{%- elif message.role == assistant_token -%}{%- if '</think>' in message.content -%}{%- set content = message.content.split('</think>')[-1].lstrip() -%}{%- else -%}{%- set content = message.content -%}{%- endif -%}{{- start_header ~ assistant_token ~ end_header -}}{{ content -}}{%- if message.tool_calls -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{\"name\": \"' + fn.name + '\", \"arguments\": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- eot -}}{%- elif message.role == tool_token -%}{%- if loop.first or (messages[loop.index0 - 1].role != tool_token) -%}{{- start_header ~ tool_token ~ end_header -}}{{ '<TOOL_RESPONSE>[' -}}{%- endif -%}{{- message.content -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == tool_token) else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != tool_token) -%}{{- ']</TOOL_RESPONSE>' -}}{{ eot -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- start_header ~ assistant_token ~ end_header -}}{%- if not enable_thinking -%}{{- '<think>\n\n</think>\n\n' -}}{%- endif -%}{%- endif -%}",
+  "context_parallel_size": 1,
+  "dataloader_num_workers": 1,
+  "dataloader_pin_memory": true,
+  "dataloader_prefetch_factor": 256,
+  "dataset_num_proc": 8,
+  "datasets": [
+    {
+      "chat_template": "tokenizer_default",
+      "message_field_training": "train",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/GLiMA_Thinking",
+      "roles_to_train": [],
+      "train_on_eos": "turn",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/Gutenberg-SFT",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/Condor-SFT-Filtered",
+      "split": "train[:250]",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/Ao3_Soft_Refusal",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/VSF",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    }
+  ],
+  "ddp": false,
+  "device": "cuda:0",
+  "device_map": "auto",
+  "dion_rank_fraction": 1.0,
+  "dion_rank_multiple_of": 1,
+  "eaft_alpha": 1.0,
+  "eaft_k": 20,
+  "env_capabilities": {
+    "torch_version": "2.9.1"
+  },
+  "eval_batch_size": 1,
+  "eval_causal_lm_metrics": [
+    "sacrebleu",
+    "comet",
+    "ter",
+    "chrf"
+  ],
+  "eval_max_new_tokens": 128,
+  "eval_sample_packing": true,
+  "eval_table_size": 0,
+  "experimental_skip_move_to_device": true,
+  "flash_attention": true,
+  "fp16": false,
+  "generate_samples": false,
+  "generation_do_sample": true,
+  "generation_max_new_tokens": 50,
+  "generation_prompt_ratio": 0.5,
+  "generation_temperature": 0.7,
+  "gradient_accumulation_steps": 16,
+  "gradient_checkpointing": true,
+  "gradient_checkpointing_kwargs": {
+    "use_reentrant": true
+  },
+  "include_tkps": true,
+  "is_llama_derived_model": true,
+  "layer_offloading": false,
+  "learning_rate": 1.25e-05,
+  "lisa_layers_attribute": "model.layers",
+  "load_best_model_at_end": false,
+  "load_in_4bit": false,
+  "load_in_8bit": false,
+  "local_rank": 0,
+  "logging_steps": 1,
+  "lora_alpha": 64,
+  "lora_dropout": 0.0,
+  "lora_mlp_kernel": false,
+  "lora_o_kernel": false,
+  "lora_qkv_kernel": false,
+  "lora_r": 32,
+  "lora_target_linear": true,
+  "loraplus_lr_embedding": 1e-06,
+  "loraplus_lr_ratio": 16.0,
+  "lr_scheduler": "constant_with_warmup",
+  "max_grad_norm": 1.0,
+  "mean_resizing_embeddings": false,
+  "merge_method": "memory_efficient",
+  "micro_batch_size": 1,
+  "model_config_type": "nemotron-nas",
+  "num_epochs": 3.0,
+  "num_generation_samples": 3,
+  "optimizer": "paged_adamw_8bit",
+  "otel_metrics_host": "localhost",
+  "otel_metrics_port": 8000,
+  "output_dir": "./Writer-Stage-1",
+  "pad_to_sequence_len": true,
+  "pretrain_multipack_attn": true,
+  "profiler_steps_start": 0,
+  "qlora_sharded_model_loading": false,
+  "quantize_moe_experts": false,
+  "ray_num_workers": 1,
+  "resources_per_worker": {
+    "GPU": 1
+  },
+  "sample_packing": true,
+  "sample_packing_bin_size": 200,
+  "sample_packing_group_size": 100000,
+  "save_only_model": false,
+  "save_safetensors": true,
+  "save_strategy": "no",
+  "seed": 42,
+  "sequence_len": 5120,
+  "shuffle_before_merging_datasets": false,
+  "shuffle_merged_datasets": true,
+  "skip_prepare_dataset": false,
+  "streaming_multipack_buffer_size": 10000,
+  "strict": false,
+  "tensor_parallel_size": 1,
+  "tf32": true,
+  "tiled_mlp_use_original_mlp": true,
+  "tokenizer_config": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
+  "tokenizer_save_jinja_files": true,
+  "torch_dtype": "torch.bfloat16",
+  "train_on_inputs": false,
+  "trl": {
+    "async_prefetch": false,
+    "log_completions": false,
+    "mask_truncated_completions": false,
+    "ref_model_mixup_alpha": 0.9,
+    "ref_model_sync_steps": 64,
+    "replay_buffer_size": 0,
+    "replay_recompute_logps": true,
+    "reroll_max_groups": 1,
+    "reroll_start_fraction": 1.0,
+    "reward_num_workers": 1,
+    "scale_rewards": true,
+    "skip_zero_advantage_batches": true,
+    "sync_ref_model": false,
+    "use_data_producer": false,
+    "use_vllm": false,
+    "vllm_lora_sync": false,
+    "vllm_server_host": "0.0.0.0",
+    "vllm_server_port": 8000
+  },
+  "trust_remote_code": true,
+  "use_otel_metrics": false,
+  "use_ray": false,
+  "use_tensorboard": true,
+  "val_set_size": 0.0,
+  "vllm": {
+    "device": "auto",
+    "dtype": "auto",
+    "gpu_memory_utilization": 0.9,
+    "host": "0.0.0.0",
+    "port": 8000
+  },
+  "warmup_ratio": 0.05,
+  "weight_decay": 0.0,
+  "world_size": 1
+}
+[2026-03-31 02:46:14,057] [INFO] [axolotl.utils.schemas.validation.check_eval_packing:129] [PID:10906] explicitly setting `eval_sample_packing` to match `sample_packing`
+[2026-03-31 02:46:14,057] [WARNING] [axolotl.utils.schemas.validation.check_sample_packing_without_attention:190] [PID:10906] sample_packing without flash, sdp, xformers, sage, or flex attention does not handle cross sample decontamination.
+[2026-03-31 02:46:14,057] [INFO] [axolotl.utils.schemas.validation.hint_sample_packing_padding:239] [PID:10906] Setting `pad_to_sequence_len: true` to prevent memory leaks when sample_packing
+[2026-03-31 02:46:14,057] [WARNING] [axolotl.utils.schemas.model.hint_trust_remote_code:103] [PID:10906] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
+[2026-03-31 02:46:14,759] [DEBUG] [axolotl.utils.config.log_gpu_memory_usage:127] [PID:10906] baseline 0.000GB ()
+[2026-03-31 02:46:14,760] [INFO] [axolotl.cli.config.load_cfg:341] [PID:10906] config:
+{
+  "activation_offloading": false,
+  "adapter": "lora",
+  "axolotl_config_path": "writer.yaml",
+  "base_model": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
+  "base_model_config": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
+  "batch_size": 16,
+  "bf16": true,
+  "capabilities": {
+    "bf16": true,
+    "compute_capability": "sm_90",
+    "fp8": true,
+    "n_gpu": 1,
+    "n_node": 1,
+    "tf32": true
+  },
+  "chat_template": "jinja",
+  "chat_template_jinja": "{% set bos = \"<|begin_of_text|>\" %}{%- set enable_thinking = false -%}{% set system_start_header = \"<|start_header_id|>\" %}{% set system_end_header = \"<|end_header_id|>\n\n\" %}{% set start_header = \"<|start_header_id|>\" %}{% set end_header = \"<|end_header_id|>\n\n\" %}{% set eot = \"<|eot_id|>\" %}{% set system_token = \"system\" %}{% set user_token = \"user\" %}{% set assistant_token = \"assistant\" %}{% set tool_token = \"tool\" %}{{- bos ~ system_start_header ~ system_token ~ system_end_header -}}{%- if messages[0].role == 'system' and messages[0].content != '' -%}{%- set system_content = messages[0].content -%}{%- if '/no_think' in system_content -%}{%- set system_content = system_content.replace('/no_think', '')|trim -%}{%- set enable_thinking = false -%}{%- elif '/think' in system_content -%}{%- set system_content = system_content.replace('/think', '')|trim -%}{%- set enable_thinking = true -%}{%- endif -%}{{- system_content + '\n\n' -}}{%- endif -%}{%- if tools -%}{{- 'You can use the following tools to assist the user if required:\n<AVAILABLE_TOOLS>[' -}}{%- for tool in tools -%}{{- (tool.function if tool.function is defined else tool) | tojson -}}{{- ', ' if not loop.last else '' -}}{%- endfor -%}{{- ']</AVAILABLE_TOOLS>\n\nIf you decide to call any tool(s), use the following format:\n<TOOLCALL>[{{\"name\": \"tool_name1\", \"arguments\": \"tool_args1\"}}, {{\"name\": \"tool_name2\", \"arguments\": \"tool_args2\"}}]</TOOLCALL>\n\nResponse from tool(s) will be returned in this format:\n<TOOL_RESPONSE>[{{\"response\": \"tool_response1\"}}, {{\"response\": \"tool_response2\"}}]</TOOL_RESPONSE>\n\nBased on the results returned by the tool(s), you can call additional tools if needed, correct tool calls if any errors are found, or just respond with the answer to the user.' -}}{%- endif -%}{{- eot -}}{%- for message in messages -%}{%- if message.role == user_token -%}{{- start_header ~ user_token ~ end_header -}}{{ message.content -}}{{ eot -}}{%- elif message.role == assistant_token -%}{%- if '</think>' in message.content -%}{%- set content = message.content.split('</think>')[-1].lstrip() -%}{%- else -%}{%- set content = message.content -%}{%- endif -%}{{- start_header ~ assistant_token ~ end_header -}}{{ content -}}{%- if message.tool_calls -%}{{- '<TOOLCALL>[' -}}{%- for call in message.tool_calls -%}{%- set fn = call.function if call.function is defined else call -%}{{- '{\"name\": \"' + fn.name + '\", \"arguments\": ' -}}{%- if fn.arguments is string -%}{{- fn.arguments -}}{%- else -%}{{- fn.arguments | tojson -}}{%- endif -%}{{- '}' + (', ' if not loop.last else '') -}}{%- endfor -%}{{- ']</TOOLCALL>' -}}{%- endif -%}{{- eot -}}{%- elif message.role == tool_token -%}{%- if loop.first or (messages[loop.index0 - 1].role != tool_token) -%}{{- start_header ~ tool_token ~ end_header -}}{{ '<TOOL_RESPONSE>[' -}}{%- endif -%}{{- message.content -}}{{- ', ' if not loop.last and (messages[loop.index0 + 1].role == tool_token) else '' -}}{%- if loop.last or (messages[loop.index0 + 1].role != tool_token) -%}{{- ']</TOOL_RESPONSE>' -}}{{ eot -}}{%- endif -%}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{- start_header ~ assistant_token ~ end_header -}}{%- if not enable_thinking -%}{{- '<think>\n\n</think>\n\n' -}}{%- endif -%}{%- endif -%}",
+  "context_parallel_size": 1,
+  "dataloader_num_workers": 1,
+  "dataloader_pin_memory": true,
+  "dataloader_prefetch_factor": 256,
+  "dataset_num_proc": 8,
+  "datasets": [
+    {
+      "chat_template": "tokenizer_default",
+      "message_field_training": "train",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/GLiMA_Thinking",
+      "roles_to_train": [],
+      "train_on_eos": "turn",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/Gutenberg-SFT",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/Condor-SFT-Filtered",
+      "split": "train[:250]",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/Ao3_Soft_Refusal",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    },
+    {
+      "chat_template": "tokenizer_default",
+      "message_property_mappings": {
+        "content": "content",
+        "role": "role"
+      },
+      "path": "ConicCat/VSF",
+      "trust_remote_code": false,
+      "type": "chat_template"
+    }
+  ],
+  "ddp": false,
+  "device": "cuda:0",
+  "device_map": "auto",
+  "dion_rank_fraction": 1.0,
+  "dion_rank_multiple_of": 1,
+  "eaft_alpha": 1.0,
+  "eaft_k": 20,
+  "env_capabilities": {
+    "torch_version": "2.9.1"
+  },
+  "eval_batch_size": 1,
+  "eval_causal_lm_metrics": [
+    "sacrebleu",
+    "comet",
+    "ter",
+    "chrf"
+  ],
+  "eval_max_new_tokens": 128,
+  "eval_sample_packing": true,
+  "eval_table_size": 0,
+  "experimental_skip_move_to_device": true,
+  "flash_attention": false,
+  "fp16": false,
+  "generate_samples": false,
+  "generation_do_sample": true,
+  "generation_max_new_tokens": 50,
+  "generation_prompt_ratio": 0.5,
+  "generation_temperature": 0.7,
+  "gradient_accumulation_steps": 16,
+  "gradient_checkpointing": true,
+  "gradient_checkpointing_kwargs": {
+    "use_reentrant": true
+  },
+  "include_tkps": true,
+  "is_llama_derived_model": true,
+  "layer_offloading": false,
+  "learning_rate": 1.25e-05,
+  "lisa_layers_attribute": "model.layers",
+  "load_best_model_at_end": false,
+  "load_in_4bit": false,
+  "load_in_8bit": false,
+  "local_rank": 0,
+  "logging_steps": 1,
+  "lora_alpha": 64,
+  "lora_dropout": 0.0,
+  "lora_mlp_kernel": false,
+  "lora_o_kernel": false,
+  "lora_qkv_kernel": false,
+  "lora_r": 32,
+  "lora_target_linear": true,
+  "loraplus_lr_embedding": 1e-06,
+  "loraplus_lr_ratio": 16.0,
+  "lr_scheduler": "constant_with_warmup",
+  "max_grad_norm": 1.0,
+  "mean_resizing_embeddings": false,
+  "merge_lora": true,
+  "merge_method": "memory_efficient",
+  "micro_batch_size": 1,
+  "model_config_type": "nemotron-nas",
+  "num_epochs": 3.0,
+  "num_generation_samples": 3,
+  "optimizer": "paged_adamw_8bit",
+  "otel_metrics_host": "localhost",
+  "otel_metrics_port": 8000,
+  "output_dir": "./Writer-Stage-1",
+  "pad_to_sequence_len": true,
+  "pretrain_multipack_attn": true,
+  "profiler_steps_start": 0,
+  "qlora_sharded_model_loading": false,
+  "quantize_moe_experts": false,
+  "ray_num_workers": 1,
+  "resources_per_worker": {
+    "GPU": 1
+  },
+  "sample_packing": true,
+  "sample_packing_bin_size": 200,
+  "sample_packing_group_size": 100000,
+  "save_only_model": false,
+  "save_safetensors": true,
+  "save_strategy": "no",
+  "seed": 42,
+  "sequence_len": 5120,
+  "shuffle_before_merging_datasets": false,
+  "shuffle_merged_datasets": true,
+  "skip_prepare_dataset": false,
+  "streaming_multipack_buffer_size": 10000,
+  "strict": false,
+  "tensor_parallel_size": 1,
+  "tf32": true,
+  "tiled_mlp_use_original_mlp": true,
+  "tokenizer_config": "nvidia/Llama-3_3-Nemotron-Super-49B-v1_5",
+  "tokenizer_save_jinja_files": true,
+  "torch_dtype": "torch.bfloat16",
+  "train_on_inputs": false,
+  "trl": {
+    "async_prefetch": false,
+    "log_completions": false,
+    "mask_truncated_completions": false,
+    "ref_model_mixup_alpha": 0.9,
+    "ref_model_sync_steps": 64,
+    "replay_buffer_size": 0,
+    "replay_recompute_logps": true,
+    "reroll_max_groups": 1,
+    "reroll_start_fraction": 1.0,
+    "reward_num_workers": 1,
+    "scale_rewards": true,
+    "skip_zero_advantage_batches": true,
+    "sync_ref_model": false,
+    "use_data_producer": false,
+    "use_vllm": false,
+    "vllm_lora_sync": false,
+    "vllm_server_host": "0.0.0.0",
+    "vllm_server_port": 8000
+  },
+  "trust_remote_code": true,
+  "use_otel_metrics": false,
+  "use_ray": false,
+  "use_tensorboard": true,
+  "val_set_size": 0.0,
+  "vllm": {
+    "device": "auto",
+    "dtype": "auto",
+    "gpu_memory_utilization": 0.9,
+    "host": "0.0.0.0",
+    "port": 8000
+  },
+  "warmup_ratio": 0.05,
+  "weight_decay": 0.0,
+  "world_size": 1
+}
+[2026-03-31 02:46:14,760] [DEBUG] [axolotl.cli.merge_lora.do_merge_lora:32] [PID:10906] Using memory-efficient LoRA merging method...
+[2026-03-31 02:46:14,760] [DEBUG] [axolotl.cli.merge_lora._do_merge_lora_efficient:79] [PID:10906] Using memory-efficient LoRA merging method...
+[2026-03-31 02:46:19,620] [DEBUG] [axolotl.cli.utils.lora_merge.merge_lora_sharded_efficient:854] [PID:10906] Loading LoRA weights from Writer-Stage-1/adapter_model.safetensors
+[2026-03-31 02:46:19,633] [DEBUG] [axolotl.cli.utils.lora_merge.merge_lora_sharded_efficient:860] [PID:10906] Keeping LoRA weights on CPU; will move per-tensor during merge
+[2026-03-31 02:46:19,633] [DEBUG] [axolotl.cli.utils.lora_merge.merge_lora_sharded_efficient:866] [PID:10906] Found 21 model shards in /workspace/data/huggingface-cache/hub/models--nvidia--Llama-3_3-Nemotron-Super-49B-v1_5/snapshots/420ba7d28211abf116b8b103ab700d92619daf98
+[2026-03-31 02:46:19,633] [INFO] [axolotl.cli.utils.lora_merge.copy_non_model_files:303] [PID:10906] Copying non-model files to output directory...
+[2026-03-31 02:46:19,633] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying config.json to output
+[2026-03-31 02:46:19,633] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying configuration_decilm.py to output
+[2026-03-31 02:46:19,633] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__configuration_llama.py to output
+[2026-03-31 02:46:19,634] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__modeling_rope_utils.py to output
+[2026-03-31 02:46:19,634] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying block_config.py to output
+[2026-03-31 02:46:19,634] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying tokenizer_config.json to output
+[2026-03-31 02:46:19,634] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying tokenizer.json to output
+[2026-03-31 02:46:19,638] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying special_tokens_map.json to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying modeling_decilm.py to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__modeling_outputs.py to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__cache_utils.py to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__pytorch_utils.py to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__activations.py to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying variable_cache.py to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__modeling_flash_attention_utils_backward_compat.py to output
+[2026-03-31 02:46:19,639] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying transformers_4_44_2__modeling_attn_mask_utils.py to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying generation_config.json to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying llama_nemotron_toolcall_parser_no_streaming.py to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying README.md to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying PRIVACY.md to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying BIAS.md to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying .gitattributes to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying accuracy_chart.png to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying SAFETY&SECURITY.md to output
+[2026-03-31 02:46:19,640] [DEBUG] [axolotl.cli.utils.lora_merge.copy_non_model_files:324] [PID:10906] Copying EXPLAINABILITY.md to output
+[2026-03-31 02:46:20,696] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.0.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([14336, 32])
+[2026-03-31 02:46:21,426] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.0.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([14336, 32])
+[2026-03-31 02:46:22,225] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.0.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:22,280] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.0.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:22,820] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.0.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:23,341] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.0.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:23,394] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.1.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:46:24,838] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.1.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:26,250] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.1.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:27,647] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.1.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:27,699] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.1.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:28,154] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.1.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:28,618] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.1.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:28,670] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.2.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:28,722] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.2.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:29,202] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.2.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:34,816] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.2.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:36,246] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.2.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:37,651] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.2.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:38,131] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.3.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:46:39,614] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.3.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:41,043] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.3.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:42,447] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.3.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:42,497] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.3.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:42,956] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.3.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:43,452] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.3.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:43,505] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.4.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:46:44,942] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.4.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:46,427] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.4.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:47,935] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.4.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:47,987] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.4.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:48,458] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.4.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:48,915] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.4.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:54,686] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.5.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:56,164] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.5.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:46:57,648] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.5.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:57,687] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.5.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:58,223] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.5.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:46:58,718] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.5.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:46:58,764] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.6.mlp.down_proj.weight: torch.Size([32, 14336]), torch.Size([8192, 32])
+[2026-03-31 02:46:59,519] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.6.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([14336, 32])
+[2026-03-31 02:47:00,309] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.6.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([14336, 32])
+[2026-03-31 02:47:01,112] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.7.mlp.down_proj.weight: torch.Size([32, 14336]), torch.Size([8192, 32])
+[2026-03-31 02:47:01,813] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.7.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([14336, 32])
+[2026-03-31 02:47:02,515] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.7.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([14336, 32])
+[2026-03-31 02:47:03,290] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.8.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:47:04,711] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.8.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:06,150] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.8.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:07,654] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.8.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:07,706] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.8.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:08,159] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.8.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:08,635] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.8.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:08,687] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.9.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:08,782] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.9.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:09,296] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.9.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:14,969] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.10.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:16,429] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.10.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:17,853] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.10.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:17,888] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.10.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:18,394] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.10.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:18,937] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.10.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:18,982] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.11.mlp.down_proj.weight: torch.Size([32, 17920]), torch.Size([8192, 32])
+[2026-03-31 02:47:19,927] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.11.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([17920, 32])
+[2026-03-31 02:47:20,862] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.11.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([17920, 32])
+[2026-03-31 02:47:21,841] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.12.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:23,267] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.12.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:23,318] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.12.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:23,760] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.12.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:24,253] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.12.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:24,310] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.9.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:47:25,835] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.9.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:27,317] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.9.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:28,733] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.9.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:34,842] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.12.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:36,252] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.13.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:47:37,651] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.13.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:39,044] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.13.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:40,529] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.13.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:40,576] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.13.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:41,056] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.13.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:41,533] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.13.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:41,585] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.14.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:47:43,009] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.14.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:44,442] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.14.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:45,912] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.14.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:45,965] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.14.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:46,456] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.14.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:46,904] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.14.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:46,961] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.15.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:47,017] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.15.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:47,457] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.15.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:47:47,951] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.15.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:47:53,756] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.15.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:55,227] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.15.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:56,651] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.16.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:47:58,125] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.16.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:47:59,605] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.16.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:01,046] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.16.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:01,081] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.16.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:01,556] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.16.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:02,015] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.16.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:02,051] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.17.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:48:03,496] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.17.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:04,865] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.17.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:06,337] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.17.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:06,373] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.17.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:06,859] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.17.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:07,346] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.17.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:07,382] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.18.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:07,482] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.18.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:08,034] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.18.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:14,139] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.18.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:15,617] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.18.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:17,054] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.18.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:17,529] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.19.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:48:18,954] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.19.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:20,434] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.19.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:21,897] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.19.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:21,935] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.19.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:22,358] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.19.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:22,851] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.19.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:22,904] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.20.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:48:24,335] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.20.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:25,757] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.20.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:27,225] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.20.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:27,282] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.20.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:27,819] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.20.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:28,336] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.20.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:34,285] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.21.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:35,749] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.21.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:37,242] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.21.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:37,284] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.21.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:37,791] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.21.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:38,257] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.21.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:38,294] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.22.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:48:39,730] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.22.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:41,243] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.22.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:42,655] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.22.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:42,694] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.22.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:43,157] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.22.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:43,653] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.22.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:43,706] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.23.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:45,154] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.23.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:46,559] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.23.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:46,612] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.23.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:47,125] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.23.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:47,652] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.23.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:53,245] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.24.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:48:54,731] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.24.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:56,186] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.24.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:48:57,635] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.24.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:57,671] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.24.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:58,217] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.24.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:48:58,732] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.24.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:48:58,785] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.25.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:49:00,244] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.25.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:01,692] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.25.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:03,112] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.25.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:03,169] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.25.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:03,656] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.25.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:04,140] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.25.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:04,195] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.26.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:05,648] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.26.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:05,706] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.26.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:06,253] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.26.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:06,752] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.26.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:12,102] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.26.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:13,563] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.27.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:49:15,043] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.27.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:16,455] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.27.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:17,848] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.27.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:17,894] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.27.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:18,356] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.27.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:18,837] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.27.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:18,887] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.28.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:49:20,329] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.28.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:21,814] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.28.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:23,243] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.28.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:23,295] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.28.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:23,830] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.28.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:24,322] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.28.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:24,376] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.29.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:24,429] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.29.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:24,936] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.29.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:25,455] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.29.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:31,229] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.29.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:32,646] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.29.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:34,113] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.30.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:49:35,536] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.30.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:36,945] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.30.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:38,428] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.30.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:38,474] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.30.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:38,953] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.30.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:39,444] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.30.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:39,496] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.31.mlp.down_proj.weight: torch.Size([32, 28672]), torch.Size([8192, 32])
+[2026-03-31 02:49:40,842] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.31.mlp.gate_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:42,246] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.31.mlp.up_proj.weight: torch.Size([32, 8192]), torch.Size([28672, 32])
+[2026-03-31 02:49:43,648] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.31.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:43,706] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.31.self_attn.o_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:44,155] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.31.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:44,646] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.31.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:44,701] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.32.self_attn.k_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:44,781] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.32.self_attn.q_proj.weight: torch.Size([32, 8192]), torch.Size([8192, 32])
+[2026-03-31 02:49:45,256] [DEBUG] [axolotl.cli.utils.lora_merge._merge_tensor_with_lora:411] [PID:10906] Merging LoRA for model.layers.32.self_attn.v_proj.weight: torch.Size([32, 8192]), torch.Size([1024, 32])
+[2026-03-31 02:49:49,768] [ERROR] [axolotl.telemetry.errors.wrapper:158] [PID:10906] Error captured in telemetry. Run ID: 77193302-fa43-4dfd-ab04-45c91b8c4748
+Traceback (most recent call last):
+  File "/root/miniconda3/envs/py3.11/bin/axolotl", line 6, in <module>
+    sys.exit(main())
+             ^^^^^^
+  File "/workspace/axolotl/src/axolotl/cli/main.py", line 347, in main
+    cli()
+  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/click/core.py", line 1485, in __call__
+    return self.main(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/click/core.py", line 1406, in main
+    rv = self.invoke(ctx)
+         ^^^^^^^^^^^^^^^^
+  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/click/core.py", line 1873, in invoke
+    return _process_result(sub_ctx.command.invoke(sub_ctx))
+                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/click/core.py", line 1269, in invoke
+    return ctx.invoke(self.callback, **ctx.params)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/click/core.py", line 824, in invoke
+    return callback(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/axolotl/src/axolotl/cli/utils/args.py", line 48, in wrapper
+    return func(*args, **filtered_kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/axolotl/src/axolotl/cli/main.py", line 293, in merge_lora
+    do_cli(config=config, **kwargs)
+  File "/workspace/axolotl/src/axolotl/cli/merge_lora.py", line 169, in do_cli
+    do_merge_lora(cfg=parsed_cfg)
+  File "/workspace/axolotl/src/axolotl/telemetry/errors.py", line 127, in wrapper
+    return func(*args, **kwargs)
+           ^^^^^^^^^^^^^^^^^^^^^
+  File "/workspace/axolotl/src/axolotl/cli/merge_lora.py", line 33, in do_merge_lora
+    _do_merge_lora_efficient(cfg=cfg)
+  File "/workspace/axolotl/src/axolotl/cli/merge_lora.py", line 108, in _do_merge_lora_efficient
+    merge_lora_sharded_efficient(
+  File "/workspace/axolotl/src/axolotl/cli/utils/lora_merge.py", line 940, in merge_lora_sharded_efficient
+    safetensors.torch.save_file(
+  File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/safetensors/torch.py", line 307, in save_file
+    serialize_file(_flatten(tensors), filename, metadata=metadata)
+safetensors_rust.SafetensorError: Error while serializing: I/O error: No space left on device (os error 28)

runs/Mar31_01-27-28_b8de28f8ab2a/events.out.tfevents.1774920448.b8de28f8ab2a.3556.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76b91afdc544c9574c1ccca57556de6d448d7ec49328dfe6e6f02bce5d22f2b7
+size 46082

runs/Mar31_01-31-17_b8de28f8ab2a/events.out.tfevents.1774920677.b8de28f8ab2a.6000.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:01aafbab852593ccec0eeaee70a3ad2a5a537356fd3d439706fa189c52846c1c
+size 48037

runs/Mar31_01-41-00_b8de28f8ab2a/events.out.tfevents.1774921260.b8de28f8ab2a.9806.0 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8f6e84333aff94e98274813987c957ec4450376a766c375677fceeb038fc0aa2
+size 81795

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6b9e4e7fb171f92fd137b777cc2714bf87d11576700a1dcd7a399e7bbe39537b
+size 17209920

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "backend": "tokenizers",
+  "bos_token": "<|begin_of_text|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|eot_id|>",
+  "is_local": false,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 131072,
+  "pad_token": "<|eot_id|>",
+  "tokenizer_class": "TokenizersBackend"
+}

transformers_4_44_2__configuration_llama.py ADDED Viewed

	@@ -0,0 +1,203 @@

+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""LLaMA model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from .transformers_4_44_2__modeling_rope_utils import rope_config_validation
+class LlamaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the LLaMA-7B.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`LlamaModel`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Llama 1 supports up to 2048 tokens,
+            Llama 2 up to 4096, CodeLlama up to 16384.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            End of stream token id.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
+            understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
+            results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. NOTE: if you apply new rope type
+            and you expect the model to work on longer `max_position_embeddings`, we recommend you to update this value
+            accordingly.
+            Expected contents:
+                `rope_type` (`str`):
+                    The sub-variant of RoPE to use. Can be one of ['default', 'linear', 'dynamic', 'yarn', 'longrope',
+                    'llama3'], with 'default' being the original RoPE implementation.
+                `factor` (`float`, *optional*):
+                    Used with all rope types except 'default'. The scaling factor to apply to the RoPE embeddings. In
+                    most scaling types, a `factor` of x will enable the model to handle sequences of length x *
+                    original maximum pre-trained length.
+                `original_max_position_embeddings` (`int`, *optional*):
+                    Used with 'dynamic', 'longrope' and 'llama3'. The original max position embeddings used during
+                    pretraining.
+                `attention_factor` (`float`, *optional*):
+                    Used with 'yarn' and 'longrope'. The scaling factor to be applied on the attention
+                    computation. If unspecified, it defaults to value recommended by the implementation, using the
+                    `factor` field to infer the suggested value.
+                `beta_fast` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for extrapolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 32.
+                `beta_slow` (`float`, *optional*):
+                    Only used with 'yarn'. Parameter to set the boundary for interpolation (only) in the linear
+                    ramp function. If unspecified, it defaults to 1.
+                `short_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to short contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `long_factor` (`List[float]`, *optional*):
+                    Only used with 'longrope'. The scaling factor to be applied to long contexts (<
+                    `original_max_position_embeddings`). Must be a list of numbers with the same length as the hidden
+                    size divided by the number of attention heads divided by 2
+                `low_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to low frequency components of the RoPE
+                `high_freq_factor` (`float`, *optional*):
+                    Only used with 'llama3'. Scaling factor applied to high frequency components of the RoPE
+        attention_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+        mlp_bias (`bool`, *optional*, defaults to `False`):
+            Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers.
+    ```python
+    >>> from transformers import LlamaModel, LlamaConfig
+    >>> # Initializing a LLaMA llama-7b style configuration
+    >>> configuration = LlamaConfig()
+    >>> # Initializing a model from the llama-7b style configuration
+    >>> model = LlamaModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "llama"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=None,
+        hidden_act="silu",
+        max_position_embeddings=2048,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        pretraining_tp=1,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        mlp_bias=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        self.mlp_bias = mlp_bias
+        # Validate the correctness of rotary position embeddings parameters
+        # BC: if there is a 'type' field, move it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )

transformers_4_44_2__modeling_rope_utils.py ADDED Viewed

	@@ -0,0 +1,559 @@

+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+from typing import Optional, Tuple
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import is_torch_available, logging
+logger = logging.get_logger(__name__)
+if is_torch_available():
+    import torch
+def _compute_default_rope_parameters(
+    config: Optional[PretrainedConfig] = None,
+    device: Optional["torch.device"] = None,
+    seq_len: Optional[int] = None,
+    **rope_kwargs,
+) -> Tuple["torch.Tensor", float]:
+    """
+    Computes the inverse frequencies according to the original RoPE implementation
+    Args:
+        config ([`~transformers.PretrainedConfig`]):
+            The model configuration.
+        device (`torch.device`):
+            The device to use for initialization of the inverse frequencies.
+        seq_len (`int`, *optional*):
+            The current sequence length. Unused for this type of RoPE.
+        rope_kwargs (`Dict`, *optional*):
+            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
+    Returns:
+        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
+    """
+    if config is not None and len(rope_kwargs) > 0:
+        raise ValueError(
+            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
+            f"`_compute_default_rope_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
+        )
+    if len(rope_kwargs) > 0:
+        base = rope_kwargs["base"]
+        dim = rope_kwargs["dim"]
+    elif config is not None:
+        base = config.rope_theta
+        partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
+        head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        dim = int(head_dim * partial_rotary_factor)
+    attention_factor = 1.0  # Unused in this type of RoPE
+    # Compute the inverse frequencies
+    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).float().to(device) / dim))
+    return inv_freq, attention_factor
+def _compute_linear_scaling_rope_parameters(
+    config: Optional[PretrainedConfig] = None,
+    device: Optional["torch.device"] = None,
+    seq_len: Optional[int] = None,
+    **rope_kwargs,
+) -> Tuple["torch.Tensor", float]:
+    """
+    Computes the inverse frequencies with linear scaling. Credits to the Reddit user /u/kaiokendev
+    Args:
+        config ([`~transformers.PretrainedConfig`]):
+            The model configuration.
+        device (`torch.device`):
+            The device to use for initialization of the inverse frequencies.
+        seq_len (`int`, *optional*):
+            The current sequence length. Unused for this type of RoPE.
+        rope_kwargs (`Dict`, *optional*):
+            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
+    Returns:
+        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
+    """
+    if config is not None and len(rope_kwargs) > 0:
+        raise ValueError(
+            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
+            f"`_compute_linear_scaling_rope_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
+        )
+    if len(rope_kwargs) > 0:
+        factor = rope_kwargs["factor"]
+    elif config is not None:
+        factor = config.rope_scaling["factor"]
+    # Gets the default RoPE parameters
+    inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len, **rope_kwargs)
+    # Then applies linear scaling to the frequencies.
+    # NOTE: originally, scaling was applied to the position_ids. However, we get `embs = inv_freq @ position_ids`, so
+    # applying scaling to the inverse frequencies is equivalent.
+    inv_freq /= factor
+    return inv_freq, attention_factor
+def _compute_dynamic_ntk_parameters(
+    config: Optional[PretrainedConfig] = None,
+    device: Optional["torch.device"] = None,
+    seq_len: Optional[int] = None,
+    **rope_kwargs,
+) -> Tuple["torch.Tensor", float]:
+    """
+    Computes the inverse frequencies with NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla
+    Args:
+        config ([`~transformers.PretrainedConfig`]):
+            The model configuration.
+        device (`torch.device`):
+            The device to use for initialization of the inverse frequencies.
+        seq_len (`int`, *optional*):
+            The current sequence length, used to update the dynamic RoPE at inference time.
+        rope_kwargs (`Dict`, *optional*):
+            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
+    Returns:
+        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+        post-processing scaling factor applied to the computed cos/sin (unused in this type of RoPE).
+    """
+    # TODO (joao): use the new `original_max_position_embeddings` from rope_scaling
+    if config is not None and len(rope_kwargs) > 0:
+        raise ValueError(
+            "Unexpected arguments: `**rope_kwargs` and `config` are mutually exclusive in "
+            f"`_compute_dynamic_ntk_parameters`, got `rope_kwargs`={rope_kwargs} and `config`={config}"
+        )
+    if len(rope_kwargs) > 0:
+        base = rope_kwargs["base"]
+        dim = rope_kwargs["dim"]
+        max_position_embeddings = rope_kwargs["max_position_embeddings"]
+        factor = rope_kwargs["factor"]
+    elif config is not None:
+        base = config.rope_theta
+        partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
+        head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+        dim = int(head_dim * partial_rotary_factor)
+        max_position_embeddings = config.max_position_embeddings
+        factor = config.rope_scaling["factor"]
+    attention_factor = 1.0  # Unused in this type of RoPE
+    # seq_len: default to max_position_embeddings, e.g. at init time
+    seq_len = seq_len if seq_len is not None and seq_len > max_position_embeddings else max_position_embeddings
+    # Compute the inverse frequencies
+    base = base * ((factor * seq_len / max_position_embeddings) - (factor - 1)) ** (dim / (dim - 2))
+    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2, dtype=torch.int64).float().to(device) / dim))
+    return inv_freq, attention_factor
+def _compute_yarn_parameters(
+    config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
+) -> Tuple["torch.Tensor", float]:
+    """
+    Computes the inverse frequencies with NTK scaling. Please refer to the
+    [original paper](https://arxiv.org/abs/2309.00071)
+    Args:
+        config ([`~transformers.PretrainedConfig`]):
+            The model configuration.
+        device (`torch.device`):
+            The device to use for initialization of the inverse frequencies.
+        seq_len (`int`, *optional*):
+            The current sequence length. Unused for this type of RoPE.
+        rope_kwargs (`Dict`, *optional*):
+            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
+    Returns:
+        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+        post-processing scaling factor applied to the computed cos/sin.
+    """
+    # No need to keep BC with yarn, unreleased when this new pattern was created.
+    if len(rope_kwargs) > 0:
+        raise ValueError(
+            f"Unexpected arguments: `**rope_kwargs` should be unset in `_compute_yarn_parameters`, got {rope_kwargs}"
+        )
+    base = config.rope_theta
+    partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
+    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+    dim = int(head_dim * partial_rotary_factor)
+    max_position_embeddings = config.max_position_embeddings
+    factor = config.rope_scaling["factor"]
+    # Sets the attention factor as suggested in the paper
+    attention_factor = config.rope_scaling.get("attention_factor")
+    if attention_factor is None:
+        attention_factor = 0.1 * math.log(factor) + 1.0
+    # Optional config options
+    # beta_fast/beta_slow: as suggested in the paper, default to 32/1 (correspondingly)
+    beta_fast = config.rope_scaling.get("beta_fast") or 32
+    beta_slow = config.rope_scaling.get("beta_slow") or 1
+    # Compute the inverse frequencies
+    def find_correction_dim(num_rotations, dim, base, max_position_embeddings):
+        """Inverse dimension formula to find the dimension based on the number of rotations"""
+        return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))
+    def find_correction_range(low_rot, high_rot, dim, base, max_position_embeddings):
+        """Find dimension range bounds based on rotations"""
+        low = math.floor(find_correction_dim(low_rot, dim, base, max_position_embeddings))
+        high = math.ceil(find_correction_dim(high_rot, dim, base, max_position_embeddings))
+        return max(low, 0), min(high, dim - 1)
+    def linear_ramp_factor(min, max, dim):
+        if min == max:
+            max += 0.001  # Prevent singularity
+        linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
+        ramp_func = torch.clamp(linear_func, 0, 1)
+        return ramp_func
+    # Note on variable naming: "interpolation" comes from the original technique, where we interpolate the position IDs
+    # to expand the possible context length. In other words, interpolation = apply scaling factor.
+    pos_freqs = base ** (torch.arange(0, dim, 2).float().to(device) / dim)
+    inv_freq_extrapolation = 1.0 / pos_freqs
+    inv_freq_interpolation = 1.0 / (factor * pos_freqs)
+    low, high = find_correction_range(beta_fast, beta_slow, dim, base, max_position_embeddings)
+    # Get n-dimensional rotational scaling corrected for extrapolation
+    inv_freq_extrapolation_factor = 1 - linear_ramp_factor(low, high, dim // 2).float().to(device)
+    inv_freq = (
+        inv_freq_interpolation * (1 - inv_freq_extrapolation_factor)
+        + inv_freq_extrapolation * inv_freq_extrapolation_factor
+    )
+    return inv_freq, attention_factor
+def _compute_longrope_parameters(
+    config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
+) -> Tuple["torch.Tensor", float]:
+    """
+    Computes the inverse frequencies with LongRoPE scaling. Please refer to the
+    [original implementation](https://github.com/microsoft/LongRoPE)
+    Args:
+        config ([`~transformers.PretrainedConfig`]):
+            The model configuration.
+        device (`torch.device`):
+            The device to use for initialization of the inverse frequencies.
+        seq_len (`int`, *optional*):
+            The current sequence length. Unused for this type of RoPE.
+        rope_kwargs (`Dict`, *optional*):
+            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
+    Returns:
+        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+        post-processing scaling factor applied to the computed cos/sin.
+    """
+    # TODO (joao): use the new `original_max_position_embeddings` from rope_scaling
+    # No need to keep BC with longrope, unreleased when this new pattern was created.
+    if len(rope_kwargs) > 0:
+        raise ValueError(
+            "Unexpected arguments: `**rope_kwargs` should be unset in `_compute_longrope_parameters`, got "
+            f"{rope_kwargs}"
+        )
+    base = config.rope_theta
+    partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
+    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+    dim = int(head_dim * partial_rotary_factor)
+    long_factor = config.rope_scaling["long_factor"]
+    short_factor = config.rope_scaling["short_factor"]
+    factor = config.rope_scaling.get("factor")
+    attention_factor = config.rope_scaling.get("attention_factor")
+    # NOTE: Phi3 (and potentially other models) modify `max_position_embeddings` and have a
+    # `original_max_position_embeddings` field containing the pretrained value. They use the ratio between these two
+    # values to compute the default attention scaling factor, instead of using `factor`.
+    if hasattr(config, "original_max_position_embeddings"):
+        max_position_embeddings = config.original_max_position_embeddings
+        expanded_max_position_embeddings = config.max_position_embeddings
+        factor = expanded_max_position_embeddings / max_position_embeddings
+    else:
+        max_position_embeddings = config.max_position_embeddings
+        expanded_max_position_embeddings = max_position_embeddings * factor
+    # Sets the attention factor as suggested in the paper
+    if attention_factor is None:
+        if factor <= 1.0:
+            attention_factor = 1.0
+        else:
+            attention_factor = math.sqrt(1 + math.log(factor) / math.log(max_position_embeddings))
+    # Compute the inverse frequencies -- scaled based on the target sequence length
+    if expanded_max_position_embeddings > max_position_embeddings:
+        ext_factors = torch.tensor(long_factor, dtype=torch.float32, device=device)
+    else:
+        ext_factors = torch.tensor(short_factor, dtype=torch.float32, device=device)
+    inv_freq_shape = torch.arange(0, dim, 2, dtype=torch.int64, device=device).float() / dim
+    inv_freq = 1.0 / (ext_factors * base**inv_freq_shape)
+    return inv_freq, attention_factor
+def _compute_llama3_parameters(
+    config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None, **rope_kwargs
+) -> Tuple["torch.Tensor", float]:
+    """
+    Computes the inverse frequencies for llama 3.1.
+    Args:
+        config ([`~transformers.PretrainedConfig`]):
+            The model configuration.
+        device (`torch.device`):
+            The device to use for initialization of the inverse frequencies.
+        seq_len (`int`, *optional*):
+            The current sequence length. Unused for this type of RoPE.
+        rope_kwargs (`Dict`, *optional*):
+            BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
+    Returns:
+        Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
+        post-processing scaling factor applied to the computed cos/sin.
+    """
+    # Gets the default RoPE parameters
+    inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len, **rope_kwargs)
+    factor = config.rope_scaling["factor"]  # `8` in the original implementation
+    low_freq_factor = config.rope_scaling["low_freq_factor"]  # `1` in the original implementation
+    high_freq_factor = config.rope_scaling["high_freq_factor"]  # `4` in the original implementation
+    old_context_len = config.rope_scaling["original_max_position_embeddings"]  # `8192` in the original implementation
+    low_freq_wavelen = old_context_len / low_freq_factor
+    high_freq_wavelen = old_context_len / high_freq_factor
+    wavelen = 2 * math.pi / inv_freq
+    # wavelen < high_freq_wavelen: do nothing
+    # wavelen > low_freq_wavelen: divide by factor
+    inv_freq_llama = torch.where(wavelen > low_freq_wavelen, inv_freq / factor, inv_freq)
+    # otherwise: interpolate between the two, using a smooth factor
+    smooth_factor = (old_context_len / wavelen - low_freq_factor) / (high_freq_factor - low_freq_factor)
+    smoothed_inv_freq = (1 - smooth_factor) * inv_freq_llama / factor + smooth_factor * inv_freq_llama
+    is_medium_freq = ~(wavelen < high_freq_wavelen) * ~(wavelen > low_freq_wavelen)
+    inv_freq_llama = torch.where(is_medium_freq, smoothed_inv_freq, inv_freq_llama)
+    return inv_freq_llama, attention_factor
+# This maps the "rope_type" string field in rope config to the corresponding function to compute the RoPE parameters
+# from the model config. You can append new {'rope_type': callable} pairs to this dictionary to enable custom RoPE
+# parameterizations, as long as the callable has the same signature.
+ROPE_INIT_FUNCTIONS = {
+    "default": _compute_default_rope_parameters,
+    "linear": _compute_linear_scaling_rope_parameters,
+    "dynamic": _compute_dynamic_ntk_parameters,
+    "yarn": _compute_yarn_parameters,
+    "longrope": _compute_longrope_parameters,
+    "llama3": _compute_llama3_parameters,
+}
+def _check_received_keys(rope_type: str, received_keys: set, required_keys: set, optional_keys: Optional[set] = None):
+    """Compare the received keys in `config.rope_scaling` against the expected and optional keys"""
+    # BC: "rope_type" was originally "type" -- let's gracefully handle it
+    if "rope_type" not in received_keys and "type" in received_keys:
+        received_keys -= {"type"}
+        received_keys.add("rope_type")
+    missing_keys = required_keys - received_keys
+    if missing_keys:
+        raise KeyError(f"Missing required keys in `rope_scaling` for 'rope_type'='{rope_type}': {missing_keys}")
+    if optional_keys is not None:
+        unused_keys = received_keys - required_keys - optional_keys
+    else:
+        unused_keys = received_keys - required_keys
+    if unused_keys:
+        logger.warning(f"Unrecognized keys in `rope_scaling` for 'rope_type'='{rope_type}': {unused_keys}")
+def _validate_default_rope_parameters(config: PretrainedConfig):
+    rope_scaling = config.rope_scaling
+    rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None))  # BC: "rope_type" was originally "type"
+    required_keys = {"rope_type"}
+    received_keys = set(rope_scaling.keys())
+    _check_received_keys(rope_type, received_keys, required_keys)
+def _validate_linear_scaling_rope_parameters(config: PretrainedConfig):
+    rope_scaling = config.rope_scaling
+    rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None))  # BC: "rope_type" was originally "type"
+    required_keys = {"rope_type", "factor"}
+    received_keys = set(rope_scaling.keys())
+    _check_received_keys(rope_type, received_keys, required_keys)
+    factor = rope_scaling["factor"]
+    if factor is None or not isinstance(factor, float) or factor < 1.0:
+        logger.warning(f"`rope_scaling`'s factor field must be a float >= 1, got {factor}")
+def _validate_dynamic_scaling_rope_parameters(config: PretrainedConfig):
+    rope_scaling = config.rope_scaling
+    rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None))  # BC: "rope_type" was originally "type"
+    required_keys = {"rope_type", "factor"}
+    # TODO (joao): update logic for the inclusion of `original_max_position_embeddings`
+    optional_keys = {"original_max_position_embeddings"}
+    received_keys = set(rope_scaling.keys())
+    _check_received_keys(rope_type, received_keys, required_keys, optional_keys)
+    factor = rope_scaling["factor"]
+    if factor is None or not isinstance(factor, float) or factor < 1.0:
+        logger.warning(f"`rope_scaling`'s factor field must be a float >= 1, got {factor}")
+def _validate_yarn_parameters(config: PretrainedConfig):
+    rope_scaling = config.rope_scaling
+    rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None))  # BC: "rope_type" was originally "type"
+    required_keys = {"rope_type", "factor"}
+    optional_keys = {"attention_factor", "beta_fast", "beta_slow"}
+    received_keys = set(rope_scaling.keys())
+    _check_received_keys(rope_type, received_keys, required_keys, optional_keys)
+    factor = rope_scaling["factor"]
+    if factor is None or not isinstance(factor, float) or factor < 1.0:
+        logger.warning(f"`rope_scaling`'s factor field must be a float >= 1, got {factor}")
+    attention_factor = rope_scaling.get("attention_factor")
+    if attention_factor is not None and (not isinstance(attention_factor, float) or attention_factor < 0):
+        logger.warning(
+            f"`rope_scaling`'s attention_factor field must be a float greater than 0, got {attention_factor}"
+        )
+    beta_fast = rope_scaling.get("beta_fast")
+    if beta_fast is not None and not isinstance(beta_fast, float):
+        logger.warning(f"`rope_scaling`'s beta_fast field must be a float, got {beta_fast}")
+    beta_slow = rope_scaling.get("beta_slow")
+    if beta_slow is not None and not isinstance(beta_slow, float):
+        logger.warning(f"`rope_scaling`'s beta_slow field must be a float, got {beta_slow}")
+    if (beta_fast or 32) < (beta_slow or 1):
+        logger.warning(
+            f"`rope_scaling`'s beta_fast field must be greater than beta_slow, got beta_fast={beta_fast} "
+            f"(defaults to 32 if None) and beta_slow={beta_slow} (defaults to 1 if None)"
+        )
+def _validate_longrope_parameters(config: PretrainedConfig):
+    rope_scaling = config.rope_scaling
+    rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None))  # BC: "rope_type" was originally "type"
+    required_keys = {"rope_type", "short_factor", "long_factor"}
+    # TODO (joao): update logic for the inclusion of `original_max_position_embeddings`
+    optional_keys = {"attention_factor", "factor", "original_max_position_embeddings"}
+    received_keys = set(rope_scaling.keys())
+    _check_received_keys(rope_type, received_keys, required_keys, optional_keys)
+    partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
+    head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
+    dim = int(head_dim * partial_rotary_factor)
+    short_factor = rope_scaling.get("short_factor")
+    if not isinstance(short_factor, list) and all(isinstance(x, (int, float)) for x in short_factor):
+        logger.warning(f"`rope_scaling`'s short_factor field must be a list of numbers, got {short_factor}")
+    if not len(short_factor) == dim // 2:
+        logger.warning(f"`rope_scaling`'s short_factor field must have length {dim // 2}, got {len(short_factor)}")
+    long_factor = rope_scaling.get("long_factor")
+    if not isinstance(long_factor, list) and all(isinstance(x, (int, float)) for x in long_factor):
+        logger.warning(f"`rope_scaling`'s long_factor field must be a list of numbers, got {long_factor}")
+    if not len(long_factor) == dim // 2:
+        logger.warning(f"`rope_scaling`'s long_factor field must have length {dim // 2}, got {len(long_factor)}")
+    # Handle Phi3 divergence: prefer the use of `attention_factor` and/or `factor` over
+    # `original_max_position_embeddings` to compute internal variables. The latter lives outside `rope_scaling` and is
+    # unique to longrope (= undesirable)
+    if hasattr(config, "original_max_position_embeddings"):
+        logger.warning_once(
+            "This model has set a `original_max_position_embeddings` field, to be used together with "
+            "`max_position_embeddings` to determine a scaling factor. Please set the `factor` field of `rope_scaling`"
+            "with this ratio instead -- we recommend the use of this field over `original_max_position_embeddings`, "
+            "as it is compatible with most model architectures."
+        )
+    else:
+        factor = rope_scaling.get("factor")
+        if factor is None:
+            logger.warning("Missing required keys in `rope_scaling`: 'factor'")
+        elif not isinstance(factor, float) or factor < 1.0:
+            logger.warning(f"`rope_scaling`'s factor field must be a float >= 1, got {factor}")
+        attention_factor = rope_scaling.get("attention_factor")
+        if attention_factor is not None and not isinstance(attention_factor, float) or attention_factor < 0:
+            logger.warning(
+                f"`rope_scaling`'s attention_factor field must be a float greater than 0, got {attention_factor}"
+            )
+def _validate_llama3_parameters(config: PretrainedConfig):
+    rope_scaling = config.rope_scaling
+    rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None))  # BC: "rope_type" was originally "type"
+    required_keys = {"rope_type", "factor", "original_max_position_embeddings", "low_freq_factor", "high_freq_factor"}
+    received_keys = set(rope_scaling.keys())
+    _check_received_keys(rope_type, received_keys, required_keys)
+    factor = rope_scaling["factor"]
+    if factor is None or not isinstance(factor, float) or factor < 1.0:
+        logger.warning(f"`rope_scaling`'s factor field must be a float >= 1, got {factor}")
+    low_freq_factor = rope_scaling["low_freq_factor"]
+    high_freq_factor = rope_scaling["high_freq_factor"]
+    if low_freq_factor is None or not isinstance(low_freq_factor, float):
+        logger.warning(f"`rope_scaling`'s low_freq_factor field must be a float, got {low_freq_factor}")
+    if high_freq_factor is None or not isinstance(high_freq_factor, float):
+        logger.warning(f"`rope_scaling`'s high_freq_factor field must be a float, got {high_freq_factor}")
+    if high_freq_factor <= low_freq_factor:
+        logger.warning(
+            "`rope_scaling`'s high_freq_factor field must be greater than low_freq_factor, got high_freq_factor="
+            f"{high_freq_factor} and low_freq_factor={low_freq_factor}"
+        )
+    original_max_position_embeddings = rope_scaling["original_max_position_embeddings"]
+    if original_max_position_embeddings is None or not isinstance(original_max_position_embeddings, int):
+        logger.warning(
+            "`rope_scaling`'s original_max_position_embeddings field must be an integer, got "
+            f"{original_max_position_embeddings}"
+        )
+    if original_max_position_embeddings >= config.max_position_embeddings:
+        logger.warning(
+            "`rope_scaling`'s original_max_position_embeddings field must be less than max_position_embeddings, got "
+            f"{original_max_position_embeddings} and max_position_embeddings={config.max_position_embeddings}"
+        )
+# Like `ROPE_INIT_FUNCTIONS`, this validation function mapping can be dynamically updated for custom RoPE types.
+ROPE_VALIDATION_FUNCTIONS = {
+    "default": _validate_default_rope_parameters,
+    "linear": _validate_linear_scaling_rope_parameters,
+    "dynamic": _validate_dynamic_scaling_rope_parameters,
+    "yarn": _validate_yarn_parameters,
+    "longrope": _validate_longrope_parameters,
+    "llama3": _validate_llama3_parameters,
+}
+def rope_config_validation(config: PretrainedConfig):
+    """
+    Validate the RoPE config arguments, given a `PretrainedConfig` object
+    """
+    rope_scaling = getattr(config, "rope_scaling", None)  # not a default parameter in `PretrainedConfig`
+    if rope_scaling is None:
+        return
+    # BC: "rope_type" was originally "type"
+    rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", "default"))
+    validation_fn = ROPE_VALIDATION_FUNCTIONS.get(rope_type)
+    if validation_fn is not None:
+        validation_fn(config)
+    else:
+        logger.warning(
+            f"Missing validation function mapping in `ROPE_VALIDATION_FUNCTIONS` for 'rope_type'='{rope_type}'"
+        )