File size: 3,255 Bytes
f19c302
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
base_model: zai-org/GLM-4.7-Flash
library_name: peft
model_name: output-2
tags:
- base_model:adapter:zai-org/GLM-4.7-Flash
- lora
- sft
- transformers
- trl
licence: license
pipeline_tag: text-generation
---

# output-2

This model is a fine-tuned version of [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash).

**W&B run:** [https://wandb.ai/cooawoo-personal/huggingface/runs/ay5ml51v](https://wandb.ai/cooawoo-personal/huggingface/runs/ay5ml51v)
## Training procedure

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Learning rate | `2e-05` |
| LR scheduler | SchedulerType.COSINE |
| Per-device batch size | 1 |
| Gradient accumulation | 4 |
| Effective batch size | 4 |
| Epochs | 2 |
| Max sequence length | 8192 |
| Optimizer | OptimizerNames.ADAMW_TORCH_FUSED |
| Weight decay | 0.01 |
| Warmup ratio | 0.03 |
| Max gradient norm | 1.0 |
| Precision | bf16 |
| Gradient checkpointing | yes |
| Loss type | nll |
| Chunked cross-entropy | yes |


### LoRA configuration

| Parameter | Value |
|-----------|-------|
| Rank (r) | 32 |
| Alpha | 16 |
| Target modules | kv_a_proj_with_mqa, kv_b_proj, mlp.down_proj, mlp.gate_proj, mlp.up_proj, o_proj, q_a_proj, q_b_proj, shared_expert.down_proj, shared_expert.gate_proj, shared_expert.up_proj |
| rsLoRA | yes |


### Dataset statistics

| Dataset | Samples | Total tokens | Trainable tokens |
|---------|--------:|-------------:|-----------------:|
| rpDungeon/some-revised-datasets/springdragon_processed.jsonl | 2,473 | 5,421,492 | 5,421,492 |


<details>
<summary>Training config</summary>

```yaml
model_name_or_path: zai-org/GLM-4.7-Flash
data_config: data.yaml
prepared_dataset: prepared
output_dir: output-2
attn_implementation: flash_attention_2
bf16: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
use_cce: true
padding_free: false
dataloader_num_workers: 4
dataloader_pin_memory: true
aux_loss_top_prob_weight: 0.05
neftune_noise_alpha: 5
max_length: 8192
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
truncation_strategy: split
use_peft: true
lora_r: 32
lora_alpha: 16
lora_dropout: 0.0
use_rslora: true
lora_target_modules:
- q_a_proj
- q_b_proj
- kv_a_proj_with_mqa
- kv_b_proj
- o_proj
- shared_expert.gate_proj
- shared_expert.up_proj
- shared_expert.down_proj
- mlp.gate_proj
- mlp.up_proj
- mlp.down_proj
model_init_kwargs:
  trust_remote_code: true
  torch_dtype: bfloat16
trust_remote_code: true
optim: adamw_torch_fused
learning_rate: 2.0e-05
lr_scheduler_type: cosine
warmup_ratio: 0.03
weight_decay: 0.01
max_grad_norm: 1.0
num_train_epochs: 2
logging_steps: 1
disable_tqdm: false
saves_per_epoch: 4
eval_strategy: 'no'
save_total_limit: 3
report_to: wandb
run_name: glm47-sonic-springdragon
```

</details>

<details>
<summary>Data config</summary>

```yaml
datasets:
- path: rpDungeon/some-revised-datasets
  data_files: springdragon_processed.jsonl
  type: text
  columns:
  - text
  truncation_strategy: split
shuffle_datasets: true
shuffle_combined: true
shuffle_seed: 42
eval_split: 0
split_seed: 42
assistant_only_loss: false
```

</details>

### Framework versions

- PEFT 0.18.1
- Loft: 0.1.0
- Transformers: 5.3.0
- Pytorch: 2.9.1
- Datasets: 4.6.1
- Tokenizers: 0.22.2