File size: 5,433 Bytes
1705308
 
 
 
 
 
 
e88f53f
 
1705308
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e88f53f
1705308
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e88f53f
 
 
 
1705308
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
library_name: transformers
license: apache-2.0
base_model: AiForgeMaster/Qwen3-4B-P3-TC-1
tags:
- axolotl
- generated_from_trainer
datasets:
- AiForgeMaster/glaiceai-natural-reasoning-10k
model-index:
- name: Qwen3-4B-P3-TC-RSSFT-1
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
<details><summary>See axolotl config</summary>

axolotl version: `0.13.0.dev0`
```yaml
# axolotl train config.yaml

# Prevent NCCL timeout
ddp_timeout: 7200  # 2 hours timeout instead of 10 minutes

# Load model from local models directory first, fallback to HuggingFace if not found
base_model: AiForgeMaster/Qwen3-4B-P3-TC-1  # Local path - will fallback to Qwen/Qwen3-4B if not found locally
# Automatically upload checkpoint and final model to HF
hub_model_id: AiForgeMaster/Qwen3-4B-P3-TC-RSSFT-1

load_in_8bit: false
load_in_4bit: false
strict: false

# SFT dataset configuration - using HuggingFace datasets
datasets:
  - path: AiForgeMaster/glaiceai-natural-reasoning-10k  # Private HF dataset - requires API key
    type: alpaca_chat.load_qa
    # skip: 0 # number of rows of data to skip over from the beginning

# Local paths relative to working directory
dataset_prepared_path: ./data/prepared
val_set_size: 0.0  # Set to 0 for SFT (no validation split)
output_dir: ./outputs

# Cache directories for HuggingFace downloads (relative to working dir)
# This ensures models and datasets are downloaded to local directories
hf_use_auth_token: true  # Use HF token for private repos if needed

sequence_len: 8192
sample_packing: false  # Standard for SFT
eval_sample_packing: false  # Disable for SFT

# WandB configuration - fill in your details
wandb_project: ngpt-cpt
wandb_entity: null
wandb_watch: gradients
wandb_name: qwen3_4b_p3_tc_rssft_1
wandb_log_model: end

# Batch size configuration (total effective batch size = micro_batch_size * gradient_accumulation_steps * num_gpus)
# For batch size 8-16: micro_batch_size=2, gradient_accumulation_steps=4 gives effective batch size of 8 per GPU
gradient_accumulation_steps: 4
micro_batch_size: 8  # Adjust based on your GPU memory
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5  # Good learning rate for SFT

bf16: auto
tf32: true

max_grad_norm: 1.0

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
logging_steps: 10  # Log every 10 steps
flash_attention: true

warmup_steps: 150  # Good warmup for SFT
# Checkpoint saving configuration - save every 50 steps
save_steps: 50
save_strategy: steps
save_total_limit: 5  # Keep only 5 most recent checkpoints
save_only_model: false  # Save full checkpoint including optimizer state

# Evaluation configuration removed for pure SFT (val_set_size: 0.0)
# eval_steps: 2000  # Not supported when val_set_size == 0
# eval_strategy: steps  # Not supported when val_set_size == 0
weight_decay: 0.01  # Good weight decay for SFT

# Liger optimizations for memory efficiency and speed
plugins:
  - axolotl.integrations.liger.LigerPlugin

liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

# Additional SFT optimizations
# Enable for first run to validate checkpoint saving works
save_first_step: true

# Memory optimizations
dataloader_pin_memory: true
dataloader_num_workers: 4
remove_unused_columns: true

# Advanced training settings for SFT
# Calculate max_steps for full epoch: dataset_size / (micro_batch_size * gradient_accumulation_steps * num_gpus)
# max_steps: 175  # Set for one full epoch with your dataset size
num_epochs: 1
group_by_length: true  # Good for SFT efficiency
train_on_inputs: true  # train on user inputs in SFT

# Loss monitoring
loss_watchdog_threshold: 10.0  # Stop if loss exceeds this value
loss_watchdog_patience: 3

# Garbage collection to manage memory
gc_steps: 100  # Run garbage collection every 100 steps
```

</details><br>

[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/uskfoundation/ngpt-cpt/runs/azltguw6)
# Qwen3-4B-P3-TC-RSSFT-1

This model is a fine-tuned version of [AiForgeMaster/Qwen3-4B-P3-TC-1](https://huggingface.co/AiForgeMaster/Qwen3-4B-P3-TC-1) on the AiForgeMaster/glaiceai-natural-reasoning-10k dataset.

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

More information needed

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 150
- training_steps: 312

### Training results



### Framework versions

- Transformers 4.55.4
- Pytorch 2.7.1+cu126
- Datasets 4.0.0
- Tokenizers 0.21.4