File size: 5,433 Bytes
1705308 e88f53f 1705308 e88f53f 1705308 e88f53f 1705308 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
library_name: transformers
license: apache-2.0
base_model: AiForgeMaster/Qwen3-4B-P3-TC-1
tags:
- axolotl
- generated_from_trainer
datasets:
- AiForgeMaster/glaiceai-natural-reasoning-10k
model-index:
- name: Qwen3-4B-P3-TC-RSSFT-1
results: []
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
<details><summary>See axolotl config</summary>
axolotl version: `0.13.0.dev0`
```yaml
# axolotl train config.yaml
# Prevent NCCL timeout
ddp_timeout: 7200 # 2 hours timeout instead of 10 minutes
# Load model from local models directory first, fallback to HuggingFace if not found
base_model: AiForgeMaster/Qwen3-4B-P3-TC-1 # Local path - will fallback to Qwen/Qwen3-4B if not found locally
# Automatically upload checkpoint and final model to HF
hub_model_id: AiForgeMaster/Qwen3-4B-P3-TC-RSSFT-1
load_in_8bit: false
load_in_4bit: false
strict: false
# SFT dataset configuration - using HuggingFace datasets
datasets:
- path: AiForgeMaster/glaiceai-natural-reasoning-10k # Private HF dataset - requires API key
type: alpaca_chat.load_qa
# skip: 0 # number of rows of data to skip over from the beginning
# Local paths relative to working directory
dataset_prepared_path: ./data/prepared
val_set_size: 0.0 # Set to 0 for SFT (no validation split)
output_dir: ./outputs
# Cache directories for HuggingFace downloads (relative to working dir)
# This ensures models and datasets are downloaded to local directories
hf_use_auth_token: true # Use HF token for private repos if needed
sequence_len: 8192
sample_packing: false # Standard for SFT
eval_sample_packing: false # Disable for SFT
# WandB configuration - fill in your details
wandb_project: ngpt-cpt
wandb_entity: null
wandb_watch: gradients
wandb_name: qwen3_4b_p3_tc_rssft_1
wandb_log_model: end
# Batch size configuration (total effective batch size = micro_batch_size * gradient_accumulation_steps * num_gpus)
# For batch size 8-16: micro_batch_size=2, gradient_accumulation_steps=4 gives effective batch size of 8 per GPU
gradient_accumulation_steps: 4
micro_batch_size: 8 # Adjust based on your GPU memory
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5 # Good learning rate for SFT
bf16: auto
tf32: true
max_grad_norm: 1.0
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
logging_steps: 10 # Log every 10 steps
flash_attention: true
warmup_steps: 150 # Good warmup for SFT
# Checkpoint saving configuration - save every 50 steps
save_steps: 50
save_strategy: steps
save_total_limit: 5 # Keep only 5 most recent checkpoints
save_only_model: false # Save full checkpoint including optimizer state
# Evaluation configuration removed for pure SFT (val_set_size: 0.0)
# eval_steps: 2000 # Not supported when val_set_size == 0
# eval_strategy: steps # Not supported when val_set_size == 0
weight_decay: 0.01 # Good weight decay for SFT
# Liger optimizations for memory efficiency and speed
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true
# Additional SFT optimizations
# Enable for first run to validate checkpoint saving works
save_first_step: true
# Memory optimizations
dataloader_pin_memory: true
dataloader_num_workers: 4
remove_unused_columns: true
# Advanced training settings for SFT
# Calculate max_steps for full epoch: dataset_size / (micro_batch_size * gradient_accumulation_steps * num_gpus)
# max_steps: 175 # Set for one full epoch with your dataset size
num_epochs: 1
group_by_length: true # Good for SFT efficiency
train_on_inputs: true # train on user inputs in SFT
# Loss monitoring
loss_watchdog_threshold: 10.0 # Stop if loss exceeds this value
loss_watchdog_patience: 3
# Garbage collection to manage memory
gc_steps: 100 # Run garbage collection every 100 steps
```
</details><br>
[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/uskfoundation/ngpt-cpt/runs/azltguw6)
# Qwen3-4B-P3-TC-RSSFT-1
This model is a fine-tuned version of [AiForgeMaster/Qwen3-4B-P3-TC-1](https://huggingface.co/AiForgeMaster/Qwen3-4B-P3-TC-1) on the AiForgeMaster/glaiceai-natural-reasoning-10k dataset.
## Model description
More information needed
## Intended uses & limitations
More information needed
## Training and evaluation data
More information needed
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 32
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 150
- training_steps: 312
### Training results
### Framework versions
- Transformers 4.55.4
- Pytorch 2.7.1+cu126
- Datasets 4.0.0
- Tokenizers 0.21.4
|