See axolotl config
axolotl version: 0.12.2
base_model: Coloss/Qwen3-8B-Instruct
#/leonardo_work/EUHPC_A04_045/training/model-fp32
#Coloss/Qwen3-8B-Instruct
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name
strict: false
#resume_from_checkpoint: /leonardo_work/EUHPC_A04_045/training/ale_outputs/pluto-8B-sft/checkpoint-4040 #
#auto_resume_from_checkpoints: true
#resume_from_checkpoint: /leonardo_work/EUHPC_A04_045/training/ale_outputs/pluto-8B-sft-32
#auto_resume_from_checkpoints: true
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_fused_linear_cross_entropy: true
liger_cross_entropy: false # Explicitly disabled to ensure the Fused version takes over
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
#liger_cross_entropy: true
#liger_rms_norm: true
#liger_glu_activation: true
#liger_layer_norm: true
#chat_template: qwen3
datasets:
- path: Coloss/Omnia-v5-Nesso
type: chat_template
field_messages: conversations
message_property_mappings:
role: from
content: value
#dataset_prepared_path: ./ale_outputs/tokenized-omni-v5-v.2
dataset_prepared_path: /leonardo_work/EUHPC_A04_045/training/ale_outputs/tokenized-omnia-v6-nesso
val_set_size: 0.0005
output_dir: ./ale_outputs/pluto-8B-sft
#do_bench_eval: true
#bench_dataset: /leonardo_work/EUHPC_A04_045/training/examples/qwen3/eval_mix_train.json
sequence_len: 8000
excess_length_strategy: truncate
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 5
#max_steps: 50
optimizer: adamw_torch_fused #adamw_bnb_8bit #adamw_torch_fused
lr_scheduler: cosine
learning_rate: 4e-5
bf16: auto #auto
fp16: false
tf32: true
wandb_mode: "offline"
wandb_project: pluto-8b
wandb_entity: mii-llm
wandb_name: pluto-8b-sft
#gradient_checkpointing: true
#gradient_checkpointing_kwargs:
# use_reentrant: false
logging_steps: 1
sdp_attention: false
flash_attention: true
warmup_ratio: 0.1
evals_per_epoch: 5
saves_per_epoch: 5
save_total_limit: 5
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: true
fsdp_offload_optimizer: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT #SHARDED_STATE_DICT #FULL_STATE_DICT
fsdp_activation_checkpointing: true
#fsdp:
# - full_shard
# - auto_wrap
#fsdp_config:
# fsdp_limit_all_gathers: true
# fsdp_sync_module_states: true
# fsdp_offload_params: true
# fsdp_use_orig_params: false
# fsdp_cpu_ram_efficient_loading: true
# ADD THIS LINE:
# fsdp_offload_optimizer: true
# fsdp_use_orig_params: false
# fsdp_cpu_ram_efficient_loading: true
# fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
# fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
# fsdp_state_dict_type: FULL_STATE_DICT
# fsdp_sharding_strategy: FULL_SHARD
special_tokens:
ale_outputs/pluto-8B-sft
This model is a fine-tuned version of Coloss/Qwen3-8B-Instruct on the Coloss/Omnia-v5-Nesso dataset. It achieves the following results on the evaluation set:
- Loss: 0.6982
- Memory/max Mem Active(gib): 48.58
- Memory/max Mem Allocated(gib): 47.86
- Memory/device Mem Reserved(gib): 53.4
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 4e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- distributed_type: multi-GPU
- num_devices: 32
- gradient_accumulation_steps: 4
- total_train_batch_size: 256
- total_eval_batch_size: 64
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 487
- training_steps: 4875
Training results
| Training Loss | Epoch | Step | Validation Loss | Mem Active(gib) | Mem Allocated(gib) | Mem Reserved(gib) |
|---|---|---|---|---|---|---|
| No log | 0 | 0 | 1.3858 | 48.57 | 47.85 | 49.2 |
| 0.7354 | 0.1999 | 195 | 0.7380 | 48.58 | 47.86 | 53.4 |
| 0.6999 | 0.3999 | 390 | 0.7103 | 48.58 | 47.86 | 53.4 |
| 0.6721 | 0.5998 | 585 | 0.6947 | 48.58 | 47.86 | 53.4 |
| 0.6809 | 0.7998 | 780 | 0.6828 | 48.58 | 47.86 | 53.4 |
| 0.6463 | 0.9997 | 975 | 0.6737 | 48.58 | 47.86 | 53.4 |
| 0.6561 | 1.1989 | 1170 | 0.6666 | 48.58 | 47.86 | 53.4 |
| 0.6136 | 1.3989 | 1365 | 0.6657 | 48.58 | 47.86 | 53.4 |
| 0.5917 | 1.5988 | 1560 | 0.6652 | 48.58 | 47.86 | 53.4 |
| 0.6024 | 1.7988 | 1755 | 0.6618 | 48.58 | 47.86 | 53.4 |
| 0.5731 | 1.9987 | 1950 | 0.6581 | 48.58 | 47.86 | 53.4 |
| 0.5549 | 2.1979 | 2145 | 0.6580 | 48.58 | 47.86 | 53.4 |
| 0.5455 | 2.3978 | 2340 | 0.6639 | 48.58 | 47.86 | 53.4 |
| 0.5214 | 2.5978 | 2535 | 0.6702 | 48.58 | 47.86 | 53.4 |
| 0.5185 | 2.7977 | 2730 | 0.6680 | 48.58 | 47.86 | 53.4 |
| 0.5032 | 2.9977 | 2925 | 0.6625 | 48.58 | 47.86 | 53.4 |
| 0.5031 | 3.1969 | 3120 | 0.6705 | 48.58 | 47.86 | 53.4 |
| 0.4604 | 3.3968 | 3315 | 0.6857 | 48.58 | 47.86 | 53.4 |
| 0.4369 | 3.5968 | 3510 | 0.6898 | 48.58 | 47.86 | 53.4 |
| 0.4483 | 3.7967 | 3705 | 0.6856 | 48.58 | 47.86 | 53.4 |
| 0.455 | 3.9967 | 3900 | 0.6826 | 48.58 | 47.86 | 53.4 |
| 0.4433 | 4.1958 | 4095 | 0.6905 | 48.58 | 47.86 | 53.4 |
| 0.4112 | 4.3958 | 4290 | 0.7021 | 48.58 | 47.86 | 53.4 |
| 0.3856 | 4.5957 | 4485 | 0.7068 | 48.58 | 47.86 | 53.4 |
| 0.415 | 4.7957 | 4680 | 0.7011 | 48.58 | 47.86 | 53.4 |
| 0.4301 | 4.9956 | 4875 | 0.6982 | 48.58 | 47.86 | 53.4 |
Framework versions
- Transformers 4.55.2
- Pytorch 2.6.0+cu126
- Datasets 4.0.0
- Tokenizers 0.21.4
- Downloads last month
- -
Model tree for Coloss/Nesso-8B-sft-v0.1
Base model
Coloss/Qwen3-8B-Instruct