training on v100
#3
by
AlexHung29629
- opened
Hello
@yentinglin
, I tried training on V100 gpus, but loss curve looks unusual. Does it match your loss curve?
Below are my configurations and logs
base_model: mistralai/Mistral-Small-24B-Instruct-2501
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
datasets:
- path: yentinglin/s1K-1.1-trl-format
type: chat_template
chat_template: tokenizer_default
field_messages: messages
message_field_role: role
message_field_content: content
- path: open-r1/OpenR1-Math-220k
type: chat_template
chat_template: tokenizer_default
field_messages: messages
message_field_role: from
message_field_content: value
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./ref_placeholder/
sequence_len: 16384
sample_packing: true
eval_sample_packing: False
pad_to_sequence_len: true
wandb_project: Reasoning
wandb_entity:
wandb_watch:
wandb_name: Mistral-24B-SFT-Reasoning
wandb_log_model:
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5
train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
bfloat16: false
float16: true
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
logging_steps: 1
flash_attention: false
xformers_attention: false
sdp_attention: false
warmup_ratio: 0.1
saves_per_epoch: 2
weight_decay: 0.0
deepspeed: deepspeed_configs/zero3.json
special_tokens:
pad_token: "<pad>"
deepspeed zero3 with zero++
{
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 0,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0,
"stage3_gather_16bit_weights_on_model_save": true,
"zero_quantized_weights": true,
"zero_hpz_partition_size": 8,
"zero_quantized_gradients": true
},
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true,
"auto_cast": false,
"loss_scale": 0,
"initial_scale_power": 6,
"loss_scale_window": 100,
"hysteresis": 2,
"min_loss_scale": 1
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
It may be due to the absence of "max_grad_norm: 1.0".
The root cause is that some software conflicts with the V100. I have placed the modified Dockerfile and code here.
https://github.com/alex-ht/axolotl/blob/v100/docker/Dockerfile-v100
AlexHung29629
changed discussion status to
closed
