training on v100

#3
by AlexHung29629 - opened

Hello @yentinglin , I tried training on V100 gpus, but loss curve looks unusual. Does it match your loss curve?
Below are my configurations and logs

base_model: mistralai/Mistral-Small-24B-Instruct-2501

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

datasets:
  - path: yentinglin/s1K-1.1-trl-format
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: role
    message_field_content: content
  - path: open-r1/OpenR1-Math-220k
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: from
    message_field_content: value
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./ref_placeholder/

sequence_len: 16384
sample_packing: true
eval_sample_packing: False
pad_to_sequence_len: true

wandb_project: Reasoning
wandb_entity:
wandb_watch:
wandb_name: Mistral-24B-SFT-Reasoning
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
bfloat16: false
float16: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
logging_steps: 1
flash_attention: false
xformers_attention: false
sdp_attention: false

warmup_ratio: 0.1
saves_per_epoch: 2
weight_decay: 0.0
deepspeed: deepspeed_configs/zero3.json
special_tokens:
  pad_token: "<pad>"

deepspeed zero3 with zero++

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true,
    "zero_quantized_weights": true,
    "zero_hpz_partition_size": 8,
    "zero_quantized_gradients": true
  },
  "bf16": {
    "enabled": false
  },
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 6,
    "loss_scale_window": 100,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

image.png

It may be due to the absence of "max_grad_norm: 1.0".

The root cause is that some software conflicts with the V100. I have placed the modified Dockerfile and code here.
https://github.com/alex-ht/axolotl/blob/v100/docker/Dockerfile-v100

AlexHung29629 changed discussion status to closed

Sign up or log in to comment