training on v100

by AlexHung29629 - opened Mar 18, 2025

Mar 18, 2025

•

edited Mar 18, 2025

Hello @yentinglin , I tried training on V100 gpus, but loss curve looks unusual. Does it match your loss curve?
Below are my configurations and logs

base_model: mistralai/Mistral-Small-24B-Instruct-2501

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

datasets:
  - path: yentinglin/s1K-1.1-trl-format
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: role
    message_field_content: content
  - path: open-r1/OpenR1-Math-220k
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    message_field_role: from
    message_field_content: value
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./ref_placeholder/

sequence_len: 16384
sample_packing: true
eval_sample_packing: False
pad_to_sequence_len: true

wandb_project: Reasoning
wandb_entity:
wandb_watch:
wandb_name: Mistral-24B-SFT-Reasoning
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 5
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false
bfloat16: false
float16: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
logging_steps: 1
flash_attention: false
xformers_attention: false
sdp_attention: false

warmup_ratio: 0.1
saves_per_epoch: 2
weight_decay: 0.0
deepspeed: deepspeed_configs/zero3.json
special_tokens:
  pad_token: "<pad>"

deepspeed zero3 with zero++

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true,
    "zero_quantized_weights": true,
    "zero_hpz_partition_size": 8,
    "zero_quantized_gradients": true
  },
  "bf16": {
    "enabled": false
  },
  "fp16": {
    "enabled": true,
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 6,
    "loss_scale_window": 100,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

AlexHung29629

Mar 19, 2025

•

edited Mar 21, 2025

~~It may be due to the absence of "max_grad_norm: 1.0".~~

AlexHung29629

Mar 21, 2025

The root cause is that some software conflicts with the V100. I have placed the modified Dockerfile and code here.
https://github.com/alex-ht/axolotl/blob/v100/docker/Dockerfile-v100

AlexHung29629 changed discussion status to closed Mar 21, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment