Model Request

#931

by ErosCoder37 - opened May 13, 2025

May 13, 2025

Hey, does anyone here train models or know someone who does? I'm trying to fine-tune the prithivMLmods/Deepthink-Llama-3-8B-Preview safetensors model using data from ErosCoder37/Eros-1.

nicoboss

May 13, 2025

I do have quite some finetuning experience. Based on your dataset are likely related or part @Enderchef 's team. I recommend you take a look at https://huggingface.co/mradermacher/model_requests/discussions/920 and adopt the axolotl script I posted there.

nicoboss

May 13, 2025

Because I'm so nice I even adopted the axolotl configuration for you. Just rent two GPUs with at least 24 GB GPU memory like 2x 4090 from RunPod or a simular provider and let it train for around 5 hours untill it is done.

base_model: prithivMLmods/Deepthink-Llama-3-8B-Preview 
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false

datasets:
  - path: ErosCoder37/Eros-1
    chat_template: llama3
    type:
      system_prompt: ""
      field_system: system
      field_instruction: input
      field_output: output
dataset_prepared_path:
val_set_size: 0.05
output_dir: ./outputs/lora-out

adapter: lora
lora_model_dir:

sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 4
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 0.0001

bf16: auto
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_steps: 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.0
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
special_tokens:
  pad_token: <|end_of_text|>

nicoboss

May 13, 2025

•

edited May 13, 2025

It might be cheaper to rent a single A100 80G (maybe even a much cheaper 48 GB GPU will do) in which case just delete the fsdp and fsdp_config sections and add the following to speed up sing-GPU training:

lora_mlp_kernel: true
lora_qkv_kernel: true
lora_o_kernel: true

ErosCoder37

May 14, 2025

Wait - Am I able to train it on a system prompt based on the system_prompt: "" instead of a dataset?

nicoboss

May 14, 2025

Wait - Am I able to train it on a system prompt based on the system_prompt: "" instead of a dataset?

Usually your dataset would contain a system prompt, prompt and response for every row. Because your dataset lacks a system prompt, I specified inside the axolotl training to not use one as well. It is still using your dataset for training but will not use any system prompt. Instead of an empty string you could also hardcode any system prompt that fits your finetune so your finetune gets only activated on similar system prompts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment