See axolotl config

axolotl version: 0.12.0.dev0

# axolotl train config.yml --deepspeed deepspeed_configs/zero2.json

# Resume from checkpoint configuration
resume_from_checkpoint: ./outputs/checkpoint-650

# Prevent NCCL timeout
ddp_timeout: 7200  # 2 hours timeout instead of 10 minutes

# Load model from local models directory first, fallback to HuggingFace if not found
base_model: Qwen/Qwen3-4B  # Local path - will fallback to Qwen/Qwen3-4B if not found locally
# Automatically upload checkpoint and final model to HF
hub_model_id: AiForgeMaster/Qwen3-4B-Pretrain-v1

load_in_8bit: false
load_in_4bit: false
strict: false

# Pre-training dataset configuration - using HuggingFace datasets
pretraining_dataset:
  - name: default
    path: AiForgeMaster/Smart_Merge_2025_07_25  # Private HF dataset - requires API key
    split: train
    text_column: text # column in dataset with the data, usually `text`
    type: pretrain
    trust_remote_code: false
    # skip: 0 # number of rows of data to skip over from the beginning

# Local paths relative to working directory
dataset_prepared_path: ./data/prepared
val_set_size: 0.0  # Set to 0 for pure pre-training (no validation split)
output_dir: ./outputs

# Cache directories for HuggingFace downloads (relative to working dir)
# This ensures models and datasets are downloaded to local directories
hf_use_auth_token: true  # Use HF token for private repos if needed

sequence_len: 16384
sample_packing: true
eval_sample_packing: false  # Disable for pre-training
# Enable sample concatenation for pre-training
pretraining_sample_concatenation: true

# LoRA/QLoRA configuration for memory-efficient training
#adapter: lora
#lora_model_dir:
#lora_r: 32
#lora_alpha: 64
#lora_dropout: 0.05
#lora_target_linear: false  # Set to false when using explicit target_modules
#lora_target_modules:       # Correct axolotl syntax for target modules
#  - "q_proj"
#  - "k_proj"
#  - "v_proj"
#  - "o_proj"
#  - "gate_proj"
#  - "up_proj"
#  - "down_proj"
#  - "embed_tokens"
#  - "lm_head"

# WandB configuration - fill in your details
wandb_project: ngpt-cpt
wandb_entity: null
wandb_watch: gradients
wandb_name: qwen3_4b_pretraining_v9_resume_2
wandb_log_model: end

# Batch size configuration (total effective batch size = micro_batch_size * gradient_accumulation_steps * num_gpus)
# For batch size 8-16: micro_batch_size=2, gradient_accumulation_steps=4 gives effective batch size of 8 per GPU
gradient_accumulation_steps: 3
micro_batch_size: 6  # Adjust based on your GPU memory
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 1e-4  # Lower learning rate for pre-training

bf16: auto
tf32: true

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
logging_steps: 10  # Log every 10 steps
flash_attention: true

warmup_steps: 100  # More warmup steps for pre-training
# Checkpoint saving configuration - save every 1000 steps
save_steps: 50
save_strategy: steps
save_total_limit: 5  # Keep only 5 most recent checkpoints
save_only_model: false  # Save full checkpoint including optimizer state

# Evaluation configuration removed for pure pre-training (val_set_size: 0.0)
# eval_steps: 2000  # Not supported when val_set_size == 0
# eval_strategy: steps  # Not supported when val_set_size == 0
weight_decay: 0.01  # Small weight decay for pre-training

# Liger optimizations for memory efficiency and speed
plugins:
  - axolotl.integrations.liger.LigerPlugin

liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

# Additional pre-training optimizations
# Enable for first run to validate checkpoint saving works
save_first_step: false

# Memory optimizations
dataloader_pin_memory: true
dataloader_num_workers: 4
remove_unused_columns: true

# Advanced training settings for pre-training
# Calculate max_steps for full epoch: dataset_size / (micro_batch_size * gradient_accumulation_steps * num_gpus)
# With 24,578 rows, batch_size=18, and assuming 2 GPUs: 24,578 / (3 * 6 * 2) = ~682 steps per epoch
max_steps: 682  # Set for one full epoch with your dataset size
group_by_length: false  # Disable for pre-training with sample packing
train_on_inputs: true  # Train on all tokens for pre-training

# Loss monitoring
loss_watchdog_threshold: 10.0  # Stop if loss exceeds this value
loss_watchdog_patience: 3

# Garbage collection to manage memory
gc_steps: 100  # Run garbage collection every 100 steps

Qwen3-4B-Pretrain-v1

This model is a fine-tuned version of Qwen/Qwen3-4B on an unknown dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 1
eval_batch_size: 6
seed: 42
gradient_accumulation_steps: 3
total_train_batch_size: 3
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
training_steps: 682

Training results

Framework versions

Transformers 4.54.0
Pytorch 2.7.1+cu126
Datasets 4.0.0
Tokenizers 0.21.2

Downloads last month: 2

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for AiForgeMaster/Qwen3-4B-Pretrain-v1-p2

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(578)

this model

Finetunes

1 model