codellama-fine-tuning / RETRAIN_WITH_CHAT_FORMAT.md
Prithvik-1's picture
Upload RETRAIN_WITH_CHAT_FORMAT.md with huggingface_hub
170941e verified

πŸ”„ Retrain with CodeLlama Chat Template Format

βœ… What Was Done

  1. βœ… Reformatted Dataset - Created chat template format dataset
  2. βœ… Split Dataset - Split into train/val/test (70/9/15)
  3. βœ… Updated Training Script - Tokenization now handles chat format correctly

πŸ“‚ New Dataset Location

Chat Format Dataset:

  • Original: datasets/processed/elinnos_fifo_codellama_chat_format.jsonl (94 samples)
  • Split Train: datasets/processed/split_chat_format/train.jsonl (70 samples)
  • Split Val: datasets/processed/split_chat_format/val.jsonl (9 samples)
  • Split Test: datasets/processed/split_chat_format/test.jsonl (15 samples)

πŸš€ Retrain Command

cd /workspace/ftt/codellama-migration
source /venv/main/bin/activate

python3 scripts/training/finetune_codellama.py \
    --base-model models/base-models/CodeLlama-7B-Instruct \
    --dataset datasets/processed/split_chat_format/train.jsonl \
    --val-dataset datasets/processed/split_chat_format/val.jsonl \
    --output-dir training-outputs/codellama-fifo-v2-chat \
    --max-length 1536 \
    --num-epochs 5 \
    --learning-rate 2e-5 \
    --batch-size 4 \
    --gradient-accumulation-steps 4 \
    --lora-r 48 \
    --lora-alpha 96 \
    --resume-from-checkpoint auto

Or use the training script:

bash start_training_chat_format.sh

πŸ” Key Changes

  1. Training Format:

    • Old: instruction + EOS + response + EOS
    • New: instruction + response + EOS (instruction already has chat template)
  2. Inference Format:

    • Use CodeLlama chat template during inference
    • Match the training format exactly

πŸ“Š Expected Results

After retraining with chat format:

  • βœ… Model should generate Verilog code (not unrelated text)
  • βœ… Model should understand the task correctly
  • βœ… Outputs should match training data format

⚠️ Important Notes

  • Old model won't work - The format mismatch means the old model can't be used
  • Must retrain - New format requires retraining from scratch
  • Use new dataset - Always use split_chat_format for training