fivehi7s

Training scripts and DeepSpeed configs

023deeb 8 days ago

5.78 kB

	---
	license: apache-2.0
	tags:
	- training
	- reproduction
	- qwen
	- deepspeed
	- consumer-gpu
	- 4bit-quantization
	---

	# GoodGlinda-7B Training Code

	[![Model](https://img.shields.io/badge/🤗%20Model-GoodGlinda--7B--Verifier-blue)](https://huggingface.co/YellowLabsStudio/goodglinda-7b-verifier)
	[![Dataset](https://img.shields.io/badge/🤗%20Dataset-Training--Data-blue)](https://huggingface.co/datasets/YellowLabsStudio/goodglinda-training-data)
	[![Space](https://img.shields.io/badge/🤗%20Space-Live--Demo-blue)](https://huggingface.co/spaces/YellowLabsStudio/goodglinda-7b-eval)
	[![Code](https://img.shields.io/badge/🤗%20Code-Training--Scripts-blue)](https://huggingface.co/YellowLabsStudio/goodglinda-training-code)
	![HomeLab AI](https://img.shields.io/badge/Research-HomeLab%20AI-gold)
	![Independent](https://img.shields.io/badge/Status-Independent-black)
	![Hardware](https://img.shields.io/badge/HW-RTX%204060%2F5070Ti-green)
	![License](https://img.shields.io/badge/License-Apache%202.0-blue)

	The model runs a hierarchical three-tier architecture on consumer hardware. I trained it on an Intel Core i7-12700 with an RTX 4060 (8GB) and RTX 5070 Ti (16GB) Overclocked and Undervoltaged in an asymmetric configuration that left the 5070 Ti idle 30% of the time. At hour 14, the 4060 hit 83°C and throttled. I replaced the thermal paste at hour 18 and watched temperatures stabilize at 79°C for the remaining 54 hours.
	I recommend using Watercooling or other liquid cooling methods for your easiness

	I initially wasted two days trying to implement pipeline parallelism before admitting defeat and switching to DeepSpeed ZeRO-2 with CPU offloading.

	## My Hardware Setup

	* CPU: Intel Core i7-12700 (12th Gen)
	* GPU 0: RTX 4060 (8GB VRAM) - Auxiliary/Offloading (throttled at 83°C before paste fix)
	* GPU 1: RTX 5070 Ti (16GB VRAM) - Primary training (30% idle time due to asymmetry)
	* RAM: 64GB DDR5-4800
	* Storage: 2TB NVMe Gen4
	* Power Supply: 850W (upgraded from 650W mid-training after voltage drops)

	## Quick Start

	Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	Generate the dataset (I used DeepSeek-V2 as teacher):

	```bash
	python prepare_data.py \
	--output_dir ./data \
	--num_samples 50000 \
	--teacher_model deepseek-ai/deepseek-llm-7b-chat
	```

	Train (72 hours, single run):

	```bash
	deepspeed --num_gpus=2 train.py \
	--deepspeed ds_config.json \
	--model_name Qwen/Qwen2.5-7B-Instruct \
	--output_dir ./output \
	--num_train_epochs 3 \
	--learning_rate 2e-4 \
	--warmup_steps 500
	```

	## Key Configurations

	I used 4-bit NormalFloat with double quantization to squeeze into 8GB.

	\| Parameter \| Value \| Notes \|
	\|-----------\|--------\|-------\|
	\| Quantization \| 4-bit NormalFloat \| Double quantization enabled \|
	\| Optimizer \| AdamW 8-bit \| CPU offloading via DeepSpeed ZeRO-2 \|
	\| Effective Batch Size \| 8 \| 2 per GPU × 2 gradient accumulation \|
	\| Learning Rate \| 2e-4 \| Cosine decay with 10% warmup \|
	\| LoRA Rank \| 64 \| Targeting q, k, v, o projections \|
	\| Training Duration \| 72 hours \| Continuous, single run, no restarts \|
	\| Peak Temp (4060) \| 83°C -> 79°C \| After thermal paste replacement \|

	## Repository Structure

	```
	├── train.py # Main training script (simplified skeleton)
	├── ds_config.json # DeepSpeed ZeRO-2 config for asymmetric VRAM
	├── prepare_data.py # Dataset generation using DeepSeek-V2
	├── verify_data.py # Validation checks
	├── requirements.txt # Locked versions
	├── logs/
	│ ├── loss_curves.png # I screengrabbed this at hour 68
	│ ├── gpu_utilization.log # Shows the 30% idle time on 5070 Ti
	│ └── thermal_stats.log # 83°C spike visible at hour 14
	└── checkpoints/ # Saved every 500 steps
	```


	## What I Learned

	Pipeline Parallelism Failure: I spent two days trying to split layers across the 8GB and 16GB cards manually. It failed constantly due to communication overhead. ZeRO-2 with CPU offloading solved this in 20 minutes but left the 5070 Ti underutilized.

	Thermal Management: The RTX 4060 required aggressive intervention. I set a custom fan curve (80% speed at 75°C), replaced the thermal paste at hour 18 (dropped temps by 4°C), and added case fans. Without these, the card would throttle to 2.1GHz, adding roughly 40% to training time.
	Watercooled should have been a better solution.. maybe

	Single Run Limitations: multiple seeds at home lab was not possible. This is a single 72-hour run. Your results may vary ±3-5% due to random initialization.

	## Reproducibility Notes

	What you can reproduce:
	* Training procedure with identical hyperparameters
	* Dataset generation pipeline (requires DeepSeek-V2 API access)
	* Verification protocol on 200 samples

	Known limitations:
	* Single training run (no seed averaging)
	* Exact loss curves may vary ±3-5% due to hardware noise
	* Thermal throttling events may affect timing (depends on ambient temperature)

	Details on the full methodology will appear in an upcoming publication.

	## Troubleshooting

	CUDA OOM Errors: I hit these constantly on the 4060. Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 4, or enable more aggressive gradient checkpointing.

	Thermal Throttling: Monitor with nvidia-smi dmon. Target <83°C sustained. Consider undervolting (I stabilized at 0.95V @ 2.75GHz).

	Slow Training: Expected throughput is ~1.8 samples/sec with my asymmetric setup. A symmetric dual-16GB setup would hit ~2.5 samples/sec, but I cannot afford that. The PCIe bottleneck is the limiting factor, not compute.

	## License

	Apache 2.0. Commercial use permitted with attribution.