| | --- |
| | license: apache-2.0 |
| | tags: |
| | - training |
| | - reproduction |
| | - qwen |
| | - deepspeed |
| | - consumer-gpu |
| | - 4bit-quantization |
| | --- |
| | |
| | # GoodGlinda-7B Training Code |
| |
|
| | [](https://huggingface.co/YellowLabsStudio/goodglinda-7b-verifier) |
| | [](https://huggingface.co/datasets/YellowLabsStudio/goodglinda-training-data) |
| | [](https://huggingface.co/spaces/YellowLabsStudio/goodglinda-7b-eval) |
| | [](https://huggingface.co/YellowLabsStudio/goodglinda-training-code) |
| |  |
| |  |
| |  |
| |  |
| |
|
| | The model runs a hierarchical three-tier architecture on consumer hardware. I trained it on an Intel Core i7-12700 with an RTX 4060 (8GB) and RTX 5070 Ti (16GB) Overclocked and Undervoltaged in an asymmetric configuration that left the 5070 Ti idle 30% of the time. At hour 14, the 4060 hit 83Β°C and throttled. I replaced the thermal paste at hour 18 and watched temperatures stabilize at 79Β°C for the remaining 54 hours. |
| | I recommend using Watercooling or other liquid cooling methods for your easiness |
| |
|
| | I initially wasted two days trying to implement pipeline parallelism before admitting defeat and switching to DeepSpeed ZeRO-2 with CPU offloading. |
| |
|
| | ## My Hardware Setup |
| |
|
| | * **CPU:** Intel Core i7-12700 (12th Gen) |
| | * **GPU 0:** RTX 4060 (8GB VRAM) - Auxiliary/Offloading (throttled at 83Β°C before paste fix) |
| | * **GPU 1:** RTX 5070 Ti (16GB VRAM) - Primary training (30% idle time due to asymmetry) |
| | * **RAM:** 64GB DDR5-4800 |
| | * **Storage:** 2TB NVMe Gen4 |
| | * **Power Supply:** 850W (upgraded from 650W mid-training after voltage drops) |
| |
|
| | ## Quick Start |
| |
|
| | Install dependencies: |
| | ```bash |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | Generate the dataset (I used DeepSeek-V2 as teacher): |
| |
|
| | ```bash |
| | python prepare_data.py \ |
| | --output_dir ./data \ |
| | --num_samples 50000 \ |
| | --teacher_model deepseek-ai/deepseek-llm-7b-chat |
| | ``` |
| |
|
| | Train (72 hours, single run): |
| |
|
| | ```bash |
| | deepspeed --num_gpus=2 train.py \ |
| | --deepspeed ds_config.json \ |
| | --model_name Qwen/Qwen2.5-7B-Instruct \ |
| | --output_dir ./output \ |
| | --num_train_epochs 3 \ |
| | --learning_rate 2e-4 \ |
| | --warmup_steps 500 |
| | ``` |
| |
|
| | ## Key Configurations |
| |
|
| | I used 4-bit NormalFloat with double quantization to squeeze into 8GB. |
| |
|
| | | Parameter | Value | Notes | |
| | |-----------|--------|-------| |
| | | Quantization | 4-bit NormalFloat | Double quantization enabled | |
| | | Optimizer | AdamW 8-bit | CPU offloading via DeepSpeed ZeRO-2 | |
| | | Effective Batch Size | 8 | 2 per GPU Γ 2 gradient accumulation | |
| | | Learning Rate | 2e-4 | Cosine decay with 10% warmup | |
| | | LoRA Rank | 64 | Targeting q, k, v, o projections | |
| | | Training Duration | 72 hours | Continuous, single run, no restarts | |
| | | Peak Temp (4060) | 83Β°C -> 79Β°C | After thermal paste replacement | |
| |
|
| | ## Repository Structure |
| |
|
| | ``` |
| | βββ train.py # Main training script (simplified skeleton) |
| | βββ ds_config.json # DeepSpeed ZeRO-2 config for asymmetric VRAM |
| | βββ prepare_data.py # Dataset generation using DeepSeek-V2 |
| | βββ verify_data.py # Validation checks |
| | βββ requirements.txt # Locked versions |
| | βββ logs/ |
| | β βββ loss_curves.png # I screengrabbed this at hour 68 |
| | β βββ gpu_utilization.log # Shows the 30% idle time on 5070 Ti |
| | β βββ thermal_stats.log # 83Β°C spike visible at hour 14 |
| | βββ checkpoints/ # Saved every 500 steps |
| | ``` |
| |
|
| |
|
| | ## What I Learned |
| |
|
| | **Pipeline Parallelism Failure:** I spent two days trying to split layers across the 8GB and 16GB cards manually. It failed constantly due to communication overhead. ZeRO-2 with CPU offloading solved this in 20 minutes but left the 5070 Ti underutilized. |
| |
|
| | **Thermal Management:** The RTX 4060 required aggressive intervention. I set a custom fan curve (80% speed at 75Β°C), replaced the thermal paste at hour 18 (dropped temps by 4Β°C), and added case fans. Without these, the card would throttle to 2.1GHz, adding roughly 40% to training time. |
| | Watercooled should have been a better solution.. maybe |
| |
|
| | **Single Run Limitations:** multiple seeds at home lab was not possible. This is a single 72-hour run. Your results may vary Β±3-5% due to random initialization. |
| |
|
| | ## Reproducibility Notes |
| |
|
| | What you can reproduce: |
| | * Training procedure with identical hyperparameters |
| | * Dataset generation pipeline (requires DeepSeek-V2 API access) |
| | * Verification protocol on 200 samples |
| |
|
| | Known limitations: |
| | * Single training run (no seed averaging) |
| | * Exact loss curves may vary Β±3-5% due to hardware noise |
| | * Thermal throttling events may affect timing (depends on ambient temperature) |
| |
|
| | Details on the full methodology will appear in an upcoming publication. |
| |
|
| | ## Troubleshooting |
| |
|
| | **CUDA OOM Errors:** I hit these constantly on the 4060. Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 4, or enable more aggressive gradient checkpointing. |
| |
|
| | **Thermal Throttling:** Monitor with nvidia-smi dmon. Target <83Β°C sustained. Consider undervolting (I stabilized at 0.95V @ 2.75GHz). |
| |
|
| | **Slow Training:** Expected throughput is ~1.8 samples/sec with my asymmetric setup. A symmetric dual-16GB setup would hit ~2.5 samples/sec, but I cannot afford that. The PCIe bottleneck is the limiting factor, not compute. |
| |
|
| | ## License |
| |
|
| | Apache 2.0. Commercial use permitted with attribution. |