File size: 5,781 Bytes
023deeb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
tags:
  - training
  - reproduction
  - qwen
  - deepspeed
  - consumer-gpu
  - 4bit-quantization
---

# GoodGlinda-7B Training Code

[![Model](https://img.shields.io/badge/πŸ€—%20Model-GoodGlinda--7B--Verifier-blue)](https://huggingface.co/YellowLabsStudio/goodglinda-7b-verifier)
[![Dataset](https://img.shields.io/badge/πŸ€—%20Dataset-Training--Data-blue)](https://huggingface.co/datasets/YellowLabsStudio/goodglinda-training-data)
[![Space](https://img.shields.io/badge/πŸ€—%20Space-Live--Demo-blue)](https://huggingface.co/spaces/YellowLabsStudio/goodglinda-7b-eval)
[![Code](https://img.shields.io/badge/πŸ€—%20Code-Training--Scripts-blue)](https://huggingface.co/YellowLabsStudio/goodglinda-training-code)
![HomeLab AI](https://img.shields.io/badge/Research-HomeLab%20AI-gold)
![Independent](https://img.shields.io/badge/Status-Independent-black)
![Hardware](https://img.shields.io/badge/HW-RTX%204060%2F5070Ti-green)
![License](https://img.shields.io/badge/License-Apache%202.0-blue)

The model runs a hierarchical three-tier architecture on consumer hardware. I trained it on an Intel Core i7-12700 with an RTX 4060 (8GB) and RTX 5070 Ti (16GB) Overclocked and Undervoltaged in an asymmetric configuration that left the 5070 Ti idle 30% of the time. At hour 14, the 4060 hit 83Β°C and throttled. I replaced the thermal paste at hour 18 and watched temperatures stabilize at 79Β°C for the remaining 54 hours.
I recommend using Watercooling or other liquid cooling methods for your easiness

I initially wasted two days trying to implement pipeline parallelism before admitting defeat and switching to DeepSpeed ZeRO-2 with CPU offloading.

## My Hardware Setup

*   **CPU:** Intel Core i7-12700 (12th Gen)
*   **GPU 0:** RTX 4060 (8GB VRAM) - Auxiliary/Offloading (throttled at 83Β°C before paste fix)
*   **GPU 1:** RTX 5070 Ti (16GB VRAM) - Primary training (30% idle time due to asymmetry)
*   **RAM:** 64GB DDR5-4800
*   **Storage:** 2TB NVMe Gen4
*   **Power Supply:** 850W (upgraded from 650W mid-training after voltage drops)

## Quick Start

Install dependencies:
```bash
pip install -r requirements.txt
```

Generate the dataset (I used DeepSeek-V2 as teacher):

```bash
python prepare_data.py \
  --output_dir ./data \
  --num_samples 50000 \
  --teacher_model deepseek-ai/deepseek-llm-7b-chat
```

Train (72 hours, single run):

```bash
deepspeed --num_gpus=2 train.py \
  --deepspeed ds_config.json \
  --model_name Qwen/Qwen2.5-7B-Instruct \
  --output_dir ./output \
  --num_train_epochs 3 \
  --learning_rate 2e-4 \
  --warmup_steps 500
```

## Key Configurations

I used 4-bit NormalFloat with double quantization to squeeze into 8GB.

| Parameter | Value | Notes |
|-----------|--------|-------|
| Quantization | 4-bit NormalFloat | Double quantization enabled |
| Optimizer | AdamW 8-bit | CPU offloading via DeepSpeed ZeRO-2 |
| Effective Batch Size | 8 | 2 per GPU Γ— 2 gradient accumulation |
| Learning Rate | 2e-4 | Cosine decay with 10% warmup |
| LoRA Rank | 64 | Targeting q, k, v, o projections |
| Training Duration | 72 hours | Continuous, single run, no restarts |
| Peak Temp (4060) | 83Β°C -> 79Β°C | After thermal paste replacement |

## Repository Structure

```
β”œβ”€β”€ train.py                 # Main training script (simplified skeleton)
β”œβ”€β”€ ds_config.json           # DeepSpeed ZeRO-2 config for asymmetric VRAM
β”œβ”€β”€ prepare_data.py          # Dataset generation using DeepSeek-V2
β”œβ”€β”€ verify_data.py           # Validation checks
β”œβ”€β”€ requirements.txt         # Locked versions
β”œβ”€β”€ logs/
β”‚   β”œβ”€β”€ loss_curves.png      # I screengrabbed this at hour 68
β”‚   β”œβ”€β”€ gpu_utilization.log  # Shows the 30% idle time on 5070 Ti
β”‚   └── thermal_stats.log    # 83Β°C spike visible at hour 14
└── checkpoints/             # Saved every 500 steps
```


## What I Learned

**Pipeline Parallelism Failure:** I spent two days trying to split layers across the 8GB and 16GB cards manually. It failed constantly due to communication overhead. ZeRO-2 with CPU offloading solved this in 20 minutes but left the 5070 Ti underutilized.

**Thermal Management:** The RTX 4060 required aggressive intervention. I set a custom fan curve (80% speed at 75Β°C), replaced the thermal paste at hour 18 (dropped temps by 4Β°C), and added case fans. Without these, the card would throttle to 2.1GHz, adding roughly 40% to training time.
Watercooled should have been a better solution.. maybe

**Single Run Limitations:** multiple seeds at home lab was not possible. This is a single 72-hour run. Your results may vary Β±3-5% due to random initialization.

## Reproducibility Notes

What you can reproduce:
*   Training procedure with identical hyperparameters
*   Dataset generation pipeline (requires DeepSeek-V2 API access)
*   Verification protocol on 200 samples

Known limitations:
*   Single training run (no seed averaging)
*   Exact loss curves may vary Β±3-5% due to hardware noise
*   Thermal throttling events may affect timing (depends on ambient temperature)

Details on the full methodology will appear in an upcoming publication.

## Troubleshooting

**CUDA OOM Errors:** I hit these constantly on the 4060. Reduce per_device_train_batch_size to 1, increase gradient_accumulation_steps to 4, or enable more aggressive gradient checkpointing.

**Thermal Throttling:** Monitor with nvidia-smi dmon. Target <83Β°C sustained. Consider undervolting (I stabilized at 0.95V @ 2.75GHz).

**Slow Training:** Expected throughput is ~1.8 samples/sec with my asymmetric setup. A symmetric dual-16GB setup would hit ~2.5 samples/sec, but I cannot afford that. The PCIe bottleneck is the limiting factor, not compute.

## License

Apache 2.0. Commercial use permitted with attribution.