File size: 7,997 Bytes
95cbc5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# 🚀 CommitGuard — Comprehensive GCP Deployment & Training Guide (A10G)

This document is a deep-dive, step-by-step manual for deploying the CommitGuard environment and training pipeline to a Google Cloud Platform (GCP) instance. We are targeting an **NVIDIA A10G GPU** to execute **GRPO (Group Relative Policy Optimization)** on the Llama-3.2-3B model.

---

## 📋 1. Prerequisites: Setting Up Your Toolbox
Before you touch the cloud, you must ensure your local environment and external accounts are configured. These are the building blocks of the entire run.

### A. GCP Account & Project Setup
*   **Active Project:** You must have a GCP project created. Note your `PROJECT_ID`.
*   **GPU Quota:** By default, GCP projects have 0 quota for GPUs. You must navigate to `IAM & Admin > Quotas` and request a limit increase for `NVIDIA_A10G_GPUS` in your desired region (e.g., `us-central1`). **Do this 24 hours in advance.**

### B. Weights & Biases (WandB) for Visualization
*   **Why?** RL training can be unstable. WandB allows you to monitor the "Reward" and "KL Divergence" curves in real-time from your browser.
*   **Action:** Create a free account at [wandb.ai](https://wandb.ai), navigate to your settings, and copy your **API Key**.

### C. Hugging Face Account & Llama Access
*   **Model Gating:** Llama-3.2-3B is a gated model. You must visit the [model page](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) and apply for access. Approval usually takes 30-60 minutes.
*   **Access Token:** Generate a "Write" token in your Hugging Face settings to allow the VM to download the model and upload your finished adapters.

### D. Local gcloud CLI Initialization
*   **Installation:** Install the Google Cloud SDK on your laptop.
*   **Authentication:** Run `gcloud auth login` and `gcloud config set project [YOUR_PROJECT_ID]`. This allows your local terminal to "talk" to GCP.

---

## 🛠 Step 1: Provisioning the High-Performance VM
We are using the **G2 Standard 4** machine. It is specifically designed for AI workloads.

### Detailed Breakdown of the Creation Command
*   **`--machine-type g2-standard-4`:** Provides 4 vCPUs and 16GB of system RAM, ensuring the CPU doesn't bottleneck the GPU.
*   **`--accelerator type=nvidia-a10g,count=1`:** Attaches the A10G GPU. Its 24GB of VRAM is the "Goldilocks" zone for 3B parameter models—enough to handle the model plus the multiple "generations" required by the GRPO algorithm.
*   **`--image-family common-cu121`:** Uses a specialized Google image that comes with **CUDA 12.1 and NVIDIA drivers pre-installed**. This saves you 30 minutes of manual driver installation.
*   **`--provisioning-model=SPOT`:** **CRITICAL FOR BUDGET.** Spot instances use excess capacity and are ~70% cheaper than standard instances. If the instance is reclaimed by Google, your 50-step checkpoints ensure you don't lose much progress.

```bash
gcloud compute instances create commitguard-trainer \
    --project=[PROJECT_ID] \
    --zone=us-central1-a \
    --machine-type=g2-standard-4 \
    --accelerator=count=1,type=nvidia-a10g \
    --image-project=ml-images \
    --image-family=common-cu121 \
    --boot-disk-size=100GB \
    --boot-disk-type=pd-balanced \
    --maintenance-policy=TERMINATE \
    --provisioning-model=SPOT
```

---

## 🏗 Step 2: Environment Preparation
Once the VM is "Running," we need to turn it into a specialized CommitGuard lab.

### A. Secure Connection (SSH)
Connect to the machine's terminal:
```bash
gcloud compute ssh commitguard-trainer --zone=us-central1-a
```

### B. Repository & Virtual Environment
We isolate our dependencies to prevent conflicts with system-level Python packages.
```bash
# Clone the project
git clone https://github.com/[YOUR_USER]/commitguard.git
cd commitguard

# Create a 'venv' (Virtual Environment)
python3 -m venv .venv
source .venv/bin/activate

# Authenticate with Hugging Face (Required for gated Llama models)
huggingface-cli login
```

### C. Installing the "Train" Stack
The `-e ".[train]"` command installs the `commitguard` package in "editable" mode along with all optional training libraries like `torch`, `peft`, and `trl`.
```bash
pip install -U pip
pip install -e ".[train]"

# Flash Attention 2: This is a specialized kernel that makes Llama training 
# significantly faster and more memory-efficient on A10G hardware.
pip install flash-attn --no-build-isolation
```

---

## 📡 Step 3: Launching the Verifiable Reward Server
CommitGuard uses **RLVR**. In this setup, the model doesn't just "guess" if it's right; it submits an action to a server that calculates a reward based on hard evidence.

### Running in the Background
Since training takes hours, we run the server in the background using the `&` symbol.
```bash
# Start the server
python -m commitguard_env.server &

# Verify Health: This ensures the database and API are ready.
# If this fails, the trainer will hang indefinitely.
curl http://localhost:8000/health
# You should see: {"status":"healthy"}
```

---

## 🧠 Step 4: Executing the GRPO Training Run
GRPO is a "reinforcement learning" algorithm. It asks the model to generate 4 different answers for the same code diff, compares them to each other, and rewards the ones that follow the XML format and correctly identify the vulnerability.

### Hyperparameter Explanation
*   **`--steps 500`:** The model will see roughly 2,000 examples (4 generations x 500 steps).
*   **`4-bit Quantization`:** Automatically handled by the script. It "compresses" the model weights so they fit into the GPU's memory without losing accuracy.
*   **`LoRA r=8`:** "Low-Rank Adaptation." Instead of training 3 billion parameters, we only train about 5 million. This makes training stable and fast.
*   **`--live`:** Tells the script to fetch rewards from the server we started in Step 3.

```bash
# Login to WandB so your graphs show up online
export WANDB_API_KEY=[YOUR_WANDB_KEY]

python scripts/train_grpo.py \
    --model_name "meta-llama/Llama-3.2-3B-Instruct" \
    --output_dir "./outputs/commitguard-final" \
    --steps 500 \
    --live \
    --wandb "commitguard-rlvr"
```

---

## 💾 Step 5: Post-Run Weight Management & Cleanup
Once the 500 steps are complete, the "brain" of your agent exists as a LoRA adapter in the `./outputs` folder.

### A. Permanent Storage (Hugging Face)
The VM's disk is temporary. Move your weights to Hugging Face immediately.
```bash
huggingface-cli login --token [YOUR_HF_TOKEN]
huggingface-cli upload [HF_USERNAME]/commitguard-llama3b-adapter ./outputs/commitguard-final
```

### B. Cost Control: Deleting the VM
**DO NOT FORGET THIS STEP.** An idle A10G instance costs money every hour.
```bash
# Exit the VM
exit

# Delete from your local terminal
gcloud compute instances delete commitguard-trainer --zone=us-central1-a
```

---

## 🆘 Critical Troubleshooting

### "CUDA Out of Memory"
*   **Symptom:** Training crashes with a long error ending in `OutOfMemoryError`.
*   **Fix:** The "Group" in GRPO is currently set to 4 generations. Open `scripts/train_grpo.py` and change `num_generations=4` to `num_generations=2`. This cuts memory usage in half.

### "Connection Refused"
*   **Symptom:** Reward function returns -1.0 for everything or throws errors.
*   **Fix:** Your environment server crashed or wasn't started. Run `ps aux | grep server` to check if it is still running.

### The "Midnight Fallback"
If the 3B model is too slow for the submission deadline:
*   Switch to the **1.5B Qwen** model. It uses the same XML format but is 2x faster.
*   Command: `python scripts/train_grpo.py --model_name "Qwen/Qwen2.5-1.5B-Instruct" ...`

---

## ✅ Final Success Checklist
1. [ ] **Health Check:** `curl` returns healthy.
2. [ ] **WandB Tracking:** You can see the `reward` curve moving on the website.
3. [ ] **Checkpoints:** You see folders like `checkpoint-50`, `checkpoint-100` in the output directory.
4. [ ] **Clean Exit:** The VM is deleted after the adapter is uploaded to Hugging Face.