AntiAtropos / training.md
div18
feat(grader): add task-aware grading with new metrics and composite scoring
e012e73
# QLoRA + OpenEnv Training Architecture (HF-Centric)
## 🧠 SYSTEM OVERVIEW
You are building a QLoRA + reward-based training system fully inside Hugging Face infrastructure.
Core principles:
- GPU = compute only (ephemeral)
- Hub = source of truth (persistent)
- Training = reproducible + resumable
- Metrics = structured + queryable
---
## 🧱 ARCHITECTURE
### 1. Code Repository (Control Plane)
Stored on Hugging Face Hub
Contents:
- train.py
- openenv_loop.py
- model_utils.py
- eval.py
- requirements.txt
- config.yaml
---
### 2. Training Runtime (Execution Plane)
Runs via Hugging Face Training Jobs
- GPU: A10G recommended
- Pulls code from Hub
- Uses temporary disk
- Must explicitly persist outputs
---
### 3. Model Storage
Stored on Hugging Face Hub
Contents:
- LoRA adapter weights
- Optional merged model
---
### 4. Metrics Storage
Stored as dataset on Hugging Face Hub
Example schema:
{
"run_id": "exp_001",
"step": 120,
"train_reward": 0.82,
"eval_reward_base": 0.65,
"eval_reward_ft": 0.78,
"loss": 1.23
}
---
### 5. Optional Artifacts
- plots (.png)
- logs
- evaluation outputs
---
## πŸ” TRAINING PROCESS FLOW
### Step 1 β€” Initialization
- Load base model (4-bit)
- Attach LoRA adapters
- Load tokenizer
- Resume from checkpoint if exists
---
### Step 2 β€” Training Loop
generate β†’ evaluate β†’ reward β†’ update β†’ log β†’ checkpoint
---
### Step 3 β€” Logging
- Buffer metrics locally
- Push every N steps to dataset
---
### Step 4 β€” Evaluation
- Evaluate base vs fine-tuned
- Log results
---
### Step 5 β€” Checkpointing
- Save locally
- Push to Hub
- Ensure resumability
---
### Step 6 β€” Visualization
- Generate plots
- Save + push
---
## πŸ”„ RESUME LOGIC
- Check latest checkpoint
- Resume if exists
- Else start fresh
---
## πŸ“Š POST-TRAINING
- Load dataset
- Plot reward curves
- Compare models
---
## βš™οΈ CONFIG
- batch_size
- learning_rate
- lora_rank
- eval_interval
- checkpoint_interval
- push_interval
- run_id
---
## ⚠️ FAILURE HANDLING
- Frequent checkpointing
- Frequent metric pushes
---
## 🧠 FINAL MODEL
HF Hub = database
HF Jobs = compute
QLoRA = training layer
OpenEnv = reward
Datasets = tracking
---
## βœ… SUCCESS CRITERIA
- Resumable training
- Persistent metrics
- Comparable runs
- No data loss