AntiAtropos / training.md
div18
feat(grader): add task-aware grading with new metrics and composite scoring
e012e73

QLoRA + OpenEnv Training Architecture (HF-Centric)

🧠 SYSTEM OVERVIEW

You are building a QLoRA + reward-based training system fully inside Hugging Face infrastructure.

Core principles:

  • GPU = compute only (ephemeral)
  • Hub = source of truth (persistent)
  • Training = reproducible + resumable
  • Metrics = structured + queryable

🧱 ARCHITECTURE

1. Code Repository (Control Plane)

Stored on Hugging Face Hub

Contents:

  • train.py
  • openenv_loop.py
  • model_utils.py
  • eval.py
  • requirements.txt
  • config.yaml

2. Training Runtime (Execution Plane)

Runs via Hugging Face Training Jobs

  • GPU: A10G recommended
  • Pulls code from Hub
  • Uses temporary disk
  • Must explicitly persist outputs

3. Model Storage

Stored on Hugging Face Hub

Contents:

  • LoRA adapter weights
  • Optional merged model

4. Metrics Storage

Stored as dataset on Hugging Face Hub

Example schema: { "run_id": "exp_001", "step": 120, "train_reward": 0.82, "eval_reward_base": 0.65, "eval_reward_ft": 0.78, "loss": 1.23 }


5. Optional Artifacts

  • plots (.png)
  • logs
  • evaluation outputs

πŸ” TRAINING PROCESS FLOW

Step 1 β€” Initialization

  • Load base model (4-bit)
  • Attach LoRA adapters
  • Load tokenizer
  • Resume from checkpoint if exists

Step 2 β€” Training Loop

generate β†’ evaluate β†’ reward β†’ update β†’ log β†’ checkpoint


Step 3 β€” Logging

  • Buffer metrics locally
  • Push every N steps to dataset

Step 4 β€” Evaluation

  • Evaluate base vs fine-tuned
  • Log results

Step 5 β€” Checkpointing

  • Save locally
  • Push to Hub
  • Ensure resumability

Step 6 β€” Visualization

  • Generate plots
  • Save + push

πŸ”„ RESUME LOGIC

  • Check latest checkpoint
  • Resume if exists
  • Else start fresh

πŸ“Š POST-TRAINING

  • Load dataset
  • Plot reward curves
  • Compare models

βš™οΈ CONFIG

  • batch_size
  • learning_rate
  • lora_rank
  • eval_interval
  • checkpoint_interval
  • push_interval
  • run_id

⚠️ FAILURE HANDLING

  • Frequent checkpointing
  • Frequent metric pushes

🧠 FINAL MODEL

HF Hub = database
HF Jobs = compute
QLoRA = training layer
OpenEnv = reward
Datasets = tracking


βœ… SUCCESS CRITERIA

  • Resumable training
  • Persistent metrics
  • Comparable runs
  • No data loss