GoodGlinda-7B-Verifier

Model Dataset Space Code HomeLab AI Independent Hardware License

The model runs a hierarchical three-tier architecture on consumer hardware. I trained it on an Intel Core i7-12700 with an RTX 4060 (8GB) and RTX 5070 Ti (16GB) Overclocked and Undervoltaged in an asymmetric configuration that left the 5070 Ti idle 30% of the time. At hour 14, the 4060 hit 83°C and throttled. I replaced the thermal paste at hour 18 and watched temperatures stabilize at 79°C for the remaining 54 hours. I recommend using Watercooling or other liquid cooling methods for your easiness

Architecture

GoodGlinda modifies Qwen2.5-7B-Instruct (28 layers, not 32) with three early-exit classification heads:

  • Local Head (Layer 7): 50ms latency, catches obvious errors with >0.9 confidence. Handles roughly 60% of traffic.
  • Arbitration Head (Layer 14): 200ms latency, softmax scoring over peer disagreements. Activates when confidence sits between 0.3 and 0.9.
  • Global Head (Layer 28): 1.5s latency, beam search with width 3 (not MCTS). Catches sophisticated adversarial attempts.

The training loss combined all three tiers: L_local + 1.0 * L_arb + 0.5 * L_global. I used 4-bit NormalFloat quantization with double quantization via QLoRA. DeepSpeed ZeRO-2 with CPU offloading kept the 8GB card from exploding.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "YellowLabsStudio/goodglinda-7b-verifier",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("YellowLabsStudio/goodglinda-7b-verifier")

# Verify a solution
task = "Write a function to validate email addresses"
candidate = "def validate(email): return '@' in email"
prompt = f"Task: {task}\nCandidate: {candidate}\nVerdict:"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100, temperature=0.3)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

## Performance

I tested on 2,000 injected error samples (1,000 buggy, 1,000 clean) because I could not afford multiple training runs for confidence intervals. Results from the single 72-hour run:

| Benchmark        | Error Detection | False Positive Rate |
|-----------------|----------------|---------------------|
| HumanEval+MBPP  | 76%            | 9%                  |
| GSM8K+MATH      | 78%            | 8%                  |

Majority voting baseline hit 34%. My flat baseline (no hierarchical heads) reached 68%. The 11-13 point improvement comes from the tiered architecture, not just fine-tuning.

Late one night at hour 68, I watched the loss curve descend like a fever breaking. The arbitration head finally started converging.

## Hardware Reality

**Minimum:** 8GB VRAM (RTX 4060, RTX 3070)  
**Recommended:** 16GB for Tier 3 beam search without CPU offloading  
**My Setup:** Intel i7-12700, 64GB DDR5-4800, RTX 4060 (8GB) + RTX 5070 Ti (16GB)

The asymmetric VRAM caused headaches. DeepSpeed partitioned the optimizer states across both cards, but the 4060's memory ceiling forced aggressive CPU offloading. I wasted two days trying pipeline parallelism before switching to ZeRO-2.

## Limitations

This is a single training run. No seed averaging, no cross-validation. The numbers could vary ±3-5 percentage points if I retrained. I distilled all 50,000 samples from DeepSeek-V2, so the model mimics its teacher's biases. The thresholds (0.9 for Tier 1 exit, 0.3 for Tier 2 trigger) are hand-tuned on 500 validation samples, not learned. 

Quantization artifacts from 4-bit NF4 training likely degrade fine-grained discrimination compared to FP16. I cannot verify this without a 24GB GPU for full-precision comparison.

Details on the full methodology will appear in an upcoming publication. For now, see the training code repository for configs and logs.

## Links

*   **Training Code & Configs:** [goodglinda-training-code](https://huggingface.co/YellowLabsStudio/goodglinda-training-code)
*   **Dataset:** [goodglinda-training-data](https://huggingface.co/datasets/YellowLabsStudio/goodglinda-training-data)
*   **Live Demo:** [goodglinda-7b-eval](https://huggingface.co/spaces/YellowLabsStudio/goodglinda-7b-eval)

## License

Apache 2.0. Commercial use permitted with attribution.
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YellowLabsStudio/goodglinda-7b-verifier

Base model

Qwen/Qwen2.5-7B
Finetuned
(3152)
this model

Dataset used to train YellowLabsStudio/goodglinda-7b-verifier

Space using YellowLabsStudio/goodglinda-7b-verifier 1