AgentJudge v2
Fine-tuned Qwen3-8B (QLoRA) for evaluating AI agent trajectories. Scores tool selection, argument correctness, reasoning quality, output faithfulness, and step efficiency.
Outperforms GPT-4o-as-judge on reasoning quality (0.633 vs 0.118), output faithfulness (0.264 vs 0.122), and step efficiency (0.633 vs 0.156) at 5x lower cost.
Results
| Metric | AgentJudge v2 | GPT-4o | Gemini |
|---|---|---|---|
| ρ reasoning_quality | 0.633 | 0.118 | 0.730 |
| ρ output_faithfulness | 0.264 | 0.122 | 0.507 |
| ρ step_efficiency | 0.633 | 0.156 | 0.730 |
| ρ tool_selection | -0.067 | 0.092 | 0.730 |
| Cost (22 evals) | $0.020 | $0.099 | $0.040 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "Jarvis710/agentjudge-v2")
model.eval()
Training
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen3-8B |
| Method | QLoRA (4-bit NF4) |
| LoRA rank | 16 |
| Learning rate | 5e-5 |
| Epochs | 5 |
| Training examples | 675 (229 real + 458 synthetic) |
| Compute | Google Colab A100, ~40 min |
Dataset
229 real agent trajectories from BFCL and HotPotQA, labeled using dual-judge consensus (GPT-4o + Gemini). 458 synthetic failures with 6 failure types, each with unique per-trajectory reasoning.
Limitations
- Verdict prediction is weak (κ near zero) — dimension scores are reliable, overall verdicts are not
- Trained on BFCL and HotPotQA only — may not generalize to other agent types
- Inference latency ~64s on A100 due to Qwen3 thinking mode — vLLM deployment would reduce to ~2s
Links
- Downloads last month
- -