AgentJudge v2

Fine-tuned Qwen3-8B (QLoRA) for evaluating AI agent trajectories. Scores tool selection, argument correctness, reasoning quality, output faithfulness, and step efficiency.

Outperforms GPT-4o-as-judge on reasoning quality (0.633 vs 0.118), output faithfulness (0.264 vs 0.122), and step efficiency (0.633 vs 0.156) at 5x lower cost.

Results

Metric	AgentJudge v2	GPT-4o	Gemini
ρ reasoning_quality	0.633	0.118	0.730
ρ output_faithfulness	0.264	0.122	0.507
ρ step_efficiency	0.633	0.156	0.730
ρ tool_selection	-0.067	0.092	0.730
Cost (22 evals)	$0.020	$0.099	$0.040

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(base_model, "Jarvis710/agentjudge-v2")
model.eval()

Training

Parameter	Value
Base model	Qwen/Qwen3-8B
Method	QLoRA (4-bit NF4)
LoRA rank	16
Learning rate	5e-5
Epochs	5
Training examples	675 (229 real + 458 synthetic)
Compute	Google Colab A100, ~40 min

Dataset

229 real agent trajectories from BFCL and HotPotQA, labeled using dual-judge consensus (GPT-4o + Gemini). 458 synthetic failures with 6 failure types, each with unique per-trajectory reasoning.

Limitations

Verdict prediction is weak (κ near zero) — dimension scores are reliable, overall verdicts are not
Trained on BFCL and HotPotQA only — may not generalize to other agent types
Inference latency ~64s on A100 due to Qwen3 thinking mode — vLLM deployment would reduce to ~2s

Model tree for Jarvis710/agentjudge-v2

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Adapter

(1426)

this model

Jarvis710
/

agentjudge-v2