AgentJudge v2

Fine-tuned Qwen3-8B (QLoRA) for evaluating AI agent trajectories. Scores tool selection, argument correctness, reasoning quality, output faithfulness, and step efficiency.

Outperforms GPT-4o-as-judge on reasoning quality (0.633 vs 0.118), output faithfulness (0.264 vs 0.122), and step efficiency (0.633 vs 0.156) at 5x lower cost.

Results

Metric AgentJudge v2 GPT-4o Gemini
ρ reasoning_quality 0.633 0.118 0.730
ρ output_faithfulness 0.264 0.122 0.507
ρ step_efficiency 0.633 0.156 0.730
ρ tool_selection -0.067 0.092 0.730
Cost (22 evals) $0.020 $0.099 $0.040

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

model = PeftModel.from_pretrained(base_model, "Jarvis710/agentjudge-v2")
model.eval()

Training

Parameter Value
Base model Qwen/Qwen3-8B
Method QLoRA (4-bit NF4)
LoRA rank 16
Learning rate 5e-5
Epochs 5
Training examples 675 (229 real + 458 synthetic)
Compute Google Colab A100, ~40 min

Dataset

229 real agent trajectories from BFCL and HotPotQA, labeled using dual-judge consensus (GPT-4o + Gemini). 458 synthetic failures with 6 failure types, each with unique per-trajectory reasoning.

Limitations

  • Verdict prediction is weak (κ near zero) — dimension scores are reliable, overall verdicts are not
  • Trained on BFCL and HotPotQA only — may not generalize to other agent types
  • Inference latency ~64s on A100 due to Qwen3 thinking mode — vLLM deployment would reduce to ~2s

Links

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jarvis710/agentjudge-v2

Finetuned
Qwen/Qwen3-8B
Adapter
(1043)
this model