AgentSight β€” Step-Level Hallucination Classifier

AgentSight is the first gradient-trained step-level hallucination classifier on the AgentHallu benchmark.
It identifies which step in an autonomous agent trajectory is the root cause of a hallucination.

Results (locked test set, one-shot evaluation)

Metric Value 95% CI Note
Step Localisation Acc 47.8% (32/67) [36.3%, 59.5%] vs Gemini 41.1% β€” p=0.163, not stat. sig.
Judgment Macro-F1 54.7% β€” n=105 trajectories
Judgment Precision 58.5% β€” threshold=0.40
Judgment Recall 55.7% β€”

The +6.7 point margin over Gemini-2.5-Pro is practically meaningful but not statistically significant at n=67. The 95% CI spans 23 points. We report this honestly.

Key Finding

Removing the [ACTION]+[OBS] tool-execution channel from the input drops step localisation accuracy by 31.8 percentage points (48.5% β†’ 16.7% on the validation set). Tool output content is the dominant localisation signal.

Architecture

  • Backbone: DeBERTa-v3-base with LoRA adapters (r=16, Ξ±=64) β€” 1.42% of parameters trained
  • Context encoder: 3-layer Pre-LN Transformer (8 heads) over the step sequence
  • Head: 2-layer MLP β†’ binary hallucination logit per step
  • Input: [QUERY] task [THOUGHT] reasoning [ACTION] tool_calls [OBS] responses (max 512 tokens)
  • Decision threshold: 0.40 (tuned on validation set)

Quick Start

from agentsight_sdk import AgentMonitor

# Auto-downloads weights from this repo
monitor = AgentMonitor("YOUR_HF_USERNAME/agentsight")

result = monitor.evaluate_trajectory(raw_trajectory_dict)

print(result["is_hallucinated"])               # True / False
print(result["predicted_root_cause_step"])      # e.g. 3
print(result["step_probabilities"])             # [0.12, 0.08, 0.91, 0.04]

Input Format

{
  "question": "What is the capital of France?",
  "history": [
    {
      "step": 1,
      "content": "I will search for this.",
      "tool_calls": [{"name": "web_search", "arguments": {"query": "capital France"}}],
      "tool_responses": ["Paris is the capital of France."]
    }
  ]
}

Training

git clone https://github.com/Minato-sudo/agentsight
cd agentsight
pip install -r requirements.txt
./run_full_pipeline.sh   # verify β†’ train β†’ ablate β†’ test (one-shot)

Citation

@misc{agentsight2026,
  title={AgentSight: Step-Level Hallucination Localisation in Autonomous Agent Trajectories via Tool-Execution Context},
  author={Namikaze, Minato},
  year={2026},
  note={Paper + code: https://github.com/Minato-sudo/agentsight}
}

Files

File Description
best_agentsight.pth PyTorch state dict β€” load with AgentSightModel()
best_agentsight_meta.json Threshold (0.40), val metrics, best epoch
tokenizer_config/ DeBERTa-v3-base tokeniser config (max_len=512)

Limitations

  • Evaluated on AgentHallu only β€” generalisation to other benchmarks unknown
  • n=67 test set means wide confidence intervals
  • Judgment F1 (54.7%) is below frontier models (Gemini 64.6%, GPT-5 70.2%)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using talha1234567/Agentic-Ai 1

Evaluation results

  • Step Localisation Accuracy (test, n=67) on AgentHallu
    self-reported
    0.478
  • Judgment Macro-F1 (test, n=105) on AgentHallu
    self-reported
    0.547