uncertain-calibrate

Fine-tuned from meta-llama/Llama-3.1-8B-Instruct via GRPO reinforcement learning to emit a special <uncertain> token when the model is uncertain during reasoning, enabling uncertainty-guided adaptive retrieval.

What it does

The model reasons step-by-step and inserts <uncertain> at any point where it lacks confidence in a fact. A lightweight ridge regression probe (trained on layer-13 hidden states at the <uncertain> span) then decides whether to trigger BM25 retrieval and a second-pass generation.

Training

  • Base model: meta-llama/Llama-3.1-8B-Instruct
  • Training method: GRPO (Group Relative Policy Optimization) with EM-based reward; the model is rewarded for correct final answers, encouraging it to emit <uncertain> in contexts where retrieval would help
  • Target datasets: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)

Retrieval gating (probe)

A separate ridge regression probe on layer-13 hidden states over <uncertain> spans must be trained to use this model for adaptive RAG. The probe AUROC on held-out data is ~0.82. Use the companion probe artifact uncertain_probe_layer13_alpha3000.pkl from the AdaRAGUE repository.

Evaluation (dev_500_subsampled, 500 questions 脳 5 datasets, with probe gating)

Dataset EM F1 Trigger Rate
HotpotQA 32.6 42.7 67.4%
MuSiQue 7.6 14.1 94.2%
2WikiMultiHopQA 26.2 29.6 59.2%
NQ 31.4 41.0 52.0%
TriviaQA 56.6 63.2 34.0%
Overall 30.9 38.1 61.4%

Trigger rate = fraction of questions where the probe decided to retrieve.

Intended use

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("your-username/uncertain-calibrate")
model = AutoModelForCausalLM.from_pretrained("your-username/uncertain-calibrate")

SYSTEM = (
    "You are a helpful reasoning assistant. Think step by step. "
    "If at any point you are uncertain about a fact, emit the special token "
    "<uncertain> to signal that you need more information. "
    "End your response with 'Answer: <your answer>' on the last line."
)

prompt = tokenizer.apply_chat_template([
    {"role": "system", "content": SYSTEM},
    {"role": "user",   "content": "Who directed the film Interstellar?"},
], tokenize=False, add_generation_prompt=True)

Downloads last month
103
Safetensors
Model size
8B params
Tensor type
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for jamesjunyuguo/uncertain-calibrate

Finetuned
(2713)
this model