verbal-calibrate

This checkpoint is a fine-tuned variant of meta-llama/Llama-3.1-8B-Instruct for factual QA with explicit verbalized confidence.

Intended behavior

Given a factual question, the model answers step by step and ends with exactly:

Answer: <answer>
Confidence: <decimal between 0 and 1>

The confidence score is intended to reflect the model's uncertainty about the answer and can be used as a retrieval trigger in adaptive RAG pipelines.

Motivation

  • Adaptive retrieval gating with verbalized confidence
  • Confidence-aware factual QA
  • Research on uncertainty calibration and selective retrieval

license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - adaptive-rag - uncertainty-quantification - retrieval-augmented-generation - question-answering language: - en

verbal-calibrate

Fine-tuned from meta-llama/Llama-3.1-8B-Instruct to express calibrated verbal confidence for adaptive retrieval-augmented generation (RAG).

What it does

Given a factual question, the model reasons step-by-step and ends every response with exactly two lines:

The confidence score reflects the model's genuine uncertainty. At inference, a confidence below 0.5 triggers BM25 retrieval and a second-pass generation with retrieved context. This allows the model to selectively retrieve only when it needs external evidence.

Training

  • Base model: meta-llama/Llama-3.1-8B-Instruct
  • Training method: Supervised fine-tuning on QA data with confidence labels, followed by calibration to align expressed confidence with empirical accuracy
  • Target datasets: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)

Evaluation (dev_500_subsampled, 500 questions × 5 datasets)

Dataset EM F1 Trigger Rate
HotpotQA 32.0 43.8 61.6%
MuSiQue 11.8 18.8 76.8%
2WikiMultiHopQA 28.4 32.9 48.2%
NQ 32.4 44.4 25.0%
TriviaQA 53.2 62.5 28.8%
Overall 31.6 40.5 48.1%

Trigger rate = fraction of questions where confidence < 0.5 triggered retrieval.

Intended use

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("your-username/verbal-calibrate")
model = AutoModelForCausalLM.from_pretrained("your-username/verbal-calibrate")

prompt = tokenizer.apply_chat_template([{
    "role": "user",
    "content": (
        "Answer the following factual question step by step, then state your answer "
        "and how confident you are.\n\n"
        "{question}\n\n"
        "Your response must end with exactly these two lines:\n"
        "Answer: $Answer\n"
        "Confidence: $Confidence\n\n"
        "Where $Confidence is a decimal between 0 and 1."
    ).format(question="What is the capital of France?")
}], tokenize=False, add_generation_prompt=True)



Downloads last month
93
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jamesjunyuguo/verbal-calibrate

Finetuned
(2713)
this model