verbal-calibrate

This checkpoint is a fine-tuned variant of meta-llama/Llama-3.1-8B-Instruct for factual QA with explicit verbalized confidence.

Intended behavior

Given a factual question, the model answers step by step and ends with exactly:

Answer: <answer>
Confidence: <decimal between 0 and 1>

The confidence score is intended to reflect the model's uncertainty about the answer and can be used as a retrieval trigger in adaptive RAG pipelines.

Motivation

Adaptive retrieval gating with verbalized confidence
Confidence-aware factual QA
Research on uncertainty calibration and selective retrieval

license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - adaptive-rag - uncertainty-quantification - retrieval-augmented-generation - question-answering language: - en

verbal-calibrate

Fine-tuned from meta-llama/Llama-3.1-8B-Instruct to express calibrated verbal confidence for adaptive retrieval-augmented generation (RAG).

What it does

Given a factual question, the model reasons step-by-step and ends every response with exactly two lines:

The confidence score reflects the model's genuine uncertainty. At inference, a confidence below 0.5 triggers BM25 retrieval and a second-pass generation with retrieved context. This allows the model to selectively retrieve only when it needs external evidence.

Training

Base model: meta-llama/Llama-3.1-8B-Instruct
Training method: Supervised fine-tuning on QA data with confidence labels, followed by calibration to align expressed confidence with empirical accuracy
Target datasets: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)

Evaluation (dev_500_subsampled, 500 questions × 5 datasets)

Dataset	EM	F1	Trigger Rate
HotpotQA	32.0	43.8	61.6%
MuSiQue	11.8	18.8	76.8%
2WikiMultiHopQA	28.4	32.9	48.2%
NQ	32.4	44.4	25.0%
TriviaQA	53.2	62.5	28.8%
Overall	31.6	40.5	48.1%

Trigger rate = fraction of questions where confidence < 0.5 triggered retrieval.

Intended use

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("your-username/verbal-calibrate")
model = AutoModelForCausalLM.from_pretrained("your-username/verbal-calibrate")

prompt = tokenizer.apply_chat_template([{
    "role": "user",
    "content": (
        "Answer the following factual question step by step, then state your answer "
        "and how confident you are.\n\n"
        "{question}\n\n"
        "Your response must end with exactly these two lines:\n"
        "Answer: $Answer\n"
        "Confidence: $Confidence\n\n"
        "Where $Confidence is a decimal between 0 and 1."
    ).format(question="What is the capital of France?")
}], tokenize=False, add_generation_prompt=True)

Downloads last month: 93

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for jamesjunyuguo/verbal-calibrate

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Finetuned

(2713)

this model