verbal-calibrate
This checkpoint is a fine-tuned variant of meta-llama/Llama-3.1-8B-Instruct for factual QA with explicit verbalized confidence.
Intended behavior
Given a factual question, the model answers step by step and ends with exactly:
Answer: <answer>
Confidence: <decimal between 0 and 1>
The confidence score is intended to reflect the model's uncertainty about the answer and can be used as a retrieval trigger in adaptive RAG pipelines.
Motivation
- Adaptive retrieval gating with verbalized confidence
- Confidence-aware factual QA
- Research on uncertainty calibration and selective retrieval
license: llama3.1 base_model: meta-llama/Llama-3.1-8B-Instruct tags: - adaptive-rag - uncertainty-quantification - retrieval-augmented-generation - question-answering language: - en
verbal-calibrate
Fine-tuned from meta-llama/Llama-3.1-8B-Instruct to express calibrated verbal confidence for adaptive retrieval-augmented generation (RAG).
What it does
Given a factual question, the model reasons step-by-step and ends every response with exactly two lines:
The confidence score reflects the model's genuine uncertainty. At inference, a confidence below 0.5 triggers BM25 retrieval and a second-pass generation with retrieved context. This allows the model to selectively retrieve only when it needs external evidence.
Training
- Base model:
meta-llama/Llama-3.1-8B-Instruct - Training method: Supervised fine-tuning on QA data with confidence labels, followed by calibration to align expressed confidence with empirical accuracy
- Target datasets: Multi-hop QA (HotpotQA, MuSiQue, 2WikiMultiHopQA) and open-domain QA (NQ, TriviaQA)
Evaluation (dev_500_subsampled, 500 questions × 5 datasets)
| Dataset | EM | F1 | Trigger Rate |
|---|---|---|---|
| HotpotQA | 32.0 | 43.8 | 61.6% |
| MuSiQue | 11.8 | 18.8 | 76.8% |
| 2WikiMultiHopQA | 28.4 | 32.9 | 48.2% |
| NQ | 32.4 | 44.4 | 25.0% |
| TriviaQA | 53.2 | 62.5 | 28.8% |
| Overall | 31.6 | 40.5 | 48.1% |
Trigger rate = fraction of questions where confidence < 0.5 triggered retrieval.
Intended use
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("your-username/verbal-calibrate")
model = AutoModelForCausalLM.from_pretrained("your-username/verbal-calibrate")
prompt = tokenizer.apply_chat_template([{
"role": "user",
"content": (
"Answer the following factual question step by step, then state your answer "
"and how confident you are.\n\n"
"{question}\n\n"
"Your response must end with exactly these two lines:\n"
"Answer: $Answer\n"
"Confidence: $Confidence\n\n"
"Where $Confidence is a decimal between 0 and 1."
).format(question="What is the capital of France?")
}], tokenize=False, add_generation_prompt=True)
- Downloads last month
- 93
Model tree for jamesjunyuguo/verbal-calibrate
Base model
meta-llama/Llama-3.1-8B