ConfTuner: Training Large Language Models to Express Their Confidence Verbally

This model is a fine-tuned version of mistralai/Ministral-8B-Instruct-2410 optimized for uncertainty calibration using the ConfTuner method, as presented in the paper ConfTuner: Training Large Language Models to Express Their Confidence Verbally, accepted by NeurIPS 2025.

Model Details

Model Description

Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems.

  • Developed by: Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi
  • Model type: LoRA-fine-tuned Causal Language Model (MistralForCausalLM)
  • Language(s) (NLP): English
  • License: Other (No explicit license was provided for this fine-tuned model. Please refer to the base model's license and the associated project for further details.)
  • Finetuned from model: mistralai/Ministral-8B-Instruct-2410

Model Sources

Uses

Direct Use

This model is intended for research and applications that require Large Language Models to express their confidence verbally, particularly in high-stakes domains like science, law, and healthcare where reliable uncertainty quantification is crucial. It can be used to generate text responses that include calibrated verbal confidence statements.

Out-of-Scope Use

The model is not intended for use in production environments without rigorous domain-specific validation and bias mitigation. It should not be used in critical applications where a potential miscalibration of verbalized confidence could lead to significant adverse outcomes. Generating misleadingly confident or unconfident responses is strictly out of scope.

Bias, Risks, and Limitations

While ConfTuner aims to improve verbalized confidence calibration, LLMs can still exhibit biases present in their training data and may generate incorrect answers. The method does not entirely eliminate risks of overconfidence or hallucination. Thorough evaluation for specific use cases is essential.

Recommendations

Users are advised to validate the model's performance and calibration on their target tasks and datasets. Human oversight is recommended for critical applications to ensure responsible deployment.

How to Get Started with the Model

To get started with the model, you can refer to the official GitHub repository for detailed instructions on dataset generation, training, and testing.

The repository provides an inference.py script for evaluating the fine-tuned model. First, clone the repository and install dependencies:

git clone git@github.com:liushiliushi/Uncertainty_ft.git
cd Uncertainty_ft
pip install -r requirements.txt

Then, you can evaluate the fine-tuned model using the inference.py script:

# Example: Validate on confidence levels 0-100
python inference.py \
    --model_name /path/to/your/checkpoint \
    --dataset dataset_name \
    --use_wandb

Replace /path/to/your/checkpoint with the path to the model checkpoint (e.g., liushiliushi/ConfTuner-Ministral) and dataset_name with your desired dataset (e.g., hotpot_qa, trivia_qa).

For testing on linguistic confidence levels (high/medium/low) or cascading performance, refer to the Testing section in the GitHub README.

Training Details

The model was fine-tuned using the ConfTuner method, which introduces a novel tokenized Brier score loss function to encourage verbalized confidence calibration.

Training Data

Training datasets were generated for different base models (Llama-3.1, Qwen, Mistral) and also using GPT-4o's responses. The datasets utilized include:

  • HotpotQA
  • TriviaQA
  • Grade School Math (GSM8k)
  • TruthfulQA
  • StrategyQA

More details on dataset generation can be found in the GitHub repository.

Training Procedure

ConfTuner fine-tunes LLMs using an efficient semi-supervised method. The training process leverages accelerate with deepspeed for distributed training and utilizes Parameter-Efficient Fine-Tuning (PEFT) with the LoRA method. Training commands also indicate options for consistency loss (--add_loss_con), training for coarse confidence levels (--train_coarse), and merging PEFT weights after training (--merge_peft).

Training Hyperparameters

Key training hyperparameters used for fine-tuning the Mistral model include:

  • Training regime: bf16 mixed precision
  • --num_processes: 4
  • --use_deepspeed: Enabled with llama_recipes/configs/ds_config.json
  • --add_loss_con: False
  • --train_coarse: True
  • --use_peft: True
  • --peft_method: lora
  • --batch_size_training: 4
  • --val_batch_size: 4
  • --generate: llm
  • --lr: 3e-5
  • --loss_type: brier
  • --num_epochs: 2
  • --merge_peft: True

Evaluation

The model's performance on verbalized confidence and cascading behavior is evaluated across various reasoning tasks.

Testing Data, Factors & Metrics

Testing Data

Testing datasets include versions of:

  • HotpotQA
  • TriviaQA
  • Grade School Math (GSM8k)
  • TruthfulQA
  • StrategyQA

Metrics

Evaluation is performed on confidence levels ranging from 0-100, as well as qualitative linguistic confidence levels (high/medium/low). Cascading performance, often evaluated with black-box models like GPT-4o, is also a key metric.

Model Card Contact

For questions or feedback, please refer to the GitHub repository for contact information.

Framework versions

  • PEFT 0.12.0
  • Transformers 4.46.3
Downloads last month
5
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for liushiliushi/ConfTuner-Ministral

Finetuned
(77)
this model
Quantizations
2 models

Paper for liushiliushi/ConfTuner-Ministral