Instructions to use liushiliushi/ConfTuner-Ministral with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use liushiliushi/ConfTuner-Ministral with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="liushiliushi/ConfTuner-Ministral")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("liushiliushi/ConfTuner-Ministral")
model = AutoModelForCausalLM.from_pretrained("liushiliushi/ConfTuner-Ministral")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

PEFT
How to use liushiliushi/ConfTuner-Ministral with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use liushiliushi/ConfTuner-Ministral with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "liushiliushi/ConfTuner-Ministral"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liushiliushi/ConfTuner-Ministral",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/liushiliushi/ConfTuner-Ministral

SGLang

How to use liushiliushi/ConfTuner-Ministral with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "liushiliushi/ConfTuner-Ministral" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liushiliushi/ConfTuner-Ministral",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "liushiliushi/ConfTuner-Ministral" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "liushiliushi/ConfTuner-Ministral",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use liushiliushi/ConfTuner-Ministral with Docker Model Runner:
```
docker model run hf.co/liushiliushi/ConfTuner-Ministral
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

ConfTuner: Training Large Language Models to Express Their Confidence Verbally

This model is a fine-tuned version of mistralai/Ministral-8B-Instruct-2410 optimized for uncertainty calibration using the ConfTuner method, as presented in the paper ConfTuner: Training Large Language Models to Express Their Confidence Verbally, accepted by NeurIPS 2025.

Model Details

Model Description

Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems.

Developed by: Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi
Model type: LoRA-fine-tuned Causal Language Model (MistralForCausalLM)
Language(s) (NLP): English
License: Other (No explicit license was provided for this fine-tuned model. Please refer to the base model's license and the associated project for further details.)
Finetuned from model: mistralai/Ministral-8B-Instruct-2410

Model Sources

Repository: https://github.com/liushiliushi/ConfTuner
Paper: ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Uses

Direct Use

This model is intended for research and applications that require Large Language Models to express their confidence verbally, particularly in high-stakes domains like science, law, and healthcare where reliable uncertainty quantification is crucial. It can be used to generate text responses that include calibrated verbal confidence statements.

Out-of-Scope Use

The model is not intended for use in production environments without rigorous domain-specific validation and bias mitigation. It should not be used in critical applications where a potential miscalibration of verbalized confidence could lead to significant adverse outcomes. Generating misleadingly confident or unconfident responses is strictly out of scope.

Bias, Risks, and Limitations

While ConfTuner aims to improve verbalized confidence calibration, LLMs can still exhibit biases present in their training data and may generate incorrect answers. The method does not entirely eliminate risks of overconfidence or hallucination. Thorough evaluation for specific use cases is essential.

Recommendations

Users are advised to validate the model's performance and calibration on their target tasks and datasets. Human oversight is recommended for critical applications to ensure responsible deployment.

How to Get Started with the Model

To get started with the model, you can refer to the official GitHub repository for detailed instructions on dataset generation, training, and testing.

The repository provides an inference.py script for evaluating the fine-tuned model. First, clone the repository and install dependencies:

git clone git@github.com:liushiliushi/Uncertainty_ft.git
cd Uncertainty_ft
pip install -r requirements.txt

Then, you can evaluate the fine-tuned model using the inference.py script:

# Example: Validate on confidence levels 0-100
python inference.py \
    --model_name /path/to/your/checkpoint \
    --dataset dataset_name \
    --use_wandb

Replace /path/to/your/checkpoint with the path to the model checkpoint (e.g., liushiliushi/ConfTuner-Ministral) and dataset_name with your desired dataset (e.g., hotpot_qa, trivia_qa).

For testing on linguistic confidence levels (high/medium/low) or cascading performance, refer to the Testing section in the GitHub README.

Training Details

The model was fine-tuned using the ConfTuner method, which introduces a novel tokenized Brier score loss function to encourage verbalized confidence calibration.

Training Data

Training datasets were generated for different base models (Llama-3.1, Qwen, Mistral) and also using GPT-4o's responses. The datasets utilized include:

HotpotQA
TriviaQA
Grade School Math (GSM8k)
TruthfulQA
StrategyQA

More details on dataset generation can be found in the GitHub repository.

Training Procedure

ConfTuner fine-tunes LLMs using an efficient semi-supervised method. The training process leverages accelerate with deepspeed for distributed training and utilizes Parameter-Efficient Fine-Tuning (PEFT) with the LoRA method. Training commands also indicate options for consistency loss (--add_loss_con), training for coarse confidence levels (--train_coarse), and merging PEFT weights after training (--merge_peft).

Training Hyperparameters

Key training hyperparameters used for fine-tuning the Mistral model include:

Training regime: bf16 mixed precision
--num_processes: 4
--use_deepspeed: Enabled with llama_recipes/configs/ds_config.json
--add_loss_con: False
--train_coarse: True
--use_peft: True
--peft_method: lora
--batch_size_training: 4
--val_batch_size: 4
--generate: llm
--lr: 3e-5
--loss_type: brier
--num_epochs: 2
--merge_peft: True

Evaluation

The model's performance on verbalized confidence and cascading behavior is evaluated across various reasoning tasks.

Testing Data, Factors & Metrics

Testing Data

Testing datasets include versions of:

HotpotQA
TriviaQA
Grade School Math (GSM8k)
TruthfulQA
StrategyQA

Metrics

Evaluation is performed on confidence levels ranging from 0-100, as well as qualitative linguistic confidence levels (high/medium/low). Cascading performance, often evaluated with black-box models like GPT-4o, is also a key metric.

Model Card Contact

For questions or feedback, please refer to the GitHub repository for contact information.

Framework versions

PEFT 0.12.0
Transformers 4.46.3

Downloads last month: 15

Safetensors

Model size

8B params

Tensor type

F16

Model tree for liushiliushi/ConfTuner-Ministral

Base model

mistralai/Ministral-8B-Instruct-2410

Finetuned

(97)

this model

Quantizations

2 models

Paper for liushiliushi/ConfTuner-Ministral

ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Paper • 2508.18847 • Published Aug 26, 2025 • 2