|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- custom_generate |
|
|
- sampling |
|
|
--- |
|
|
|
|
|
# DeepCONF Custom Generation Strategy |
|
|
|
|
|
This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, as presented in the paper [Deep Think with Confidence](https://huggingface.co/papers/2508.15260). |
|
|
|
|
|
- **Project Page:** [https://jiaweizzhao.github.io/deepconf](https://jiaweizzhao.github.io/deepconf) |
|
|
- **GitHub Repository:** [https://github.com/facebookresearch/deepconf](https://github.com/facebookresearch/deepconf) |
|
|
|
|
|
## Overview |
|
|
|
|
|
DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the official implementation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. |
|
|
|
|
|
## Parameters |
|
|
|
|
|
- `enable_conf` (bool): Whether to enable the DeepCONF strategy. Defaults to `False`. |
|
|
- `enable_early_stopping` (bool): Whether to apply early stopping during generation (online mode) or just track confidences for post-processing (batch mode). Defaults to `True`. |
|
|
- `window_size` (int): Size of the sliding window for confidence calculation. Defaults to `2048`. |
|
|
- `threshold` (float): Confidence threshold for early stopping. Defaults to `17.0`. |
|
|
- `conf_topk` (int): Number of top tokens to use for confidence calculation from the full vocabulary. Defaults to `20`. |
|
|
- `output_confidences` (bool): If `True` and `return_dict_in_generate=True`, returns a per-step confidence tensor alongside generated sequences for debugging/visualization. |
|
|
- `deepconf_variant` (str): Optional variant for automatic threshold calibration (`"low"` or `"high"`). Requires `deepconf_warmup_confidences`. |
|
|
- `deepconf_warmup_confidences` (list/tensor): Warmup confidence values for threshold calibration. Used with `deepconf_variant`. |
|
|
- `deepconf_eta` (float): Optional override for eta value in threshold calculation (defaults: 0.1 for low, 0.9 for high). |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
To use this custom generation strategy, you can pass it directly to the `generate` method: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig |
|
|
|
|
|
# Load model and tokenizer |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"your-model", |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-model") |
|
|
|
|
|
# Prepare your prompt |
|
|
question = "What is the square root of 144?" |
|
|
messages = [{"role": "user", "content": question}] |
|
|
prompt = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Configure generation with DeepCONF |
|
|
gen_config = GenerationConfig( |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.95, |
|
|
max_new_tokens=512, |
|
|
enable_conf=True, # Enable DeepCONF |
|
|
window_size=2048, # Sliding window size |
|
|
threshold=17.0, # Confidence threshold |
|
|
conf_topk=20, # Top-k for confidence (default: 20) |
|
|
output_confidences=True, # Return confidence scores |
|
|
return_dict_in_generate=True, # Required for confidence output |
|
|
) |
|
|
|
|
|
# Generate with DeepCONF (Hub repo) |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
generation_config=gen_config, |
|
|
custom_generate="kashif/DeepConf", # Hugging Face Hub repo |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Access results |
|
|
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True) |
|
|
print(f"Generated: {generated_text}") |
|
|
|
|
|
# Access per-step confidences if requested |
|
|
if hasattr(outputs, 'confidences'): |
|
|
confidences = outputs.confidences # Shape: (batch_size, num_generated_tokens) |
|
|
print(f"Min confidence: {confidences.min().item():.3f}") |
|
|
print(f"Mean confidence: {confidences.mean().item():.3f}") |
|
|
``` |
|
|
|
|
|
### Calibration (DeepConf-low/high) |
|
|
|
|
|
DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior. |
|
|
|
|
|
**Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences: |
|
|
|
|
|
```python |
|
|
from transformers import GenerationConfig |
|
|
|
|
|
# Configure warmup generation |
|
|
warmup_cfg = GenerationConfig( |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.95, |
|
|
max_new_tokens=256, |
|
|
enable_conf=True, # Enable confidence tracking |
|
|
return_dict_in_generate=True, |
|
|
output_confidences=True, |
|
|
num_return_sequences=8, # Generate 8 warmup sequences |
|
|
) |
|
|
|
|
|
# Generate warmup sequences |
|
|
warmup_out = model.generate( |
|
|
**inputs, |
|
|
generation_config=warmup_cfg, |
|
|
custom_generate="kashif/DeepConf", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
# Extract minimum confidence per sequence (C_t = min over all steps) |
|
|
warmup_C = warmup_out.confidences.min(dim=1).values.tolist() |
|
|
print(f"Warmup min confidences: {warmup_C}") |
|
|
``` |
|
|
|
|
|
**Step 2: Production Generation** - Use warmup confidences to auto-derive threshold: |
|
|
|
|
|
```python |
|
|
# Configure production generation with calibrated threshold |
|
|
gen_cfg = GenerationConfig( |
|
|
do_sample=True, |
|
|
max_new_tokens=512, |
|
|
enable_conf=True, |
|
|
return_dict_in_generate=True, |
|
|
output_confidences=True, |
|
|
|
|
|
# Automatic threshold calibration |
|
|
deepconf_variant="low", # "low" (aggressive) or "high" (permissive) |
|
|
deepconf_warmup_confidences=warmup_C, # Pass warmup confidences |
|
|
) |
|
|
|
|
|
# Generate with calibrated threshold |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
generation_config=gen_cfg, |
|
|
custom_generate="kashif/DeepConf", |
|
|
trust_remote_code=True, |
|
|
) |
|
|
``` |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
### Confidence Calculation |
|
|
|
|
|
The confidence score for each generated token is calculated as follows: |
|
|
1. **Extract top-k tokens**: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary. |
|
|
2. **Compute log probabilities**: Calculate log probabilities for these top-k tokens. |
|
|
3. **Average**: The confidence score is `-mean(log_probs)` of the top-k tokens. |
|
|
|
|
|
### Online Stopping |
|
|
|
|
|
The online method uses a sliding window of confidence scores: |
|
|
- Maintains a window of the last `window_size` (default: 2048) confidence scores. |
|
|
- Calculates the mean confidence over this window. |
|
|
- Stops generation when: `mean_confidence < threshold`. |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- PyTorch >= 1.13.0 |
|
|
- Transformers >= 4.35.0 |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{fu2025deep, |
|
|
title={Deep think with confidence}, |
|
|
author={Fu, Yichao and Wang, Xuewei and Tian, Yuandong and Zhao, Jiawei}, |
|
|
journal={arXiv preprint arXiv:2508.15260}, |
|
|
year={2025} |
|
|
} |
|
|
``` |