File size: 6,905 Bytes
df24cd5 eea3f37 e226c73 094fb45 eea3f37 094fb45 eea3f37 094fb45 56bd97c 094fb45 30add1f 094fb45 56bd97c 094fb45 30add1f 094fb45 30add1f 094fb45 30add1f 094fb45 30add1f 094fb45 30add1f 094fb45 30add1f 4a2373b 30add1f 4a2373b eea3f37 4a2373b 30add1f 094fb45 4a2373b 30add1f 9ed69b6 4a2373b 30add1f 4a2373b 30add1f 4a2373b 30add1f eea3f37 30add1f 4a2373b 56bd97c cfa4f52 eea3f37 cfa4f52 eea3f37 cfa4f52 094fb45 eea3f37 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
---
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
- custom_generate
- sampling
---
# DeepCONF Custom Generation Strategy
This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, as presented in the paper [Deep Think with Confidence](https://huggingface.co/papers/2508.15260).
- **Project Page:** [https://jiaweizzhao.github.io/deepconf](https://jiaweizzhao.github.io/deepconf)
- **GitHub Repository:** [https://github.com/facebookresearch/deepconf](https://github.com/facebookresearch/deepconf)
## Overview
DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the official implementation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks.
## Parameters
- `enable_conf` (bool): Whether to enable the DeepCONF strategy. Defaults to `False`.
- `enable_early_stopping` (bool): Whether to apply early stopping during generation (online mode) or just track confidences for post-processing (batch mode). Defaults to `True`.
- `window_size` (int): Size of the sliding window for confidence calculation. Defaults to `2048`.
- `threshold` (float): Confidence threshold for early stopping. Defaults to `17.0`.
- `conf_topk` (int): Number of top tokens to use for confidence calculation from the full vocabulary. Defaults to `20`.
- `output_confidences` (bool): If `True` and `return_dict_in_generate=True`, returns a per-step confidence tensor alongside generated sequences for debugging/visualization.
- `deepconf_variant` (str): Optional variant for automatic threshold calibration (`"low"` or `"high"`). Requires `deepconf_warmup_confidences`.
- `deepconf_warmup_confidences` (list/tensor): Warmup confidence values for threshold calibration. Used with `deepconf_variant`.
- `deepconf_eta` (float): Optional override for eta value in threshold calculation (defaults: 0.1 for low, 0.9 for high).
## Usage
### Basic Usage
To use this custom generation strategy, you can pass it directly to the `generate` method:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"your-model",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-model")
# Prepare your prompt
question = "What is the square root of 144?"
messages = [{"role": "user", "content": question}]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Configure generation with DeepCONF
gen_config = GenerationConfig(
do_sample=True,
temperature=0.7,
top_p=0.95,
max_new_tokens=512,
enable_conf=True, # Enable DeepCONF
window_size=2048, # Sliding window size
threshold=17.0, # Confidence threshold
conf_topk=20, # Top-k for confidence (default: 20)
output_confidences=True, # Return confidence scores
return_dict_in_generate=True, # Required for confidence output
)
# Generate with DeepCONF (Hub repo)
outputs = model.generate(
**inputs,
generation_config=gen_config,
custom_generate="kashif/DeepConf", # Hugging Face Hub repo
trust_remote_code=True
)
# Access results
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(f"Generated: {generated_text}")
# Access per-step confidences if requested
if hasattr(outputs, 'confidences'):
confidences = outputs.confidences # Shape: (batch_size, num_generated_tokens)
print(f"Min confidence: {confidences.min().item():.3f}")
print(f"Mean confidence: {confidences.mean().item():.3f}")
```
### Calibration (DeepConf-low/high)
DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior.
**Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences:
```python
from transformers import GenerationConfig
# Configure warmup generation
warmup_cfg = GenerationConfig(
do_sample=True,
temperature=0.7,
top_p=0.95,
max_new_tokens=256,
enable_conf=True, # Enable confidence tracking
return_dict_in_generate=True,
output_confidences=True,
num_return_sequences=8, # Generate 8 warmup sequences
)
# Generate warmup sequences
warmup_out = model.generate(
**inputs,
generation_config=warmup_cfg,
custom_generate="kashif/DeepConf",
trust_remote_code=True,
)
# Extract minimum confidence per sequence (C_t = min over all steps)
warmup_C = warmup_out.confidences.min(dim=1).values.tolist()
print(f"Warmup min confidences: {warmup_C}")
```
**Step 2: Production Generation** - Use warmup confidences to auto-derive threshold:
```python
# Configure production generation with calibrated threshold
gen_cfg = GenerationConfig(
do_sample=True,
max_new_tokens=512,
enable_conf=True,
return_dict_in_generate=True,
output_confidences=True,
# Automatic threshold calibration
deepconf_variant="low", # "low" (aggressive) or "high" (permissive)
deepconf_warmup_confidences=warmup_C, # Pass warmup confidences
)
# Generate with calibrated threshold
outputs = model.generate(
**inputs,
generation_config=gen_cfg,
custom_generate="kashif/DeepConf",
trust_remote_code=True,
)
```
## Technical Details
### Confidence Calculation
The confidence score for each generated token is calculated as follows:
1. **Extract top-k tokens**: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary.
2. **Compute log probabilities**: Calculate log probabilities for these top-k tokens.
3. **Average**: The confidence score is `-mean(log_probs)` of the top-k tokens.
### Online Stopping
The online method uses a sliding window of confidence scores:
- Maintains a window of the last `window_size` (default: 2048) confidence scores.
- Calculates the mean confidence over this window.
- Stops generation when: `mean_confidence < threshold`.
## Requirements
- PyTorch >= 1.13.0
- Transformers >= 4.35.0
## Citation
```bibtex
@article{fu2025deep,
title={Deep think with confidence},
author={Fu, Yichao and Wang, Xuewei and Tian, Yuandong and Zhao, Jiawei},
journal={arXiv preprint arXiv:2508.15260},
year={2025}
}
``` |