File size: 6,905 Bytes

df24cd5
 
eea3f37
 
e226c73
 
 
094fb45
 
 
 
eea3f37
 
 
 
094fb45
 
 
eea3f37
094fb45
 
 
 
56bd97c
094fb45
 
30add1f
094fb45
56bd97c
 
 
094fb45
 
 
30add1f
 
094fb45
 
 
30add1f
094fb45
30add1f
 
 
 
 
 
094fb45
 
30add1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
094fb45
 
 
 
30add1f
094fb45
 
 
30add1f
 
 
 
 
 
 
 
 
 
4a2373b
 
30add1f
4a2373b
eea3f37
4a2373b
30add1f
094fb45
4a2373b
 
 
30add1f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9ed69b6
4a2373b
 
30add1f
 
 
 
4a2373b
 
30add1f
 
4a2373b
30add1f
 
 
 
 
 
 
 
 
eea3f37
30add1f
 
 
 
 
 
 
4a2373b
 
 
56bd97c
 
cfa4f52
 
 
 
 
eea3f37
 
 
cfa4f52
 
 
 
eea3f37
 
 
cfa4f52
094fb45
 
 
 
 
eea3f37

---
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
- custom_generate
- sampling
---

# DeepCONF Custom Generation Strategy

This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, as presented in the paper [Deep Think with Confidence](https://huggingface.co/papers/2508.15260).

- **Project Page:** [https://jiaweizzhao.github.io/deepconf](https://jiaweizzhao.github.io/deepconf)
- **GitHub Repository:** [https://github.com/facebookresearch/deepconf](https://github.com/facebookresearch/deepconf)

## Overview

DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the official implementation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks.

## Parameters

- `enable_conf` (bool): Whether to enable the DeepCONF strategy. Defaults to `False`.
- `enable_early_stopping` (bool): Whether to apply early stopping during generation (online mode) or just track confidences for post-processing (batch mode). Defaults to `True`.
- `window_size` (int): Size of the sliding window for confidence calculation. Defaults to `2048`.
- `threshold` (float): Confidence threshold for early stopping. Defaults to `17.0`.
- `conf_topk` (int): Number of top tokens to use for confidence calculation from the full vocabulary. Defaults to `20`.
- `output_confidences` (bool): If `True` and `return_dict_in_generate=True`, returns a per-step confidence tensor alongside generated sequences for debugging/visualization.
- `deepconf_variant` (str): Optional variant for automatic threshold calibration (`"low"` or `"high"`). Requires `deepconf_warmup_confidences`.
- `deepconf_warmup_confidences` (list/tensor): Warmup confidence values for threshold calibration. Used with `deepconf_variant`.
- `deepconf_eta` (float): Optional override for eta value in threshold calculation (defaults: 0.1 for low, 0.9 for high).

## Usage

### Basic Usage

To use this custom generation strategy, you can pass it directly to the `generate` method:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "your-model",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-model")

# Prepare your prompt
question = "What is the square root of 144?"
messages = [{"role": "user", "content": question}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Configure generation with DeepCONF
gen_config = GenerationConfig(
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    max_new_tokens=512,
    enable_conf=True,              # Enable DeepCONF
    window_size=2048,              # Sliding window size
    threshold=17.0,                # Confidence threshold
    conf_topk=20,                  # Top-k for confidence (default: 20)
    output_confidences=True,       # Return confidence scores
    return_dict_in_generate=True,  # Required for confidence output
)

# Generate with DeepCONF (Hub repo)
outputs = model.generate(
    **inputs,
    generation_config=gen_config,
    custom_generate="kashif/DeepConf",  # Hugging Face Hub repo
    trust_remote_code=True
)

# Access results
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(f"Generated: {generated_text}")

# Access per-step confidences if requested
if hasattr(outputs, 'confidences'):
    confidences = outputs.confidences  # Shape: (batch_size, num_generated_tokens)
    print(f"Min confidence: {confidences.min().item():.3f}")
    print(f"Mean confidence: {confidences.mean().item():.3f}")
```

### Calibration (DeepConf-low/high)

DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior.

**Step 1: Warmup Phase** - Generate multiple sequences and collect their minimum confidences:

```python
from transformers import GenerationConfig

# Configure warmup generation
warmup_cfg = GenerationConfig(
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    max_new_tokens=256,
    enable_conf=True,               # Enable confidence tracking
    return_dict_in_generate=True,
    output_confidences=True,
    num_return_sequences=8,         # Generate 8 warmup sequences
)

# Generate warmup sequences
warmup_out = model.generate(
    **inputs,
    generation_config=warmup_cfg,
    custom_generate="kashif/DeepConf",
    trust_remote_code=True,
)

# Extract minimum confidence per sequence (C_t = min over all steps)
warmup_C = warmup_out.confidences.min(dim=1).values.tolist()
print(f"Warmup min confidences: {warmup_C}")
```

**Step 2: Production Generation** - Use warmup confidences to auto-derive threshold:

```python
# Configure production generation with calibrated threshold
gen_cfg = GenerationConfig(
    do_sample=True,
    max_new_tokens=512,
    enable_conf=True,
    return_dict_in_generate=True,
    output_confidences=True,

    # Automatic threshold calibration
    deepconf_variant="low",  # "low" (aggressive) or "high" (permissive)
    deepconf_warmup_confidences=warmup_C,  # Pass warmup confidences
)

# Generate with calibrated threshold
outputs = model.generate(
    **inputs,
    generation_config=gen_cfg,
    custom_generate="kashif/DeepConf",
    trust_remote_code=True,
)
```

## Technical Details

### Confidence Calculation

The confidence score for each generated token is calculated as follows:
1. **Extract top-k tokens**: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary.
2. **Compute log probabilities**: Calculate log probabilities for these top-k tokens.
3. **Average**: The confidence score is `-mean(log_probs)` of the top-k tokens.

### Online Stopping

The online method uses a sliding window of confidence scores:
- Maintains a window of the last `window_size` (default: 2048) confidence scores.
- Calculates the mean confidence over this window.
- Stops generation when: `mean_confidence < threshold`.

## Requirements

- PyTorch >= 1.13.0
- Transformers >= 4.35.0

## Citation

```bibtex
@article{fu2025deep,
  title={Deep think with confidence},
  author={Fu, Yichao and Wang, Xuewei and Tian, Yuandong and Zhao, Jiawei},
  journal={arXiv preprint arXiv:2508.15260},
  year={2025}
}
```