DeepConf / README.md
nielsr's picture
nielsr HF Staff
Add pipeline tag and official links
eea3f37 verified
|
raw
history blame
6.91 kB
metadata
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
  - custom_generate
  - sampling

DeepCONF Custom Generation Strategy

This repository implements the DeepCONF (Deep Confidence-based Early Stopping) generation strategy for Hugging Face Transformers models, as presented in the paper Deep Think with Confidence.

Overview

DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It monitors the confidence of generated tokens and stops generation when confidence falls below a threshold. The confidence is calculated as the negative mean log probability of the top-k tokens from the full vocabulary (before sampling/filtering is applied), following the methodology from the official implementation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks.

Parameters

  • enable_conf (bool): Whether to enable the DeepCONF strategy. Defaults to False.
  • enable_early_stopping (bool): Whether to apply early stopping during generation (online mode) or just track confidences for post-processing (batch mode). Defaults to True.
  • window_size (int): Size of the sliding window for confidence calculation. Defaults to 2048.
  • threshold (float): Confidence threshold for early stopping. Defaults to 17.0.
  • conf_topk (int): Number of top tokens to use for confidence calculation from the full vocabulary. Defaults to 20.
  • output_confidences (bool): If True and return_dict_in_generate=True, returns a per-step confidence tensor alongside generated sequences for debugging/visualization.
  • deepconf_variant (str): Optional variant for automatic threshold calibration ("low" or "high"). Requires deepconf_warmup_confidences.
  • deepconf_warmup_confidences (list/tensor): Warmup confidence values for threshold calibration. Used with deepconf_variant.
  • deepconf_eta (float): Optional override for eta value in threshold calculation (defaults: 0.1 for low, 0.9 for high).

Usage

Basic Usage

To use this custom generation strategy, you can pass it directly to the generate method:

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "your-model",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-model")

# Prepare your prompt
question = "What is the square root of 144?"
messages = [{"role": "user", "content": question}]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Configure generation with DeepCONF
gen_config = GenerationConfig(
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    max_new_tokens=512,
    enable_conf=True,              # Enable DeepCONF
    window_size=2048,              # Sliding window size
    threshold=17.0,                # Confidence threshold
    conf_topk=20,                  # Top-k for confidence (default: 20)
    output_confidences=True,       # Return confidence scores
    return_dict_in_generate=True,  # Required for confidence output
)

# Generate with DeepCONF (Hub repo)
outputs = model.generate(
    **inputs,
    generation_config=gen_config,
    custom_generate="kashif/DeepConf",  # Hugging Face Hub repo
    trust_remote_code=True
)

# Access results
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(f"Generated: {generated_text}")

# Access per-step confidences if requested
if hasattr(outputs, 'confidences'):
    confidences = outputs.confidences  # Shape: (batch_size, num_generated_tokens)
    print(f"Min confidence: {confidences.min().item():.3f}")
    print(f"Mean confidence: {confidences.mean().item():.3f}")

Calibration (DeepConf-low/high)

DeepConf's online stopping threshold can be automatically derived from a warmup phase. This allows you to calibrate the threshold based on actual model behavior.

Step 1: Warmup Phase - Generate multiple sequences and collect their minimum confidences:

from transformers import GenerationConfig

# Configure warmup generation
warmup_cfg = GenerationConfig(
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    max_new_tokens=256,
    enable_conf=True,               # Enable confidence tracking
    return_dict_in_generate=True,
    output_confidences=True,
    num_return_sequences=8,         # Generate 8 warmup sequences
)

# Generate warmup sequences
warmup_out = model.generate(
    **inputs,
    generation_config=warmup_cfg,
    custom_generate="kashif/DeepConf",
    trust_remote_code=True,
)

# Extract minimum confidence per sequence (C_t = min over all steps)
warmup_C = warmup_out.confidences.min(dim=1).values.tolist()
print(f"Warmup min confidences: {warmup_C}")

Step 2: Production Generation - Use warmup confidences to auto-derive threshold:

# Configure production generation with calibrated threshold
gen_cfg = GenerationConfig(
    do_sample=True,
    max_new_tokens=512,
    enable_conf=True,
    return_dict_in_generate=True,
    output_confidences=True,

    # Automatic threshold calibration
    deepconf_variant="low",  # "low" (aggressive) or "high" (permissive)
    deepconf_warmup_confidences=warmup_C,  # Pass warmup confidences
)

# Generate with calibrated threshold
outputs = model.generate(
    **inputs,
    generation_config=gen_cfg,
    custom_generate="kashif/DeepConf",
    trust_remote_code=True,
)

Technical Details

Confidence Calculation

The confidence score for each generated token is calculated as follows:

  1. Extract top-k tokens: Get the top-k (default: 20) tokens with highest probabilities from the full vocabulary.
  2. Compute log probabilities: Calculate log probabilities for these top-k tokens.
  3. Average: The confidence score is -mean(log_probs) of the top-k tokens.

Online Stopping

The online method uses a sliding window of confidence scores:

  • Maintains a window of the last window_size (default: 2048) confidence scores.
  • Calculates the mean confidence over this window.
  • Stops generation when: mean_confidence < threshold.

Requirements

  • PyTorch >= 1.13.0
  • Transformers >= 4.35.0

Citation

@article{fu2025deep,
  title={Deep think with confidence},
  author={Fu, Yichao and Wang, Xuewei and Tian, Yuandong and Zhao, Jiawei},
  journal={arXiv preprint arXiv:2508.15260},
  year={2025}
}