File size: 8,879 Bytes

---
datasets:
- zwhe99/DeepMath-103K
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
---
# AutoDeco
Official Implementation of "[The End of Manual Decoding: Towards Truly End-to-End Language Models](https://arxiv.org/abs/2510.26697)"

**AutoDeco** is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding.

## 🎯 Key Features

- **Token-Level Decoding Parameter Prediction**: Dynamically predict decoding parameters (temperature and top-p) for each generated token
- **Lightweight Design**: Only adds two small MLP prediction heads (~5MB), without modifying the base model
- **Universal Architecture**: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.)
- **End-to-End Training**: Complete training with implicit gradient backpropagation through cross-entropy loss only 
- **Flexible Training**: Supports independent training of temperature head, top-p head, or joint training
- **Efficient Deployment**: Only saves AutoDeco prediction head weights during training, merges with base model during decoding.

## 🏗️ Architecture

The AutoDeco framework consists of two core components:

![AutoDeco Architecture](figure/arch.png)

### Model Workflow

```
Input Tokens
    ↓
Base LLM (frozen during head training)
    ↓
Hidden States
    ├──→ LM Head → Logits
    ├──→ TempHead → Temperature
    └──→ TopPHead → Top-P
```

During training, the base LLM parameters are frozen, and only the two prediction heads are trained.

## 🤖 Supported Models

AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface.



<div align="center">

| **Base Model** | **#Base Params** | **#AutoDeco Params** | **Download** |
| :------------: | :------------: | :------------: | :------------: |
| Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Llama-Nemotron-8B)   |
| DeepSeek-R1-Distill-Qwen-7B   | 7B | 1.84M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-R1-Distill-Qwen-7B)   |
| Qwen3-30B-A3B-Instruct-2507   | 30B | 1.05M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Qwen3-30B-A3B-Instruct-2507)   |
| OpenAI-GPT-OSS-20B   | 20B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-20B)   |
| OpenAI-GPT-OSS-120B   | 120B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-120B)  |
| Qwen3-235B-A22B-Thinking   | 235B | 2.1M | [🤗 HuggingFace](https://huggingface.co/zacks917/AutoDeco-Qwen3-235B-A22B-Thinking-2507)  |
| DeepSeek-V3.1-Terminus   | 671B | - | Comming Soon  |

</div>



## 🚀 Installation

### Recommended Requirements

- Python >= 3.10
- PyTorch >= 2.0
- CUDA >= 12.0 (recommended for training)

### Install Dependencies

```bash
# Clone repository
cd AutoDeco

# Install core dependencies
pip install -r requirements.txt

# Optional: for training monitoring
pip install wandb
```

## 💡 Quick Start

### Initialize AutoDeco Model

```python
python script/construct_autodeco.py \
    --base_model_name_or_path path_to_your_base_LLM \
    --output_dir path_to_your_AutoDeco_model
```

<!-- ### 2. Inference

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("path/to/model")
inputs = tokenizer("What is the meaning of life?", return_tensors="pt")

# Forward pass to get predictions
outputs = model(**inputs)

# outputs contains:
# - outputs.logits: Regular language model logits
# - outputs.temp_logits: Predicted temperature values
# - outputs.top_p_logits: Predicted top-p values
```

### 3. Efficient Inference with vLLM

We have integrated AutoDeco with vLLM for efficient batch inference:

- Install vLLM from source code first
    ```bash
    cd vllm
    pip install -e .
    ```

- Inference
    ```bash
    # Use training script for evaluation
    python llm_eval.py \
        --model_name_or_path path/to/autodeco_model \
        --dataset aime24 \
        --temp 1.0 \
        --top_p 1.0 \
        --k 16 \
        --tp_size 4
    ``` -->

## 🔥 Training

### Prepare Training Data

Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format:


```bash
{
  "prompt": "formatted prompt text",
  "completion": "expected completion"
}

# example
{
  "prompt": "<|im_start|>user\nEvaluate the limit:$$\\lim_{(x, y) \\to (1, 2)} \\frac{(x-1)(y-2)-x+3}{x^2-2x+y^2-4}$$\nMake sure you output the final answer within \\boxed{}<|im_end|>\n< im_start>assistant\n",
  "completion": "......### ✅ Final Answer:\n$$\n\\boxed{-1}\n$$""
}
```

### Train AutoDeco Heads

Use the provided training script:

```bash
# Edit script/trl_train.sh to configure parameters
# Key parameters:
# - MODEL_NAME_OR_PATH: Your initialized AutoDeco Model Path
# - DATA_NAME: Training data filename (in data directory)
# - MAX_LENGTH: Maximum sequence length
# - train_temp: Whether to train temperature head
# - train_top_p: Whether to train top-p head

bash script/trl_train.sh
```

Training configuration examples:

```bash
# Train only temperature head
accelerate launch trl_train.py \
    --model_name_or_path AutoDeco-Llama-3.1-8B \
    --dataset_name train_data.jsonl \
    --train_temp true \
    --train_top_p false \
    --learning_rate 5e-6 \
    --num_train_epochs 1 \
    --output_dir ckpt/llama3_temp_head
```

## 📊 Inference

### Batch Evaluation with vLLM

```bash
# Single evaluation
python llm_eval.py \
    --model_name_or_path ckpt/autodeco_model \
    --dataset aime24 \
    --temp 1.0 \
    --top_p 1.0 \
    --k 16 \
    --seed 42

# Batch evaluation with script (automatically generates multiple random seeds)
bash script/test_generation.sh aime24 1.0 1.0 -1 1.0 path/to/model
```

Evaluation results are saved in the `generation_log/` directory, including:
- Pass@K metrics
- Average accuracy
- Detailed generation results for each sample

### Deploy with vLLM
```bash
# example
vllm serve 
```

## 📁 Project Structure
```
AutoDeco/
├── model/                          # Model definitions
│   ├── templlm_auto.py            # Unified AutoDeco model (recommended)
definitions
│
├── trainer/                        # Trainers
│   └── trl_Temp.py                # AutoDeco trainer
│
├── script/                         # Scripts
│   ├── trl_train.sh               # Training launch script
│   ├── test_generation.sh         # Batch evaluation script
│   └── merge_autodeco.py          # Merge or split heads
│
├── config/                         # Configuration files
│   └── deepspeed/                 # DeepSpeed configuration
│       └── deepspeed_zero3_gradaccu4.yaml
│
├── trl_train.py                   # Training main program
├── llm_eval.py                    # Evaluation main program (vLLM)
├── boxed_extract.py               # Answer extraction tool
├── requirements.txt               # requirements
└── README.md                      # This document

```

## 🔧 Advanced Usage

### 1. Extract AutoDeco Heads from AutoDeco Model

```python
python merge_autodeco.py split \
    --full-checkpoint path_to_your_full_model \
    --output path_to_split_head
```

This generates a lightweight checkpoint (~5MB) containing:
- `config.json`: AutoDeco configuration (including base_model_name_or_path)
- `autodeco_heads.safetensors`: Heads weights

### 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment)

If you need to create a complete model file with heads for inference engines like vLLM:

```python
python merge_autodeco.py merge \
    --autodeco-path path_to_autodeco_heads \
    --base-model-path path_to_base_LLM \
    --output path_to_your_full_model
```


## 📝 Citation

If you use AutoDeco in your research, please cite:

```bibtex
@misc{wang2025endmanualdecodingtruly,
      title={The End of Manual Decoding: Towards Truly End-to-End Language Models}, 
      author={Zhichao Wang and Dongyang Ma and Xinting Huang and Deng Cai and Tian Lan and Jiahao Xu and Haitao Mi and Xiaoying Tang and Yan Wang},
      year={2025},
      eprint={2510.26697},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.26697}, 
}
```

<!-- ## Acknowledgments

- Built on [Transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl)
- Training framework uses [DeepSpeed](https://github.com/microsoft/DeepSpeed)
- Inference optimization uses [vLLM](https://github.com/vllm-project/vllm) -->