File size: 8,879 Bytes
9b86515
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3bb5ce1
 
9b86515
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
datasets:
- zwhe99/DeepMath-103K
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
---
# AutoDeco
Official Implementation of "[The End of Manual Decoding: Towards Truly End-to-End Language Models](https://arxiv.org/abs/2510.26697)"

**AutoDeco** is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding.

## 🎯 Key Features

- **Token-Level Decoding Parameter Prediction**: Dynamically predict decoding parameters (temperature and top-p) for each generated token
- **Lightweight Design**: Only adds two small MLP prediction heads (~5MB), without modifying the base model
- **Universal Architecture**: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.)
- **End-to-End Training**: Complete training with implicit gradient backpropagation through cross-entropy loss only 
- **Flexible Training**: Supports independent training of temperature head, top-p head, or joint training
- **Efficient Deployment**: Only saves AutoDeco prediction head weights during training, merges with base model during decoding.

## 🏗️ Architecture

The AutoDeco framework consists of two core components:

![AutoDeco Architecture](figure/arch.png)

### Model Workflow

```
Input Tokens

Base LLM (frozen during head training)

Hidden States
    ├──→ LM Head → Logits
    ├──→ TempHead → Temperature
    └──→ TopPHead → Top-P
```

During training, the base LLM parameters are frozen, and only the two prediction heads are trained.

## 🤖 Supported Models

AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface.



<div align="center">

| **Base Model** | **#Base Params** | **#AutoDeco Params** | **Download** |
| :------------: | :------------: | :------------: | :------------: |
| Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Llama-Nemotron-8B)   |
| DeepSeek-R1-Distill-Qwen-7B   | 7B | 1.84M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-R1-Distill-Qwen-7B)   |
| Qwen3-30B-A3B-Instruct-2507   | 30B | 1.05M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Qwen3-30B-A3B-Instruct-2507)   |
| OpenAI-GPT-OSS-20B   | 20B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-20B)   |
| OpenAI-GPT-OSS-120B   | 120B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-120B)  |
| Qwen3-235B-A22B-Thinking   | 235B | 2.1M | [🤗 HuggingFace](https://huggingface.co/zacks917/AutoDeco-Qwen3-235B-A22B-Thinking-2507)  |
| DeepSeek-V3.1-Terminus   | 671B | - | Comming Soon  |

</div>



## 🚀 Installation

### Recommended Requirements

- Python >= 3.10
- PyTorch >= 2.0
- CUDA >= 12.0 (recommended for training)

### Install Dependencies

```bash
# Clone repository
cd AutoDeco

# Install core dependencies
pip install -r requirements.txt

# Optional: for training monitoring
pip install wandb
```

## 💡 Quick Start

### Initialize AutoDeco Model

```python
python script/construct_autodeco.py \
    --base_model_name_or_path path_to_your_base_LLM \
    --output_dir path_to_your_AutoDeco_model
```

<!-- ### 2. Inference

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("path/to/model")
inputs = tokenizer("What is the meaning of life?", return_tensors="pt")

# Forward pass to get predictions
outputs = model(**inputs)

# outputs contains:
# - outputs.logits: Regular language model logits
# - outputs.temp_logits: Predicted temperature values
# - outputs.top_p_logits: Predicted top-p values
```

### 3. Efficient Inference with vLLM

We have integrated AutoDeco with vLLM for efficient batch inference:

- Install vLLM from source code first
    ```bash
    cd vllm
    pip install -e .
    ```

- Inference
    ```bash
    # Use training script for evaluation
    python llm_eval.py \
        --model_name_or_path path/to/autodeco_model \
        --dataset aime24 \
        --temp 1.0 \
        --top_p 1.0 \
        --k 16 \
        --tp_size 4
    ``` -->

## 🔥 Training

### Prepare Training Data

Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format:


```bash
{
  "prompt": "formatted prompt text",
  "completion": "expected completion"
}

# example
{
  "prompt": "<|im_start|>user\nEvaluate the limit:$$\\lim_{(x, y) \\to (1, 2)} \\frac{(x-1)(y-2)-x+3}{x^2-2x+y^2-4}$$\nMake sure you output the final answer within \\boxed{}<|im_end|>\n< im_start>assistant\n",
  "completion": "......### ✅ Final Answer:\n$$\n\\boxed{-1}\n$$""
}
```

### Train AutoDeco Heads

Use the provided training script:

```bash
# Edit script/trl_train.sh to configure parameters
# Key parameters:
# - MODEL_NAME_OR_PATH: Your initialized AutoDeco Model Path
# - DATA_NAME: Training data filename (in data directory)
# - MAX_LENGTH: Maximum sequence length
# - train_temp: Whether to train temperature head
# - train_top_p: Whether to train top-p head

bash script/trl_train.sh
```

Training configuration examples:

```bash
# Train only temperature head
accelerate launch trl_train.py \
    --model_name_or_path AutoDeco-Llama-3.1-8B \
    --dataset_name train_data.jsonl \
    --train_temp true \
    --train_top_p false \
    --learning_rate 5e-6 \
    --num_train_epochs 1 \
    --output_dir ckpt/llama3_temp_head
```

## 📊 Inference

### Batch Evaluation with vLLM

```bash
# Single evaluation
python llm_eval.py \
    --model_name_or_path ckpt/autodeco_model \
    --dataset aime24 \
    --temp 1.0 \
    --top_p 1.0 \
    --k 16 \
    --seed 42

# Batch evaluation with script (automatically generates multiple random seeds)
bash script/test_generation.sh aime24 1.0 1.0 -1 1.0 path/to/model
```

Evaluation results are saved in the `generation_log/` directory, including:
- Pass@K metrics
- Average accuracy
- Detailed generation results for each sample

### Deploy with vLLM
```bash
# example
vllm serve 
```

## 📁 Project Structure
```
AutoDeco/
├── model/                          # Model definitions
│   ├── templlm_auto.py            # Unified AutoDeco model (recommended)
definitions

├── trainer/                        # Trainers
│   └── trl_Temp.py                # AutoDeco trainer

├── script/                         # Scripts
│   ├── trl_train.sh               # Training launch script
│   ├── test_generation.sh         # Batch evaluation script
│   └── merge_autodeco.py          # Merge or split heads

├── config/                         # Configuration files
│   └── deepspeed/                 # DeepSpeed configuration
│       └── deepspeed_zero3_gradaccu4.yaml

├── trl_train.py                   # Training main program
├── llm_eval.py                    # Evaluation main program (vLLM)
├── boxed_extract.py               # Answer extraction tool
├── requirements.txt               # requirements
└── README.md                      # This document

```

## 🔧 Advanced Usage

### 1. Extract AutoDeco Heads from AutoDeco Model

```python
python merge_autodeco.py split \
    --full-checkpoint path_to_your_full_model \
    --output path_to_split_head
```

This generates a lightweight checkpoint (~5MB) containing:
- `config.json`: AutoDeco configuration (including base_model_name_or_path)
- `autodeco_heads.safetensors`: Heads weights

### 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment)

If you need to create a complete model file with heads for inference engines like vLLM:

```python
python merge_autodeco.py merge \
    --autodeco-path path_to_autodeco_heads \
    --base-model-path path_to_base_LLM \
    --output path_to_your_full_model
```


## 📝 Citation

If you use AutoDeco in your research, please cite:

```bibtex
@misc{wang2025endmanualdecodingtruly,
      title={The End of Manual Decoding: Towards Truly End-to-End Language Models}, 
      author={Zhichao Wang and Dongyang Ma and Xinting Huang and Deng Cai and Tian Lan and Jiahao Xu and Haitao Mi and Xiaoying Tang and Yan Wang},
      year={2025},
      eprint={2510.26697},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.26697}, 
}
```

<!-- ## Acknowledgments

- Built on [Transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl)
- Training framework uses [DeepSpeed](https://github.com/microsoft/DeepSpeed)
- Inference optimization uses [vLLM](https://github.com/vllm-project/vllm) -->