File size: 8,879 Bytes
9b86515 3bb5ce1 9b86515 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 |
---
datasets:
- zwhe99/DeepMath-103K
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
---
# AutoDeco
Official Implementation of "[The End of Manual Decoding: Towards Truly End-to-End Language Models](https://arxiv.org/abs/2510.26697)"
**AutoDeco** is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding.
## 🎯 Key Features
- **Token-Level Decoding Parameter Prediction**: Dynamically predict decoding parameters (temperature and top-p) for each generated token
- **Lightweight Design**: Only adds two small MLP prediction heads (~5MB), without modifying the base model
- **Universal Architecture**: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.)
- **End-to-End Training**: Complete training with implicit gradient backpropagation through cross-entropy loss only
- **Flexible Training**: Supports independent training of temperature head, top-p head, or joint training
- **Efficient Deployment**: Only saves AutoDeco prediction head weights during training, merges with base model during decoding.
## 🏗️ Architecture
The AutoDeco framework consists of two core components:

### Model Workflow
```
Input Tokens
↓
Base LLM (frozen during head training)
↓
Hidden States
├──→ LM Head → Logits
├──→ TempHead → Temperature
└──→ TopPHead → Top-P
```
During training, the base LLM parameters are frozen, and only the two prediction heads are trained.
## 🤖 Supported Models
AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface.
<div align="center">
| **Base Model** | **#Base Params** | **#AutoDeco Params** | **Download** |
| :------------: | :------------: | :------------: | :------------: |
| Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Llama-Nemotron-8B) |
| DeepSeek-R1-Distill-Qwen-7B | 7B | 1.84M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-R1-Distill-Qwen-7B) |
| Qwen3-30B-A3B-Instruct-2507 | 30B | 1.05M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Qwen3-30B-A3B-Instruct-2507) |
| OpenAI-GPT-OSS-20B | 20B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-20B) |
| OpenAI-GPT-OSS-120B | 120B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-120B) |
| Qwen3-235B-A22B-Thinking | 235B | 2.1M | [🤗 HuggingFace](https://huggingface.co/zacks917/AutoDeco-Qwen3-235B-A22B-Thinking-2507) |
| DeepSeek-V3.1-Terminus | 671B | - | Comming Soon |
</div>
## 🚀 Installation
### Recommended Requirements
- Python >= 3.10
- PyTorch >= 2.0
- CUDA >= 12.0 (recommended for training)
### Install Dependencies
```bash
# Clone repository
cd AutoDeco
# Install core dependencies
pip install -r requirements.txt
# Optional: for training monitoring
pip install wandb
```
## 💡 Quick Start
### Initialize AutoDeco Model
```python
python script/construct_autodeco.py \
--base_model_name_or_path path_to_your_base_LLM \
--output_dir path_to_your_AutoDeco_model
```
<!-- ### 2. Inference
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("path/to/model")
inputs = tokenizer("What is the meaning of life?", return_tensors="pt")
# Forward pass to get predictions
outputs = model(**inputs)
# outputs contains:
# - outputs.logits: Regular language model logits
# - outputs.temp_logits: Predicted temperature values
# - outputs.top_p_logits: Predicted top-p values
```
### 3. Efficient Inference with vLLM
We have integrated AutoDeco with vLLM for efficient batch inference:
- Install vLLM from source code first
```bash
cd vllm
pip install -e .
```
- Inference
```bash
# Use training script for evaluation
python llm_eval.py \
--model_name_or_path path/to/autodeco_model \
--dataset aime24 \
--temp 1.0 \
--top_p 1.0 \
--k 16 \
--tp_size 4
``` -->
## 🔥 Training
### Prepare Training Data
Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format:
```bash
{
"prompt": "formatted prompt text",
"completion": "expected completion"
}
# example
{
"prompt": "<|im_start|>user\nEvaluate the limit:$$\\lim_{(x, y) \\to (1, 2)} \\frac{(x-1)(y-2)-x+3}{x^2-2x+y^2-4}$$\nMake sure you output the final answer within \\boxed{}<|im_end|>\n< im_start>assistant\n",
"completion": "......### ✅ Final Answer:\n$$\n\\boxed{-1}\n$$""
}
```
### Train AutoDeco Heads
Use the provided training script:
```bash
# Edit script/trl_train.sh to configure parameters
# Key parameters:
# - MODEL_NAME_OR_PATH: Your initialized AutoDeco Model Path
# - DATA_NAME: Training data filename (in data directory)
# - MAX_LENGTH: Maximum sequence length
# - train_temp: Whether to train temperature head
# - train_top_p: Whether to train top-p head
bash script/trl_train.sh
```
Training configuration examples:
```bash
# Train only temperature head
accelerate launch trl_train.py \
--model_name_or_path AutoDeco-Llama-3.1-8B \
--dataset_name train_data.jsonl \
--train_temp true \
--train_top_p false \
--learning_rate 5e-6 \
--num_train_epochs 1 \
--output_dir ckpt/llama3_temp_head
```
## 📊 Inference
### Batch Evaluation with vLLM
```bash
# Single evaluation
python llm_eval.py \
--model_name_or_path ckpt/autodeco_model \
--dataset aime24 \
--temp 1.0 \
--top_p 1.0 \
--k 16 \
--seed 42
# Batch evaluation with script (automatically generates multiple random seeds)
bash script/test_generation.sh aime24 1.0 1.0 -1 1.0 path/to/model
```
Evaluation results are saved in the `generation_log/` directory, including:
- Pass@K metrics
- Average accuracy
- Detailed generation results for each sample
### Deploy with vLLM
```bash
# example
vllm serve
```
## 📁 Project Structure
```
AutoDeco/
├── model/ # Model definitions
│ ├── templlm_auto.py # Unified AutoDeco model (recommended)
definitions
│
├── trainer/ # Trainers
│ └── trl_Temp.py # AutoDeco trainer
│
├── script/ # Scripts
│ ├── trl_train.sh # Training launch script
│ ├── test_generation.sh # Batch evaluation script
│ └── merge_autodeco.py # Merge or split heads
│
├── config/ # Configuration files
│ └── deepspeed/ # DeepSpeed configuration
│ └── deepspeed_zero3_gradaccu4.yaml
│
├── trl_train.py # Training main program
├── llm_eval.py # Evaluation main program (vLLM)
├── boxed_extract.py # Answer extraction tool
├── requirements.txt # requirements
└── README.md # This document
```
## 🔧 Advanced Usage
### 1. Extract AutoDeco Heads from AutoDeco Model
```python
python merge_autodeco.py split \
--full-checkpoint path_to_your_full_model \
--output path_to_split_head
```
This generates a lightweight checkpoint (~5MB) containing:
- `config.json`: AutoDeco configuration (including base_model_name_or_path)
- `autodeco_heads.safetensors`: Heads weights
### 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment)
If you need to create a complete model file with heads for inference engines like vLLM:
```python
python merge_autodeco.py merge \
--autodeco-path path_to_autodeco_heads \
--base-model-path path_to_base_LLM \
--output path_to_your_full_model
```
## 📝 Citation
If you use AutoDeco in your research, please cite:
```bibtex
@misc{wang2025endmanualdecodingtruly,
title={The End of Manual Decoding: Towards Truly End-to-End Language Models},
author={Zhichao Wang and Dongyang Ma and Xinting Huang and Deng Cai and Tian Lan and Jiahao Xu and Haitao Mi and Xiaoying Tang and Yan Wang},
year={2025},
eprint={2510.26697},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.26697},
}
```
<!-- ## Acknowledgments
- Built on [Transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl)
- Training framework uses [DeepSpeed](https://github.com/microsoft/DeepSpeed)
- Inference optimization uses [vLLM](https://github.com/vllm-project/vllm) --> |