File size: 6,001 Bytes
bfe4557
 
 
 
 
 
98a465a
 
bfe4557
 
98a465a
bfe4557
98a465a
6c86734
98a465a
 
bfe4557
98a465a
bfe4557
98a465a
 
 
 
2fd389f
98a465a
bfe4557
98a465a
 
 
 
bfe4557
98a465a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfe4557
 
98a465a
bfe4557
 
98a465a
 
 
 
 
 
 
 
 
 
 
bfe4557
98a465a
bfe4557
 
98a465a
6c86734
98a465a
 
 
 
 
 
6c86734
98a465a
 
 
 
6c86734
98a465a
 
6c86734
98a465a
 
 
 
 
 
 
44396d8
98a465a
 
 
 
 
 
6c86734
98a465a
 
 
 
 
 
 
 
 
 
 
 
 
6c86734
98a465a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-generation
tags:
- text diffusion model
- language model
- code generation
---
# CoDA: Coding LM via Diffusion Adaptation

**CoDA-1.7B** is a lightweight diffusion language model for code generation developed by Salesforce AI Research. Unlike traditional autoregressive models, CoDA leverages discrete diffusion processes to enable bidirectional context understanding and efficient code completion.

- πŸ“„ [Technical Report](https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf)
- πŸ’» [Code Repository](https://github.com/SalesforceAIResearch/CoDA/)

## πŸ“Š Model Details

- **Model Size**: 1.7B parameters
- **Architecture**: Diffusion-based language model
- **Training**: TPU-based pre-training with GPU fine-tuning
- **Primary Use**: Code generation and completion tasks

## ✨ Key Features

- **Bidirectional Context**: Diffusion modeling enables understanding of both past and future tokens
- **Confidence-Guided Sampling**: Maintains competitive inference latency through intelligent sampling
- **Lightweight Design**: Achieves strong performance with fewer parameters than comparable models
- **Open Training Pipeline**: Fully reproducible training from pre-training to fine-tuning

## πŸ“ˆ Performance

CoDA-1.7B-Instruct demonstrates competitive performance on standard code generation benchmarks:

| Model | HumanEval | HumanEval+ | MBPP | MBPP+ | EvalPlus |
|-------|-----------|------------|------|-------|----------|
| **CoDA-Base** | 29.3 | 23.8 | 35.2 | 46.0 | 34.9 |
| **CoDA-Instruct** | **54.3** | **47.6** | 47.2 | **63.2** | **55.4** |
| Dream-Base | 56.7 | 50.0 | 68.7 | 57.4 | 53.7 |
| Dream-7B-Instruct | 57.9 | 53.7 | 68.3 | 56.1 | 54.9 |
| LLaDA-8B-Instruct | 35.4 | 31.7 | 31.5 | 28.6 | 30.2 |

**🎯 Key Finding**: CoDA-1.7B-Instruct matches or surpasses diffusion models up to 7B parameters while maintaining significantly lower computational requirements. CoDA offers an advantageous balance between inference speed and accuracy compared to larger diffusion models.

## πŸŽ“ Training Methodology

CoDA employs a three-stage training process:

*Three-stage training: (1) Pre-training with bidirectional masking, (2) Post-training with instruction format, (3) Inference with progressive denoising.*

## πŸ› οΈ Usage

### πŸš€ Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Salesforce/CoDA-v0-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate code
prompt = "Write a Python function to calculate fibonacci numbers"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_tokens=256,
    diffusion_steps=128,
    temperature=0.0
)
print(tokenizer.decode(outputs[0]))
```

### πŸš€ Deployment

For production deployment, we provide serving with OpenAI-compatible APIs:

```bash
# Clone the repository
git clone https://github.com/SalesforceAIResearch/CoDA
cd CoDA

# Set up environment
python3 -m venv .venv
source .venv/bin/activate
pip install -r serving/requirements.txt

# Export your Hugging Face token
export HF_TOKEN="hf_..."

# Start the server
bash serving/fast-api/start_server.sh
```

The server will listen on `http://localhost:8000`.

### πŸ’¬ Interactive CLI

```bash
python serving/fast-api/chat_cli.py \
  --base-url http://localhost:8000 \
  --model Salesforce/CoDA-v0-Instruct \
  --stream \
  --show-meta
```

### βš™οΈ Generation Hyperparameters

Customize generation behavior with environment variables:

```bash
export MAX_TOKENS=512          # Maximum tokens to generate
export TEMPERATURE=0.7         # Sampling temperature
export TOP_P=0.9              # Nucleus sampling threshold
export STEPS=128              # Number of diffusion steps
export ALG="entropy"          # Sampling algorithm
export ALG_TEMP=0.1           # Algorithm temperature
export BLOCK_LENGTH=32        # Block size for processing
```

**Recommended Settings**:
- **Fast inference**: `STEPS=64`, `TEMPERATURE=0.0`
- **Quality generation**: `STEPS=128`, `TEMPERATURE=0.7`, `TOP_P=0.9`
- **High quality**: `STEPS=256`, `TEMPERATURE=0.5`, `TOP_P=0.95`

## πŸ”§ Training from Scratch

The complete training pipeline is available in our [repository](https://github.com/SalesforceAIResearch/CoDA):

```bash
# Clone the repository
git clone https://github.com/SalesforceAIResearch/CoDA
cd CoDA
```

### 🧠 Pre-training on TPU
```bash
# Configure TPU environment
cd pre-train
cp env.example .env  # Add your TPU metadata
bash setup_tpu.sh

# Launch pre-training
bash recipes/midtrain_v4_512.sh
```

### 🎯 Supervised Fine-tuning
```bash
# Set up fine-tuning environment
cd post-train/LLaMA-Factory
pip install -r requirements.txt

# Configure dataset and run fine-tuning
bash ../../run_sft.sh
```

### πŸ“Š Evaluation
```bash
cd evaluation/lm_eval
bash eval_mbpp_humaneval.sh
```
## πŸ“š Citation

Technical report coming soon. For now, please cite:

```bibtex
@misc{coda2025,
  title={CoDA: Coding LM via Diffusion Adaptation},
  author={Chen, Haolin and Wang, Shiyu and Qin, Can and Pang, Bo and Liu, Zuxin and Qiu, Jielin and Zhang, Jianguo and Zhou, Yingbo and Chen, Zeyuan and Xu, Ran and Heinecke, Shelby and Savarese, Silvio and Xiong, Caiming and Wang, Huan and Yao, Weiran},
  year={2025},
  publisher={Salesforce AI Research}
}
```

## πŸ”— Resources

- πŸ“„ **Technical Report**: [technical_report.pdf](https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf)
- πŸ’» **Code Repository**: [github.com/SalesforceAIResearch/CoDA](https://github.com/SalesforceAIResearch/CoDA)
- πŸ€— **Model Hub**: [Salesforce CoDA collection](https://huggingface.co/collections/Salesforce/coda-68d627d87921c0e28a69e340)

## πŸ™ Acknowledgements

We thank Lingpeng Kong for insightful discussions and Jialei Chen for technical support with TPU infrastructure.

---

*🏒 Developed by Salesforce AI Research*