File size: 8,671 Bytes
0483440
 
 
 
 
 
b2b98f3
0483440
b2b98f3
0483440
 
b2b98f3
0483440
58e62be
eb745fb
58e62be
 
 
eb745fb
045bc4d
 
58e62be
 
 
0483440
 
 
 
 
 
045bc4d
 
0483440
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58e62be
8a9063b
 
 
 
0483440
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2b98f3
 
 
0483440
 
 
 
 
 
 
 
 
 
 
 
 
045bc4d
0483440
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
base_model:
- Qwen/Qwen2.5-Coder-7B-Instruct
library_name: transformers
tags:
- verilog
pipeline_tag: text-generation
---

## CodeV-R1-Qwen-7B

[Project page](https://iprc-dip.github.io/CodeV-R1)

<div class="figure-container" style="display: flex; flex-direction: column; gap: 15px; max-width: 850px;">
  <div style="display: flex; gap: 10px; justify-content: center; margin-bottom: -3rem;">
    <img src="./assets/rtllm_tts.png" alt="RTLLM TTS Results" width="400">
    <img src="./assets/rtllm_tts_flops.png" alt="RTLLM TTS FLOPs Results" width="400">
  </div>
  <figcaption class="caption has-text-centered has-text-grey" style="font-size: 0.8rem;">
    Test-time scaling curves. <strong>Left</strong>: Inference time as a function of token length. <strong>Right</strong>: Inference time vs. estimated FLOPs consumption.
    When measured by FLOPs consumption, our <strong>CodeV-R1-Qwen-7B</strong> achieves better results with fewer computational resources than DeepSeek-R1, highlighting its superior efficiency.
  </figcaption>
</div>

### 1. Introduction

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high‐quality NL–code pairs, and the prohibitive computation cost of RLVR. 

To this end, we introduce **CodeV-R1**, an RLVR framework for training Verilog generation LLMs, As a continuation of the work initiated with [CodeV](https://huggingface.co/collections/yang-z/codev-6698a560cd94e61a9675fa2a). First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM‐generated NL descriptions, verifies code–NL–code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage distill-then-RL training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate.

**CodeV-R1-Qwen-7B**, is a model that employs reinforcement learning (RL) fine-tuning, built upon the foundation of **CodeV-R1-Distill-Qwen-7B**. The distillation-based precursor, **CodeV-R1-Distill-Qwen-7B**, is provided [here](https://huggingface.co/zhuyaoyu/CodeV-R1-Distill-Qwen-7B). 
For more training details, please refer to our [paper](https://arxiv.org/abs/2505.24183).

### 2. Evaluation Results

During the evaluation phase, the maximum generation length is configured to 16,384 tokens. A temperature setting of 0.6 is applied, and 20 responses are generated per query to estimate the pass@1 score.

Our evaluation encompasses Verilog benchmarks, including VerilogEval and RTLLM. For VerilogEval v2, we examine zero-shot scenarios in both specification-to-RTL translation and code completion tasks. Concerning RTLLM, results are reported for version 1.1, which offers a broader spectrum of comparative analyses. Furthermore, we find that the acquisition of the reasoning process in Verilog problems, as facilitated by DeepSeek-R1, enhances the model's out-of-domain mathematical capabilities.

#### VerilogEval (v2)

| Model                       | Model size  | Type        | Spec-to-rtl | Completion |
| --------------------------- | ----------- | ----------- | ----------- | ---------- |
| GPT-4o                      | Undisclosed | General     | 62.5%       | 59.0%      |
| GPT-4 Turbo                 | Undisclosed | General     | 61.1%       | 53.9%      |
| GPT-4                       | Undisclosed | General     | 32.0%       | 42.3%      |
| Mistral Large               | Undisclosed | General     | 37.5%       | 34.0%      |
| Llama3.1                    | 405B        | General     | 57.2%       | 56.4%      |
| Llama3.1                    | 70B         | General     | 42.8%       | 35.3%      |
| Llama3                      | 70B         | General     | 43.9%       | 37.8%      |
| Llama2                      | 70B         | General     | 5.3%        | 1.3%       |
| Llama3.1                    | 8B          | General     | 19.1%       | 2.6%       |
| CodeLlama                   | 70B         | Coding      | 34.9%       | 37.2%      |
| DeepSeek Coder              | 33B         | Coding      | 21.7%       | 25.0%      |
| CodeGemma                   | 7B          | Coding      | 9.5%        | 8.3%       |
| DeepSeek Coder              | 6.7B        | Coding      | 29.6%       | 24.4%      |
| RTL-Coder                   | 6.7B        | Verilog RTL | 36.8%       | 35.9%      |
| **CodeV-R1-distill (ours)** | 7B          | Verilog RTL | 65.2%       | 65.5%      |
| **CodeV-R1 (ours)**         | 7B          | Verilog RTL | **68.8%**   | **69.9%**  |

### RTLLM (v1.1)

| Model                       | Model size  | Type        | Pass@1    |
| --------------------------- | ----------- | ----------- | --------- |
| GPT-4o                      | Undisclosed | General     | 33.8%     |
| GPT-3.5 Turbo               | Undisclosed | General     | 28.3%     |
| Llama3.1                    | 405B        | General     | 38.9%     |
| Nemotron-4                  | 340B        | General     | 18.9%     |
| Llama3.1                    | 8B          | General     | 19.1%     |
| CodeLlama                   | 7B          | Coding      | 17.9%     |
| CodeQwen                    | 7B          | Coding      | 24.1%     |
| Starcoder2                  | 15B         | Coding      | 15.5%     |
| DeepSeek Coder              | 6.7B        | Coding      | 23.1%     |
| DeepSeek-Coder-V2           | 16B         | Coding      | 33.1%     |
| DeepSeek-Coder-V2           | 236B        | Coding      | 34.5%     |
| RTL-Coder                   | 6.7B        | Verilog RTL | 36.8%     |
| CraftRTL                    | 6.7B        | Verilog RTL | 53.1%     |
| **CodeV-R1-distill (ours)** | 7B          | Verilog RTL | 56.2%     |
| **CodeV-R1 (ours)**         | 7B          | Verilog RTL | **72.9%** |

For RTLLM v1.1, we also plot results showing pass rate against model size.
<div style="display: flex; gap: 10px;">
 <img src="./assets/rtllm_acc_vs_model_size.png" alt="RTLLM TTS Results" width="1200">
</div>

### 4. Usage

CodeV-R1-Distill-Qwen-7B can be utilized in the same manner as Qwen or Llama models. 

For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm):

```bash
vllm serve zhuyaoyu/CodeV-R1-Distill-Qwen-7B --tensor-parallel-size 2 --max-model-len 16384 --enforce-eager
```

**Usage Recommendations**

During training and evaluation, we use a system prompt

```
You are a helpful assistant. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>.  Now the user asks you to write verilog code. After thinking, when you finally reach a conclusion, enclose the final verilog code in ```verilog ``` within <answer> </answer> tags. i.e., <answer> ```verilog
 module top_module(in, out, ...) ... ``` </answer>.

```

It is recommended to use this prompt during inference.

### 5. License

CodeV-R1-Qwen-7B is derived from [Qwen-2.5 series](https://github.com/QwenLM/Qwen2.5), which are originally licensed under [Apache 2.0 License](https://huggingface.co/Qwen/Qwen2.5-1.5B/blob/main/LICENSE), and now finetuned with 87k samples curated with DeepSeek-R1.

### 6. Citation

If you find our model helpful, please cite our [paper](https://arxiv.org/abs/2505.24183):

```tex
@misc{zhu2025codevr1,
      title={CodeV-R1: Reasoning-Enhanced Verilog Generation}, 
      author={Yaoyu Zhu and Di Huang and Hanqi Lyu and Xiaoyun Zhang and Chongxiao Li and Wenxuan Shi and Yutong Wu and Jianan Mu and Jinghua Wang and Yang Zhao and Pengwei Jin and Shuyao Cheng and Shengwen Liang and Xishan Zhang and Rui Zhang and Zidong Du and Qi Guo and Xing Hu and Yunji Chen},
      year={2025},
      eprint={2505.24183},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.24183}, 
}
```