henryen commited on
Commit
f5d7af1
·
verified ·
1 Parent(s): 0f854c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -141
README.md CHANGED
@@ -1,141 +1,142 @@
1
- ---
2
- license: gpl-3.0
3
- base_model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
4
- library_name: peft
5
-
6
- ---
7
-
8
- # OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection
9
-
10
- ### Introduction
11
- OriGen is a fine-tuned lora model designed for Verilog code generation. It is trained on top of DeepSeek Coder 7B using datasets generated from code-to-code augmentation and self-reflection.
12
-
13
- OriGen_Fix is a fine-tuned lora model designed for fixing syntax errors in Verilog code. It is trained based on OriGen.
14
-
15
-
16
- The models have been uploaded to Hugging Face, and the repository contains the inference scripts. The dataset and data generation flow will be released soon.
17
-
18
- - **Huggingface**:
19
- - https://huggingface.co/henryen/OriGen
20
- - https://huggingface.co/henryen/OriGen_Fix
21
- - **Repository**: https://github.com/pku-liang/OriGen
22
-
23
- ### Evaluation Results
24
- <img src="figures/evaluation.png" alt="evaluation" width="1000"/>
25
-
26
- ### Quick Start
27
-
28
- Before running the following code, please install the required packages:
29
-
30
- ```bash
31
- conda create -n origen python=3.11
32
- conda activate origen
33
- pip install -r requirements.txt
34
- ```
35
-
36
- Here is an example of how to use the model. Please note that the base model, DeepSeek Coder 7B, is loaded in float16 precision, even though its default precision is bfloat16. This choice was made because our experiments showed that Lora trained in float16 outperforms those trained in bfloat16.
37
-
38
- ```python
39
- from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
40
- import torch
41
- from peft import PeftModel
42
- import json
43
-
44
- model_name = "deepseek-ai/deepseek-coder-7b-instruct-v1.5"
45
-
46
- tokenizer = AutoTokenizer.from_pretrained(model_name)
47
-
48
- model = AutoModelForCausalLM.from_pretrained(
49
- model_name,
50
- low_cpu_mem_usage=True,
51
- torch_dtype=torch.float16,
52
- attn_implementation="flash_attention_2",
53
- device_map="auto",
54
- ).to("cuda")
55
-
56
- model = PeftModel.from_pretrained(model, model_id="henryen/OriGen_Fix")
57
- model.eval()
58
-
59
- streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
60
-
61
- prompt_template = """
62
- ### Instruction: Please act as a professional Verilog designer. Your task is to debug a Verilog module.\nYou will receive a task, an original verilog code with syntax and function errors, and the corresponding error messages. \nYou should generate a corrected code based on the original code and the error messages
63
- Your task:
64
- {description}
65
- Original code:
66
- {original_code}
67
- Error message:
68
- {error}
69
- You should now generate a correct code.
70
- ### Response:{header}
71
- """
72
-
73
- def generate_code(data):
74
- description = data["description"]
75
- original_code = data["original_code"]
76
- error = data["error"]
77
- header = data["module_header"]
78
- prompt = prompt_template.format(description=description, original_code=original_code, error=error, header=header)
79
- inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
80
-
81
- outputs = model.generate(
82
- **inputs,
83
- max_new_tokens=1000,
84
- do_sample=False,
85
- temperature=0,
86
- eos_token_id=tokenizer.eos_token_id,
87
- pad_token_id=tokenizer.pad_token_id,
88
- streamer=streamer
89
- )
90
-
91
- input_length = len(inputs[0])
92
- completion = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
93
- return completion
94
-
95
- input_file = "./data/example/example_error.jsonl"
96
- output_file = "./data/example/example_fix.jsonl"
97
-
98
- with open(input_file, "r") as f, open(output_file, "w") as f2:
99
- for line in f:
100
- data = json.loads(line)
101
- completion = generate_code(data)
102
- json.dump({"task_id": data["task_id"], "completion": completion}, f2)
103
- f2.write("\n")
104
- ```
105
-
106
-
107
- The output will be:
108
- ```verilog
109
- wire and0_out;
110
- wire and1_out;
111
- wire or0_out;
112
-
113
- assign and0_out = a & b;
114
- assign and1_out = c & d;
115
- assign or0_out = and0_out | and1_out;
116
- assign out = or0_out;
117
- assign out_n = ~or0_out;
118
-
119
- endmodule
120
- ```
121
-
122
- You can check its correctness using testbench provided under the folder `./data/example/`.
123
- ```bash
124
- cd ./data/example/
125
- evaluate_functional_correctness example_fix.jsonl --problem_file example_problem.jsonl
126
- ```
127
-
128
-
129
- ### Paper
130
- **Arxiv:** https://arxiv.org/abs/2407.16237
131
-
132
- Please cite our paper if you find this model useful.
133
-
134
- ```
135
- @article{2024origen,
136
- title={OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection},
137
- author={Cui, Fan and Yin, Chenyang and Zhou, Kexing and Xiao, Youwei and Sun, Guangyu and Xu, Qiang and Guo, Qipeng and Song, Demin and Lin, Dahua and Zhang, Xingcheng and others},
138
- journal={arXiv preprint arXiv:2407.16237},
139
- year={2024}
140
- }
141
- ```
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ base_model: deepseek-ai/deepseek-coder-7b-instruct-v1.5
4
+ library_name: peft
5
+
6
+ ---
7
+
8
+ # OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection
9
+
10
+ ### Introduction
11
+ OriGen is a fine-tuned lora model designed for Verilog code generation. It is trained on top of DeepSeek Coder 7B using datasets generated from code-to-code augmentation and self-reflection.
12
+
13
+ OriGen_Fix is a fine-tuned lora model designed for fixing syntax errors in Verilog code. It is trained based on OriGen.
14
+
15
+
16
+ The models have been uploaded to Hugging Face, and the repository contains the inference scripts. The dataset and data generation flow will be released soon.
17
+
18
+ - **Huggingface**:
19
+ - https://huggingface.co/henryen/OriGen
20
+ - https://huggingface.co/henryen/OriGen_Fix
21
+ - **Repository**: https://github.com/pku-liang/OriGen
22
+
23
+ ### Evaluation Results
24
+ <img src="figures/evaluation.png" alt="evaluation" width="1000"/>
25
+
26
+ ### Quick Start
27
+
28
+ Before running the following code, please install the required packages:
29
+
30
+ ```bash
31
+ conda create -n origen python=3.11
32
+ conda activate origen
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ Here is an example of how to use the model.
37
+ Please note that the base model, DeepSeek Coder 7B, is loaded in float16 precision, even though its default precision is bfloat16. This choice was made because our experiments showed that Lora trained in float16 outperforms those trained in bfloat16.
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
41
+ import torch
42
+ from peft import PeftModel
43
+ import json
44
+
45
+ model_name = "deepseek-ai/deepseek-coder-7b-instruct-v1.5"
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
48
+
49
+ model = AutoModelForCausalLM.from_pretrained(
50
+ model_name,
51
+ low_cpu_mem_usage=True,
52
+ torch_dtype=torch.float16,
53
+ attn_implementation="flash_attention_2",
54
+ device_map="auto",
55
+ ).to("cuda")
56
+
57
+ model = PeftModel.from_pretrained(model, model_id="henryen/OriGen_Fix")
58
+ model.eval()
59
+
60
+ streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
61
+
62
+ prompt_template = """
63
+ ### Instruction: Please act as a professional Verilog designer. Your task is to debug a Verilog module.\nYou will receive a task, an original verilog code with syntax and function errors, and the corresponding error messages. \nYou should generate a corrected code based on the original code and the error messages
64
+ Your task:
65
+ {description}
66
+ Original code:
67
+ {original_code}
68
+ Error message:
69
+ {error}
70
+ You should now generate a correct code.
71
+ ### Response:{header}
72
+ """
73
+
74
+ def generate_code(data):
75
+ description = data["description"]
76
+ original_code = data["original_code"]
77
+ error = data["error"]
78
+ header = data["module_header"]
79
+ prompt = prompt_template.format(description=description, original_code=original_code, error=error, header=header)
80
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
81
+
82
+ outputs = model.generate(
83
+ **inputs,
84
+ max_new_tokens=1000,
85
+ do_sample=False,
86
+ temperature=0,
87
+ eos_token_id=tokenizer.eos_token_id,
88
+ pad_token_id=tokenizer.pad_token_id,
89
+ streamer=streamer
90
+ )
91
+
92
+ input_length = len(inputs[0])
93
+ completion = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
94
+ return completion
95
+
96
+ input_file = "./data/example/example_error.jsonl"
97
+ output_file = "./data/example/example_fix.jsonl"
98
+
99
+ with open(input_file, "r") as f, open(output_file, "w") as f2:
100
+ for line in f:
101
+ data = json.loads(line)
102
+ completion = generate_code(data)
103
+ json.dump({"task_id": data["task_id"], "completion": completion}, f2)
104
+ f2.write("\n")
105
+ ```
106
+
107
+
108
+ The output will be:
109
+ ```verilog
110
+ wire and0_out;
111
+ wire and1_out;
112
+ wire or0_out;
113
+
114
+ assign and0_out = a & b;
115
+ assign and1_out = c & d;
116
+ assign or0_out = and0_out | and1_out;
117
+ assign out = or0_out;
118
+ assign out_n = ~or0_out;
119
+
120
+ endmodule
121
+ ```
122
+
123
+ You can check its correctness using testbench provided under the folder `./data/example/`.
124
+ ```bash
125
+ cd ./data/example/
126
+ evaluate_functional_correctness example_fix.jsonl --problem_file example_problem.jsonl
127
+ ```
128
+
129
+
130
+ ### Paper
131
+ **Arxiv:** https://arxiv.org/abs/2407.16237
132
+
133
+ Please cite our paper if you find this model useful.
134
+
135
+ ```
136
+ @article{2024origen,
137
+ title={OriGen: Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection},
138
+ author={Cui, Fan and Yin, Chenyang and Zhou, Kexing and Xiao, Youwei and Sun, Guangyu and Xu, Qiang and Guo, Qipeng and Song, Demin and Lin, Dahua and Zhang, Xingcheng and others},
139
+ journal={arXiv preprint arXiv:2407.16237},
140
+ year={2024}
141
+ }
142
+ ```