GiuLeo01 commited on
Commit
5765c17
·
verified ·
1 Parent(s): 8f97403

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +163 -9
README.md CHANGED
@@ -1,31 +1,185 @@
1
  ---
2
  base_model: unsloth/Qwen2.5-Coder-3B-Instruct
3
  library_name: transformers
4
- model_name: outputs
5
  tags:
6
- - generated_from_trainer
7
  - unsloth
8
- - trl
9
  - sft
10
- licence: license
11
  ---
12
 
13
- # Model Card for outputs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- This model is a fine-tuned version of [unsloth/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct).
16
- It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
  ## Quick start
19
 
20
  ```python
21
  from transformers import pipeline
22
 
23
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
24
- generator = pipeline("text-generation", model="GiuLeo01/outputs", device="cuda")
25
  output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
26
  print(output["generated_text"])
27
  ```
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## Training procedure
30
 
31
  [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/giulioleonardi2001-universit-di-pisa/QwenCoder_fine_tuning/runs/57yilwhe)
 
1
  ---
2
  base_model: unsloth/Qwen2.5-Coder-3B-Instruct
3
  library_name: transformers
 
4
  tags:
 
5
  - unsloth
6
+ - grpo
7
  - sft
8
+ - fortran
9
  ---
10
 
11
+ # Introduction
12
+
13
+ <!-- Provide a quick summary of what the model is/does. -->
14
+ This model is a prototype of a large language model specifically fine-tuned for Fortran90 code generation. It is based on the Qwen 2.5 Coder 3B Instruct model and has been refined using Supervised Fine-Tuning.
15
+
16
+ **There is a more powerful version of this model, which has also been fine-tuned using Reinforcement Learning with Verifiable Rewards (via GRPO).**
17
+
18
+ This model was fine-tuned briefly, without any human-labeled data and using only a single consumer GPU. Despite these clear constraints, the training process led to a 400% boost in performance on tasks involving simple to moderately complex fortran program generation (HumanEval-like). Compilation errors dropped as well, and the model now performs close to much larger general-purpose models that weren’t specifically trained for this task.
19
+
20
+ ## Evaluation
21
+ Due to the lack of existing benchmarks for Fortran code generation, a quick-and-dirty adaptation of the HumanEval benchmark was created for Fortran in order to evaluate the model. This benchmark is currently under review and will be released publicly at a later date.
22
+
23
+ According to the current demo version of the FortranHumanEval benchmark:
24
+
25
+ | Model | pass@1 | Compile Error Rate |
26
+ |----------------------|--------|--------------------|
27
+ | **FortranCodeGen 3B** | 23.17% | 17.68% |
28
+ |**FortranCodeGen 3B only sft** | 19.51% | 31.09% |
29
+ | Qwen 2.5 Coder 3B Instruct | 5.48% | 63.41% |
30
+ | GPT-4o mini | 18.90% | 43.90% |
31
+ | GPT-4o | 32.31% | 17.07% |
32
+
33
+ Compared to its base model (Qwen 2.5 Coder 3B Instruct), FortranCodeGen 3B shows a strong improvement, increasing pass@1 accuracy from 5.48% to 23.17% and reducing the compile error rate from 63.41% to 17.68%. This highlights the effectiveness of this simple fine-tuning process, even though it was performed with limited resources: no human-labeled data, small synthetic dataset, and training on a single consumer GPU (L4 :'( ).
34
+
35
+ When compared to GPT-4o mini, FortranCodeGen 3B outperforms it in terms of both pass@1 accuracy (23.17% vs. 18.90%) and compile reliability (17.68% vs. 43.90%). This suggests that task-specific fine-tuning can produce better results than more general, (probably) larger models.
36
+
37
+ While it doesn't yet match the overall performance of GPT-4o, which achieves 32.31% pass@1, FortranCodeGen 3B reaches a comparable level of compilation correctness (17.68% vs. 17.07%), suggesting that its outputs are syntactically robust and close to executable, even when they don’t solve the full task.
38
+
39
+ These results confirm that targeted specialization can significantly enhance model performance on underrepresented tasks, and suggest a promising direction for very-low-resource fine-tuning in legacy or niche programming languages.
40
+
41
+ ## Uses
42
+
43
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
44
+ The model is highly specialized in generating Fortran90 programs that read input from stdin and print output to stdout. It’s recommended to use the model with a low temperature (or even disable sampling entirely) to maximize accuracy.
45
+
46
+ Before running any generated code, it’s always a good idea to check how the program handles input from stdin, especially for users who are new to Fortran.
47
 
 
 
48
 
49
  ## Quick start
50
 
51
  ```python
52
  from transformers import pipeline
53
 
54
+ question = "Write me a Fortran program that, given an array of real numbers from stdin, prints the average."
55
+ generator = pipeline("text-generation", model="GiuLeo01/FortranCodeGen-3b-SynthData", device="cuda")
56
  output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
57
  print(output["generated_text"])
58
  ```
59
 
60
+
61
+ ## Training Details
62
+
63
+ ### Training Data
64
+
65
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
66
+
67
+ The goal of this experiment was to specialize a model in a complex task, like Fortran code generation, without using manually annotated data, which is particularly hard to find for this programming language.
68
+
69
+ #### Supervised Data
70
+ 1) A subset of the MBPP dataset (~600 examples) was selected.
71
+ 2) The task descriptions were automatically adapted to Fortran using precise instructions, using OpenAI o3-mini.
72
+ 3) Tasks were filtered using embeddings and manually reviewed to ensure that no examples too similar to HumanEval tasks were included in the training set.
73
+ 4) Each task was automatically labeled using three stronger (and bigger) models: OpenAI o3-mini, Qwen 2.5 Coder 32B, and OpenAI GPT-4o.
74
+ 5) Labels were automatically validated through unit tests.
75
+ 6) Only correct solutions were kept, at most one per task, prioritized in the following order: OpenAI o3-mini > Qwen 2.5 Coder 32B > OpenAI GPT-4o.
76
+
77
+ This simple process led to the creation of a small, synthetically labeled training set used for supervised fine-tuning.
78
+
79
+ IMPORTANT: Do not validate this model on MBPP-derived benchmarks due to the data overlap.
80
+
81
+ #### Reinforcement Learning with Verifiable Rewards Data
82
+
83
+ In this phase, both the programming tasks and their test cases were generated automatically using a large language model (OpenAI o3-mini).
84
+
85
+ 1) The model received detailed instructions regarding:
86
+ - the expected format of the task descriptions
87
+ - the difficulty level of the problems
88
+ - the structure and format of the test cases
89
+ 2) To ensure a wide variety of tasks, 30 distinct themes were defined, including:
90
+ > *string manipulation and formatting, basic array processing (1D arrays), simple numeric sequences, frequency counting in arrays, finding prime numbers, basic sorting algorithms on 1D arrays, simple recursive functions, pattern detection in strings, calculating GCD and LCM, basic statistics (mean, median), string encoding/decoding, subarray sums, basic combinatorial calculations, bitwise operations, date and time manipulation, palindrome substring detection, basic hashing techniques, number base conversions, array rotation (1D), counting unique elements, string compression, validating numeric strings, string reversal with conditions, generating Fibonacci sequence, checking balanced parentheses, basic queue and stack problems (using 1D arrays), counting vowels and consonants, integer factorization, simple encryption/decryption, basic logical puzzles.*
91
+
92
+ 3) For each theme, the model was prompted once to generate 10 unique programming problems and their corresponding test cases.
93
+
94
+ This final step was key to generating high-quality synthetic data. Without a clearly defined theme, the model tends to repeat or default to similar types of tasks.
95
+ By guiding generation through specific topics, I built a synthetic dataset of 300 examples, each composed of a task and a corresponding test case.
96
+
97
+ ### Training Procedure
98
+
99
+ #### Supervised Fine-Tuning
100
+
101
+ The annotated example dataset was split into training and validation sets (80/20 split), and used to perform full fine-tuning of the model.
102
+
103
+ Training was carried out for 10 epochs.
104
+
105
+ The key hyperparameters were:
106
+
107
+ * batch size = 4
108
+ * gradient accumulation steps = 4
109
+ * learning rate = 2e-5
110
+ * learning rate scheduler = cosine
111
+ * weight decay = 0.01
112
+
113
+ ![Training Loss](./imgs/sft_train_loss.png)
114
+ ![Evaluation Loss](./imgs/sft_eval_loss.png)
115
+
116
+
117
+ #### Reinforcement Learning with Verifiable Rewards
118
+
119
+ In this stage, a QLoRA adapter was trained using the GRPO algorithm to refine the generated Fortran programs.
120
+ The goal was to reduce compilation errors and further improve the accuracy of the generated solutions.
121
+
122
+ The model was quantized to 4-bit, and a LoRA adapter was used with `rank=32` and `alpha=64`.
123
+
124
+ The reward function used throughout this phase was very simple:
125
+
126
+ * A reward of 1 was given if the generated program compiled successfully
127
+ * An additional 3 points were awarded if it passed the test case
128
+ This results in a reward range of \[0, 4].
129
+
130
+ The initial training phase was run for 3 epochs with:
131
+
132
+
133
+ ![Compile Reward](./imgs/grpo_1_compile_reward.png)
134
+ ![Correct Reward](./imgs/grpo_1_correct_reward.png)
135
+ ![Tot Reward](./imgs/grpo_1_tot_reward.png)
136
+
137
+ * batch size = 16
138
+ * number of generations = 4
139
+ * learning rate = 1e-5
140
+ * learning rate scheduler = cosine
141
+ * weight decay = 0.1
142
+ * max gradient norm = 0.5
143
+
144
+ A second phase followed, resetting the learning rate to `1e-6` with a linear decay schedule.
145
+
146
+ ![Compile Reward](./imgs/grpo_2_compile_reward.png)
147
+ ![Correct Reward](./imgs/grpo_2_correct_reward.png)
148
+ ![Tot Reward](./imgs/grpo_2_tot_reward.png)
149
+
150
+ ## Citation
151
+
152
+ If you use this model or parts of this work, please consider citing the references below.
153
+
154
+ ## References
155
+
156
+ * Qwen/Qwen2.5-Coder-3B-Instruct
157
+ [https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct)
158
+
159
+ * OpenAI o3-mini
160
+ [https://platform.openai.com/docs/models](https://platform.openai.com/docs/models)
161
+
162
+ * OpenAI GPT-4o
163
+ [https://platform.openai.com/docs/models](https://platform.openai.com/docs/models)
164
+
165
+ * Group Relative Policy Optimization (GRPO)
166
+ [https://arxiv.org/abs/2402.03300](https://arxiv.org/abs/2402.03300)
167
+
168
+ * Unsloth – Fast and memory-efficient fine-tuning via QLoRA
169
+ [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
170
+
171
+ * Hugging Face Transformers
172
+ [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
173
+
174
+
175
+ ## Disclaimer on Use of Proprietary Models
176
+
177
+ Some of the training data used for this model was generated or labeled using proprietary large language models, including OpenAI o3-mini and GPT-4o. These models were used to synthesize programming tasks, adapt natural language descriptions, and automatically label code solutions for supervised fine-tuning and reinforcement learning.
178
+ No raw outputs from these proprietary models are included in this repository or redistributed in any form. All generated data has been filtered, validated, and used solely to train a distinct, task-specific model.
179
+
180
+ This model is **not intended to replicate or imitate any specific proprietary system**, and is designed only for a specialized use case (program generation in Fortran) and for research purposes.
181
+ ```
182
+
183
  ## Training procedure
184
 
185
  [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/giulioleonardi2001-universit-di-pisa/QwenCoder_fine_tuning/runs/57yilwhe)