GiuLeo01
/

FortranCodeGen-3B-SynthData-onlysft

@@ -1,31 +1,185 @@
 ---
 base_model: unsloth/Qwen2.5-Coder-3B-Instruct
 library_name: transformers
-model_name: outputs
 tags:
-- generated_from_trainer
 - unsloth
-- trl
 - sft
-licence: license
 ---
-# Model Card for outputs
-This model is a fine-tuned version of [unsloth/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/unsloth/Qwen2.5-Coder-3B-Instruct).
-It has been trained using [TRL](https://github.com/huggingface/trl).
 ## Quick start
 ```python
 from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="GiuLeo01/outputs", device="cuda")
 output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
 print(output["generated_text"])
 ```
 ## Training procedure
 [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/giulioleonardi2001-universit-di-pisa/QwenCoder_fine_tuning/runs/57yilwhe)

 ---
 base_model: unsloth/Qwen2.5-Coder-3B-Instruct
 library_name: transformers
 tags:
 - unsloth
+- grpo
 - sft
+- fortran
 ---
+# Introduction
+<!-- Provide a quick summary of what the model is/does. -->
+This model is a prototype of a large language model specifically fine-tuned for Fortran90 code generation. It is based on the Qwen 2.5 Coder 3B Instruct model and has been refined using Supervised Fine-Tuning.
+**There is a more powerful version of this model, which has also been fine-tuned using Reinforcement Learning with Verifiable Rewards (via GRPO).**
+This model was fine-tuned briefly, without any human-labeled data and using only a single consumer GPU. Despite these clear constraints, the training process led to a 400% boost in performance on tasks involving simple to moderately complex fortran program generation (HumanEval-like). Compilation errors dropped as well, and the model now performs close to much larger general-purpose models that weren’t specifically trained for this task.
+## Evaluation
+Due to the lack of existing benchmarks for Fortran code generation, a quick-and-dirty adaptation of the HumanEval benchmark was created for Fortran in order to evaluate the model. This benchmark is currently under review and will be released publicly at a later date.
+According to the current demo version of the FortranHumanEval benchmark:
+| Model                | pass@1 | Compile Error Rate |
+|----------------------|--------|--------------------|
+| **FortranCodeGen 3B**     |    23.17%   |        17.68%           |
+|**FortranCodeGen 3B only sft**     |    19.51%   |        31.09%           |
+| Qwen 2.5 Coder 3B Instruct     |    5.48%   |        63.41%          |
+| GPT-4o mini           |    18.90%   |           43.90%        |
+| GPT-4o                |    32.31%   |        17.07%           |
+Compared to its base model (Qwen 2.5 Coder 3B Instruct), FortranCodeGen 3B shows a strong improvement, increasing pass@1 accuracy from 5.48% to 23.17% and reducing the compile error rate from 63.41% to 17.68%. This highlights the effectiveness of this simple fine-tuning process, even though it was performed with limited resources: no human-labeled data, small synthetic dataset, and training on a single consumer GPU (L4 :'( ).
+When compared to GPT-4o mini, FortranCodeGen 3B outperforms it in terms of both pass@1 accuracy (23.17% vs. 18.90%) and compile reliability (17.68% vs. 43.90%). This suggests that task-specific fine-tuning can produce better results than more general, (probably) larger models.
+While it doesn't yet match the overall performance of GPT-4o, which achieves 32.31% pass@1, FortranCodeGen 3B reaches a comparable level of compilation correctness (17.68% vs. 17.07%), suggesting that its outputs are syntactically robust and close to executable, even when they don’t solve the full task.
+These results confirm that targeted specialization can significantly enhance model performance on underrepresented tasks, and suggest a promising direction for very-low-resource fine-tuning in legacy or niche programming languages.
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+The model is highly specialized in generating Fortran90 programs that read input from stdin and print output to stdout. It’s recommended to use the model with a low temperature (or even disable sampling entirely) to maximize accuracy.
+Before running any generated code, it’s always a good idea to check how the program handles input from stdin, especially for users who are new to Fortran.
 ## Quick start
 ```python
 from transformers import pipeline
+question = "Write me a Fortran program that, given an array of real numbers from stdin, prints the average."
+generator = pipeline("text-generation", model="GiuLeo01/FortranCodeGen-3b-SynthData", device="cuda")
 output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
 print(output["generated_text"])
 ```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The goal of this experiment was to specialize a model in a complex task, like Fortran code generation, without using manually annotated data, which is particularly hard to find for this programming language.
+#### Supervised Data
+1) A subset of the MBPP dataset (~600 examples) was selected.
+2) The task descriptions were automatically adapted to Fortran using precise instructions, using OpenAI o3-mini.
+3) Tasks were filtered using embeddings and manually reviewed to ensure that no examples too similar to HumanEval tasks were included in the training set.
+4) Each task was automatically labeled using three stronger (and bigger) models: OpenAI o3-mini, Qwen 2.5 Coder 32B, and OpenAI GPT-4o.
+5) Labels were automatically validated through unit tests.
+6) Only correct solutions were kept, at most one per task, prioritized in the following order: OpenAI o3-mini > Qwen 2.5 Coder 32B > OpenAI GPT-4o.
+This simple process led to the creation of a small, synthetically labeled training set used for supervised fine-tuning.
+IMPORTANT: Do not validate this model on MBPP-derived benchmarks due to the data overlap.
+#### Reinforcement Learning with Verifiable Rewards Data
+In this phase, both the programming tasks and their test cases were generated automatically using a large language model (OpenAI o3-mini).
+1) The model received detailed instructions regarding:
+  - the expected format of the task descriptions
+  - the difficulty level of the problems
+  - the structure and format of the test cases
+2) To ensure a wide variety of tasks, 30 distinct themes were defined, including:
+> *string manipulation and formatting, basic array processing (1D arrays), simple numeric sequences, frequency counting in arrays, finding prime numbers, basic sorting algorithms on 1D arrays, simple recursive functions, pattern detection in strings, calculating GCD and LCM, basic statistics (mean, median), string encoding/decoding, subarray sums, basic combinatorial calculations, bitwise operations, date and time manipulation, palindrome substring detection, basic hashing techniques, number base conversions, array rotation (1D), counting unique elements, string compression, validating numeric strings, string reversal with conditions, generating Fibonacci sequence, checking balanced parentheses, basic queue and stack problems (using 1D arrays), counting vowels and consonants, integer factorization, simple encryption/decryption, basic logical puzzles.*
+3) For each theme, the model was prompted once to generate 10 unique programming problems and their corresponding test cases.
+This final step was key to generating high-quality synthetic data. Without a clearly defined theme, the model tends to repeat or default to similar types of tasks.
+By guiding generation through specific topics, I built a synthetic dataset of 300 examples, each composed of a task and a corresponding test case.
+### Training Procedure
+#### Supervised Fine-Tuning
+The annotated example dataset was split into training and validation sets (80/20 split), and used to perform full fine-tuning of the model.
+Training was carried out for 10 epochs.
+The key hyperparameters were:
+* batch size = 4
+* gradient accumulation steps = 4
+* learning rate = 2e-5
+* learning rate scheduler = cosine
+* weight decay = 0.01
+![Training Loss](./imgs/sft_train_loss.png)
+![Evaluation Loss](./imgs/sft_eval_loss.png)
+#### Reinforcement Learning with Verifiable Rewards
+In this stage, a QLoRA adapter was trained using the GRPO algorithm to refine the generated Fortran programs.
+The goal was to reduce compilation errors and further improve the accuracy of the generated solutions.
+The model was quantized to 4-bit, and a LoRA adapter was used with `rank=32` and `alpha=64`.
+The reward function used throughout this phase was very simple:
+* A reward of 1 was given if the generated program compiled successfully
+* An additional 3 points were awarded if it passed the test case
+  This results in a reward range of \[0, 4].
+The initial training phase was run for 3 epochs with:
+![Compile Reward](./imgs/grpo_1_compile_reward.png)
+![Correct Reward](./imgs/grpo_1_correct_reward.png)
+![Tot Reward](./imgs/grpo_1_tot_reward.png)
+* batch size = 16
+* number of generations = 4
+* learning rate = 1e-5
+* learning rate scheduler = cosine
+* weight decay = 0.1
+* max gradient norm = 0.5
+A second phase followed, resetting the learning rate to `1e-6` with a linear decay schedule.
+![Compile Reward](./imgs/grpo_2_compile_reward.png)
+![Correct Reward](./imgs/grpo_2_correct_reward.png)
+![Tot Reward](./imgs/grpo_2_tot_reward.png)
+## Citation
+If you use this model or parts of this work, please consider citing the references below.
+## References
+* Qwen/Qwen2.5-Coder-3B-Instruct
+  [https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct)
+* OpenAI o3-mini
+  [https://platform.openai.com/docs/models](https://platform.openai.com/docs/models)
+* OpenAI GPT-4o
+  [https://platform.openai.com/docs/models](https://platform.openai.com/docs/models)
+* Group Relative Policy Optimization (GRPO)
+  [https://arxiv.org/abs/2402.03300](https://arxiv.org/abs/2402.03300)
+* Unsloth – Fast and memory-efficient fine-tuning via QLoRA
+  [https://github.com/unslothai/unsloth](https://github.com/unslothai/unsloth)
+* Hugging Face Transformers
+  [https://github.com/huggingface/transformers](https://github.com/huggingface/transformers)
+## Disclaimer on Use of Proprietary Models
+Some of the training data used for this model was generated or labeled using proprietary large language models, including OpenAI o3-mini and GPT-4o. These models were used to synthesize programming tasks, adapt natural language descriptions, and automatically label code solutions for supervised fine-tuning and reinforcement learning.
+No raw outputs from these proprietary models are included in this repository or redistributed in any form. All generated data has been filtered, validated, and used solely to train a distinct, task-specific model.
+This model is **not intended to replicate or imitate any specific proprietary system**, and is designed only for a specialized use case (program generation in Fortran) and for research purposes.
+```
 ## Training procedure
 [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/giulioleonardi2001-universit-di-pisa/QwenCoder_fine_tuning/runs/57yilwhe)