Update README.md
Browse files
README.md
CHANGED
|
@@ -29,6 +29,13 @@ According to the current demo version of the FortranHumanEval benchmark:
|
|
| 29 |
| GPT-4o mini | 18.90% | 43.90% |
|
| 30 |
| GPT-4o | 32.31% | 17.07% |
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Uses
|
| 34 |
|
|
|
|
| 29 |
| GPT-4o mini | 18.90% | 43.90% |
|
| 30 |
| GPT-4o | 32.31% | 17.07% |
|
| 31 |
|
| 32 |
+
Compared to its base model (Qwen 2.5 Coder 3B Instruct), FortranCodeGen 3B shows a strong improvement, increasing pass@1 accuracy from 5.48% to 23.17% and reducing the compile error rate from 63.41% to 17.68%. This highlights the effectiveness of this simple fine-tuning process, even though it was performed with limited resources: no human-labeled data, small synthetic dataset, and training on a single consumer GPU (L4 :'( ).
|
| 33 |
+
|
| 34 |
+
When compared to GPT-4o mini, FortranCodeGen 3B outperforms it in terms of both pass@1 accuracy (23.17% vs. 18.90%) and compile reliability (17.68% vs. 43.90%). This suggests that task-specific fine-tuning can produce better results than more general, (probably) larger models.
|
| 35 |
+
|
| 36 |
+
While it doesn't yet match the overall performance of GPT-4o, which achieves 32.31% pass@1, FortranCodeGen 3B reaches a comparable level of compilation correctness (17.68% vs. 17.07%), suggesting that its outputs are syntactically robust and close to executable, even when they don’t solve the full task.
|
| 37 |
+
|
| 38 |
+
These results confirm that targeted specialization can significantly enhance model performance on underrepresented tasks, and suggest a promising direction for very-low-resource fine-tuning in legacy or niche programming languages.
|
| 39 |
|
| 40 |
## Uses
|
| 41 |
|