Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
# Model Card: palmyra-mini-thinking-b
|
| 3 |
+
|
| 4 |
+
##Introduction##
|
| 5 |
+
|
| 6 |
+
Palmyra-mini-thinking-b represents a significant step forward in generative AI, demonstrating exceptional capabilities in complex reasoning and problem-solving domains. This model excels in mathematical and programming challenges, showcasing a robust understanding of abstract concepts and logical structures. Its performance is not just a measure of its power but a testament to its specialized training, which has honed its ability to tackle tasks that demand deep, multi-step thinking.
|
| 7 |
+
|
| 8 |
+
##Mathematical Prowess##
|
| 9 |
+
|
| 10 |
+
The model's mathematical abilities are particularly noteworthy. It achieves an impressive score of 0.925 on the AMC23 benchmark, indicating a strong grasp of advanced high school mathematics. This is further complemented by its performance on MATH500, where it scores 0.882, proving its proficiency across a wide range of mathematical problems. The model also shows its strength in competitive mathematics, scoring 0.6 on AIME24(pass@1)(avg-of-1) and 0.5733 on Olympiadbench (extractive_match). These scores highlight the model's capacity for sophisticated mathematical reasoning, making it a powerful tool for both educational and research applications.
|
| 11 |
+
|
| 12 |
+
##Excellence in Competitive Programming##
|
| 13 |
+
|
| 14 |
+
Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.
|
| 15 |
+
|
| 16 |
+
## Benchmark Scores
|
| 17 |
+
|
| 18 |
+
| Benchmark | Score |
|
| 19 |
+
|:-----------------------------------------------------------------|---------:|
|
| 20 |
+
| gsm8k (strict-match) | 0.4268 |
|
| 21 |
+
| minerva_math(exact_match) | 0.0708 |
|
| 22 |
+
| mmlu_pro(exact_match) | 0.2926 |
|
| 23 |
+
| hendrycks_math | 0.0016 |
|
| 24 |
+
| ifeval (inst_level_loose_acc) | 0.3297 |
|
| 25 |
+
| mathqa (acc) | 0.3045 |
|
| 26 |
+
| humaneval (pass@1) | 0.0732 |
|
| 27 |
+
| BBH (get-answer)(exact_match) | 0.288 |
|
| 28 |
+
| mbpp | 0.168 |
|
| 29 |
+
| leadboard_musr (acc_norm) | 0.3796 |
|
| 30 |
+
| gpqa lighteval gpqa diamond_pass@1:8_samples | 0.3958 |
|
| 31 |
+
| AIME24(pass@1)(avg-of-1) | 0.6 |
|
| 32 |
+
| AIME25(pass@1)(avg-of-1) | 0.5 |
|
| 33 |
+
| Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.2873 |
|
| 34 |
+
| AMC23 | 0.925 |
|
| 35 |
+
| MATH500 | 0.882 |
|
| 36 |
+
| Minerva | 0.2941 |
|
| 37 |
+
| Olympiadbench (extractive_match) | 0.5733 |
|
| 38 |
+
| Codecontests (pass_rate) | 0.2018 |
|
| 39 |
+
| Codeforces (pass_rate) | 0.6343 |
|
| 40 |
+
| Taco (pass_rate) | 0.3456 |
|
| 41 |
+
| APPS (all_levels) | 0.0584 |
|
| 42 |
+
| HMMT23 (extractive_match) | 0.2333 |
|
| 43 |
+
| Average | 0.359378 |
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
## Intended Use
|
| 47 |
+
|
| 48 |
+
This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
|
| 49 |
+
|
| 50 |
+
## Limitations
|
| 51 |
+
|
| 52 |
+
The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
|
| 53 |
+
|
| 54 |
+
## Ethical Considerations
|
| 55 |
+
|
| 56 |
+
As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.
|