tperes commited on
Commit
d69be75
·
verified ·
1 Parent(s): ad8ebba

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -0
README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Model Card: palmyra-mini-thinking-b
3
+
4
+ ##Introduction##
5
+
6
+ Palmyra-mini-thinking-b represents a significant step forward in generative AI, demonstrating exceptional capabilities in complex reasoning and problem-solving domains. This model excels in mathematical and programming challenges, showcasing a robust understanding of abstract concepts and logical structures. Its performance is not just a measure of its power but a testament to its specialized training, which has honed its ability to tackle tasks that demand deep, multi-step thinking.
7
+
8
+ ##Mathematical Prowess##
9
+
10
+ The model's mathematical abilities are particularly noteworthy. It achieves an impressive score of 0.925 on the AMC23 benchmark, indicating a strong grasp of advanced high school mathematics. This is further complemented by its performance on MATH500, where it scores 0.882, proving its proficiency across a wide range of mathematical problems. The model also shows its strength in competitive mathematics, scoring 0.6 on AIME24(pass@1)(avg-of-1) and 0.5733 on Olympiadbench (extractive_match). These scores highlight the model's capacity for sophisticated mathematical reasoning, making it a powerful tool for both educational and research applications.
11
+
12
+ ##Excellence in Competitive Programming##
13
+
14
+ Beyond mathematics, Palmyra-mini-thinking-b demonstrates strong performance in the competitive programming arena. Its score of 0.6343 on the Codeforces (pass_rate) benchmark underscores its ability to understand complex algorithmic problems and generate correct, efficient code. This capability suggests the model is well-suited for tasks involving code generation, debugging, and algorithmic design, making it a valuable asset for software developers and computer science researchers.
15
+
16
+ ## Benchmark Scores
17
+
18
+ | Benchmark | Score |
19
+ |:-----------------------------------------------------------------|---------:|
20
+ | gsm8k (strict-match) | 0.4268 |
21
+ | minerva_math(exact_match) | 0.0708 |
22
+ | mmlu_pro(exact_match) | 0.2926 |
23
+ | hendrycks_math | 0.0016 |
24
+ | ifeval (inst_level_loose_acc) | 0.3297 |
25
+ | mathqa (acc) | 0.3045 |
26
+ | humaneval (pass@1) | 0.0732 |
27
+ | BBH (get-answer)(exact_match) | 0.288 |
28
+ | mbpp | 0.168 |
29
+ | leadboard_musr (acc_norm) | 0.3796 |
30
+ | gpqa lighteval gpqa diamond_pass@1:8_samples | 0.3958 |
31
+ | AIME24(pass@1)(avg-of-1) | 0.6 |
32
+ | AIME25(pass@1)(avg-of-1) | 0.5 |
33
+ | Livecodebench-codegen (livecodebench/code_generation_lite v4_v5) | 0.2873 |
34
+ | AMC23 | 0.925 |
35
+ | MATH500 | 0.882 |
36
+ | Minerva | 0.2941 |
37
+ | Olympiadbench (extractive_match) | 0.5733 |
38
+ | Codecontests (pass_rate) | 0.2018 |
39
+ | Codeforces (pass_rate) | 0.6343 |
40
+ | Taco (pass_rate) | 0.3456 |
41
+ | APPS (all_levels) | 0.0584 |
42
+ | HMMT23 (extractive_match) | 0.2333 |
43
+ | Average | 0.359378 |
44
+
45
+
46
+ ## Intended Use
47
+
48
+ This model is intended for research and development in the field of generative AI, particularly for tasks requiring mathematical and logical reasoning.
49
+
50
+ ## Limitations
51
+
52
+ The model's performance has been evaluated on a specific set of benchmarks. Its performance on other tasks or in real-world applications may vary.
53
+
54
+ ## Ethical Considerations
55
+
56
+ As with any language model, there is a potential for generating biased or inaccurate information. Users should be aware of these limitations and use the model responsibly.