fakerbaby commited on
Commit
e0f5026
Β·
verified Β·
1 Parent(s): 318fc98

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -78,25 +78,25 @@ These innovations lead to Broad Reasoning Generalization, allowing our RL-powere
78
  # Visual-Language Models Benchmark Comparison
79
 
80
  | Category | Benchmark | Metric | Skywork-38B | QVQ-72B | InternVL-78B | Qwen-72B | Claude 3.7 | GPT-4o |
81
- |----------------|-------------------------|---------|------------:|--------:|-------------:|--------:|----------:|-------:|
82
  | **General** | MMMU (val) | Acc. | πŸ† **76.0** | 70.3 | 72.2 | 70.3 | 75.0 | 70.7 |
83
- | | EMMA (mini-cot) | Acc. | 40.3 | 32.0 | 38.3 | 39.3 | πŸ† **56.5** | 36.0 |
84
  | | MMMU-pro | Acc. | πŸ† **55.4** | 46.9* | 48.6 | 51.1 | 50.0 | 54.5 |
85
  | | MMK12 | Acc. | πŸ† **78.5** | 62.7* | 67.4* | 70.5* | 55.3 | 49.9 |
86
- | | MMstar | Acc. | 70.6 | 60.8 | πŸ† **72.5** | 70.8 | 68.8 | 65.1 |
87
- | | MMBench-en-1.1 | Acc. | 85.7 | 72.6* | 87.7 | πŸ† **88.0** | 82.0 | 84.3 |
88
- | | HallusionBench | Acc. | πŸ†**61.3** | 55.3* | 59.1 | 55.2 | 58.3 | 56.2 |
89
- | **Mathematics**| MathVista (mini) | Acc. | 77.1 | 71.4 | πŸ† **79.0** | 74.8 | 66.8 | 62.9 |
90
- | | MathVerse (vision-only) | Acc. | πŸ†**59.6** | 45.1 | 51.0 | 57.6 | **49.9*** | 49.9 |
91
- | | MathVision | Acc. | 52.6 | 35.9 | 43.1 | 38.1 | πŸ† 58.6 | 31.2 |
92
- | | WeMath (strict) | Acc. |πŸ†**56.5** | 37.7 | 46.1 | 50.6 | 48.9* | 50.6 |
93
- | **Logic** | Visulogic | Acc. | πŸ†**28.5** | 23.5* | 27.7 | 26.2 | 25.9 | 26.3 |
94
- | | LogicVista | Acc. | 59.7 | 53.8 | 55.9 | 57.1 | 60.6* | πŸ† **64.4** |
95
- | | MME-reasoning | Acc. | πŸ†**42.8** | 35.2 | 32.1 | 34.1 | 34.1 | 30.2 |
96
  | **Physics** | PhyX (mc-text-minimal) | Acc. | πŸ† **52.8** | 35.2* | 40.5 | 44.8 | 41.6 | 43.8 |
97
- | | SeePhys | Acc. | 31.5 | 22.5 | 19.0* | 24.2 | πŸ† **34.6** | 21.9 |
98
 
99
- πŸ† **Top performer** in each benchmark
100
  [*] indicates results from our evaluation framework.
101
 
102
  ## 4. Usage
 
78
  # Visual-Language Models Benchmark Comparison
79
 
80
  | Category | Benchmark | Metric | Skywork-38B | QVQ-72B | InternVL-78B | Qwen-72B | Claude 3.7 | GPT-4o |
81
+ |----------------|-------------------------|---------|------------:|--------:|-------------:|--------:|----------:|---------:|
82
  | **General** | MMMU (val) | Acc. | πŸ† **76.0** | 70.3 | 72.2 | 70.3 | 75.0 | 70.7 |
83
+ | | EMMA (mini-cot) | Acc. | 40.3 | 32.0 | 38.3 | 39.3 | **56.5** | 36.0 |
84
  | | MMMU-pro | Acc. | πŸ† **55.4** | 46.9* | 48.6 | 51.1 | 50.0 | 54.5 |
85
  | | MMK12 | Acc. | πŸ† **78.5** | 62.7* | 67.4* | 70.5* | 55.3 | 49.9 |
86
+ | | MMstar | Acc. | 70.6 | 60.8 | **72.5** | 70.8 | 68.8 | 65.1 |
87
+ | | MMBench-en-1.1 | Acc. | 85.7 | 72.6* | 87.7 | **88.0** | 82.0 | 84.3 |
88
+ | | HallusionBench | Acc. | πŸ† **61.3** | 55.3* | 59.1 | 55.2 | 58.3 | 56.2 |
89
+ | **Mathematics**| MathVista (mini) | Acc. | 77.1 | 71.4 | **79.0** | 74.8 | 66.8 | 62.9 |
90
+ | | MathVerse (vision-only) | Acc. | πŸ† **59.6** | 45.1 | 51.0 | 57.6 | 49.9* | 49.9 |
91
+ | | MathVision | Acc. | 52.6 | 35.9 | 43.1 | 38.1 | 58.6 | 31.2 |
92
+ | | WeMath (strict) | Acc. |πŸ† **56.5** | 37.7 | 46.1 | 50.6 | 48.9* | 50.6 |
93
+ | **Logic** | Visulogic | Acc. | πŸ† **28.5** | 23.5* | 27.7 | 26.2 | 25.9 | 26.3 |
94
+ | | LogicVista | Acc. | 59.7 | 53.8 | 55.9 | 57.1 | 60.6* | **64.4** |
95
+ | | MME-reasoning | Acc. | πŸ† **42.8** | 35.2 | 32.1 | 34.1 | 34.1 | 30.2 |
96
  | **Physics** | PhyX (mc-text-minimal) | Acc. | πŸ† **52.8** | 35.2* | 40.5 | 44.8 | 41.6 | 43.8 |
97
+ | | SeePhys | Acc. | 31.5 | 22.5 | 19.0* | 24.2 | **34.6** | 21.9 |
98
 
99
+ πŸ† **Top performer** of Skywork-R1V3 in each benchmark
100
  [*] indicates results from our evaluation framework.
101
 
102
  ## 4. Usage