| Model | Datasets (Accuracy) | ||
|---|---|---|---|
| FinQA | ConvFinQA | TATQA | |
| Llama 3 70B Instruct | 0.809 | 0.709 | 0.772 |
| Llama 3 8B Instruct | 0.767 | 0.268 | 0.706 |
| DBRX Instruct | 0.738 | 0.252 | 0.633 |
| DeepSeek LLM (67B) | 0.742 | 0.174 | 0.355 |
| Gemma 2 27B | 0.768 | 0.268 | 0.734 |
| Gemma 2 9B | 0.779 | 0.292 | 0.750 |
| Mistral (7B) Instruct v0.3 | 0.655 | 0.199 | 0.553 |
| Mixtral-8x22B Instruct | 0.766 | 0.285 | 0.666 |
| Mixtral-8x7B Instruct | 0.611 | 0.315 | 0.501 |
| Qwen 2 Instruct (72B) | 0.819 | 0.269 | 0.715 |
| WizardLM-2 8x22B | 0.796 | 0.247 | 0.725 |
| DeepSeek-V3 | 0.840 | 0.261 | 0.779 |
| DeepSeek R1 | 0.836 | 0.853 | 0.858 |
| QwQ-32B-Preview | 0.793 | 0.282 | 0.796 |
| Jamba 1.5 Mini | 0.666 | 0.218 | 0.586 |
| Jamba 1.5 Large | 0.790 | 0.225 | 0.660 |
| Claude 3.5 Sonnet | 0.844 | 0.402 | 0.700 |
| Claude 3 Haiku | 0.803 | 0.421 | 0.733 |
| Cohere Command R 7B | 0.709 | 0.212 | 0.716 |
| Cohere Command R + | 0.776 | 0.259 | 0.698 |
| Google Gemini 1.5 Pro | 0.829 | 0.280 | 0.763 |
| OpenAI gpt-4o | 0.836 | 0.749 | 0.754 |
| OpenAI o1-mini | 0.799 | 0.840 | 0.698 |
Note: Color highlighting indicates performance ranking: Best , Strong , Good