FLaME: Financial Language Model Evaluation Results

This page presents the results of the FLaME evaluation across various financial NLP tasks. Each tab shows performance metrics for different task categories.

Overall Performance Across All Tasks

Model Information Retrieval * Sentiment Analysis Causal Analysis Text Classification Question Answering Summarization
Dataset FiNERF1 FRF1 RDF1 FNXLF1 FEF1 FiQAMSE SQAF1 FPBF1 CDF1 CCF1 B77F1 FBF1 FOMCF1 NCF1 HLAcc CFQAAcc FinQAAcc TQAAcc ECTSumBERT-F1 EDTSumBERT-F1
Llama 3 70B Instruct .701.332.883.020 .469.123.535.902 .142.192 .645.309.652.386.811 .709.809.772 .754.817
Llama 3 8B Instruct .565.289.705.003 .350.161.600.698 .049.234 .512.659.497.511.763 .268.767.706 .757.811
DBRX Instruct .489.304.778.009 .006.160.436.499 .087.231 .574.483.193.319.746 .252.738.633 .729.806
DeepSeek LLM (67B) .745.334.879.007 .416.118.462.811 .025.193 .578.492.407.151.778 .174.742.355 .681.807
Gemma 2 27B .761.356.902.006 .298.100.515.884 .133.242 .621.538.620.408.808 .268.768.734 .723.814
Gemma 2 9B .651.331.892.005 .367.189.491.940 .105.207 .609.541.519.365.856 .292.779.750 .585.817
Mistral (7B) Instruct v0.3 .526.276.771.004 .368.135.522.841 .052.227 .528.503.542.412.779 .199.655.553 .750.811
Mixtral-8x22B Instruct .635.367.811.009 .435.221.510.776 .125.308 .602.221.465.513.835 .285.766.666 .758.815
Mixtral-8x7B Instruct .598.282.845.009 .267.208.498.893 .055.229 .547.396.603.583.805 .315.611.501 .747.810
Qwen 2 Instruct (72B) .748.348.854.012 .483.205.576.901 .190.184 .627.495.605.639.830 .269.819.715 .752.811
WizardLM-2 8x22B .744.355.852.008 .226.129.566.779 .114.201 .648.500.505.272.797 .247.796.725 .735.808
DeepSeek-V3 .790.437.934.045 .549.150.583.814 .198.170 .714.487.578.675.729 .261.840.779 .750.815
DeepSeek R1 .807.393.952.057 .587.110.499.902 .337.202 .763.419.670.688.769 .853.836.858 .759.804
QwQ-32B-Preview .685.270.656.001 .005.141.550.815 .131.220 .613.784.555.020.744 .282.793.796 .696.817
Jamba 1.5 Mini .552.284.844.005 .132.119.418.765 .043.270 .508.898.499.151.682 .218.666.586 .741.816
Jamba 1.5 Large .693.341.862.005 .397.183.582.798 .074.176 .628.618.550.541.782 .225.790.660 .734.818
Claude 3.5 Sonnet .799.439.891.047 .655.101.553.944 .196.197 .668.634.674.692.827 .402.844.700 .767.813
Claude 3 Haiku .711.285.883.015 .494.167.463.908 .081.200 .622.022.631.558.781 .421.803.733 .646.808
Cohere Command R 7B .748.194.845.018 .441.164.532.840 .057.255 .516.762.459.068.770 .212.709.716 .750.815
Cohere Command R + .756.333.922.021 .452.106.533.699 .080.238 .651.684.393.118.812 .259.776.698 .751.810
Google Gemini 1.5 Pro .712.374.944.019 .393.144.593.885 .196.217 .418.336.579.525.837 .280.829.763 .777.817
OpenAI gpt-4o .766.399.942.037 .523.184.541.928 .130.222 .710.524.664.750.824 .749.836.754 .773.816
OpenAI o1-mini .761.403.876.010 .662.120.542.917 .289.209 .670.612.635.720.769 .840.799.698 .763.816

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Causal Analysis Task Results

Model Causal Detection (CD) Causal Classification (CC)
Accuracy Precision Recall F1 Precision Recall F1 Accuracy
Llama 3 70B Instruct 0.148 0.429 0.148 0.142 0.241 0.329 0.192 0.198
Llama 3 8B Instruct 0.097 0.341 0.097 0.049 0.232 0.241 0.234 0.380
DBRX Instruct 0.078 0.521 0.078 0.087 0.276 0.313 0.231 0.235
DeepSeek LLM (67B) 0.026 0.214 0.026 0.025 0.141 0.328 0.193 0.221
Gemma 2 27B 0.115 0.510 0.115 0.133 0.309 0.310 0.242 0.262
Gemma 2 9B 0.115 0.394 0.115 0.105 0.275 0.294 0.207 0.258
Mistral (7B) Instruct v0.3 0.078 0.455 0.078 0.052 0.339 0.361 0.227 0.258
Mixtral-8x22B Instruct 0.131 0.486 0.131 0.125 0.344 0.310 0.308 0.318
Mixtral-8x7B Instruct 0.088 0.510 0.088 0.055 0.308 0.314 0.229 0.273
Qwen 2 Instruct (72B) 0.139 0.489 0.139 0.190 0.208 0.330 0.184 0.188
WizardLM-2 8x22B 0.076 0.453 0.076 0.114 0.263 0.347 0.201 0.237
DeepSeek-V3 0.164 0.528 0.164 0.198 0.194 0.327 0.170 0.248
DeepSeek R1 0.245 0.643 0.245 0.337 0.385 0.318 0.202 0.221
QwQ-32B-Preview 0.110 0.473 0.110 0.131 0.193 0.262 0.220 0.465
Jamba 1.5 Mini 0.050 0.280 0.050 0.043 0.323 0.283 0.270 0.295
Jamba 1.5 Large 0.076 0.517 0.076 0.074 0.268 0.248 0.176 0.200
Claude 3.5 Sonnet 0.154 0.564 0.154 0.196 0.259 0.336 0.197 0.235
Claude 3 Haiku 0.082 0.388 0.082 0.081 0.369 0.347 0.200 0.203
Cohere Command R 7B 0.089 0.363 0.089 0.057 0.379 0.356 0.255 0.275
Cohere Command R + 0.090 0.453 0.090 0.080 0.353 0.336 0.238 0.265
Google Gemini 1.5 Pro 0.165 0.514 0.165 0.196 0.265 0.357 0.217 0.258
OpenAI gpt-4o 0.082 0.576 0.082 0.130 0.254 0.327 0.222 0.235
OpenAI o1-mini 0.206 0.648 0.206 0.289 0.325 0.316 0.209 0.233

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Information Retrieval Task Results

Model FiNER-ORD FinRED ReFiND FNXL FinEntity
Precision Recall F1 Accuracy Accuracy Precision Recall F1 Accuracy Precision Recall F1 Precision Recall F1 Accuracy Precision Recall Accuracy F1
Llama 3 70B Instruct 0.715 0.693 0.701 0.911 0.314 0.454 0.314 0.332 0.879 0.904 0.879 0.883 0.015 0.030 0.020 0.010 0.474 0.485 0.485 0.469
Llama 3 8B Instruct 0.581 0.558 0.565 0.854 0.296 0.357 0.296 0.289 0.723 0.755 0.723 0.705 0.003 0.004 0.003 0.002 0.301 0.478 0.478 0.350
DBRX Instruct 0.516 0.476 0.489 0.802 0.329 0.371 0.329 0.304 0.766 0.825 0.766 0.778 0.008 0.011 0.009 0.005 0.004 0.014 0.014 0.006
DeepSeek LLM (67B) 0.752 0.742 0.745 0.917 0.344 0.403 0.344 0.334 0.874 0.890 0.874 0.879 0.005 0.009 0.007 0.003 0.456 0.405 0.405 0.416
Gemma 2 27B 0.772 0.754 0.761 0.923 0.352 0.437 0.352 0.356 0.897 0.914 0.897 0.902 0.005 0.008 0.006 0.003 0.320 0.295 0.295 0.298
Gemma 2 9B 0.665 0.643 0.651 0.886 0.336 0.373 0.336 0.331 0.885 0.902 0.885 0.892 0.004 0.008 0.005 0.003 0.348 0.419 0.419 0.367
Mistral (7B) Instruct 0.540 0.522 0.526 0.806 0.278 0.383 0.278 0.276 0.767 0.817 0.767 0.771 0.004 0.006 0.004 0.002 0.337 0.477 0.477 0.368
Mixtral-8x22B Instruct 0.653 0.625 0.635 0.870 0.381 0.414 0.381 0.367 0.807 0.847 0.807 0.811 0.010 0.008 0.009 0.005 0.428 0.481 0.481 0.435
Mixtral-8x7B Instruct 0.613 0.591 0.598 0.875 0.291 0.376 0.291 0.282 0.840 0.863 0.840 0.845 0.007 0.012 0.009 0.005 0.251 0.324 0.324 0.267
Qwen 2 Instruct (72B) 0.766 0.742 0.748 0.899 0.365 0.407 0.365 0.348 0.850 0.881 0.850 0.854 0.010 0.016 0.012 0.006 0.468 0.530 0.530 0.483
WizardLM-2 8x22B 0.755 0.741 0.744 0.920 0.362 0.397 0.362 0.355 0.846 0.874 0.846 0.852 0.008 0.009 0.008 0.004 0.222 0.247 0.247 0.226
DeepSeek-V3 0.798 0.787 0.790 0.945 0.450 0.463 0.450 0.437 0.927 0.943 0.927 0.934 0.034 0.067 0.045 0.023 0.563 0.544 0.544 0.549
DeepSeek R1 0.813 0.805 0.807 0.944 0.412 0.424 0.412 0.393 0.946 0.960 0.946 0.952 0.044 0.082 0.057 0.029 0.600 0.586 0.586 0.587
QwQ-32B-Preview 0.695 0.681 0.685 0.907 0.278 0.396 0.278 0.270 0.680 0.770 0.680 0.656 0.001 0.001 0.001 0.000 0.005 0.005 0.005 0.005
Jamba 1.5 Mini 0.564 0.556 0.552 0.818 0.308 0.450 0.308 0.284 0.830 0.864 0.830 0.844 0.004 0.006 0.005 0.003 0.119 0.182 0.182 0.132
Jamba 1.5 Large 0.707 0.687 0.693 0.883 0.341 0.452 0.341 0.341 0.856 0.890 0.856 0.862 0.004 0.005 0.005 0.002 0.403 0.414 0.414 0.397
Claude 3.5 Sonnet 0.811 0.794 0.799 0.922 0.455 0.465 0.455 0.439 0.873 0.927 0.873 0.891 0.034 0.080 0.047 0.024 0.658 0.668 0.668 0.655
Claude 3 Haiku 0.732 0.700 0.711 0.895 0.294 0.330 0.294 0.285 0.879 0.917 0.879 0.883 0.011 0.022 0.015 0.008 0.498 0.517 0.517 0.494
Cohere Command R + 0.769 0.750 0.756 0.902 0.353 0.405 0.353 0.333 0.917 0.930 0.917 0.922 0.016 0.032 0.021 0.011 0.462 0.459 0.459 0.452
Google Gemini 1.5 Pro 0.728 0.705 0.712 0.891 0.373 0.436 0.373 0.374 0.934 0.955 0.934 0.944 0.014 0.028 0.019 0.010 0.399 0.400 0.400 0.393
OpenAI gpt-4o 0.778 0.760 0.766 0.911 0.402 0.445 0.402 0.399 0.931 0.955 0.931 0.942 0.027 0.056 0.037 0.019 0.537 0.517 0.517 0.523
OpenAI o1-mini 0.772 0.755 0.761 0.922 0.407 0.444 0.407 0.403 0.867 0.900 0.867 0.876 0.007 0.015 0.010 0.005 0.661 0.681 0.681 0.662

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Question Answering Task Results

Model Datasets (Accuracy)
FinQA ConvFinQA TATQA
Llama 3 70B Instruct 0.809 0.709 0.772
Llama 3 8B Instruct 0.767 0.268 0.706
DBRX Instruct 0.738 0.252 0.633
DeepSeek LLM (67B) 0.742 0.174 0.355
Gemma 2 27B 0.768 0.268 0.734
Gemma 2 9B 0.779 0.292 0.750
Mistral (7B) Instruct v0.3 0.655 0.199 0.553
Mixtral-8x22B Instruct 0.766 0.285 0.666
Mixtral-8x7B Instruct 0.611 0.315 0.501
Qwen 2 Instruct (72B) 0.819 0.269 0.715
WizardLM-2 8x22B 0.796 0.247 0.725
DeepSeek-V3 0.840 0.261 0.779
DeepSeek R1 0.836 0.853 0.858
QwQ-32B-Preview 0.793 0.282 0.796
Jamba 1.5 Mini 0.666 0.218 0.586
Jamba 1.5 Large 0.790 0.225 0.660
Claude 3.5 Sonnet 0.844 0.402 0.700
Claude 3 Haiku 0.803 0.421 0.733
Cohere Command R 7B 0.709 0.212 0.716
Cohere Command R + 0.776 0.259 0.698
Google Gemini 1.5 Pro 0.829 0.280 0.763
OpenAI gpt-4o 0.836 0.749 0.754
OpenAI o1-mini 0.799 0.840 0.698

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Sentiment Analysis Task Results

Model FiQA Task 1 FinEntity SubjECTive-QA Financial Phrase Bank (FPB)
MSE MAE r² Score Precision Recall Accuracy F1 Precision Recall F1 Accuracy Accuracy Precision Recall F1
Llama 3 70B Instruct 0.123 0.290 0.272 0.474 0.485 0.485 0.469 0.652 0.573 0.535 0.573 0.901 0.904 0.901 0.902
Llama 3 8B Instruct 0.161 0.344 0.045 0.301 0.478 0.478 0.350 0.635 0.625 0.600 0.625 0.738 0.801 0.738 0.698
DBRX Instruct 0.160 0.321 0.052 0.004 0.014 0.014 0.006 0.654 0.541 0.436 0.541 0.524 0.727 0.524 0.499
DeepSeek LLM (67B) 0.118 0.278 0.302 0.456 0.405 0.405 0.416 0.676 0.544 0.462 0.544 0.815 0.867 0.815 0.811
Gemma 2 27B 0.100 0.266 0.406 0.320 0.295 0.295 0.298 0.562 0.524 0.515 0.524 0.890 0.896 0.890 0.884
Gemma 2 9B 0.189 0.352 -0.120 0.348 0.419 0.419 0.367 0.570 0.499 0.491 0.499 0.940 0.941 0.940 0.940
Mistral (7B) Instruct v0.3 0.135 0.278 0.200 0.337 0.477 0.477 0.368 0.607 0.542 0.522 0.542 0.847 0.854 0.847 0.841
Mixtral-8x22B Instruct 0.221 0.364 -0.310 0.428 0.481 0.481 0.435 0.614 0.538 0.510 0.538 0.768 0.845 0.768 0.776
Mixtral-8x7B Instruct 0.208 0.307 -0.229 0.251 0.324 0.324 0.267 0.611 0.518 0.498 0.518 0.896 0.898 0.896 0.893
Qwen 2 Instruct (72B) 0.205 0.409 -0.212 0.468 0.530 0.530 0.483 0.644 0.601 0.576 0.601 0.904 0.908 0.904 0.901
WizardLM-2 8x22B 0.129 0.283 0.239 0.222 0.247 0.247 0.226 0.611 0.570 0.566 0.570 0.765 0.853 0.765 0.779
DeepSeek-V3 0.150 0.311 0.111 0.563 0.544 0.544 0.549 0.640 0.572 0.583 0.572 0.828 0.851 0.828 0.814
DeepSeek R1 0.110 0.289 0.348 0.600 0.586 0.586 0.587 0.644 0.489 0.499 0.489 0.904 0.907 0.904 0.902
QwQ-32B-Preview 0.141 0.290 0.165 0.005 0.005 0.005 0.005 0.629 0.534 0.550 0.534 0.812 0.827 0.812 0.815
Jamba 1.5 Mini 0.119 0.282 0.293 0.119 0.182 0.182 0.132 0.380 0.525 0.418 0.525 0.784 0.814 0.784 0.765
Jamba 1.5 Large 0.183 0.363 -0.085 0.403 0.414 0.414 0.397 0.635 0.573 0.582 0.573 0.824 0.850 0.824 0.798
Claude 3.5 Sonnet 0.101 0.268 0.402 0.658 0.668 0.668 0.655 0.634 0.585 0.553 0.585 0.944 0.945 0.944 0.944
Claude 3 Haiku 0.167 0.349 0.008 0.498 0.517 0.517 0.494 0.619 0.538 0.463 0.538 0.907 0.913 0.907 0.908
Cohere Command R 7B 0.164 0.319 0.028 0.457 0.446 0.446 0.441 0.609 0.547 0.532 0.547 0.835 0.861 0.835 0.840
Cohere Command R + 0.106 0.274 0.373 0.462 0.459 0.459 0.452 0.608 0.547 0.533 0.547 0.741 0.806 0.741 0.699
Google Gemini 1.5 Pro 0.144 0.329 0.149 0.399 0.400 0.400 0.393 0.642 0.587 0.593 0.587 0.890 0.895 0.890 0.885
OpenAI gpt-4o 0.184 0.317 -0.089 0.537 0.517 0.517 0.523 0.639 0.515 0.541 0.515 0.929 0.931 0.929 0.928
OpenAI o1-mini 0.120 0.295 0.289 0.661 0.681 0.681 0.662 0.660 0.515 0.542 0.515 0.918 0.917 0.918 0.917

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Text Classification Task Results

Model Banking77 FinBench FOMC NumClaim Headlines
Accuracy Precision Recall F1 Accuracy Precision Recall F1 Accuracy Precision Recall F1 Precision Recall Accuracy F1 Accuracy
Llama 3 70B Instruct 0.660 0.748 0.660 0.645 0.222 0.826 0.222 0.309 0.661 0.662 0.661 0.652 0.430 0.240 0.980 0.386 0.811
Llama 3 8B Instruct 0.534 0.672 0.534 0.512 0.543 0.857 0.543 0.659 0.565 0.618 0.565 0.497 0.801 0.463 0.571 0.511 0.763
DBRX Instruct 0.578 0.706 0.578 0.574 0.359 0.851 0.359 0.483 0.285 0.572 0.285 0.193 0.222 0.190 1.000 0.319 0.746
DeepSeek LLM (67B) 0.596 0.711 0.596 0.578 0.369 0.856 0.369 0.492 0.532 0.678 0.532 0.407 0.832 1.000 0.082 0.151 0.778
Gemma 2 27B 0.639 0.730 0.639 0.621 0.410 0.849 0.410 0.538 0.651 0.704 0.651 0.620 0.471 0.257 1.000 0.408 0.808
Gemma 2 9B 0.630 0.710 0.630 0.609 0.412 0.848 0.412 0.541 0.595 0.694 0.595 0.519 0.371 0.224 0.990 0.365 0.856
Mistral (7B) Instruct v0.3 0.547 0.677 0.547 0.528 0.375 0.839 0.375 0.503 0.587 0.598 0.587 0.542 0.521 0.266 0.918 0.412 0.779
Mixtral-8x22B Instruct 0.622 0.718 0.622 0.602 0.166 0.811 0.166 0.221 0.562 0.709 0.562 0.465 0.732 0.384 0.775 0.513 0.835
Mixtral-8x7B Instruct 0.567 0.693 0.567 0.547 0.285 0.838 0.285 0.396 0.623 0.636 0.623 0.603 0.765 0.431 0.898 0.583 0.805
Qwen 2 Instruct (72B) 0.644 0.730 0.644 0.627 0.370 0.848 0.370 0.495 0.623 0.639 0.623 0.605 0.821 0.506 0.867 0.639 0.830
WizardLM-2 8x22B 0.664 0.737 0.664 0.648 0.373 0.842 0.373 0.500 0.583 0.710 0.583 0.505 0.831 0.630 0.173 0.272 0.797
DeepSeek-V3 0.722 0.774 0.722 0.714 0.362 0.845 0.362 0.487 0.625 0.712 0.625 0.578 0.860 0.586 0.796 0.675 0.729
DeepSeek R1 0.772 0.789 0.772 0.763 0.306 0.846 0.306 0.419 0.679 0.682 0.679 0.670 0.851 0.557 0.898 0.688 0.769
QwQ-32B-Preview 0.577 0.747 0.577 0.613 0.716 0.871 0.716 0.784 0.591 0.630 0.591 0.555 0.819 1.000 0.010 0.020 0.744
Jamba 1.5 Mini 0.528 0.630 0.528 0.508 0.913 0.883 0.913 0.898 0.572 0.678 0.572 0.499 0.812 0.429 0.092 0.151 0.682
Jamba 1.5 Large 0.642 0.746 0.642 0.628 0.494 0.851 0.494 0.618 0.597 0.650 0.597 0.550 0.855 0.639 0.469 0.541 0.782
Claude 3.5 Sonnet 0.682 0.755 0.682 0.668 0.513 0.854 0.513 0.634 0.675 0.677 0.675 0.674 0.879 0.646 0.745 0.692 0.827
Claude 3 Haiku 0.639 0.735 0.639 0.622 0.067 0.674 0.067 0.022 0.633 0.634 0.633 0.631 0.838 0.556 0.561 0.558 0.781
Cohere Command R 7B 0.530 0.650 0.530 0.516 0.682 0.868 0.682 0.762 0.536 0.505 0.536 0.459 0.797 0.210 0.041 0.068 0.770
Cohere Command R + 0.660 0.747 0.660 0.651 0.575 0.859 0.575 0.684 0.526 0.655 0.526 0.393 0.804 0.333 0.071 0.118 0.812
Google Gemini 1.5 Pro 0.483 0.487 0.483 0.418 0.240 0.823 0.240 0.336 0.619 0.667 0.619 0.579 0.700 0.369 0.908 0.525 0.837
OpenAI gpt-4o 0.704 0.792 0.704 0.710 0.396 0.846 0.396 0.524 0.681 0.719 0.681 0.664 0.896 0.667 0.857 0.750 0.824
OpenAI o1-mini 0.681 0.760 0.681 0.670 0.487 0.851 0.487 0.612 0.651 0.670 0.651 0.635 0.888 0.664 0.786 0.720 0.769

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Text Summarization Task Results

Model ECTSum EDTSum
BERTScore Precision BERTScore Recall BERTScore F1 BERTScore Precision BERTScore Recall BERTScore F1
Llama 3 70B Instruct 0.715 0.801 0.754 0.793 0.844 0.817
Llama 3 8B Instruct 0.724 0.796 0.757 0.785 0.841 0.811
DBRX Instruct 0.680 0.786 0.729 0.774 0.843 0.806
DeepSeek LLM (67B) 0.692 0.678 0.681 0.779 0.840 0.807
Gemma 2 27B 0.680 0.777 0.723 0.801 0.829 0.814
Gemma 2 9B 0.651 0.531 0.585 0.803 0.833 0.817
Mistral (7B) Instruct v0.3 0.702 0.806 0.750 0.783 0.842 0.811
Mixtral-8x22B Instruct 0.713 0.812 0.758 0.790 0.843 0.815
Mixtral-8x7B Instruct 0.727 0.773 0.747 0.785 0.839 0.810
Qwen 2 Instruct (72B) 0.709 0.804 0.752 0.781 0.846 0.811
WizardLM-2 8x22B 0.677 0.806 0.735 0.774 0.847 0.808
DeepSeek-V3 0.703 0.806 0.750 0.791 0.842 0.815
DeepSeek R1 0.724 0.800 0.759 0.770 0.843 0.804
QwQ-32B-Preview 0.653 0.751 0.696 0.797 0.841 0.817
Jamba 1.5 Mini 0.692 0.798 0.741 0.798 0.838 0.816
Jamba 1.5 Large 0.679 0.800 0.734 0.799 0.841 0.818
Claude 3.5 Sonnet 0.737 0.802 0.767 0.786 0.843 0.813
Claude 3 Haiku 0.683 0.617 0.646 0.778 0.844 0.808
Cohere Command R 7B 0.724 0.781 0.750 0.790 0.844 0.815
Cohere Command R + 0.724 0.782 0.751 0.789 0.834 0.810
Google Gemini 1.5 Pro 0.757 0.800 0.777 0.800 0.836 0.817
OpenAI gpt-4o 0.755 0.793 0.773 0.795 0.840 0.816
OpenAI o1-mini 0.731 0.801 0.763 0.795 0.840 0.816

Note: Color highlighting indicates performance ranking:  Best ,  Strong ,  Good 

Model Cost Analysis

Model FOMC FPB FinQA FiQA-1 FiQA-2 HL FB FR RD EDTSum B77 CD CC ECTSum FE FiNER FNXL NC TQA CFQA SQA Total
Llama 3 70B Instruct 0.10 0.11 1.14 0.06 0.72 1.00 0.40 0.38 1.34 1.94 1.64 0.07 0.05 1.56 0.12 0.33 0.25 0.09 1.11 2.96 1.17 16.54
Llama 3 8B Instruct 0.02 0.03 0.25 0.01 0.16 0.22 0.09 0.09 0.32 0.43 0.37 0.02 0.01 0.36 0.03 0.08 0.06 0.02 0.26 0.69 0.26 3.79
DBRX Instruct 0.14 0.17 1.50 0.06 0.95 1.29 0.56 0.57 2.05 2.93 2.14 0.11 0.10 2.45 0.17 0.47 0.34 0.13 1.47 4.19 1.55 23.35
DeepSeek LLM (67B) 0.10 0.12 1.25 0.05 0.76 0.87 0.42 0.37 1.45 1.85 2.03 0.08 0.05 0.83 0.13 0.34 0.24 0.09 1.20 3.17 1.17 16.57
Gemma 2 27B 0.08 0.09 1.05 0.05 0.66 0.91 0.30 0.34 1.37 1.75 1.77 0.07 0.04 1.46 0.11 0.30 0.21 0.08 1.00 2.84 1.04 15.50
Gemma 2 9B 0.03 0.03 0.40 0.02 0.24 0.33 0.12 0.14 0.51 0.66 0.66 0.03 0.02 0.00 0.04 0.11 0.08 0.03 0.37 1.08 0.39 5.29
Mistral (7B) Instruct v0.3 0.03 0.03 0.28 0.01 0.18 0.24 0.10 0.09 0.36 0.57 0.48 0.02 0.01 0.45 0.03 0.08 0.06 0.02 0.27 0.78 0.26 4.36
Mixtral-8x22B Instruct 0.14 0.17 1.80 0.07 1.05 1.44 0.58 0.56 2.04 3.42 2.89 0.11 0.07 2.66 0.18 0.48 0.35 0.14 1.73 4.90 1.55 26.35
Mixtral-8x7B Instruct 0.08 0.09 0.88 0.04 0.53 0.70 0.30 0.30 1.07 1.72 1.50 0.06 0.05 1.30 0.09 0.24 0.20 0.07 0.87 2.55 0.78 13.41
Qwen 2 Instruct (72B) 0.10 0.12 1.29 0.05 0.74 0.96 0.43 0.43 1.44 2.36 1.61 0.08 0.05 1.80 0.12 0.34 0.24 0.10 1.18 3.41 1.17 18.02
WizardLM-2 8x22B 0.16 0.19 1.94 0.08 1.07 1.47 0.61 0.61 2.24 3.47 3.00 0.11 0.10 2.85 0.18 0.49 0.34 0.14 1.94 5.31 1.55 27.87
DeepSeek-V3 0.13 0.15 1.57 0.07 0.98 1.36 0.52 0.54 2.10 2.99 2.55 0.11 0.06 2.33 0.16 0.55 0.28 0.12 1.56 4.28 1.62 24.03
DeepSeek R1 1.99 2.10 14.18 1.48 17.82 20.11 6.63 12.65 31.00 21.15 23.28 3.75 1.06 15.02 7.31 8.34 11.21 1.88 13.72 39.42 9.07 263.16
QwQ-32B-Preview 0.15 0.18 2.38 0.08 0.93 1.37 0.60 0.68 2.18 3.12 2.36 0.11 0.07 2.76 0.14 0.65 0.54 0.14 2.61 7.83 1.55 30.43
Jamba 1.5 Mini 0.02 0.03 0.30 0.02 0.23 0.22 0.10 0.08 0.44 0.55 0.51 0.02 0.01 0.49 0.05 0.10 0.07 0.02 0.25 0.72 0.26 4.47
Jamba 1.5 Large 0.31 0.36 4.42 0.30 3.47 4.81 1.78 0.94 4.97 5.80 5.51 0.35 0.13 7.07 0.56 1.67 0.77 0.30 2.87 7.45 2.59 56.42
Claude 3.5 Sonnet 0.62 0.72 6.98 0.55 6.50 8.81 3.44 3.21 12.32 9.50 11.11 0.61 0.22 7.09 0.90 3.01 1.79 0.57 9.18 16.86 3.89 107.87
Claude 3 Haiku 0.06 0.07 0.56 0.05 0.54 0.73 0.28 0.25 0.82 0.81 0.90 0.05 0.02 0.21 0.06 0.23 0.14 0.05 0.64 1.28 0.32 8.07
Cohere Command R 7B 0.01 0.01 0.08 0.00 0.07 0.09 0.04 0.03 0.11 0.11 0.10 0.01 0.00 0.08 0.01 0.03 0.01 0.01 0.08 0.19 0.05 1.09
Cohere Command R + 0.41 0.45 5.40 0.35 4.41 4.00 2.30 0.93 3.87 7.03 7.21 0.43 0.12 5.55 0.48 1.69 0.97 0.42 4.59 10.09 3.24 63.95
Google Gemini 1.5 Pro 0.23 0.21 2.26 0.18 2.20 2.78 1.02 0.49 2.27 3.45 2.70 0.21 0.07 2.65 0.25 0.87 0.58 0.21 2.13 5.78 1.62 32.16
OpenAI gpt-4o 0.35 0.41 4.99 0.32 4.45 5.33 1.55 1.21 5.77 6.57 5.00 0.35 0.14 4.85 0.44 1.94 0.96 0.34 4.95 10.36 3.24 63.52
OpenAI o1-mini 0.90 0.90 5.25 0.73 9.70 12.20 3.27 4.89 13.60 1.29 9.29 2.56 0.75 3.18 2.92 1.91 6.39 0.92 6.97 15.71 1.42 104.73
Low Cost ($0-$10)
Medium Cost ($10-$35)
High Cost ($35-$70)
Very High Cost ($70+)

Note: All costs are in USD and represent the expense to run the model on each specific dataset. Colors indicate cost tiers on the total cost, with darker blue representing higher costs. For cost-efficiency analysis, consider comparing these costs with the corresponding performance metrics in other tabs.