Spaces:
Running
Running
Huzaifa Pardawala
commited on
Commit
·
9ed293f
1
Parent(s):
464a24c
adding changes to the results and contributions sections
Browse files- index.html +281 -161
index.html
CHANGED
|
@@ -614,6 +614,7 @@
|
|
| 614 |
</div>
|
| 615 |
</div>
|
| 616 |
|
|
|
|
| 617 |
<!-- Task 3 -->
|
| 618 |
<div class="task-performance mb-4">
|
| 619 |
<div class="columns is-vcentered">
|
|
@@ -643,7 +644,7 @@
|
|
| 643 |
</div>
|
| 644 |
</div>
|
| 645 |
<div class="column is-2 has-text-centered">
|
| 646 |
-
<span class="tag is-info is-light">
|
| 647 |
</div>
|
| 648 |
</div>
|
| 649 |
</div>
|
|
@@ -671,104 +672,155 @@
|
|
| 671 |
</div>
|
| 672 |
</div>
|
| 673 |
|
| 674 |
-
|
| 675 |
-
|
| 676 |
-
|
| 677 |
-
|
| 678 |
-
|
| 679 |
-
|
| 680 |
-
|
| 681 |
-
|
| 682 |
-
|
| 683 |
-
|
| 684 |
-
|
| 685 |
-
|
| 686 |
-
|
| 687 |
-
|
| 688 |
-
|
| 689 |
-
|
| 690 |
-
|
| 691 |
-
|
| 692 |
-
|
| 693 |
-
|
| 694 |
-
|
| 695 |
-
|
| 696 |
-
|
| 697 |
-
|
| 698 |
-
|
| 699 |
-
<p class="mt-2 mb-0">OpenAI o1-mini</p>
|
| 700 |
-
</div>
|
| 701 |
-
</div>
|
| 702 |
-
<div class="column">
|
| 703 |
-
<div class="has-text-centered">
|
| 704 |
-
<span class="icon is-large has-text-bronze"><i class="fas fa-trophy fa-2x"></i></span>
|
| 705 |
-
<p class="mt-2 mb-0">Claude 3.5 Sonnet</p>
|
| 706 |
-
</div>
|
| 707 |
-
</div>
|
| 708 |
-
</div>
|
| 709 |
-
</div>
|
| 710 |
-
|
| 711 |
-
<p class="mb-2">Key insights from our model analysis:</p>
|
| 712 |
-
<ul>
|
| 713 |
-
<li>Inconsistent scaling: larger parameter sizes do not guarantee higher performance</li>
|
| 714 |
-
<li>Open-weight models show competitive performance on many tasks</li>
|
| 715 |
-
<li>Dramatic price differences between models ($4-260 USD)</li>
|
| 716 |
-
<li>Cost-conscious choices should be based on specific use cases</li>
|
| 717 |
-
</ul>
|
| 718 |
-
</div>
|
| 719 |
-
</div>
|
| 720 |
-
</div>
|
| 721 |
</div>
|
| 722 |
-
|
| 723 |
-
|
| 724 |
-
|
| 725 |
-
|
| 726 |
-
|
| 727 |
-
|
| 728 |
-
|
| 729 |
-
|
| 730 |
-
|
| 731 |
-
|
| 732 |
-
|
| 733 |
-
|
| 734 |
-
|
| 735 |
-
|
| 736 |
-
|
| 737 |
-
|
| 738 |
-
|
| 739 |
-
|
| 740 |
-
|
| 741 |
-
|
| 742 |
-
|
| 743 |
-
|
| 744 |
-
|
| 745 |
-
|
| 746 |
-
|
| 747 |
-
|
| 748 |
-
|
| 749 |
-
|
| 750 |
-
|
| 751 |
-
|
| 752 |
-
|
| 753 |
-
|
| 754 |
-
|
| 755 |
-
|
| 756 |
-
|
| 757 |
-
|
| 758 |
-
|
| 759 |
-
|
| 760 |
-
|
| 761 |
-
|
| 762 |
-
|
| 763 |
-
|
| 764 |
-
|
| 765 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 766 |
</div>
|
| 767 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 768 |
</div>
|
| 769 |
</div>
|
| 770 |
</div>
|
| 771 |
-
|
| 772 |
</div>
|
| 773 |
</section>
|
| 774 |
|
|
@@ -780,155 +832,223 @@
|
|
| 780 |
<h2 class="title is-3 section-title has-text-centered">Contributions & Future Work</h2>
|
| 781 |
|
| 782 |
<div class="content">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 783 |
<!-- Contributions -->
|
| 784 |
<div class="box has-background-white-ter mb-5">
|
| 785 |
<h4 class="title is-4 has-text-centered mb-4">Key Contributions</h4>
|
| 786 |
-
|
| 787 |
-
<div class="columns is-multiline">
|
|
|
|
| 788 |
<!-- Contribution 1 -->
|
| 789 |
<div class="column is-6">
|
| 790 |
<div class="media">
|
| 791 |
<div class="media-left">
|
| 792 |
<span class="icon has-text-primary is-large">
|
| 793 |
-
<i class="fas fa-
|
| 794 |
</span>
|
| 795 |
</div>
|
| 796 |
<div class="media-content">
|
| 797 |
-
<p class="has-text-weight-semibold">
|
| 798 |
-
<p class="is-size-7">
|
| 799 |
</div>
|
| 800 |
</div>
|
| 801 |
</div>
|
| 802 |
-
|
| 803 |
<!-- Contribution 2 -->
|
| 804 |
<div class="column is-6">
|
| 805 |
<div class="media">
|
| 806 |
<div class="media-left">
|
| 807 |
<span class="icon has-text-primary is-large">
|
| 808 |
-
<i class="fas fa-
|
| 809 |
</span>
|
| 810 |
</div>
|
| 811 |
<div class="media-content">
|
| 812 |
-
<p class="has-text-weight-semibold">
|
| 813 |
-
<p class="is-size-7">
|
| 814 |
</div>
|
| 815 |
</div>
|
| 816 |
</div>
|
| 817 |
-
|
| 818 |
<!-- Contribution 3 -->
|
| 819 |
<div class="column is-6">
|
| 820 |
<div class="media">
|
| 821 |
<div class="media-left">
|
| 822 |
<span class="icon has-text-primary is-large">
|
| 823 |
-
<i class="fas fa-
|
| 824 |
</span>
|
| 825 |
</div>
|
| 826 |
<div class="media-content">
|
| 827 |
-
<p class="has-text-weight-semibold">
|
| 828 |
-
<p class="is-size-7">
|
| 829 |
</div>
|
| 830 |
</div>
|
| 831 |
</div>
|
| 832 |
-
|
| 833 |
<!-- Contribution 4 -->
|
| 834 |
<div class="column is-6">
|
| 835 |
<div class="media">
|
| 836 |
<div class="media-left">
|
| 837 |
<span class="icon has-text-primary is-large">
|
| 838 |
-
<i class="fas fa-
|
| 839 |
</span>
|
| 840 |
</div>
|
| 841 |
<div class="media-content">
|
| 842 |
-
<p class="has-text-weight-semibold">
|
| 843 |
-
<p class="is-size-7">
|
| 844 |
</div>
|
| 845 |
</div>
|
| 846 |
</div>
|
| 847 |
-
|
| 848 |
<!-- Contribution 5 -->
|
| 849 |
<div class="column is-6">
|
| 850 |
<div class="media">
|
| 851 |
<div class="media-left">
|
| 852 |
<span class="icon has-text-primary is-large">
|
| 853 |
-
<i class="fas fa-
|
| 854 |
</span>
|
| 855 |
</div>
|
| 856 |
<div class="media-content">
|
| 857 |
-
<p class="has-text-weight-semibold">Cost-Performance
|
| 858 |
-
<p class="is-size-7">
|
| 859 |
</div>
|
| 860 |
</div>
|
| 861 |
</div>
|
| 862 |
-
|
| 863 |
<!-- Contribution 6 -->
|
| 864 |
<div class="column is-6">
|
| 865 |
<div class="media">
|
| 866 |
<div class="media-left">
|
| 867 |
<span class="icon has-text-primary is-large">
|
| 868 |
-
<i class="fas fa-
|
| 869 |
</span>
|
| 870 |
</div>
|
| 871 |
<div class="media-content">
|
| 872 |
<p class="has-text-weight-semibold">Open-Source Implementation</p>
|
| 873 |
-
<p class="is-size-7">
|
| 874 |
</div>
|
| 875 |
</div>
|
| 876 |
</div>
|
|
|
|
| 877 |
</div>
|
| 878 |
</div>
|
| 879 |
|
| 880 |
<!-- Limitations & Future Work -->
|
| 881 |
-
|
| 882 |
-
|
| 883 |
-
|
| 884 |
-
|
| 885 |
-
|
| 886 |
-
|
| 887 |
-
|
| 888 |
-
|
| 889 |
-
|
| 890 |
-
|
| 891 |
-
|
| 892 |
-
|
| 893 |
-
|
| 894 |
-
|
| 895 |
-
|
| 896 |
-
|
| 897 |
-
|
| 898 |
-
|
| 899 |
-
|
| 900 |
-
</div>
|
| 901 |
-
</div>
|
| 902 |
</div>
|
| 903 |
-
|
| 904 |
-
|
| 905 |
-
|
| 906 |
-
|
| 907 |
-
|
| 908 |
-
|
| 909 |
-
|
| 910 |
-
|
| 911 |
-
|
| 912 |
-
|
| 913 |
-
|
| 914 |
-
|
| 915 |
-
|
| 916 |
-
|
| 917 |
-
<li>Extend to more languages beyond English</li>
|
| 918 |
-
<li>Explore few-shot and chain-of-thought prompting</li>
|
| 919 |
-
<li>Evaluate domain-adaptive training for finance</li>
|
| 920 |
-
<li>Expand dataset coverage across more financial sectors</li>
|
| 921 |
-
<li>Benchmark efficiency trade-offs</li>
|
| 922 |
-
<li>Develop more nuanced evaluation metrics</li>
|
| 923 |
-
</ul>
|
| 924 |
-
</div>
|
| 925 |
-
</div>
|
| 926 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 927 |
</div>
|
| 928 |
</div>
|
| 929 |
</div>
|
| 930 |
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 931 |
</div>
|
|
|
|
|
|
|
| 932 |
<!--/ Contributions -->
|
| 933 |
</div>
|
| 934 |
</section>
|
|
|
|
| 614 |
</div>
|
| 615 |
</div>
|
| 616 |
|
| 617 |
+
|
| 618 |
<!-- Task 3 -->
|
| 619 |
<div class="task-performance mb-4">
|
| 620 |
<div class="columns is-vcentered">
|
|
|
|
| 644 |
</div>
|
| 645 |
</div>
|
| 646 |
<div class="column is-2 has-text-centered">
|
| 647 |
+
<span class="tag is-info is-light">75-82 %</span>
|
| 648 |
</div>
|
| 649 |
</div>
|
| 650 |
</div>
|
|
|
|
| 672 |
</div>
|
| 673 |
</div>
|
| 674 |
|
| 675 |
+
<!-- </section>
|
| 676 |
+
</div> -->
|
| 677 |
+
<!-- <section class="section">
|
| 678 |
+
<div class="container"> -->
|
| 679 |
+
<!-- Model Performance Highlights -->
|
| 680 |
+
<div class="card mb-5">
|
| 681 |
+
<div class="card-header">
|
| 682 |
+
<p class="card-header-title">
|
| 683 |
+
<span class="icon mr-2"><i class="fas fa-medal"></i></span>
|
| 684 |
+
Model Performance Highlights
|
| 685 |
+
</p>
|
| 686 |
+
</div>
|
| 687 |
+
<div class="card-content">
|
| 688 |
+
<div class="content">
|
| 689 |
+
<p class="has-text-weight-bold mb-4 has-text-centered">🏆 Top Performing Models</p>
|
| 690 |
+
|
| 691 |
+
<div class="columns is-centered is-multiline">
|
| 692 |
+
<!-- DeepSeek R1 -->
|
| 693 |
+
<div class="column is-4">
|
| 694 |
+
<div class="is-flex is-flex-direction-column is-align-items-center">
|
| 695 |
+
<figure class="image is-128x128 mb-3">
|
| 696 |
+
<img src="static/images/deepseek_logo.png" alt="DeepSeek R1 Logo">
|
| 697 |
+
</figure>
|
| 698 |
+
<p class="is-size-4 has-text-weight-semibold mb-2">DeepSeek R1</p>
|
| 699 |
+
<span class="icon is-large has-text-warning"><i class="fas fa-trophy fa-2x"></i></span>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 700 |
</div>
|
| 701 |
+
</div>
|
| 702 |
+
|
| 703 |
+
<!-- OpenAI o1-mini -->
|
| 704 |
+
<div class="column is-4">
|
| 705 |
+
<div class="is-flex is-flex-direction-column is-align-items-center">
|
| 706 |
+
<figure class="image is-128x128 mb-3">
|
| 707 |
+
<img src="static/images/openai_logo.png" alt="OpenAI Logo">
|
| 708 |
+
</figure>
|
| 709 |
+
<p class="is-size-4 has-text-weight-semibold mb-2">OpenAI o1-mini</p>
|
| 710 |
+
<span class="icon is-large has-text-grey"><i class="fas fa-trophy fa-2x"></i></span>
|
| 711 |
+
</div>
|
| 712 |
+
</div>
|
| 713 |
+
|
| 714 |
+
<!-- Claude 3.5 Sonnet -->
|
| 715 |
+
<div class="column is-4">
|
| 716 |
+
<div class="is-flex is-flex-direction-column is-align-items-center">
|
| 717 |
+
<figure class="image is-128x128 mb-3">
|
| 718 |
+
<img src="static/images/claude_logo.png" alt="Claude 3.5 Sonnet Logo">
|
| 719 |
+
</figure>
|
| 720 |
+
<p class="is-size-4 has-text-weight-semibold mb-2">Claude 3.5 Sonnet</p>
|
| 721 |
+
<span class="icon is-large has-text-bronze"><i class="fas fa-trophy fa-2x"></i></span>
|
| 722 |
+
</div>
|
| 723 |
+
</div>
|
| 724 |
+
|
| 725 |
+
</div>
|
| 726 |
+
|
| 727 |
+
<hr>
|
| 728 |
+
|
| 729 |
+
<p class="has-text-weight-bold mb-3">🔍 Key Insights from Model Analysis</p>
|
| 730 |
+
|
| 731 |
+
<div class="notification is-info is-light py-3 px-4">
|
| 732 |
+
<p><strong>🏆 No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
|
| 733 |
+
<p><strong>⚖️ Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
|
| 734 |
+
<p><strong>🛠️ Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
|
| 735 |
+
<p><strong>💰 Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
|
| 736 |
+
<p><strong>📉 Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
|
| 737 |
+
<p><strong>🔢 Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
|
| 738 |
+
</div>
|
| 739 |
+
</div>
|
| 740 |
+
</div>
|
| 741 |
+
</div>
|
| 742 |
+
<!-- Error Analysis & Key Findings -->
|
| 743 |
+
<div class="card">
|
| 744 |
+
<div class="card-header">
|
| 745 |
+
<p class="card-header-title">
|
| 746 |
+
<span class="icon mr-2"><i class="fas fa-exclamation-triangle"></i></span>
|
| 747 |
+
Error Analysis & Key Findings
|
| 748 |
+
</p>
|
| 749 |
+
</div>
|
| 750 |
+
<div class="card-content">
|
| 751 |
+
<div class="content">
|
| 752 |
+
<p class="mb-4">Common challenges and limitations identified in our evaluations:</p>
|
| 753 |
+
|
| 754 |
+
<!-- Individual Error Categories -->
|
| 755 |
+
<div class="error-category mb-4">
|
| 756 |
+
<p class="has-text-weight-bold mb-1">Concerns regarding outdated models</p>
|
| 757 |
+
<div class="notification is-danger is-light py-2 px-3">
|
| 758 |
+
<p class="is-size-7 mb-0"><strong>LLama 2 13B Chat</strong> produces trivial or empty responses, possibly due to misalignment during fine-tuning.</p>
|
| 759 |
+
</div>
|
| 760 |
+
</div>
|
| 761 |
+
|
| 762 |
+
<div class="error-category mb-4">
|
| 763 |
+
<p class="has-text-weight-bold mb-1">Numeric Regression Issues</p>
|
| 764 |
+
<div class="notification is-danger is-light py-2 px-3">
|
| 765 |
+
<p class="is-size-7 mb-0">LMs struggle with precision and rounding in continuous-valued regressions (e.g., financial percentages). Post-hoc normalization is needed.</p>
|
| 766 |
+
</div>
|
| 767 |
+
</div>
|
| 768 |
+
|
| 769 |
+
<div class="error-category mb-4">
|
| 770 |
+
<p class="has-text-weight-bold mb-1">Data Contamination</p>
|
| 771 |
+
<div class="notification is-danger is-light py-2 px-3">
|
| 772 |
+
<p class="is-size-7 mb-0">Overlap between public financial datasets and pretraining corpora can inflate zero-shot performance, requiring time-split test sets.</p>
|
| 773 |
+
</div>
|
| 774 |
+
</div>
|
| 775 |
+
|
| 776 |
+
<div class="error-category mb-4">
|
| 777 |
+
<p class="has-text-weight-bold mb-1">Challenges in Causal Classification</p>
|
| 778 |
+
<div class="notification is-danger is-light py-2 px-3">
|
| 779 |
+
<p class="is-size-7 mb-0">Most models struggle with financial causal reasoning, requiring structured knowledge bases or explicit symbolic reasoning.</p>
|
| 780 |
+
</div>
|
| 781 |
+
</div>
|
| 782 |
+
|
| 783 |
+
|
| 784 |
+
<div class="error-category mb-4">
|
| 785 |
+
<p class="has-text-weight-bold mb-1">Language Drift</p>
|
| 786 |
+
<div class="notification is-warning is-light py-2 px-3">
|
| 787 |
+
<p class="is-size-7 mb-0"><strong>Qwen 2 72B</strong> exhibits unintended shifts to Chinese output in English summarization tasks, indicating strong pretraining priors.</p>
|
| 788 |
+
</div>
|
| 789 |
+
</div>
|
| 790 |
+
|
| 791 |
+
<div class="error-category mb-4">
|
| 792 |
+
<p class="has-text-weight-bold mb-1">Summarization Nuances</p>
|
| 793 |
+
<div class="notification is-warning is-light py-2 px-3">
|
| 794 |
+
<p class="is-size-7 mb-0">Models achieve high BERTScores (~80-82%) on extractive summarization but suffer on abstractive tasks, especially in finance-specific jargon.</p>
|
| 795 |
+
</div>
|
| 796 |
+
</div>
|
| 797 |
+
|
| 798 |
+
|
| 799 |
+
<div class="error-category mb-4">
|
| 800 |
+
<p class="has-text-weight-bold mb-1">Prompt Design Limitations</p>
|
| 801 |
+
<div class="notification is-warning is-light py-2 px-3">
|
| 802 |
+
<p class="is-size-7 mb-0">Prompts tuned on <strong>Llama 3 8B</strong> may not generalize across models, leading to inconsistencies in label generation (e.g., minor syntactic variations).</p>
|
| 803 |
+
</div>
|
| 804 |
+
</div>
|
| 805 |
+
|
| 806 |
+
<div class="error-category mb-4">
|
| 807 |
+
<p class="has-text-weight-bold mb-1">Differences in QA Datasets</p>
|
| 808 |
+
<div class="notification is-warning is-light py-2 px-3">
|
| 809 |
+
<p class="is-size-7 mb-0"><strong>ConvFinQA</strong> consistently underperforms compared to <strong>FinQA</strong> due to its multi-turn dialogue complexity.</p>
|
| 810 |
</div>
|
| 811 |
</div>
|
| 812 |
+
|
| 813 |
+
<div class="error-category mb-4">
|
| 814 |
+
<p class="has-text-weight-bold mb-1">Efficiency and Cost Considerations</p>
|
| 815 |
+
<div class="notification is-warning is-light py-2 px-3">
|
| 816 |
+
<p class="is-size-7 mb-0">Inference costs vary by up to <strong>2×</strong> among similarly sized models, requiring a balance between performance and resource usage.</p>
|
| 817 |
+
</div>
|
| 818 |
+
</div>
|
| 819 |
+
|
| 820 |
</div>
|
| 821 |
</div>
|
| 822 |
</div>
|
| 823 |
+
|
| 824 |
</div>
|
| 825 |
</section>
|
| 826 |
|
|
|
|
| 832 |
<h2 class="title is-3 section-title has-text-centered">Contributions & Future Work</h2>
|
| 833 |
|
| 834 |
<div class="content">
|
| 835 |
+
<!-- Contributions Overview -->
|
| 836 |
+
<div class="notification is-info is-light has-text-centered mb-5">
|
| 837 |
+
<p class="is-size-5 has-text-weight-semibold">
|
| 838 |
+
Our work introduces a standardized, large-scale, and holistic evaluation framework for financial language models.
|
| 839 |
+
</p>
|
| 840 |
+
</div>
|
| 841 |
+
|
| 842 |
<!-- Contributions -->
|
| 843 |
<div class="box has-background-white-ter mb-5">
|
| 844 |
<h4 class="title is-4 has-text-centered mb-4">Key Contributions</h4>
|
| 845 |
+
|
| 846 |
+
<div class="columns is-multiline is-centered">
|
| 847 |
+
|
| 848 |
<!-- Contribution 1 -->
|
| 849 |
<div class="column is-6">
|
| 850 |
<div class="media">
|
| 851 |
<div class="media-left">
|
| 852 |
<span class="icon has-text-primary is-large">
|
| 853 |
+
<i class="fas fa-cogs fa-lg"></i>
|
| 854 |
</span>
|
| 855 |
</div>
|
| 856 |
<div class="media-content">
|
| 857 |
+
<p class="has-text-weight-semibold">Standardized Evaluation Framework</p>
|
| 858 |
+
<p class="is-size-7">We introduce an open-source, modular benchmarking suite for systematic LM evaluations on core financial NLP tasks.</p>
|
| 859 |
</div>
|
| 860 |
</div>
|
| 861 |
</div>
|
| 862 |
+
|
| 863 |
<!-- Contribution 2 -->
|
| 864 |
<div class="column is-6">
|
| 865 |
<div class="media">
|
| 866 |
<div class="media-left">
|
| 867 |
<span class="icon has-text-primary is-large">
|
| 868 |
+
<i class="fas fa-chart-line fa-lg"></i>
|
| 869 |
</span>
|
| 870 |
</div>
|
| 871 |
<div class="media-content">
|
| 872 |
+
<p class="has-text-weight-semibold">Large-Scale Model Assessment</p>
|
| 873 |
+
<p class="is-size-7">We benchmark 23 foundation LMs—open-weight and proprietary—across 20 financial tasks, revealing performance-cost trade-offs.</p>
|
| 874 |
</div>
|
| 875 |
</div>
|
| 876 |
</div>
|
| 877 |
+
|
| 878 |
<!-- Contribution 3 -->
|
| 879 |
<div class="column is-6">
|
| 880 |
<div class="media">
|
| 881 |
<div class="media-left">
|
| 882 |
<span class="icon has-text-primary is-large">
|
| 883 |
+
<i class="fas fa-database fa-lg"></i>
|
| 884 |
</span>
|
| 885 |
</div>
|
| 886 |
<div class="media-content">
|
| 887 |
+
<p class="has-text-weight-semibold">Holistic Dataset Taxonomy</p>
|
| 888 |
+
<p class="is-size-7">We establish a structured dataset taxonomy, categorizing financial NLP tasks based on domain, data format, and linguistic complexity.</p>
|
| 889 |
</div>
|
| 890 |
</div>
|
| 891 |
</div>
|
| 892 |
+
|
| 893 |
<!-- Contribution 4 -->
|
| 894 |
<div class="column is-6">
|
| 895 |
<div class="media">
|
| 896 |
<div class="media-left">
|
| 897 |
<span class="icon has-text-primary is-large">
|
| 898 |
+
<i class="fas fa-users fa-lg"></i>
|
| 899 |
</span>
|
| 900 |
</div>
|
| 901 |
<div class="media-content">
|
| 902 |
+
<p class="has-text-weight-semibold">Living Benchmark & Open Collaboration</p>
|
| 903 |
+
<p class="is-size-7">We introduce a continuously updated leaderboard, inviting researchers to contribute new datasets and evaluation results.</p>
|
| 904 |
</div>
|
| 905 |
</div>
|
| 906 |
</div>
|
| 907 |
+
|
| 908 |
<!-- Contribution 5 -->
|
| 909 |
<div class="column is-6">
|
| 910 |
<div class="media">
|
| 911 |
<div class="media-left">
|
| 912 |
<span class="icon has-text-primary is-large">
|
| 913 |
+
<i class="fas fa-balance-scale fa-lg"></i>
|
| 914 |
</span>
|
| 915 |
</div>
|
| 916 |
<div class="media-content">
|
| 917 |
+
<p class="has-text-weight-semibold">Error Analysis & Cost-Performance Insights</p>
|
| 918 |
+
<p class="is-size-7">We analyze systematic model errors and quantify cost-performance trade-offs for informed deployment in real-world applications.</p>
|
| 919 |
</div>
|
| 920 |
</div>
|
| 921 |
</div>
|
| 922 |
+
|
| 923 |
<!-- Contribution 6 -->
|
| 924 |
<div class="column is-6">
|
| 925 |
<div class="media">
|
| 926 |
<div class="media-left">
|
| 927 |
<span class="icon has-text-primary is-large">
|
| 928 |
+
<i class="fas fa-code-branch fa-lg"></i>
|
| 929 |
</span>
|
| 930 |
</div>
|
| 931 |
<div class="media-content">
|
| 932 |
<p class="has-text-weight-semibold">Open-Source Implementation</p>
|
| 933 |
+
<p class="is-size-7">We release a fully open-source framework, enabling the research community to extend and refine financial LM evaluation methodologies.</p>
|
| 934 |
</div>
|
| 935 |
</div>
|
| 936 |
</div>
|
| 937 |
+
|
| 938 |
</div>
|
| 939 |
</div>
|
| 940 |
|
| 941 |
<!-- Limitations & Future Work -->
|
| 942 |
+
<div class="columns is-multiline">
|
| 943 |
+
<!-- Limitations -->
|
| 944 |
+
<div class="column is-6">
|
| 945 |
+
<div class="card h-100">
|
| 946 |
+
<div class="card-header">
|
| 947 |
+
<p class="card-header-title">
|
| 948 |
+
<span class="icon mr-2"><i class="fas fa-exclamation-circle"></i></span>
|
| 949 |
+
Limitations
|
| 950 |
+
</p>
|
| 951 |
+
</div>
|
| 952 |
+
<div class="card-content">
|
| 953 |
+
<div class="content">
|
| 954 |
+
<p class="mb-3">
|
| 955 |
+
While our benchmark provides valuable insights, several limitations must be acknowledged:
|
| 956 |
+
</p>
|
| 957 |
+
|
| 958 |
+
<div class="notification is-danger is-light py-2 px-3 mb-3">
|
| 959 |
+
<p class="has-text-weight-bold mb-1">❌ Data Contamination Risks</p>
|
| 960 |
+
<p class="is-size-7 mb-0">Benchmark testing data may overlap with model pretraining corpora, leading to artificially inflated performance. We actively work on novel datasets to mitigate these risks.</p>
|
|
|
|
|
|
|
| 961 |
</div>
|
| 962 |
+
|
| 963 |
+
<div class="notification is-warning is-light py-2 px-3 mb-3">
|
| 964 |
+
<p class="has-text-weight-bold mb-1">⚠️ Dataset Size & Diversity</p>
|
| 965 |
+
<p class="is-size-7 mb-0">Our dataset scope is limited, affecting model generalization across diverse financial domains and languages.</p>
|
| 966 |
+
</div>
|
| 967 |
+
|
| 968 |
+
<div class="notification is-warning is-light py-2 px-3 mb-3">
|
| 969 |
+
<p class="has-text-weight-bold mb-1">⚠️ Zero-Shot Focus</p>
|
| 970 |
+
<p class="is-size-7 mb-0">Due to budget constraints, our evaluations rely on zero-shot learning only, without fine-tuning or few-shot prompting.</p>
|
| 971 |
+
</div>
|
| 972 |
+
|
| 973 |
+
<div class="notification is-warning is-light py-2 px-3 mb-3">
|
| 974 |
+
<p class="has-text-weight-bold mb-1">⚠️ Limited Adaptation Strategies</p>
|
| 975 |
+
<p class="is-size-7 mb-0">We do not explore chain-of-thought reasoning or advanced prompting, though these techniques are known to improve model performance.</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 976 |
</div>
|
| 977 |
+
|
| 978 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 979 |
+
<p class="has-text-weight-bold mb-1">ℹ️ English Language Bias</p>
|
| 980 |
+
<p class="is-size-7 mb-0">The benchmark primarily focuses on English due to the availability of financial datasets, limiting insights into multilingual model performance.</p>
|
| 981 |
+
</div>
|
| 982 |
+
|
| 983 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 984 |
+
<p class="has-text-weight-bold mb-1">ℹ️ Real-World Complex Tasks</p>
|
| 985 |
+
<p class="is-size-7 mb-0">Existing tasks do not fully capture the dynamic and evolving nature of financial markets, requiring ongoing dataset expansion.</p>
|
| 986 |
+
</div>
|
| 987 |
+
|
| 988 |
+
<p class="is-italic is-size-7 mt-4">
|
| 989 |
+
Recognizing these limitations is essential for improving future financial NLP benchmarks. Our ongoing work aims to address these challenges through dataset refinement, broader task coverage, and multilingual support.
|
| 990 |
+
</p>
|
| 991 |
</div>
|
| 992 |
</div>
|
| 993 |
</div>
|
| 994 |
</div>
|
| 995 |
+
|
| 996 |
+
<!-- Future Work -->
|
| 997 |
+
<div class="column is-6">
|
| 998 |
+
<div class="card h-100">
|
| 999 |
+
<div class="card-header">
|
| 1000 |
+
<p class="card-header-title">
|
| 1001 |
+
<span class="icon mr-2"><i class="fas fa-lightbulb"></i></span>
|
| 1002 |
+
Future Work
|
| 1003 |
+
</p>
|
| 1004 |
+
</div>
|
| 1005 |
+
<div class="card-content">
|
| 1006 |
+
<div class="content">
|
| 1007 |
+
<p class="mb-3">
|
| 1008 |
+
To strengthen the robustness and adaptability of our framework, we advocate for open collaboration within the research community
|
| 1009 |
+
and propose the following future directions to expand its capabilities:
|
| 1010 |
+
</p>
|
| 1011 |
+
|
| 1012 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 1013 |
+
<p class="has-text-weight-bold mb-1">🌍 Multilingual Expansion</p>
|
| 1014 |
+
<p class="is-size-7 mb-0">Extending benchmarks beyond English to include multilingual financial datasets and evaluations.</p>
|
| 1015 |
+
</div>
|
| 1016 |
+
|
| 1017 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 1018 |
+
<p class="has-text-weight-bold mb-1">🧠 Few-Shot & Chain-of-Thought</p>
|
| 1019 |
+
<p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
|
| 1020 |
+
</div>
|
| 1021 |
+
|
| 1022 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 1023 |
+
<p class="has-text-weight-bold mb-1">📊 Domain-Adaptive Training</p>
|
| 1024 |
+
<p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
|
| 1025 |
+
</div>
|
| 1026 |
+
|
| 1027 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 1028 |
+
<p class="has-text-weight-bold mb-1">🔍 Expanded Dataset Coverage</p>
|
| 1029 |
+
<p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
|
| 1030 |
+
</div>
|
| 1031 |
+
|
| 1032 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 1033 |
+
<p class="has-text-weight-bold mb-1">⚖️ Efficiency & Cost Benchmarking</p>
|
| 1034 |
+
<p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
|
| 1035 |
+
</div>
|
| 1036 |
+
|
| 1037 |
+
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 1038 |
+
<p class="has-text-weight-bold mb-1">📈 Advanced Evaluation Metrics</p>
|
| 1039 |
+
<p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
|
| 1040 |
+
</div>
|
| 1041 |
+
|
| 1042 |
+
<p class="is-italic is-size-7 mt-4">
|
| 1043 |
+
These improvements will enable more accurate and fair comparisons of financial language models,
|
| 1044 |
+
fostering greater transparency, reproducibility, and real-world applicability.
|
| 1045 |
+
</p>
|
| 1046 |
+
</div>
|
| 1047 |
+
</div>
|
| 1048 |
+
</div>
|
| 1049 |
</div>
|
| 1050 |
+
</div>
|
| 1051 |
+
|
| 1052 |
<!--/ Contributions -->
|
| 1053 |
</div>
|
| 1054 |
</section>
|