Huzaifa Pardawala commited on
Commit
9ed293f
·
1 Parent(s): 464a24c

adding changes to the results and contributions sections

Browse files
Files changed (1) hide show
  1. index.html +281 -161
index.html CHANGED
@@ -614,6 +614,7 @@
614
  </div>
615
  </div>
616
 
 
617
  <!-- Task 3 -->
618
  <div class="task-performance mb-4">
619
  <div class="columns is-vcentered">
@@ -643,7 +644,7 @@
643
  </div>
644
  </div>
645
  <div class="column is-2 has-text-centered">
646
- <span class="tag is-info is-light">0.75-0.82</span>
647
  </div>
648
  </div>
649
  </div>
@@ -671,104 +672,155 @@
671
  </div>
672
  </div>
673
 
674
- <!-- Results details in cards -->
675
- <div class="columns is-multiline">
676
- <!-- Model Performance Card -->
677
- <div class="column is-6">
678
- <div class="card h-100">
679
- <div class="card-header">
680
- <p class="card-header-title">
681
- <span class="icon mr-2"><i class="fas fa-medal"></i></span>
682
- Model Performance Highlights
683
- </p>
684
- </div>
685
- <div class="card-content">
686
- <div class="content">
687
- <div class="model-ranking mb-4">
688
- <p class="has-text-weight-bold mb-2">Top Performers:</p>
689
- <div class="columns is-mobile">
690
- <div class="column">
691
- <div class="has-text-centered">
692
- <span class="icon is-large has-text-warning"><i class="fas fa-trophy fa-2x"></i></span>
693
- <p class="mt-2 mb-0">DeepSeek R1</p>
694
- </div>
695
- </div>
696
- <div class="column">
697
- <div class="has-text-centered">
698
- <span class="icon is-large has-text-grey"><i class="fas fa-trophy fa-2x"></i></span>
699
- <p class="mt-2 mb-0">OpenAI o1-mini</p>
700
- </div>
701
- </div>
702
- <div class="column">
703
- <div class="has-text-centered">
704
- <span class="icon is-large has-text-bronze"><i class="fas fa-trophy fa-2x"></i></span>
705
- <p class="mt-2 mb-0">Claude 3.5 Sonnet</p>
706
- </div>
707
- </div>
708
- </div>
709
- </div>
710
-
711
- <p class="mb-2">Key insights from our model analysis:</p>
712
- <ul>
713
- <li>Inconsistent scaling: larger parameter sizes do not guarantee higher performance</li>
714
- <li>Open-weight models show competitive performance on many tasks</li>
715
- <li>Dramatic price differences between models ($4-260 USD)</li>
716
- <li>Cost-conscious choices should be based on specific use cases</li>
717
- </ul>
718
- </div>
719
- </div>
720
- </div>
721
  </div>
722
-
723
- <!-- Error Analysis Card -->
724
- <div class="column is-6">
725
- <div class="card h-100">
726
- <div class="card-header">
727
- <p class="card-header-title">
728
- <span class="icon mr-2"><i class="fas fa-exclamation-triangle"></i></span>
729
- Error Analysis
730
- </p>
731
- </div>
732
- <div class="card-content">
733
- <div class="content">
734
- <p class="mb-3">Common error patterns identified across models:</p>
735
-
736
- <div class="error-category mb-3">
737
- <p class="has-text-weight-bold mb-1">Numeric Reasoning</p>
738
- <div class="notification is-danger is-light py-2 px-3">
739
- <p class="is-size-7 mb-0">Models struggled with consistent numeric formats and financial calculations</p>
740
- </div>
741
- </div>
742
-
743
- <div class="error-category mb-3">
744
- <p class="has-text-weight-bold mb-1">Language Consistency</p>
745
- <div class="notification is-warning is-light py-2 px-3">
746
- <p class="is-size-7 mb-0">Occasional non-English outputs or language drift</p>
747
- </div>
748
- </div>
749
-
750
- <div class="error-category mb-3">
751
- <p class="has-text-weight-bold mb-1">Classification Complexity</p>
752
- <div class="notification is-warning is-light py-2 px-3">
753
- <p class="is-size-7 mb-0">Difficulties with longer label sets and fine-grained distinctions</p>
754
- </div>
755
- </div>
756
-
757
- <div class="error-category mb-3">
758
- <p class="has-text-weight-bold mb-1">Causal Reasoning</p>
759
- <div class="notification is-danger is-light py-2 px-3">
760
- <p class="is-size-7 mb-0">Challenges with cause-effect relationships due to data scarcity</p>
761
- </div>
762
- </div>
763
- </div>
764
- </div>
765
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
766
  </div>
767
  </div>
 
 
 
 
 
 
 
 
768
  </div>
769
  </div>
770
  </div>
771
- <!--/ Results -->
772
  </div>
773
  </section>
774
 
@@ -780,155 +832,223 @@
780
  <h2 class="title is-3 section-title has-text-centered">Contributions & Future Work</h2>
781
 
782
  <div class="content">
 
 
 
 
 
 
 
783
  <!-- Contributions -->
784
  <div class="box has-background-white-ter mb-5">
785
  <h4 class="title is-4 has-text-centered mb-4">Key Contributions</h4>
786
-
787
- <div class="columns is-multiline">
 
788
  <!-- Contribution 1 -->
789
  <div class="column is-6">
790
  <div class="media">
791
  <div class="media-left">
792
  <span class="icon has-text-primary is-large">
793
- <i class="fas fa-check-circle fa-lg"></i>
794
  </span>
795
  </div>
796
  <div class="media-content">
797
- <p class="has-text-weight-semibold">First Holistic Benchmark</p>
798
- <p class="is-size-7">The first benchmarking suite specifically designed for financial language model evaluation</p>
799
  </div>
800
  </div>
801
  </div>
802
-
803
  <!-- Contribution 2 -->
804
  <div class="column is-6">
805
  <div class="media">
806
  <div class="media-left">
807
  <span class="icon has-text-primary is-large">
808
- <i class="fas fa-check-circle fa-lg"></i>
809
  </span>
810
  </div>
811
  <div class="media-content">
812
- <p class="has-text-weight-semibold">Comprehensive Taxonomy</p>
813
- <p class="is-size-7">Organization of financial NLP tasks by task type, domain, and language</p>
814
  </div>
815
  </div>
816
  </div>
817
-
818
  <!-- Contribution 3 -->
819
  <div class="column is-6">
820
  <div class="media">
821
  <div class="media-left">
822
  <span class="icon has-text-primary is-large">
823
- <i class="fas fa-check-circle fa-lg"></i>
824
  </span>
825
  </div>
826
  <div class="media-content">
827
- <p class="has-text-weight-semibold">Standardized Framework</p>
828
- <p class="is-size-7">Modular design for customizable assessment across tasks</p>
829
  </div>
830
  </div>
831
  </div>
832
-
833
  <!-- Contribution 4 -->
834
  <div class="column is-6">
835
  <div class="media">
836
  <div class="media-left">
837
  <span class="icon has-text-primary is-large">
838
- <i class="fas fa-check-circle fa-lg"></i>
839
  </span>
840
  </div>
841
  <div class="media-content">
842
- <p class="has-text-weight-semibold">Model Comparison</p>
843
- <p class="is-size-7">Thorough comparison of open and closed source models</p>
844
  </div>
845
  </div>
846
  </div>
847
-
848
  <!-- Contribution 5 -->
849
  <div class="column is-6">
850
  <div class="media">
851
  <div class="media-left">
852
  <span class="icon has-text-primary is-large">
853
- <i class="fas fa-check-circle fa-lg"></i>
854
  </span>
855
  </div>
856
  <div class="media-content">
857
- <p class="has-text-weight-semibold">Cost-Performance Analysis</p>
858
- <p class="is-size-7">Analysis of tradeoffs for different language models</p>
859
  </div>
860
  </div>
861
  </div>
862
-
863
  <!-- Contribution 6 -->
864
  <div class="column is-6">
865
  <div class="media">
866
  <div class="media-left">
867
  <span class="icon has-text-primary is-large">
868
- <i class="fas fa-check-circle fa-lg"></i>
869
  </span>
870
  </div>
871
  <div class="media-content">
872
  <p class="has-text-weight-semibold">Open-Source Implementation</p>
873
- <p class="is-size-7">Framework allowing researchers to extend the benchmark</p>
874
  </div>
875
  </div>
876
  </div>
 
877
  </div>
878
  </div>
879
 
880
  <!-- Limitations & Future Work -->
881
- <div class="columns is-multiline">
882
- <!-- Limitations -->
883
- <div class="column is-6">
884
- <div class="card h-100">
885
- <div class="card-header">
886
- <p class="card-header-title">
887
- <span class="icon mr-2"><i class="fas fa-exclamation-circle"></i></span>
888
- Limitations
889
- </p>
890
- </div>
891
- <div class="card-content">
892
- <div class="content">
893
- <ul>
894
- <li>Limited dataset size and diversity</li>
895
- <li>Current focus on zero-shot scenarios only</li>
896
- <li>English-language focus due to availability of benchmarks</li>
897
- <li>No evaluation of advanced prompting techniques</li>
898
- <li>Tasks don't capture full breadth of real-world financial scenarios</li>
899
- </ul>
900
- </div>
901
- </div>
902
  </div>
903
- </div>
904
-
905
- <!-- Future Work -->
906
- <div class="column is-6">
907
- <div class="card h-100">
908
- <div class="card-header">
909
- <p class="card-header-title">
910
- <span class="icon mr-2"><i class="fas fa-lightbulb"></i></span>
911
- Future Work
912
- </p>
913
- </div>
914
- <div class="card-content">
915
- <div class="content">
916
- <ul>
917
- <li>Extend to more languages beyond English</li>
918
- <li>Explore few-shot and chain-of-thought prompting</li>
919
- <li>Evaluate domain-adaptive training for finance</li>
920
- <li>Expand dataset coverage across more financial sectors</li>
921
- <li>Benchmark efficiency trade-offs</li>
922
- <li>Develop more nuanced evaluation metrics</li>
923
- </ul>
924
- </div>
925
- </div>
926
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
927
  </div>
928
  </div>
929
  </div>
930
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
931
  </div>
 
 
932
  <!--/ Contributions -->
933
  </div>
934
  </section>
 
614
  </div>
615
  </div>
616
 
617
+
618
  <!-- Task 3 -->
619
  <div class="task-performance mb-4">
620
  <div class="columns is-vcentered">
 
644
  </div>
645
  </div>
646
  <div class="column is-2 has-text-centered">
647
+ <span class="tag is-info is-light">75-82 %</span>
648
  </div>
649
  </div>
650
  </div>
 
672
  </div>
673
  </div>
674
 
675
+ <!-- </section>
676
+ </div> -->
677
+ <!-- <section class="section">
678
+ <div class="container"> -->
679
+ <!-- Model Performance Highlights -->
680
+ <div class="card mb-5">
681
+ <div class="card-header">
682
+ <p class="card-header-title">
683
+ <span class="icon mr-2"><i class="fas fa-medal"></i></span>
684
+ Model Performance Highlights
685
+ </p>
686
+ </div>
687
+ <div class="card-content">
688
+ <div class="content">
689
+ <p class="has-text-weight-bold mb-4 has-text-centered">🏆 Top Performing Models</p>
690
+
691
+ <div class="columns is-centered is-multiline">
692
+ <!-- DeepSeek R1 -->
693
+ <div class="column is-4">
694
+ <div class="is-flex is-flex-direction-column is-align-items-center">
695
+ <figure class="image is-128x128 mb-3">
696
+ <img src="static/images/deepseek_logo.png" alt="DeepSeek R1 Logo">
697
+ </figure>
698
+ <p class="is-size-4 has-text-weight-semibold mb-2">DeepSeek R1</p>
699
+ <span class="icon is-large has-text-warning"><i class="fas fa-trophy fa-2x"></i></span>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
700
  </div>
701
+ </div>
702
+
703
+ <!-- OpenAI o1-mini -->
704
+ <div class="column is-4">
705
+ <div class="is-flex is-flex-direction-column is-align-items-center">
706
+ <figure class="image is-128x128 mb-3">
707
+ <img src="static/images/openai_logo.png" alt="OpenAI Logo">
708
+ </figure>
709
+ <p class="is-size-4 has-text-weight-semibold mb-2">OpenAI o1-mini</p>
710
+ <span class="icon is-large has-text-grey"><i class="fas fa-trophy fa-2x"></i></span>
711
+ </div>
712
+ </div>
713
+
714
+ <!-- Claude 3.5 Sonnet -->
715
+ <div class="column is-4">
716
+ <div class="is-flex is-flex-direction-column is-align-items-center">
717
+ <figure class="image is-128x128 mb-3">
718
+ <img src="static/images/claude_logo.png" alt="Claude 3.5 Sonnet Logo">
719
+ </figure>
720
+ <p class="is-size-4 has-text-weight-semibold mb-2">Claude 3.5 Sonnet</p>
721
+ <span class="icon is-large has-text-bronze"><i class="fas fa-trophy fa-2x"></i></span>
722
+ </div>
723
+ </div>
724
+
725
+ </div>
726
+
727
+ <hr>
728
+
729
+ <p class="has-text-weight-bold mb-3">🔍 Key Insights from Model Analysis</p>
730
+
731
+ <div class="notification is-info is-light py-3 px-4">
732
+ <p><strong>🏆 No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
733
+ <p><strong>⚖️ Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
734
+ <p><strong>🛠️ Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
735
+ <p><strong>💰 Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
736
+ <p><strong>📉 Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
737
+ <p><strong>🔢 Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
738
+ </div>
739
+ </div>
740
+ </div>
741
+ </div>
742
+ <!-- Error Analysis & Key Findings -->
743
+ <div class="card">
744
+ <div class="card-header">
745
+ <p class="card-header-title">
746
+ <span class="icon mr-2"><i class="fas fa-exclamation-triangle"></i></span>
747
+ Error Analysis & Key Findings
748
+ </p>
749
+ </div>
750
+ <div class="card-content">
751
+ <div class="content">
752
+ <p class="mb-4">Common challenges and limitations identified in our evaluations:</p>
753
+
754
+ <!-- Individual Error Categories -->
755
+ <div class="error-category mb-4">
756
+ <p class="has-text-weight-bold mb-1">Concerns regarding outdated models</p>
757
+ <div class="notification is-danger is-light py-2 px-3">
758
+ <p class="is-size-7 mb-0"><strong>LLama 2 13B Chat</strong> produces trivial or empty responses, possibly due to misalignment during fine-tuning.</p>
759
+ </div>
760
+ </div>
761
+
762
+ <div class="error-category mb-4">
763
+ <p class="has-text-weight-bold mb-1">Numeric Regression Issues</p>
764
+ <div class="notification is-danger is-light py-2 px-3">
765
+ <p class="is-size-7 mb-0">LMs struggle with precision and rounding in continuous-valued regressions (e.g., financial percentages). Post-hoc normalization is needed.</p>
766
+ </div>
767
+ </div>
768
+
769
+ <div class="error-category mb-4">
770
+ <p class="has-text-weight-bold mb-1">Data Contamination</p>
771
+ <div class="notification is-danger is-light py-2 px-3">
772
+ <p class="is-size-7 mb-0">Overlap between public financial datasets and pretraining corpora can inflate zero-shot performance, requiring time-split test sets.</p>
773
+ </div>
774
+ </div>
775
+
776
+ <div class="error-category mb-4">
777
+ <p class="has-text-weight-bold mb-1">Challenges in Causal Classification</p>
778
+ <div class="notification is-danger is-light py-2 px-3">
779
+ <p class="is-size-7 mb-0">Most models struggle with financial causal reasoning, requiring structured knowledge bases or explicit symbolic reasoning.</p>
780
+ </div>
781
+ </div>
782
+
783
+
784
+ <div class="error-category mb-4">
785
+ <p class="has-text-weight-bold mb-1">Language Drift</p>
786
+ <div class="notification is-warning is-light py-2 px-3">
787
+ <p class="is-size-7 mb-0"><strong>Qwen 2 72B</strong> exhibits unintended shifts to Chinese output in English summarization tasks, indicating strong pretraining priors.</p>
788
+ </div>
789
+ </div>
790
+
791
+ <div class="error-category mb-4">
792
+ <p class="has-text-weight-bold mb-1">Summarization Nuances</p>
793
+ <div class="notification is-warning is-light py-2 px-3">
794
+ <p class="is-size-7 mb-0">Models achieve high BERTScores (~80-82%) on extractive summarization but suffer on abstractive tasks, especially in finance-specific jargon.</p>
795
+ </div>
796
+ </div>
797
+
798
+
799
+ <div class="error-category mb-4">
800
+ <p class="has-text-weight-bold mb-1">Prompt Design Limitations</p>
801
+ <div class="notification is-warning is-light py-2 px-3">
802
+ <p class="is-size-7 mb-0">Prompts tuned on <strong>Llama 3 8B</strong> may not generalize across models, leading to inconsistencies in label generation (e.g., minor syntactic variations).</p>
803
+ </div>
804
+ </div>
805
+
806
+ <div class="error-category mb-4">
807
+ <p class="has-text-weight-bold mb-1">Differences in QA Datasets</p>
808
+ <div class="notification is-warning is-light py-2 px-3">
809
+ <p class="is-size-7 mb-0"><strong>ConvFinQA</strong> consistently underperforms compared to <strong>FinQA</strong> due to its multi-turn dialogue complexity.</p>
810
  </div>
811
  </div>
812
+
813
+ <div class="error-category mb-4">
814
+ <p class="has-text-weight-bold mb-1">Efficiency and Cost Considerations</p>
815
+ <div class="notification is-warning is-light py-2 px-3">
816
+ <p class="is-size-7 mb-0">Inference costs vary by up to <strong>2×</strong> among similarly sized models, requiring a balance between performance and resource usage.</p>
817
+ </div>
818
+ </div>
819
+
820
  </div>
821
  </div>
822
  </div>
823
+
824
  </div>
825
  </section>
826
 
 
832
  <h2 class="title is-3 section-title has-text-centered">Contributions & Future Work</h2>
833
 
834
  <div class="content">
835
+ <!-- Contributions Overview -->
836
+ <div class="notification is-info is-light has-text-centered mb-5">
837
+ <p class="is-size-5 has-text-weight-semibold">
838
+ Our work introduces a standardized, large-scale, and holistic evaluation framework for financial language models.
839
+ </p>
840
+ </div>
841
+
842
  <!-- Contributions -->
843
  <div class="box has-background-white-ter mb-5">
844
  <h4 class="title is-4 has-text-centered mb-4">Key Contributions</h4>
845
+
846
+ <div class="columns is-multiline is-centered">
847
+
848
  <!-- Contribution 1 -->
849
  <div class="column is-6">
850
  <div class="media">
851
  <div class="media-left">
852
  <span class="icon has-text-primary is-large">
853
+ <i class="fas fa-cogs fa-lg"></i>
854
  </span>
855
  </div>
856
  <div class="media-content">
857
+ <p class="has-text-weight-semibold">Standardized Evaluation Framework</p>
858
+ <p class="is-size-7">We introduce an open-source, modular benchmarking suite for systematic LM evaluations on core financial NLP tasks.</p>
859
  </div>
860
  </div>
861
  </div>
862
+
863
  <!-- Contribution 2 -->
864
  <div class="column is-6">
865
  <div class="media">
866
  <div class="media-left">
867
  <span class="icon has-text-primary is-large">
868
+ <i class="fas fa-chart-line fa-lg"></i>
869
  </span>
870
  </div>
871
  <div class="media-content">
872
+ <p class="has-text-weight-semibold">Large-Scale Model Assessment</p>
873
+ <p class="is-size-7">We benchmark 23 foundation LMs—open-weight and proprietary—across 20 financial tasks, revealing performance-cost trade-offs.</p>
874
  </div>
875
  </div>
876
  </div>
877
+
878
  <!-- Contribution 3 -->
879
  <div class="column is-6">
880
  <div class="media">
881
  <div class="media-left">
882
  <span class="icon has-text-primary is-large">
883
+ <i class="fas fa-database fa-lg"></i>
884
  </span>
885
  </div>
886
  <div class="media-content">
887
+ <p class="has-text-weight-semibold">Holistic Dataset Taxonomy</p>
888
+ <p class="is-size-7">We establish a structured dataset taxonomy, categorizing financial NLP tasks based on domain, data format, and linguistic complexity.</p>
889
  </div>
890
  </div>
891
  </div>
892
+
893
  <!-- Contribution 4 -->
894
  <div class="column is-6">
895
  <div class="media">
896
  <div class="media-left">
897
  <span class="icon has-text-primary is-large">
898
+ <i class="fas fa-users fa-lg"></i>
899
  </span>
900
  </div>
901
  <div class="media-content">
902
+ <p class="has-text-weight-semibold">Living Benchmark & Open Collaboration</p>
903
+ <p class="is-size-7">We introduce a continuously updated leaderboard, inviting researchers to contribute new datasets and evaluation results.</p>
904
  </div>
905
  </div>
906
  </div>
907
+
908
  <!-- Contribution 5 -->
909
  <div class="column is-6">
910
  <div class="media">
911
  <div class="media-left">
912
  <span class="icon has-text-primary is-large">
913
+ <i class="fas fa-balance-scale fa-lg"></i>
914
  </span>
915
  </div>
916
  <div class="media-content">
917
+ <p class="has-text-weight-semibold">Error Analysis & Cost-Performance Insights</p>
918
+ <p class="is-size-7">We analyze systematic model errors and quantify cost-performance trade-offs for informed deployment in real-world applications.</p>
919
  </div>
920
  </div>
921
  </div>
922
+
923
  <!-- Contribution 6 -->
924
  <div class="column is-6">
925
  <div class="media">
926
  <div class="media-left">
927
  <span class="icon has-text-primary is-large">
928
+ <i class="fas fa-code-branch fa-lg"></i>
929
  </span>
930
  </div>
931
  <div class="media-content">
932
  <p class="has-text-weight-semibold">Open-Source Implementation</p>
933
+ <p class="is-size-7">We release a fully open-source framework, enabling the research community to extend and refine financial LM evaluation methodologies.</p>
934
  </div>
935
  </div>
936
  </div>
937
+
938
  </div>
939
  </div>
940
 
941
  <!-- Limitations & Future Work -->
942
+ <div class="columns is-multiline">
943
+ <!-- Limitations -->
944
+ <div class="column is-6">
945
+ <div class="card h-100">
946
+ <div class="card-header">
947
+ <p class="card-header-title">
948
+ <span class="icon mr-2"><i class="fas fa-exclamation-circle"></i></span>
949
+ Limitations
950
+ </p>
951
+ </div>
952
+ <div class="card-content">
953
+ <div class="content">
954
+ <p class="mb-3">
955
+ While our benchmark provides valuable insights, several limitations must be acknowledged:
956
+ </p>
957
+
958
+ <div class="notification is-danger is-light py-2 px-3 mb-3">
959
+ <p class="has-text-weight-bold mb-1">❌ Data Contamination Risks</p>
960
+ <p class="is-size-7 mb-0">Benchmark testing data may overlap with model pretraining corpora, leading to artificially inflated performance. We actively work on novel datasets to mitigate these risks.</p>
 
 
961
  </div>
962
+
963
+ <div class="notification is-warning is-light py-2 px-3 mb-3">
964
+ <p class="has-text-weight-bold mb-1">⚠️ Dataset Size & Diversity</p>
965
+ <p class="is-size-7 mb-0">Our dataset scope is limited, affecting model generalization across diverse financial domains and languages.</p>
966
+ </div>
967
+
968
+ <div class="notification is-warning is-light py-2 px-3 mb-3">
969
+ <p class="has-text-weight-bold mb-1">⚠️ Zero-Shot Focus</p>
970
+ <p class="is-size-7 mb-0">Due to budget constraints, our evaluations rely on zero-shot learning only, without fine-tuning or few-shot prompting.</p>
971
+ </div>
972
+
973
+ <div class="notification is-warning is-light py-2 px-3 mb-3">
974
+ <p class="has-text-weight-bold mb-1">⚠️ Limited Adaptation Strategies</p>
975
+ <p class="is-size-7 mb-0">We do not explore chain-of-thought reasoning or advanced prompting, though these techniques are known to improve model performance.</p>
 
 
 
 
 
 
 
 
 
976
  </div>
977
+
978
+ <div class="notification is-info is-light py-2 px-3 mb-3">
979
+ <p class="has-text-weight-bold mb-1">ℹ️ English Language Bias</p>
980
+ <p class="is-size-7 mb-0">The benchmark primarily focuses on English due to the availability of financial datasets, limiting insights into multilingual model performance.</p>
981
+ </div>
982
+
983
+ <div class="notification is-info is-light py-2 px-3 mb-3">
984
+ <p class="has-text-weight-bold mb-1">ℹ️ Real-World Complex Tasks</p>
985
+ <p class="is-size-7 mb-0">Existing tasks do not fully capture the dynamic and evolving nature of financial markets, requiring ongoing dataset expansion.</p>
986
+ </div>
987
+
988
+ <p class="is-italic is-size-7 mt-4">
989
+ Recognizing these limitations is essential for improving future financial NLP benchmarks. Our ongoing work aims to address these challenges through dataset refinement, broader task coverage, and multilingual support.
990
+ </p>
991
  </div>
992
  </div>
993
  </div>
994
  </div>
995
+
996
+ <!-- Future Work -->
997
+ <div class="column is-6">
998
+ <div class="card h-100">
999
+ <div class="card-header">
1000
+ <p class="card-header-title">
1001
+ <span class="icon mr-2"><i class="fas fa-lightbulb"></i></span>
1002
+ Future Work
1003
+ </p>
1004
+ </div>
1005
+ <div class="card-content">
1006
+ <div class="content">
1007
+ <p class="mb-3">
1008
+ To strengthen the robustness and adaptability of our framework, we advocate for open collaboration within the research community
1009
+ and propose the following future directions to expand its capabilities:
1010
+ </p>
1011
+
1012
+ <div class="notification is-info is-light py-2 px-3 mb-3">
1013
+ <p class="has-text-weight-bold mb-1">🌍 Multilingual Expansion</p>
1014
+ <p class="is-size-7 mb-0">Extending benchmarks beyond English to include multilingual financial datasets and evaluations.</p>
1015
+ </div>
1016
+
1017
+ <div class="notification is-info is-light py-2 px-3 mb-3">
1018
+ <p class="has-text-weight-bold mb-1">🧠 Few-Shot & Chain-of-Thought</p>
1019
+ <p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
1020
+ </div>
1021
+
1022
+ <div class="notification is-info is-light py-2 px-3 mb-3">
1023
+ <p class="has-text-weight-bold mb-1">📊 Domain-Adaptive Training</p>
1024
+ <p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
1025
+ </div>
1026
+
1027
+ <div class="notification is-info is-light py-2 px-3 mb-3">
1028
+ <p class="has-text-weight-bold mb-1">🔍 Expanded Dataset Coverage</p>
1029
+ <p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
1030
+ </div>
1031
+
1032
+ <div class="notification is-info is-light py-2 px-3 mb-3">
1033
+ <p class="has-text-weight-bold mb-1">⚖️ Efficiency & Cost Benchmarking</p>
1034
+ <p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
1035
+ </div>
1036
+
1037
+ <div class="notification is-info is-light py-2 px-3 mb-3">
1038
+ <p class="has-text-weight-bold mb-1">📈 Advanced Evaluation Metrics</p>
1039
+ <p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
1040
+ </div>
1041
+
1042
+ <p class="is-italic is-size-7 mt-4">
1043
+ These improvements will enable more accurate and fair comparisons of financial language models,
1044
+ fostering greater transparency, reproducibility, and real-world applicability.
1045
+ </p>
1046
+ </div>
1047
+ </div>
1048
+ </div>
1049
  </div>
1050
+ </div>
1051
+
1052
  <!--/ Contributions -->
1053
  </div>
1054
  </section>