Spaces:
Running
Running
Glenn Matlin
commited on
Commit
·
632b302
1
Parent(s):
db2bd37
updates to index.html
Browse files- CLAUDE.md +33 -128
- FLaME/{FLaME__ACL_AAR_Feb_2025_.pdf → FLaME.pdf} +0 -0
- FLaME/FLaME.pdf:Zone.Identifier +4 -0
- FLaME/FLaME[ACLAARFeb2025]/content/0_authors.tex +1 -1
- FLaME/content/0_authors.tex +1 -1
- index.html +29 -29
CLAUDE.md
CHANGED
|
@@ -2,138 +2,43 @@
|
|
| 2 |
|
| 3 |
## Project Overview
|
| 4 |
- FLaME: Holistic Financial Language Model Evaluation
|
| 5 |
-
-
|
| 6 |
-
-
|
| 7 |
-
- Research paper for ACL Annual Advances in Research (Feb 2025)
|
| 8 |
- Hosted on HuggingFace Spaces
|
| 9 |
|
| 10 |
-
##
|
| 11 |
-
-
|
| 12 |
-
-
|
|
|
|
| 13 |
|
| 14 |
## Code Style Guidelines
|
| 15 |
-
- HTML:
|
| 16 |
-
- CSS: Follow Bulma
|
| 17 |
- JavaScript:
|
| 18 |
-
- Use camelCase for variables
|
| 19 |
-
-
|
| 20 |
-
- Include semicolons
|
| 21 |
-
-
|
| 22 |
-
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
-
|
| 30 |
-
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
-
|
| 37 |
-
-
|
| 38 |
-
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
- Keep all CSS in static/css/
|
| 44 |
-
- Keep all JavaScript in static/js/
|
| 45 |
-
- Keep media files in appropriate subdirectories
|
| 46 |
-
- Paper content in FLaME/content/
|
| 47 |
-
- Use section IDs for navigation linking (e.g., #abstract, #methodology)
|
| 48 |
-
|
| 49 |
-
## Interactive Components
|
| 50 |
-
- Navbar with smooth scrolling to sections
|
| 51 |
-
- Performance indicator bars for result visualization
|
| 52 |
-
- Card layouts for key findings with hover effects
|
| 53 |
-
- Methodology workflow diagram with step visualization
|
| 54 |
-
- Interactive feature highlights with icons
|
| 55 |
-
- Getting started guide with numbered steps
|
| 56 |
-
|
| 57 |
-
## Responsive Design
|
| 58 |
-
- Mobile-friendly navigation menu (hamburger on small screens)
|
| 59 |
-
- Stacked cards on mobile devices
|
| 60 |
-
- Adjusted typography and spacing for different screen sizes
|
| 61 |
-
- Media queries for breakpoints at 768px
|
| 62 |
-
|
| 63 |
-
## FLaME Research Paper Information
|
| 64 |
-
|
| 65 |
-
### Authors
|
| 66 |
-
- Oopy Goopy, General Munchkin Man, L'il Jim Bob, Larry
|
| 67 |
-
- Affiliation: Georgia Institute of Technology
|
| 68 |
-
|
| 69 |
-
### Paper Focus and Objective
|
| 70 |
-
- First comprehensive benchmarking framework for evaluating language models on financial NLP tasks
|
| 71 |
-
- Addresses gaps in existing evaluation methodologies for financial language models
|
| 72 |
-
- Provides standardized evaluation framework with open-source implementation
|
| 73 |
-
|
| 74 |
-
### Key Components
|
| 75 |
-
|
| 76 |
-
#### Taxonomy
|
| 77 |
-
- Organized by three dimensions: tasks, domains, and languages
|
| 78 |
-
- Six core FinNLP tasks:
|
| 79 |
-
1. Text classification
|
| 80 |
-
2. Sentiment analysis
|
| 81 |
-
3. Information retrieval
|
| 82 |
-
4. Causal analysis
|
| 83 |
-
5. Text summarization
|
| 84 |
-
6. Question answering
|
| 85 |
-
- Domains categorized by data source, origination, time period, etc.
|
| 86 |
-
- Currently focuses on English language
|
| 87 |
-
|
| 88 |
-
#### Datasets
|
| 89 |
-
Selected based on:
|
| 90 |
-
- Financial domain relevance
|
| 91 |
-
- Fair usage licensing
|
| 92 |
-
- Annotation quality
|
| 93 |
-
- Task substance
|
| 94 |
-
|
| 95 |
-
Key datasets include:
|
| 96 |
-
- Banking: Banking77, FiQA, FinRED
|
| 97 |
-
- Investment: FPB, Headlines, SubjectiveQA
|
| 98 |
-
- Accounting: FinQA, TaT-QA, ConvFinQA
|
| 99 |
-
- Corporate: ECTSum, EDTSum, FinCausal
|
| 100 |
-
- Monetary Policy: FOMC, FNXL
|
| 101 |
-
- Cross-domain: FinBench, NumClaim, ReFINED
|
| 102 |
-
|
| 103 |
-
#### Models Evaluated
|
| 104 |
-
- Proprietary closed-source: GPT-4o & o1-mini, Gemini-1.5, Claude3, Cohere Command R
|
| 105 |
-
- Open-weight: Llama-3, DeepSeekV3 & R-1, Qwen-2 & QwQ, Mistral, Gemma-1 & 2, Mixtral, WizardLM2, DBRX
|
| 106 |
-
- Used deterministic decoding (temperature 0.0, top p of 0.9, repetition penalty of 1)
|
| 107 |
-
|
| 108 |
-
#### Evaluation Process
|
| 109 |
-
- Two-stage approach: generation and extraction
|
| 110 |
-
- Task-specific metrics: accuracy, F1 scores, precision, recall, BLEU scores
|
| 111 |
-
- Standardized zero-shot evaluation
|
| 112 |
-
|
| 113 |
-
### Key Findings
|
| 114 |
-
- No single model performs best across all tasks
|
| 115 |
-
- Performance varies significantly based on domain and task structure
|
| 116 |
-
- Open-weight models show strong cost/performance efficiency
|
| 117 |
-
- Numeric reasoning tasks remain challenging for all models
|
| 118 |
-
- Inconsistent scaling: larger parameter sizes don't guarantee higher performance
|
| 119 |
-
- Models struggle with consistent numeric formats and longer label sets
|
| 120 |
-
- Top performers: DeepSeek R1, OpenAI o1-mini, Claude 3.5 Sonnet
|
| 121 |
-
|
| 122 |
-
### Limitations
|
| 123 |
-
- Limited dataset size and diversity
|
| 124 |
-
- Focus on zero-shot scenarios only
|
| 125 |
-
- English-language focus
|
| 126 |
-
- No evaluation of advanced prompting techniques
|
| 127 |
-
- Doesn't capture full breadth of real-world financial scenarios
|
| 128 |
-
|
| 129 |
-
### Future Directions
|
| 130 |
-
- More advanced prompt engineering
|
| 131 |
-
- Domain-adaptive training for numeric/causal tasks
|
| 132 |
-
- Benchmarking efficiency trade-offs
|
| 133 |
-
- Multi-lingual coverage expansion
|
| 134 |
-
|
| 135 |
-
### Resources
|
| 136 |
-
- Paper PDF: FLaME/FLaME__ACL_AAR_Feb_2025_.pdf
|
| 137 |
-
- ArXiv: https://arxiv.org/abs/2402.14017
|
| 138 |
- GitHub: https://github.com/flame-benchmark/flame
|
| 139 |
- HuggingFace: https://huggingface.co/spaces/flame-benchmark/flame
|
|
|
|
| 2 |
|
| 3 |
## Project Overview
|
| 4 |
- FLaME: Holistic Financial Language Model Evaluation
|
| 5 |
+
- LaTeX paper with static website built using Bulma CSS
|
| 6 |
+
- Research project for ACL Annual Advances in Research (Feb 2025)
|
|
|
|
| 7 |
- Hosted on HuggingFace Spaces
|
| 8 |
|
| 9 |
+
## Commands
|
| 10 |
+
- Build LaTeX paper: `pdflatex FLaME.tex && bibtex FLaME && pdflatex FLaME.tex && pdflatex FLaME.tex`
|
| 11 |
+
- Local website testing: `python -m http.server 8000`
|
| 12 |
+
- Fix tooltips: `./fix_tooltips.sh`
|
| 13 |
|
| 14 |
## Code Style Guidelines
|
| 15 |
+
- HTML: Semantic HTML5, Bulma framework conventions
|
| 16 |
+
- CSS: Follow Bulma conventions, keep in static/css/
|
| 17 |
- JavaScript:
|
| 18 |
+
- Use camelCase for variables/functions
|
| 19 |
+
- 2-space indentation
|
| 20 |
+
- Include semicolons
|
| 21 |
+
- Prefer vanilla JS
|
| 22 |
+
- Store in static/js/
|
| 23 |
+
- LaTeX: Follow ACL style guidelines in acl_formatting.md
|
| 24 |
+
|
| 25 |
+
## Paper Structure
|
| 26 |
+
- Main source in FLaME.tex
|
| 27 |
+
- Content modularized in FLaME/content/ directory
|
| 28 |
+
- Six core task areas: text classification, sentiment analysis, info retrieval, causal analysis, summarization, QA
|
| 29 |
+
- Datasets in FLaME/content/datasets/
|
| 30 |
+
- Results in FLaME/content/tables/
|
| 31 |
+
|
| 32 |
+
## Website Design
|
| 33 |
+
- Color scheme: Deep blue (#004d99), Orange (#ff6b00), Light bg (#f8f9fa)
|
| 34 |
+
- Card-based responsive layout
|
| 35 |
+
- Interactive elements with tooltips
|
| 36 |
+
- Results displayed in task-specific tables
|
| 37 |
+
- Media files optimized for web delivery
|
| 38 |
+
- Convert PDF figures to JPG/PNG for web display
|
| 39 |
+
|
| 40 |
+
## Deployment
|
| 41 |
+
- Website deploys to HuggingFace Spaces
|
| 42 |
+
- Paper available as PDF at FLaME/FLaME.pdf
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
- GitHub: https://github.com/flame-benchmark/flame
|
| 44 |
- HuggingFace: https://huggingface.co/spaces/flame-benchmark/flame
|
FLaME/{FLaME__ACL_AAR_Feb_2025_.pdf → FLaME.pdf}
RENAMED
|
Binary files a/FLaME/FLaME__ACL_AAR_Feb_2025_.pdf and b/FLaME/FLaME.pdf differ
|
|
|
FLaME/FLaME.pdf:Zone.Identifier
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[ZoneTransfer]
|
| 2 |
+
ZoneId=3
|
| 3 |
+
ReferrerUrl=https://www.overleaf.com/project/67380d90965fe3bdf7157e6c
|
| 4 |
+
HostUrl=https://www.overleaf.com/download/project/67380d90965fe3bdf7157e6c/build/195acbb99c5-1bdebd47a7495259/output/output.pdf?compileGroup=priority&clsiserverid=clsi-pre-emp-c2d-c-f-bzgb&enable_pdf_caching=true&popupDownload=true
|
FLaME/FLaME[ACLAARFeb2025]/content/0_authors.tex
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
\author{
|
|
|
|
| 1 |
+
\author{Glenn Matlin, {\bf Mika Okamoto}, {\bf Huzaifa Pardwala}, {\bf Yang Yang}, {\bf Sudheer Chava} \\ Georgia Institute of Technology}
|
FLaME/content/0_authors.tex
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
\author{
|
|
|
|
| 1 |
+
\author{Glenn Matlin, {\bf Mika Okamoto}, {\bf Huzaifa Pardwala}, {\bf Yang Yang}, {\bf Sudheer Chava} \\ Georgia Institute of Technology}
|
index.html
CHANGED
|
@@ -93,14 +93,15 @@
|
|
| 93 |
<h1 class="title is-1 publication-title">FLaME: Holistic Financial Language Model Evaluation</h1>
|
| 94 |
<div class="is-size-5 publication-authors">
|
| 95 |
<span class="author-block">
|
| 96 |
-
<a href="#" target="_blank">
|
| 97 |
<span class="author-block">
|
| 98 |
-
<a href="#" target="_blank">
|
| 99 |
<span class="author-block">
|
| 100 |
-
<a href="#" target="_blank">
|
| 101 |
-
|
|
|
|
| 102 |
<span class="author-block">
|
| 103 |
-
<a href="#" target="_blank">
|
| 104 |
</span>
|
| 105 |
</div>
|
| 106 |
|
|
@@ -112,7 +113,7 @@
|
|
| 112 |
<div class="publication-links">
|
| 113 |
<!-- PDF Link. -->
|
| 114 |
<span class="link-block">
|
| 115 |
-
<a href="FLaME/
|
| 116 |
class="external-link button is-normal is-rounded is-dark">
|
| 117 |
<span class="icon">
|
| 118 |
<i class="fas fa-file-pdf"></i>
|
|
@@ -3911,15 +3912,15 @@
|
|
| 3911 |
|
| 3912 |
<hr>
|
| 3913 |
|
| 3914 |
-
<p class="has-text-weight-bold mb-3"
|
| 3915 |
|
| 3916 |
<div class="notification is-info is-light py-3 px-4">
|
| 3917 |
-
<p><strong
|
| 3918 |
-
<p><strong
|
| 3919 |
-
<p><strong
|
| 3920 |
-
<p><strong
|
| 3921 |
-
<p><strong
|
| 3922 |
-
<p><strong
|
| 3923 |
</div>
|
| 3924 |
</div>
|
| 3925 |
</div>
|
|
@@ -4200,27 +4201,27 @@
|
|
| 4200 |
</div>
|
| 4201 |
|
| 4202 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4203 |
-
<p class="has-text-weight-bold mb-1"
|
| 4204 |
<p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
|
| 4205 |
</div>
|
| 4206 |
|
| 4207 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4208 |
-
<p class="has-text-weight-bold mb-1"
|
| 4209 |
<p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
|
| 4210 |
</div>
|
| 4211 |
|
| 4212 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4213 |
-
<p class="has-text-weight-bold mb-1"
|
| 4214 |
<p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
|
| 4215 |
</div>
|
| 4216 |
|
| 4217 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4218 |
-
<p class="has-text-weight-bold mb-1"
|
| 4219 |
<p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
|
| 4220 |
</div>
|
| 4221 |
|
| 4222 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4223 |
-
<p class="has-text-weight-bold mb-1"
|
| 4224 |
<p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
|
| 4225 |
</div>
|
| 4226 |
|
|
@@ -4299,7 +4300,7 @@
|
|
| 4299 |
|
| 4300 |
<div class="feature-item mb-3">
|
| 4301 |
<p class="has-text-weight-bold mb-1">
|
| 4302 |
-
<span class="icon has-text-primary"><i class="fas fa-check"></i></span>
|
| 4303 |
</p>
|
| 4304 |
<p class="is-size-7 ml-4">Ensures consistent evaluation metrics and transparent methodology.</p>
|
| 4305 |
</div>
|
|
@@ -4391,7 +4392,7 @@
|
|
| 4391 |
<div class="column is-6">
|
| 4392 |
<div class="dataset-category box">
|
| 4393 |
<p class="has-text-weight-bold">
|
| 4394 |
-
|
| 4395 |
</p>
|
| 4396 |
<ul>
|
| 4397 |
<li><strong>FinQA</strong> – Multi-step financial numerical reasoning.</li>
|
|
@@ -4405,7 +4406,7 @@
|
|
| 4405 |
<div class="column is-6">
|
| 4406 |
<div class="dataset-category box">
|
| 4407 |
<p class="has-text-weight-bold">
|
| 4408 |
-
|
| 4409 |
</p>
|
| 4410 |
<ul>
|
| 4411 |
<li><strong>ECTSum</strong> – Earnings call transcript summarization.</li>
|
|
@@ -4418,7 +4419,7 @@
|
|
| 4418 |
<div class="column is-6">
|
| 4419 |
<div class="dataset-category box">
|
| 4420 |
<p class="has-text-weight-bold">
|
| 4421 |
-
|
| 4422 |
</p>
|
| 4423 |
<ul>
|
| 4424 |
<li><strong>FiNER-ORD</strong> – Named entity recognition for financial documents.</li>
|
|
@@ -4434,7 +4435,7 @@
|
|
| 4434 |
<div class="column is-6">
|
| 4435 |
<div class="dataset-category box">
|
| 4436 |
<p class="has-text-weight-bold">
|
| 4437 |
-
|
| 4438 |
</p>
|
| 4439 |
<ul>
|
| 4440 |
<li><strong>FiQA (Task 1)</strong> – Aspect-based sentiment analysis.</li>
|
|
@@ -4449,7 +4450,7 @@
|
|
| 4449 |
<div class="column is-6">
|
| 4450 |
<div class="dataset-category box">
|
| 4451 |
<p class="has-text-weight-bold">
|
| 4452 |
-
|
| 4453 |
</p>
|
| 4454 |
<ul>
|
| 4455 |
<li><strong>Numerical Claim Detection</strong> – Fine-grained investor claim detection.</li>
|
|
@@ -4461,11 +4462,11 @@
|
|
| 4461 |
</div>
|
| 4462 |
</div>
|
| 4463 |
|
| 4464 |
-
<!--
|
| 4465 |
<div class="column is-6">
|
| 4466 |
<div class="dataset-category box">
|
| 4467 |
<p class="has-text-weight-bold">
|
| 4468 |
-
|
| 4469 |
</p>
|
| 4470 |
<ul>
|
| 4471 |
<li><strong>FinCausal</strong> – Causal reasoning in financial news.</li>
|
|
@@ -4493,7 +4494,6 @@
|
|
| 4493 |
<pre><code>@article{flame2025,
|
| 4494 |
author = {Goopy, Oopy and Man, General Munchkin and Bob, L'il Jim and Larry},
|
| 4495 |
title = {FLaME: Holistic Financial Language Model Evaluation},
|
| 4496 |
-
journal = {ACL Annual Advances in Research},
|
| 4497 |
year = {2025},
|
| 4498 |
month = {February},
|
| 4499 |
}</code></pre>
|
|
@@ -4510,7 +4510,7 @@
|
|
| 4510 |
<h4 class="has-text-white mb-4"><span class="flame">FLaME</span>: Financial Language Model Evaluation</h4>
|
| 4511 |
|
| 4512 |
<div class="footer-links mb-5">
|
| 4513 |
-
<a class="icon-link mr-3" target="_blank" href="FLaME/
|
| 4514 |
<i class="fas fa-file-pdf fa-lg"></i>
|
| 4515 |
</a>
|
| 4516 |
<a class="icon-link mr-3" href="https://arxiv.org/abs/2402.14017" target="_blank" title="View on arXiv">
|
|
@@ -4525,7 +4525,7 @@
|
|
| 4525 |
</div>
|
| 4526 |
|
| 4527 |
<div class="institution-info mb-4">
|
| 4528 |
-
<p class="has-text-white-ter">Georgia Institute of Technology
|
| 4529 |
</div>
|
| 4530 |
|
| 4531 |
<p class="has-text-white-ter is-size-7">
|
|
|
|
| 93 |
<h1 class="title is-1 publication-title">FLaME: Holistic Financial Language Model Evaluation</h1>
|
| 94 |
<div class="is-size-5 publication-authors">
|
| 95 |
<span class="author-block">
|
| 96 |
+
<a href="#" target="_blank">Glenn Matlin</a><sup>1</sup>,</span>
|
| 97 |
<span class="author-block">
|
| 98 |
+
<a href="#" target="_blank">Mika Okamoto</a><sup>1</sup>,</span>
|
| 99 |
<span class="author-block">
|
| 100 |
+
<a href="#" target="_blank">Huzaifa Pardwala</a><sup>1</sup>,</span>
|
| 101 |
+
<span class="author-block">
|
| 102 |
+
<a href="#" target="_blank">Yang Yang</a><sup>1</sup>,</span>
|
| 103 |
<span class="author-block">
|
| 104 |
+
<a href="#" target="_blank">Sudheer Chava</a><sup>1</sup>
|
| 105 |
</span>
|
| 106 |
</div>
|
| 107 |
|
|
|
|
| 113 |
<div class="publication-links">
|
| 114 |
<!-- PDF Link. -->
|
| 115 |
<span class="link-block">
|
| 116 |
+
<a href="FLaME/FLaME.pdf" target="_blank"
|
| 117 |
class="external-link button is-normal is-rounded is-dark">
|
| 118 |
<span class="icon">
|
| 119 |
<i class="fas fa-file-pdf"></i>
|
|
|
|
| 3912 |
|
| 3913 |
<hr>
|
| 3914 |
|
| 3915 |
+
<p class="has-text-weight-bold mb-3"><span class="icon has-text-primary"><i class="fa-solid fa-magnifying-glass"></i></span> Key Insights from Model Analysis</p>
|
| 3916 |
|
| 3917 |
<div class="notification is-info is-light py-3 px-4">
|
| 3918 |
+
<p><strong><span class="icon"><i class="fa-solid fa-trophy"></i></span> No single dominant model:</strong> DeepSeek R1 leads in complex multi-step QA, while Claude 3.5 excels in sentiment tasks. GPT-4o is strong in classification and summarization.</p>
|
| 3919 |
+
<p><strong><span class="icon"><i class="fa-solid fa-balance-scale"></i></span> Inconsistent scaling:</strong> Larger models don’t always outperform smaller ones—DeepSeek R1 trails in summarization despite excelling in QA.</p>
|
| 3920 |
+
<p><strong><span class="icon"><i class="fa-solid fa-tools"></i></span> Open-weight models:</strong> Many open-weight models like DeepSeek-V3 and Llama 3.1 70B offer competitive performance while being cost-effective.</p>
|
| 3921 |
+
<p><strong><span class="icon"><i class="fa-solid fa-coins"></i></span> Cost-performance disparities:</strong> Running DeepSeek R1 can cost up to <strong>$260</strong> per million tokens, while Claude 3.5 Sonnet and o1-mini cost around <strong>$105</strong>, and Meta’s Llama 3.1 8B only <strong>$4</strong>.</p>
|
| 3922 |
+
<p><strong><span class="icon"><i class="fa-solid fa-chart-line"></i></span> Numeric reasoning challenges:</strong> Even the best models struggle with financial numeric reasoning tasks, achieving low F1 scores (<strong>≤ 0.06</strong>).</p>
|
| 3923 |
+
<p><strong><span class="icon"><i class="fa-solid fa-list-ol"></i></span> Step-by-step deductions:</strong> Multi-turn financial QA (e.g., ConvFinQA) significantly reduces model accuracy due to complex dependencies.</p>
|
| 3924 |
</div>
|
| 3925 |
</div>
|
| 3926 |
</div>
|
|
|
|
| 4201 |
</div>
|
| 4202 |
|
| 4203 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4204 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-brain"></i></span> Few-Shot & Chain-of-Thought</p>
|
| 4205 |
<p class="is-size-7 mb-0">Investigating in-context learning techniques such as few-shot, chain-of-thought, and retrieval-augmented generation (RAG).</p>
|
| 4206 |
</div>
|
| 4207 |
|
| 4208 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4209 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-line"></i></span> Domain-Adaptive Training</p>
|
| 4210 |
<p class="is-size-7 mb-0">Evaluating fine-tuning strategies to enhance model understanding of financial-specific terminology and reasoning.</p>
|
| 4211 |
</div>
|
| 4212 |
|
| 4213 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4214 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-database"></i></span> Expanded Dataset Coverage</p>
|
| 4215 |
<p class="is-size-7 mb-0">Curating datasets from underrepresented financial sectors such as insurance, derivatives, and central banking.</p>
|
| 4216 |
</div>
|
| 4217 |
|
| 4218 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4219 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-balance-scale"></i></span> Efficiency & Cost Benchmarking</p>
|
| 4220 |
<p class="is-size-7 mb-0">Developing detailed trade-off analyses between accuracy, latency, and cost to optimize real-world usability.</p>
|
| 4221 |
</div>
|
| 4222 |
|
| 4223 |
<div class="notification is-info is-light py-2 px-3 mb-3">
|
| 4224 |
+
<p class="has-text-weight-bold mb-1"><span class="icon has-text-primary"><i class="fa-solid fa-chart-bar"></i></span> Advanced Evaluation Metrics</p>
|
| 4225 |
<p class="is-size-7 mb-0">Moving beyond traditional accuracy metrics by incorporating trustworthiness, robustness, and interpretability measures.</p>
|
| 4226 |
</div>
|
| 4227 |
|
|
|
|
| 4300 |
|
| 4301 |
<div class="feature-item mb-3">
|
| 4302 |
<p class="has-text-weight-bold mb-1">
|
| 4303 |
+
<span class="icon has-text-primary"><i class="fas fa-check"></i></span> Reproducible Benchmarking
|
| 4304 |
</p>
|
| 4305 |
<p class="is-size-7 ml-4">Ensures consistent evaluation metrics and transparent methodology.</p>
|
| 4306 |
</div>
|
|
|
|
| 4392 |
<div class="column is-6">
|
| 4393 |
<div class="dataset-category box">
|
| 4394 |
<p class="has-text-weight-bold">
|
| 4395 |
+
📊 Numerical Reasoning & Question Answering
|
| 4396 |
</p>
|
| 4397 |
<ul>
|
| 4398 |
<li><strong>FinQA</strong> – Multi-step financial numerical reasoning.</li>
|
|
|
|
| 4406 |
<div class="column is-6">
|
| 4407 |
<div class="dataset-category box">
|
| 4408 |
<p class="has-text-weight-bold">
|
| 4409 |
+
📝 Text Summarization
|
| 4410 |
</p>
|
| 4411 |
<ul>
|
| 4412 |
<li><strong>ECTSum</strong> – Earnings call transcript summarization.</li>
|
|
|
|
| 4419 |
<div class="column is-6">
|
| 4420 |
<div class="dataset-category box">
|
| 4421 |
<p class="has-text-weight-bold">
|
| 4422 |
+
🔎 Information Retrieval
|
| 4423 |
</p>
|
| 4424 |
<ul>
|
| 4425 |
<li><strong>FiNER-ORD</strong> – Named entity recognition for financial documents.</li>
|
|
|
|
| 4435 |
<div class="column is-6">
|
| 4436 |
<div class="dataset-category box">
|
| 4437 |
<p class="has-text-weight-bold">
|
| 4438 |
+
😐 Sentiment Analysis
|
| 4439 |
</p>
|
| 4440 |
<ul>
|
| 4441 |
<li><strong>FiQA (Task 1)</strong> – Aspect-based sentiment analysis.</li>
|
|
|
|
| 4450 |
<div class="column is-6">
|
| 4451 |
<div class="dataset-category box">
|
| 4452 |
<p class="has-text-weight-bold">
|
| 4453 |
+
🏷️ Text Classification
|
| 4454 |
</p>
|
| 4455 |
<ul>
|
| 4456 |
<li><strong>Numerical Claim Detection</strong> – Fine-grained investor claim detection.</li>
|
|
|
|
| 4462 |
</div>
|
| 4463 |
</div>
|
| 4464 |
|
| 4465 |
+
<!-- 🧠 Causal Analysis -->
|
| 4466 |
<div class="column is-6">
|
| 4467 |
<div class="dataset-category box">
|
| 4468 |
<p class="has-text-weight-bold">
|
| 4469 |
+
🧠 Causal Analysis
|
| 4470 |
</p>
|
| 4471 |
<ul>
|
| 4472 |
<li><strong>FinCausal</strong> – Causal reasoning in financial news.</li>
|
|
|
|
| 4494 |
<pre><code>@article{flame2025,
|
| 4495 |
author = {Goopy, Oopy and Man, General Munchkin and Bob, L'il Jim and Larry},
|
| 4496 |
title = {FLaME: Holistic Financial Language Model Evaluation},
|
|
|
|
| 4497 |
year = {2025},
|
| 4498 |
month = {February},
|
| 4499 |
}</code></pre>
|
|
|
|
| 4510 |
<h4 class="has-text-white mb-4"><span class="flame">FLaME</span>: Financial Language Model Evaluation</h4>
|
| 4511 |
|
| 4512 |
<div class="footer-links mb-5">
|
| 4513 |
+
<a class="icon-link mr-3" target="_blank" href="FLaME/FLaME.pdf" title="Download PDF">
|
| 4514 |
<i class="fas fa-file-pdf fa-lg"></i>
|
| 4515 |
</a>
|
| 4516 |
<a class="icon-link mr-3" href="https://arxiv.org/abs/2402.14017" target="_blank" title="View on arXiv">
|
|
|
|
| 4525 |
</div>
|
| 4526 |
|
| 4527 |
<div class="institution-info mb-4">
|
| 4528 |
+
<p class="has-text-white-ter">Georgia Institute of Technology</p>
|
| 4529 |
</div>
|
| 4530 |
|
| 4531 |
<p class="has-text-white-ter is-size-7">
|