FutureMa commited on
Commit
0eb3939
·
verified ·
1 Parent(s): 7aac7d8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +219 -0
README.md ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ base_model:
6
+ - Qwen/Qwen3-4B-Instruct-2507
7
+ tags:
8
+ - finance
9
+ - earnings-calls
10
+ - financial-nlp
11
+ - text-classification
12
+ - qwen3
13
+ - llm-as-judge
14
+ - distillation
15
+ pipeline_tag: text-generation
16
+ library_name: transformers
17
+ ---
18
+
19
+ # Eva-4B: Financial Evasion Detection Model
20
+
21
+ Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**.
22
+
23
+ ## Model Summary
24
+
25
+ - **Model name:** Eva-4B
26
+ - **Task:** 3-way classification of Q&A pairs into:
27
+ - `direct`
28
+ - `intermediate`
29
+ - `fully_evasive`
30
+ - **Base model:** `Qwen/Qwen3-4B-Instruct-2507`
31
+ - **Training method:** full-parameter fine-tuning
32
+ - **Training data:** EvasionBench training set (30,000 samples; 10,000 per class)
33
+
34
+ ## Intended Use
35
+
36
+ Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.
37
+
38
+ ## Task Definition
39
+
40
+ Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of:
41
+
42
+ - **direct:** answers the core question with specific information
43
+ - **intermediate:** provides related information but sidesteps the core question
44
+ - **fully_evasive:** does not address the question (refusal, redirection, non-response)
45
+
46
+ This taxonomy follows the Rasiah framework referenced in the paper.
47
+
48
+ ## Dataset: EvasionBench (as reported in the paper)
49
+
50
+ ### Sources
51
+
52
+ - Earnings call transcripts from the **S&P Capital IQ** database.
53
+
54
+ ### Splits
55
+
56
+ - **Training:** 30,000 samples (balanced)
57
+ - direct: 10,000
58
+ - intermediate: 10,000
59
+ - fully_evasive: 10,000
60
+ - **Test (Human):** 1,000 samples (natural distribution)
61
+ - direct: 412 (41.2%)
62
+ - intermediate: 256 (25.6%)
63
+ - fully_evasive: 332 (33.2%)
64
+
65
+ ### Labeling / Construction
66
+
67
+ The training set is constructed via a multi-model annotation framework:
68
+
69
+ - Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash**
70
+ - Agreement cases (~70–80%) are treated as high-confidence
71
+ - Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5**
72
+ - Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)
73
+
74
+ ### Human validation (test set)
75
+
76
+ - A 100-sample subset is double-annotated by two experts.
77
+ - Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**.
78
+
79
+ ## Training Details
80
+
81
+ - **Base model:** Qwen3-4B-Instruct-2507
82
+ - **Fine-tuning:** full-parameter fine-tuning
83
+ - **Framework:** MS-Swift
84
+ - **Hardware:** 2× NVIDIA B200 SXM6 (180GB VRAM each)
85
+ - **Epochs:** 2
86
+ - **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio)
87
+ - **Batch size:** 8 per GPU
88
+ - **Gradient accumulation:** 2 (effective batch size 32)
89
+ - **Precision:** bfloat16
90
+ - **Max sequence length:** 2048
91
+ - **Optimizer:** AdamW
92
+ - **Gradient checkpointing:** enabled
93
+
94
+ ## Performance
95
+
96
+ ### Top-5 models on the 1,000-sample human test set
97
+
98
+ | Rank | Model | Accuracy | F1-Macro |
99
+ |---:|---|---:|---:|
100
+ | 1 | Claude Opus 4.5 | 83.9% | 0.838 |
101
+ | 2 | Gemini-3-Flash | 83.7% | 0.833 |
102
+ | 3 | GLM-4.7 | 82.6% | 0.809 |
103
+ | 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** |
104
+ | 5 | GPT-5.2 | 80.5% | 0.805 |
105
+
106
+ Note: by the accuracy values in the paper’s table, Eva-4B is above GPT-5.2. The paper also states Eva-4B **“ranks 5th overall and 2nd among open-source models (after GLM-4.7)”**, which appears inconsistent with the raw ordering implied by the accuracies.
107
+
108
+ ### Per-class F1 (Eva-4B)
109
+
110
+ | Class | F1 |
111
+ |------:|---:|
112
+ | direct | 0.851 |
113
+ | intermediate | 0.698 |
114
+ | fully_evasive | 0.873 |
115
+
116
+ The paper notes most errors are confusion between **direct** and **intermediate**.
117
+
118
+ ### Ablation (label-source comparison)
119
+
120
+ The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:
121
+
122
+ - **Qwen-Opus-Only:** 78.9% accuracy
123
+ - **Eva-4B:** 81.3% accuracy (**+2.4%** absolute)
124
+
125
+ The paper reports the Opus-only baseline achieves lower training loss but worse generalization.
126
+
127
+ ## Quick Start
128
+
129
+ The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo.
130
+
131
+ ````python
132
+ from transformers import AutoModelForCausalLM, AutoTokenizer
133
+ import torch
134
+
135
+ model_name = "FutureMa/Eva-4B"
136
+
137
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
138
+ model = AutoModelForCausalLM.from_pretrained(
139
+ model_name,
140
+ torch_dtype="auto",
141
+ device_map="auto",
142
+ )
143
+
144
+ PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A
145
+
146
+ Question: {{question}}
147
+ Answer: {{answer}}
148
+
149
+ Response format:
150
+ ```json
151
+ {"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
152
+ ```
153
+
154
+ Answer in ```json content, no other text"""
155
+
156
+ question = "What are your revenue expectations for next quarter?"
157
+ answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."
158
+
159
+ prompt = (
160
+ PROMPT_TEMPLATE
161
+ .replace("{{question}}", question)
162
+ .replace("{{answer}}", answer)
163
+ )
164
+
165
+ messages = [{"role": "user", "content": prompt}]
166
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
167
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
168
+
169
+ with torch.no_grad():
170
+ output_ids = model.generate(
171
+ **inputs,
172
+ max_new_tokens=128,
173
+ temperature=0.7,
174
+ do_sample=True,
175
+ )
176
+
177
+ generated = output_ids[0][inputs["input_ids"].shape[1]:]
178
+ response = tokenizer.decode(generated, skip_special_tokens=True)
179
+ print(response)
180
+ ````
181
+
182
+ Expected output format:
183
+
184
+ ```json
185
+ {"reason": "...", "label": "direct|intermediate|fully_evasive"}
186
+ ```
187
+
188
+ ## Limitations
189
+
190
+ - Domain-specific to earnings call Q&A
191
+ - English-only evaluation
192
+ - Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
193
+ - Judge position bias risk (no position randomization)
194
+ - Potential self-preference concerns (Opus judging its own predictions)
195
+ - Subjectivity in the intermediate class (lower agreement)
196
+ - Temporal drift (training data spans 2005–2023)
197
+
198
+ ## Ethics
199
+
200
+ Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.
201
+
202
+ ## Citation
203
+
204
+ If you use this model, please cite the accompanying paper:
205
+
206
+ ```bibtex
207
+ @article{ma_evasionbench,
208
+ title={EvasionBench: Detecting Evasive Answers in Financial Q\&A via Multi-Model Consensus and LLM-as-Judge},
209
+ author={Ma, Shijian}
210
+ }
211
+ ```
212
+
213
+ ## Author
214
+
215
+ - Shijian Ma (mas8069@foxmail.com)
216
+
217
+ ---
218
+
219
+ Last updated: 2026-01-12