File size: 6,956 Bytes
0eb3939
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a523d7
 
0eb3939
 
 
 
2a523d7
 
0eb3939
 
2a523d7
 
 
 
 
 
0eb3939
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21b2c0c
0eb3939
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21b2c0c
0eb3939
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83303c9
 
 
 
 
 
 
 
0eb3939
 
 
83303c9
 
0eb3939
 
 
 
 
 
21b2c0c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
---
language:
- en
license: apache-2.0
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- finance
- earnings-calls
- financial-nlp
- text-classification
- qwen3
- llm-as-judge
- distillation
pipeline_tag: text-generation
library_name: transformers
spaces:
- FutureMa/financial-evasion-detection
---

# Eva-4B: Financial Evasion Detection Model

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)

Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**.

## 🚀 Try the Demo

You can test Eva-4B directly in your browser without installation:

**[👉 Click here to open the Interactive Demo](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)**

## Model Summary

- **Model name:** Eva-4B
- **Task:** 3-way classification of Q&A pairs into:
  - `direct`
  - `intermediate`
  - `fully_evasive`
- **Base model:** `Qwen/Qwen3-4B-Instruct-2507`
- **Training method:** full-parameter fine-tuning
- **Training data:** EvasionBench training set (30,000 samples; 10,000 per class)

## Intended Use

Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.

## Task Definition

Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of:

- **direct:** answers the core question with specific information
- **intermediate:** provides related information but sidesteps the core question
- **fully_evasive:** does not address the question (refusal, redirection, non-response)

This taxonomy follows the Rasiah framework referenced in the paper.

## Dataset: EvasionBench (as reported in the paper)

### Sources

- Earnings call transcripts from the **S&P Capital IQ** database.

### Splits

- **Training:** 30,000 samples (balanced)
  - direct: 10,000
  - intermediate: 10,000
  - fully_evasive: 10,000
- **Test (Human):** 1,000 samples (natural distribution)
  - direct: 412 (41.2%)
  - intermediate: 256 (25.6%)
  - fully_evasive: 332 (33.2%)

### Labeling / Construction

The training set is constructed via a multi-model annotation framework:

- Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash**
- Agreement cases (~70–80%) are treated as high-confidence
- Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5**
- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)

### Human validation (test set)

- A 100-sample subset is double-annotated by two experts.
- Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**.

## Training Details

- **Base model:** Qwen3-4B-Instruct-2507
- **Fine-tuning:** full-parameter fine-tuning
- **Framework:** MS-Swift
- **Hardware:** 2× NVIDIA B200 SXM6 (180GB VRAM each)
- **Epochs:** 2
- **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio)
- **Batch size:** 8 per GPU
- **Gradient accumulation:** 2 (effective batch size 32)
- **Precision:** bfloat16
- **Max sequence length:** 2048
- **Optimizer:** AdamW
- **Gradient checkpointing:** enabled

## Performance

### Top-5 models on the 1,000-sample human test set

| Rank | Model | Accuracy | F1-Macro |
|---:|---|---:|---:|
| 1 | Claude Opus 4.5 | 83.9% | 0.838 |
| 2 | Gemini-3-Flash | 83.7% | 0.833 |
| 3 | GLM-4.7 | 82.6% | 0.809 |
| 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** |
| 5 | GPT-5.2 | 80.5% | 0.805 |

Note: based on the accuracy values, Eva-4B is **2nd among open-source models**, after GLM-4.7 (82.6%).

### Per-class F1 (Eva-4B)

| Class | F1 |
|------:|---:|
| direct | 0.851 |
| intermediate | 0.698 |
| fully_evasive | 0.873 |

The paper notes most errors are confusion between **direct** and **intermediate**.

### Ablation (label-source comparison)

The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:

- **Qwen-Opus-Only:** 78.9% accuracy
- **Eva-4B:** 81.3% accuracy (**+2.4%** absolute)

The paper reports the Opus-only baseline achieves lower training loss but worse generalization.

## Quick Start

The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo.

````python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FutureMa/Eva-4B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

Question: {{question}}
Answer: {{answer}}

Response format:
```json
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
```

Answer in json block content, no other text"""

question = "What are your revenue expectations for next quarter?"
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."

prompt = (
    PROMPT_TEMPLATE
    .replace("{{question}}", question)
    .replace("{{answer}}", answer)
)

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        do_sample=True,
    )

generated = output_ids[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated, skip_special_tokens=True)
print(response)
````

Expected output format:

```json
{"reason": "...", "label": "direct|intermediate|fully_evasive"}
```

## Limitations

- Domain-specific to earnings call Q&A
- English-only evaluation
- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
- Judge position bias risk (no position randomization)
- Potential self-preference concerns (Opus judging its own predictions)
- Subjectivity in the intermediate class (lower agreement)
- Temporal drift (training data spans 2005–2023)

## Ethics

Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.

## Citation

If you use this model, please cite the accompanying paper:

```bibtex
@misc{ma2026evasionbenchdetectingevasiveanswers,
      title={EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge}, 
      author={Shijian Ma and Yan Lin and Yi Yang},
      year={2026},
      eprint={2601.09142},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.09142}, 
}
```

Paper: [https://arxiv.org/abs/2601.09142](https://arxiv.org/abs/2601.09142)

## Author

- Shijian Ma (mas8069@foxmail.com)

---

Last updated: 2026-01-12