File size: 6,956 Bytes
0eb3939 2a523d7 0eb3939 2a523d7 0eb3939 2a523d7 0eb3939 21b2c0c 0eb3939 21b2c0c 0eb3939 83303c9 0eb3939 83303c9 0eb3939 21b2c0c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
---
language:
- en
license: apache-2.0
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- finance
- earnings-calls
- financial-nlp
- text-classification
- qwen3
- llm-as-judge
- distillation
pipeline_tag: text-generation
library_name: transformers
spaces:
- FutureMa/financial-evasion-detection
---
# Eva-4B: Financial Evasion Detection Model
[](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)
Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**.
## 🚀 Try the Demo
You can test Eva-4B directly in your browser without installation:
**[👉 Click here to open the Interactive Demo](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)**
## Model Summary
- **Model name:** Eva-4B
- **Task:** 3-way classification of Q&A pairs into:
- `direct`
- `intermediate`
- `fully_evasive`
- **Base model:** `Qwen/Qwen3-4B-Instruct-2507`
- **Training method:** full-parameter fine-tuning
- **Training data:** EvasionBench training set (30,000 samples; 10,000 per class)
## Intended Use
Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.
## Task Definition
Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of:
- **direct:** answers the core question with specific information
- **intermediate:** provides related information but sidesteps the core question
- **fully_evasive:** does not address the question (refusal, redirection, non-response)
This taxonomy follows the Rasiah framework referenced in the paper.
## Dataset: EvasionBench (as reported in the paper)
### Sources
- Earnings call transcripts from the **S&P Capital IQ** database.
### Splits
- **Training:** 30,000 samples (balanced)
- direct: 10,000
- intermediate: 10,000
- fully_evasive: 10,000
- **Test (Human):** 1,000 samples (natural distribution)
- direct: 412 (41.2%)
- intermediate: 256 (25.6%)
- fully_evasive: 332 (33.2%)
### Labeling / Construction
The training set is constructed via a multi-model annotation framework:
- Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash**
- Agreement cases (~70–80%) are treated as high-confidence
- Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5**
- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)
### Human validation (test set)
- A 100-sample subset is double-annotated by two experts.
- Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**.
## Training Details
- **Base model:** Qwen3-4B-Instruct-2507
- **Fine-tuning:** full-parameter fine-tuning
- **Framework:** MS-Swift
- **Hardware:** 2× NVIDIA B200 SXM6 (180GB VRAM each)
- **Epochs:** 2
- **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio)
- **Batch size:** 8 per GPU
- **Gradient accumulation:** 2 (effective batch size 32)
- **Precision:** bfloat16
- **Max sequence length:** 2048
- **Optimizer:** AdamW
- **Gradient checkpointing:** enabled
## Performance
### Top-5 models on the 1,000-sample human test set
| Rank | Model | Accuracy | F1-Macro |
|---:|---|---:|---:|
| 1 | Claude Opus 4.5 | 83.9% | 0.838 |
| 2 | Gemini-3-Flash | 83.7% | 0.833 |
| 3 | GLM-4.7 | 82.6% | 0.809 |
| 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** |
| 5 | GPT-5.2 | 80.5% | 0.805 |
Note: based on the accuracy values, Eva-4B is **2nd among open-source models**, after GLM-4.7 (82.6%).
### Per-class F1 (Eva-4B)
| Class | F1 |
|------:|---:|
| direct | 0.851 |
| intermediate | 0.698 |
| fully_evasive | 0.873 |
The paper notes most errors are confusion between **direct** and **intermediate**.
### Ablation (label-source comparison)
The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:
- **Qwen-Opus-Only:** 78.9% accuracy
- **Eva-4B:** 81.3% accuracy (**+2.4%** absolute)
The paper reports the Opus-only baseline achieves lower training loss but worse generalization.
## Quick Start
The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo.
````python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "FutureMa/Eva-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A
Question: {{question}}
Answer: {{answer}}
Response format:
```json
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
```
Answer in json block content, no other text"""
question = "What are your revenue expectations for next quarter?"
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."
prompt = (
PROMPT_TEMPLATE
.replace("{{question}}", question)
.replace("{{answer}}", answer)
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.7,
do_sample=True,
)
generated = output_ids[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated, skip_special_tokens=True)
print(response)
````
Expected output format:
```json
{"reason": "...", "label": "direct|intermediate|fully_evasive"}
```
## Limitations
- Domain-specific to earnings call Q&A
- English-only evaluation
- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
- Judge position bias risk (no position randomization)
- Potential self-preference concerns (Opus judging its own predictions)
- Subjectivity in the intermediate class (lower agreement)
- Temporal drift (training data spans 2005–2023)
## Ethics
Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.
## Citation
If you use this model, please cite the accompanying paper:
```bibtex
@misc{ma2026evasionbenchdetectingevasiveanswers,
title={EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge},
author={Shijian Ma and Yan Lin and Yi Yang},
year={2026},
eprint={2601.09142},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.09142},
}
```
Paper: [https://arxiv.org/abs/2601.09142](https://arxiv.org/abs/2601.09142)
## Author
- Shijian Ma (mas8069@foxmail.com)
---
Last updated: 2026-01-12 |