|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen3-4B-Instruct-2507 |
|
|
tags: |
|
|
- finance |
|
|
- earnings-calls |
|
|
- financial-nlp |
|
|
- text-classification |
|
|
- qwen3 |
|
|
- llm-as-judge |
|
|
- distillation |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
spaces: |
|
|
- FutureMa/financial-evasion-detection |
|
|
--- |
|
|
|
|
|
# Eva-4B: Financial Evasion Detection Model |
|
|
|
|
|
[](https://huggingface.co/spaces/FutureMa/financial-evasion-detection) |
|
|
|
|
|
Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**. |
|
|
|
|
|
## 🚀 Try the Demo |
|
|
|
|
|
You can test Eva-4B directly in your browser without installation: |
|
|
|
|
|
**[👉 Click here to open the Interactive Demo](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)** |
|
|
|
|
|
## Model Summary |
|
|
|
|
|
- **Model name:** Eva-4B |
|
|
- **Task:** 3-way classification of Q&A pairs into: |
|
|
- `direct` |
|
|
- `intermediate` |
|
|
- `fully_evasive` |
|
|
- **Base model:** `Qwen/Qwen3-4B-Instruct-2507` |
|
|
- **Training method:** full-parameter fine-tuning |
|
|
- **Training data:** EvasionBench training set (30,000 samples; 10,000 per class) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A. |
|
|
|
|
|
## Task Definition |
|
|
|
|
|
Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of: |
|
|
|
|
|
- **direct:** answers the core question with specific information |
|
|
- **intermediate:** provides related information but sidesteps the core question |
|
|
- **fully_evasive:** does not address the question (refusal, redirection, non-response) |
|
|
|
|
|
This taxonomy follows the Rasiah framework referenced in the paper. |
|
|
|
|
|
## Dataset: EvasionBench (as reported in the paper) |
|
|
|
|
|
### Sources |
|
|
|
|
|
- Earnings call transcripts from the **S&P Capital IQ** database. |
|
|
|
|
|
### Splits |
|
|
|
|
|
- **Training:** 30,000 samples (balanced) |
|
|
- direct: 10,000 |
|
|
- intermediate: 10,000 |
|
|
- fully_evasive: 10,000 |
|
|
- **Test (Human):** 1,000 samples (natural distribution) |
|
|
- direct: 412 (41.2%) |
|
|
- intermediate: 256 (25.6%) |
|
|
- fully_evasive: 332 (33.2%) |
|
|
|
|
|
### Labeling / Construction |
|
|
|
|
|
The training set is constructed via a multi-model annotation framework: |
|
|
|
|
|
- Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash** |
|
|
- Agreement cases (~70–80%) are treated as high-confidence |
|
|
- Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5** |
|
|
- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%) |
|
|
|
|
|
### Human validation (test set) |
|
|
|
|
|
- A 100-sample subset is double-annotated by two experts. |
|
|
- Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base model:** Qwen3-4B-Instruct-2507 |
|
|
- **Fine-tuning:** full-parameter fine-tuning |
|
|
- **Framework:** MS-Swift |
|
|
- **Hardware:** 2× NVIDIA B200 SXM6 (180GB VRAM each) |
|
|
- **Epochs:** 2 |
|
|
- **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio) |
|
|
- **Batch size:** 8 per GPU |
|
|
- **Gradient accumulation:** 2 (effective batch size 32) |
|
|
- **Precision:** bfloat16 |
|
|
- **Max sequence length:** 2048 |
|
|
- **Optimizer:** AdamW |
|
|
- **Gradient checkpointing:** enabled |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Top-5 models on the 1,000-sample human test set |
|
|
|
|
|
| Rank | Model | Accuracy | F1-Macro | |
|
|
|---:|---|---:|---:| |
|
|
| 1 | Claude Opus 4.5 | 83.9% | 0.838 | |
|
|
| 2 | Gemini-3-Flash | 83.7% | 0.833 | |
|
|
| 3 | GLM-4.7 | 82.6% | 0.809 | |
|
|
| 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** | |
|
|
| 5 | GPT-5.2 | 80.5% | 0.805 | |
|
|
|
|
|
Note: based on the accuracy values, Eva-4B is **2nd among open-source models**, after GLM-4.7 (82.6%). |
|
|
|
|
|
### Per-class F1 (Eva-4B) |
|
|
|
|
|
| Class | F1 | |
|
|
|------:|---:| |
|
|
| direct | 0.851 | |
|
|
| intermediate | 0.698 | |
|
|
| fully_evasive | 0.873 | |
|
|
|
|
|
The paper notes most errors are confusion between **direct** and **intermediate**. |
|
|
|
|
|
### Ablation (label-source comparison) |
|
|
|
|
|
The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction: |
|
|
|
|
|
- **Qwen-Opus-Only:** 78.9% accuracy |
|
|
- **Eva-4B:** 81.3% accuracy (**+2.4%** absolute) |
|
|
|
|
|
The paper reports the Opus-only baseline achieves lower training loss but worse generalization. |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo. |
|
|
|
|
|
````python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_name = "FutureMa/Eva-4B" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto", |
|
|
) |
|
|
|
|
|
PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A |
|
|
|
|
|
Question: {{question}} |
|
|
Answer: {{answer}} |
|
|
|
|
|
Response format: |
|
|
```json |
|
|
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"} |
|
|
``` |
|
|
|
|
|
Answer in json block content, no other text""" |
|
|
|
|
|
question = "What are your revenue expectations for next quarter?" |
|
|
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities." |
|
|
|
|
|
prompt = ( |
|
|
PROMPT_TEMPLATE |
|
|
.replace("{{question}}", question) |
|
|
.replace("{{answer}}", answer) |
|
|
) |
|
|
|
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
output_ids = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=128, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
) |
|
|
|
|
|
generated = output_ids[0][inputs["input_ids"].shape[1]:] |
|
|
response = tokenizer.decode(generated, skip_special_tokens=True) |
|
|
print(response) |
|
|
```` |
|
|
|
|
|
Expected output format: |
|
|
|
|
|
```json |
|
|
{"reason": "...", "label": "direct|intermediate|fully_evasive"} |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Domain-specific to earnings call Q&A |
|
|
- English-only evaluation |
|
|
- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model) |
|
|
- Judge position bias risk (no position randomization) |
|
|
- Potential self-preference concerns (Opus judging its own predictions) |
|
|
- Subjectivity in the intermediate class (lower agreement) |
|
|
- Temporal drift (training data spans 2005–2023) |
|
|
|
|
|
## Ethics |
|
|
|
|
|
Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the accompanying paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{ma2026evasionbenchdetectingevasiveanswers, |
|
|
title={EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge}, |
|
|
author={Shijian Ma and Yan Lin and Yi Yang}, |
|
|
year={2026}, |
|
|
eprint={2601.09142}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2601.09142}, |
|
|
} |
|
|
``` |
|
|
|
|
|
Paper: [https://arxiv.org/abs/2601.09142](https://arxiv.org/abs/2601.09142) |
|
|
|
|
|
## Author |
|
|
|
|
|
- Shijian Ma (mas8069@foxmail.com) |
|
|
|
|
|
--- |
|
|
|
|
|
Last updated: 2026-01-12 |