Eva-4B / README.md
FutureMa's picture
Update README.md
83303c9 verified
---
language:
- en
license: apache-2.0
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- finance
- earnings-calls
- financial-nlp
- text-classification
- qwen3
- llm-as-judge
- distillation
pipeline_tag: text-generation
library_name: transformers
spaces:
- FutureMa/financial-evasion-detection
---
# Eva-4B: Financial Evasion Detection Model
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)
Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**.
## 🚀 Try the Demo
You can test Eva-4B directly in your browser without installation:
**[👉 Click here to open the Interactive Demo](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)**
## Model Summary
- **Model name:** Eva-4B
- **Task:** 3-way classification of Q&A pairs into:
- `direct`
- `intermediate`
- `fully_evasive`
- **Base model:** `Qwen/Qwen3-4B-Instruct-2507`
- **Training method:** full-parameter fine-tuning
- **Training data:** EvasionBench training set (30,000 samples; 10,000 per class)
## Intended Use
Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.
## Task Definition
Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of:
- **direct:** answers the core question with specific information
- **intermediate:** provides related information but sidesteps the core question
- **fully_evasive:** does not address the question (refusal, redirection, non-response)
This taxonomy follows the Rasiah framework referenced in the paper.
## Dataset: EvasionBench (as reported in the paper)
### Sources
- Earnings call transcripts from the **S&P Capital IQ** database.
### Splits
- **Training:** 30,000 samples (balanced)
- direct: 10,000
- intermediate: 10,000
- fully_evasive: 10,000
- **Test (Human):** 1,000 samples (natural distribution)
- direct: 412 (41.2%)
- intermediate: 256 (25.6%)
- fully_evasive: 332 (33.2%)
### Labeling / Construction
The training set is constructed via a multi-model annotation framework:
- Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash**
- Agreement cases (~70–80%) are treated as high-confidence
- Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5**
- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)
### Human validation (test set)
- A 100-sample subset is double-annotated by two experts.
- Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**.
## Training Details
- **Base model:** Qwen3-4B-Instruct-2507
- **Fine-tuning:** full-parameter fine-tuning
- **Framework:** MS-Swift
- **Hardware:** 2× NVIDIA B200 SXM6 (180GB VRAM each)
- **Epochs:** 2
- **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio)
- **Batch size:** 8 per GPU
- **Gradient accumulation:** 2 (effective batch size 32)
- **Precision:** bfloat16
- **Max sequence length:** 2048
- **Optimizer:** AdamW
- **Gradient checkpointing:** enabled
## Performance
### Top-5 models on the 1,000-sample human test set
| Rank | Model | Accuracy | F1-Macro |
|---:|---|---:|---:|
| 1 | Claude Opus 4.5 | 83.9% | 0.838 |
| 2 | Gemini-3-Flash | 83.7% | 0.833 |
| 3 | GLM-4.7 | 82.6% | 0.809 |
| 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** |
| 5 | GPT-5.2 | 80.5% | 0.805 |
Note: based on the accuracy values, Eva-4B is **2nd among open-source models**, after GLM-4.7 (82.6%).
### Per-class F1 (Eva-4B)
| Class | F1 |
|------:|---:|
| direct | 0.851 |
| intermediate | 0.698 |
| fully_evasive | 0.873 |
The paper notes most errors are confusion between **direct** and **intermediate**.
### Ablation (label-source comparison)
The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:
- **Qwen-Opus-Only:** 78.9% accuracy
- **Eva-4B:** 81.3% accuracy (**+2.4%** absolute)
The paper reports the Opus-only baseline achieves lower training loss but worse generalization.
## Quick Start
The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo.
````python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "FutureMa/Eva-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A
Question: {{question}}
Answer: {{answer}}
Response format:
```json
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
```
Answer in json block content, no other text"""
question = "What are your revenue expectations for next quarter?"
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."
prompt = (
PROMPT_TEMPLATE
.replace("{{question}}", question)
.replace("{{answer}}", answer)
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.7,
do_sample=True,
)
generated = output_ids[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated, skip_special_tokens=True)
print(response)
````
Expected output format:
```json
{"reason": "...", "label": "direct|intermediate|fully_evasive"}
```
## Limitations
- Domain-specific to earnings call Q&A
- English-only evaluation
- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
- Judge position bias risk (no position randomization)
- Potential self-preference concerns (Opus judging its own predictions)
- Subjectivity in the intermediate class (lower agreement)
- Temporal drift (training data spans 2005–2023)
## Ethics
Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.
## Citation
If you use this model, please cite the accompanying paper:
```bibtex
@misc{ma2026evasionbenchdetectingevasiveanswers,
title={EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge},
author={Shijian Ma and Yan Lin and Yi Yang},
year={2026},
eprint={2601.09142},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.09142},
}
```
Paper: [https://arxiv.org/abs/2601.09142](https://arxiv.org/abs/2601.09142)
## Author
- Shijian Ma (mas8069@foxmail.com)
---
Last updated: 2026-01-12