Eva-4B

File size: 6,956 Bytes

---
language:
- en
license: apache-2.0
base_model:
- Qwen/Qwen3-4B-Instruct-2507
tags:
- finance
- earnings-calls
- financial-nlp
- text-classification
- qwen3
- llm-as-judge
- distillation
pipeline_tag: text-generation
library_name: transformers
spaces:
- FutureMa/financial-evasion-detection
---

# Eva-4B: Financial Evasion Detection Model

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)

Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**.

## 🚀 Try the Demo

You can test Eva-4B directly in your browser without installation:

**[👉 Click here to open the Interactive Demo](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)**

## Model Summary

- **Model name:** Eva-4B
- **Task:** 3-way classification of Q&A pairs into:
  - `direct`
  - `intermediate`
  - `fully_evasive`
- **Base model:** `Qwen/Qwen3-4B-Instruct-2507`
- **Training method:** full-parameter fine-tuning
- **Training data:** EvasionBench training set (30,000 samples; 10,000 per class)

## Intended Use

Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.

## Task Definition

Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of:

- **direct:** answers the core question with specific information
- **intermediate:** provides related information but sidesteps the core question
- **fully_evasive:** does not address the question (refusal, redirection, non-response)

This taxonomy follows the Rasiah framework referenced in the paper.

## Dataset: EvasionBench (as reported in the paper)

### Sources

- Earnings call transcripts from the **S&P Capital IQ** database.

### Splits

- **Training:** 30,000 samples (balanced)
  - direct: 10,000
  - intermediate: 10,000
  - fully_evasive: 10,000
- **Test (Human):** 1,000 samples (natural distribution)
  - direct: 412 (41.2%)
  - intermediate: 256 (25.6%)
  - fully_evasive: 332 (33.2%)

### Labeling / Construction

The training set is constructed via a multi-model annotation framework:

- Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash**
- Agreement cases (~70–80%) are treated as high-confidence
- Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5**
- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)

### Human validation (test set)

- A 100-sample subset is double-annotated by two experts.
- Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**.

## Training Details

- **Base model:** Qwen3-4B-Instruct-2507
- **Fine-tuning:** full-parameter fine-tuning
- **Framework:** MS-Swift
- **Hardware:** 2× NVIDIA B200 SXM6 (180GB VRAM each)
- **Epochs:** 2
- **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio)
- **Batch size:** 8 per GPU
- **Gradient accumulation:** 2 (effective batch size 32)
- **Precision:** bfloat16
- **Max sequence length:** 2048
- **Optimizer:** AdamW
- **Gradient checkpointing:** enabled

## Performance

### Top-5 models on the 1,000-sample human test set

| Rank | Model | Accuracy | F1-Macro |
|---:|---|---:|---:|
| 1 | Claude Opus 4.5 | 83.9% | 0.838 |
| 2 | Gemini-3-Flash | 83.7% | 0.833 |
| 3 | GLM-4.7 | 82.6% | 0.809 |
| 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** |
| 5 | GPT-5.2 | 80.5% | 0.805 |

Note: based on the accuracy values, Eva-4B is **2nd among open-source models**, after GLM-4.7 (82.6%).

### Per-class F1 (Eva-4B)

| Class | F1 |
|------:|---:|
| direct | 0.851 |
| intermediate | 0.698 |
| fully_evasive | 0.873 |

The paper notes most errors are confusion between **direct** and **intermediate**.

### Ablation (label-source comparison)

The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:

- **Qwen-Opus-Only:** 78.9% accuracy
- **Eva-4B:** 81.3% accuracy (**+2.4%** absolute)

The paper reports the Opus-only baseline achieves lower training loss but worse generalization.

## Quick Start

The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo.

````python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FutureMa/Eva-4B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

Question: {{question}}
Answer: {{answer}}

Response format:
```json
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
```

Answer in json block content, no other text"""

question = "What are your revenue expectations for next quarter?"
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."

prompt = (
    PROMPT_TEMPLATE
    .replace("{{question}}", question)
    .replace("{{answer}}", answer)
)

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        do_sample=True,
    )

generated = output_ids[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated, skip_special_tokens=True)
print(response)
````

Expected output format:

```json
{"reason": "...", "label": "direct|intermediate|fully_evasive"}
```

## Limitations

- Domain-specific to earnings call Q&A
- English-only evaluation
- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
- Judge position bias risk (no position randomization)
- Potential self-preference concerns (Opus judging its own predictions)
- Subjectivity in the intermediate class (lower agreement)
- Temporal drift (training data spans 2005–2023)

## Ethics

Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.

## Citation

If you use this model, please cite the accompanying paper:

```bibtex
@misc{ma2026evasionbenchdetectingevasiveanswers,
      title={EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge}, 
      author={Shijian Ma and Yan Lin and Yi Yang},
      year={2026},
      eprint={2601.09142},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.09142}, 
}
```

Paper: [https://arxiv.org/abs/2601.09142](https://arxiv.org/abs/2601.09142)

## Author

- Shijian Ma (mas8069@foxmail.com)

---

Last updated: 2026-01-12