FutureMa
/

Eva-4B

+---
+language:
+- en
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-4B-Instruct-2507
+tags:
+- finance
+- earnings-calls
+- financial-nlp
+- text-classification
+- qwen3
+- llm-as-judge
+- distillation
+pipeline_tag: text-generation
+library_name: transformers
+---
+# Eva-4B: Financial Evasion Detection Model
+Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**.
+## Model Summary
+- **Model name:** Eva-4B
+- **Task:** 3-way classification of Q&A pairs into:
+  - `direct`
+  - `intermediate`
+  - `fully_evasive`
+- **Base model:** `Qwen/Qwen3-4B-Instruct-2507`
+- **Training method:** full-parameter fine-tuning
+- **Training data:** EvasionBench training set (30,000 samples; 10,000 per class)
+## Intended Use
+Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.
+## Task Definition
+Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of:
+- **direct:** answers the core question with specific information
+- **intermediate:** provides related information but sidesteps the core question
+- **fully_evasive:** does not address the question (refusal, redirection, non-response)
+This taxonomy follows the Rasiah framework referenced in the paper.
+## Dataset: EvasionBench (as reported in the paper)
+### Sources
+- Earnings call transcripts from the **S&P Capital IQ** database.
+### Splits
+- **Training:** 30,000 samples (balanced)
+  - direct: 10,000
+  - intermediate: 10,000
+  - fully_evasive: 10,000
+- **Test (Human):** 1,000 samples (natural distribution)
+  - direct: 412 (41.2%)
+  - intermediate: 256 (25.6%)
+  - fully_evasive: 332 (33.2%)
+### Labeling / Construction
+The training set is constructed via a multi-model annotation framework:
+- Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash**
+- Agreement cases (~70–80%) are treated as high-confidence
+- Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5**
+- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)
+### Human validation (test set)
+- A 100-sample subset is double-annotated by two experts.
+- Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**.
+## Training Details
+- **Base model:** Qwen3-4B-Instruct-2507
+- **Fine-tuning:** full-parameter fine-tuning
+- **Framework:** MS-Swift
+- **Hardware:** 2× NVIDIA B200 SXM6 (180GB VRAM each)
+- **Epochs:** 2
+- **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio)
+- **Batch size:** 8 per GPU
+- **Gradient accumulation:** 2 (effective batch size 32)
+- **Precision:** bfloat16
+- **Max sequence length:** 2048
+- **Optimizer:** AdamW
+- **Gradient checkpointing:** enabled
+## Performance
+### Top-5 models on the 1,000-sample human test set
+| Rank | Model | Accuracy | F1-Macro |
+|---:|---|---:|---:|
+| 1 | Claude Opus 4.5 | 83.9% | 0.838 |
+| 2 | Gemini-3-Flash | 83.7% | 0.833 |
+| 3 | GLM-4.7 | 82.6% | 0.809 |
+| 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** |
+| 5 | GPT-5.2 | 80.5% | 0.805 |
+Note: by the accuracy values in the paper’s table, Eva-4B is above GPT-5.2. The paper also states Eva-4B **“ranks 5th overall and 2nd among open-source models (after GLM-4.7)”**, which appears inconsistent with the raw ordering implied by the accuracies.
+### Per-class F1 (Eva-4B)
+| Class | F1 |
+|------:|---:|
+| direct | 0.851 |
+| intermediate | 0.698 |
+| fully_evasive | 0.873 |
+The paper notes most errors are confusion between **direct** and **intermediate**.
+### Ablation (label-source comparison)
+The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:
+- **Qwen-Opus-Only:** 78.9% accuracy
+- **Eva-4B:** 81.3% accuracy (**+2.4%** absolute)
+The paper reports the Opus-only baseline achieves lower training loss but worse generalization.
+## Quick Start
+The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo.
+````python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_name = "FutureMa/Eva-4B"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto",
+)
+PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A
+Question: {{question}}
+Answer: {{answer}}
+Response format:
+```json
+{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
+```
+Answer in ```json content, no other text"""
+question = "What are your revenue expectations for next quarter?"
+answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."
+prompt = (
+    PROMPT_TEMPLATE
+    .replace("{{question}}", question)
+    .replace("{{answer}}", answer)
+)
+messages = [{"role": "user", "content": prompt}]
+text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer([text], return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output_ids = model.generate(
+        **inputs,
+        max_new_tokens=128,
+        temperature=0.7,
+        do_sample=True,
+    )
+generated = output_ids[0][inputs["input_ids"].shape[1]:]
+response = tokenizer.decode(generated, skip_special_tokens=True)
+print(response)
+````
+Expected output format:
+```json
+{"reason": "...", "label": "direct|intermediate|fully_evasive"}
+```
+## Limitations
+- Domain-specific to earnings call Q&A
+- English-only evaluation
+- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
+- Judge position bias risk (no position randomization)
+- Potential self-preference concerns (Opus judging its own predictions)
+- Subjectivity in the intermediate class (lower agreement)
+- Temporal drift (training data spans 2005–2023)
+## Ethics
+Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.
+## Citation
+If you use this model, please cite the accompanying paper:
+```bibtex
+@article{ma_evasionbench,
+  title={EvasionBench: Detecting Evasive Answers in Financial Q\&A via Multi-Model Consensus and LLM-as-Judge},
+  author={Ma, Shijian}
+}
+```
+## Author
+- Shijian Ma (mas8069@foxmail.com)
+---
+Last updated: 2026-01-12