--- language: - en license: apache-2.0 base_model: - Qwen/Qwen3-4B-Instruct-2507 tags: - finance - earnings-calls - financial-nlp - text-classification - qwen3 - llm-as-judge - distillation pipeline_tag: text-generation library_name: transformers spaces: - FutureMa/financial-evasion-detection --- # Eva-4B: Financial Evasion Detection Model [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/FutureMa/financial-evasion-detection) Eva-4B is a **4B-parameter** model for detecting **evasive answers** in **earnings call Q&A**. ## πŸš€ Try the Demo You can test Eva-4B directly in your browser without installation: **[πŸ‘‰ Click here to open the Interactive Demo](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)** ## Model Summary - **Model name:** Eva-4B - **Task:** 3-way classification of Q&A pairs into: - `direct` - `intermediate` - `fully_evasive` - **Base model:** `Qwen/Qwen3-4B-Instruct-2507` - **Training method:** full-parameter fine-tuning - **Training data:** EvasionBench training set (30,000 samples; 10,000 per class) ## Intended Use Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A. ## Task Definition Given an earnings call **Question** (analyst) and **Answer** (management), the model predicts one of: - **direct:** answers the core question with specific information - **intermediate:** provides related information but sidesteps the core question - **fully_evasive:** does not address the question (refusal, redirection, non-response) This taxonomy follows the Rasiah framework referenced in the paper. ## Dataset: EvasionBench (as reported in the paper) ### Sources - Earnings call transcripts from the **S&P Capital IQ** database. ### Splits - **Training:** 30,000 samples (balanced) - direct: 10,000 - intermediate: 10,000 - fully_evasive: 10,000 - **Test (Human):** 1,000 samples (natural distribution) - direct: 412 (41.2%) - intermediate: 256 (25.6%) - fully_evasive: 332 (33.2%) ### Labeling / Construction The training set is constructed via a multi-model annotation framework: - Two annotators: **Claude Opus 4.5** and **Gemini-3-Flash** - Agreement cases (~70–80%) are treated as high-confidence - Disagreement cases (~20–30%) are resolved by an **LLM-as-Judge** protocol using **Claude Opus 4.5** - Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%) ### Human validation (test set) - A 100-sample subset is double-annotated by two experts. - Reported inter-annotator agreement: **Cohen’s Kappa = 0.835**. ## Training Details - **Base model:** Qwen3-4B-Instruct-2507 - **Fine-tuning:** full-parameter fine-tuning - **Framework:** MS-Swift - **Hardware:** 2Γ— NVIDIA B200 SXM6 (180GB VRAM each) - **Epochs:** 2 - **Learning rate:** 2e-5 (linear warmup; 3% warmup ratio) - **Batch size:** 8 per GPU - **Gradient accumulation:** 2 (effective batch size 32) - **Precision:** bfloat16 - **Max sequence length:** 2048 - **Optimizer:** AdamW - **Gradient checkpointing:** enabled ## Performance ### Top-5 models on the 1,000-sample human test set | Rank | Model | Accuracy | F1-Macro | |---:|---|---:|---:| | 1 | Claude Opus 4.5 | 83.9% | 0.838 | | 2 | Gemini-3-Flash | 83.7% | 0.833 | | 3 | GLM-4.7 | 82.6% | 0.809 | | 4 | **Eva-4B (Ours)** | **81.3%** | **0.807** | | 5 | GPT-5.2 | 80.5% | 0.805 | Note: based on the accuracy values, Eva-4B is **2nd among open-source models**, after GLM-4.7 (82.6%). ### Per-class F1 (Eva-4B) | Class | F1 | |------:|---:| | direct | 0.851 | | intermediate | 0.698 | | fully_evasive | 0.873 | The paper notes most errors are confusion between **direct** and **intermediate**. ### Ablation (label-source comparison) The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction: - **Qwen-Opus-Only:** 78.9% accuracy - **Eva-4B:** 81.3% accuracy (**+2.4%** absolute) The paper reports the Opus-only baseline achieves lower training loss but worse generalization. ## Quick Start The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo. ````python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "FutureMa/Eva-4B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto", ) PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A Question: {{question}} Answer: {{answer}} Response format: ```json {"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"} ``` Answer in json block content, no other text""" question = "What are your revenue expectations for next quarter?" answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities." prompt = ( PROMPT_TEMPLATE .replace("{{question}}", question) .replace("{{answer}}", answer) ) messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=128, temperature=0.7, do_sample=True, ) generated = output_ids[0][inputs["input_ids"].shape[1]:] response = tokenizer.decode(generated, skip_special_tokens=True) print(response) ```` Expected output format: ```json {"reason": "...", "label": "direct|intermediate|fully_evasive"} ``` ## Limitations - Domain-specific to earnings call Q&A - English-only evaluation - Multi-model + judge labeling increases annotation cost (~2.2–2.3Γ— vs single-model) - Judge position bias risk (no position randomization) - Potential self-preference concerns (Opus judging its own predictions) - Subjectivity in the intermediate class (lower agreement) - Temporal drift (training data spans 2005–2023) ## Ethics Eva-4B is a research artifact and **not financial advice**. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions. ## Citation If you use this model, please cite the accompanying paper: ```bibtex @misc{ma2026evasionbenchdetectingevasiveanswers, title={EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge}, author={Shijian Ma and Yan Lin and Yi Yang}, year={2026}, eprint={2601.09142}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2601.09142}, } ``` Paper: [https://arxiv.org/abs/2601.09142](https://arxiv.org/abs/2601.09142) ## Author - Shijian Ma (mas8069@foxmail.com) --- Last updated: 2026-01-12