Eva-4B-V2

A 4B parameter model fine-tuned for detecting evasive answers in earnings call Q&A sessions.

Model Description

Base Model: Qwen/Qwen3-4B-Instruct-2507
Task: Text Classification (Evasion Detection)
Language: English
License: Apache 2.0

Performance

Eva-4B-V2 achieves 84.9% Macro-F1 on the EvasionBench evaluation set, outperforming frontier LLMs:

Top 5 Model Performance

Rank	Model	Macro-F1
1	Eva-4B-V2	84.9%
2	Gemini 3 Flash	84.6%
3	Claude Opus 4.5	84.4%
4	GLM-4.7	82.9%
5	GPT-5.2	80.9%

Per-Class Performance

Class	Precision	Recall	F1
Direct	90.6%	75.1%	82.1%
Intermediate	73.7%	87.7%	80.1%
Fully Evasive	93.3%	91.6%	92.4%

Label Definitions

Label	Definition
`direct`	The core question is directly and explicitly answered
`intermediate`	The response provides related context but sidesteps the specific core
`fully_evasive`	The question is ignored, explicitly refused, or entirely off-topic

Training

Two-Stage Training Pipeline

Qwen3-4B-Instruct-2507
        │
        ▼ Stage 1: 60K consensus data
        │
Eva-4B-Consensus
        │
        ▼ Stage 2: 24K three-judge data
        │
Eva-4B-V2

Training Configuration

Parameter	Stage 1	Stage 2
Dataset	60K consensus	24K three-judge
Epochs	2	2
Learning Rate	2e-5	2e-5
Batch Size	32	32
Max Length	2500	2048
Precision	bfloat16	bfloat16

Hardware

Stage 1: 2x NVIDIA B200 (180GB SXM6)
Stage 2: 4x NVIDIA H100 (80GB SXM5)

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FutureMa/Eva-4B-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

# Prompt template
prompt = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

Question: What is the expected margin for Q4?
Answer: We expect it to be 32%.

Response format:
```json
{"label": "direct|intermediate|fully_evasive"}
```

Answer in ```json content, no other text"""

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.1, do_sample=False)

generated = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
# Output: ```json
# {"label": "direct"}
# ```

With vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="FutureMa/Eva-4B-V2")
sampling_params = SamplingParams(temperature=0, max_tokens=64)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Links

Resource	URL
Dataset	FutureMa/EvasionBench
GitHub	IIIIQIIII/EvasionBench
Project Page	https://iiiiqiiii.github.io/EvasionBench
Paper	arXiv:2601.09142
Colab	Quick Start Notebook

Citation

@misc{ma2026evasionbenchlargescalebenchmarkdetecting,
  title={EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A},
  author={Shijian Ma and Yan Lin and Yi Yang},
  year={2026},
  eprint={2601.09142},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2601.09142}
}