Eva-4B-V2

Model Dataset GitHub Project Page Open In Colab Paper

A 4B parameter model fine-tuned for detecting evasive answers in earnings call Q&A sessions.

Model Description

Performance

Eva-4B-V2 achieves 84.9% Macro-F1 on the EvasionBench evaluation set, outperforming frontier LLMs:

Top 5 Model Performance

Rank Model Macro-F1
1 Eva-4B-V2 84.9%
2 Gemini 3 Flash 84.6%
3 Claude Opus 4.5 84.4%
4 GLM-4.7 82.9%
5 GPT-5.2 80.9%

Per-Class Performance

Class Precision Recall F1
Direct 90.6% 75.1% 82.1%
Intermediate 73.7% 87.7% 80.1%
Fully Evasive 93.3% 91.6% 92.4%

Label Definitions

Label Definition
direct The core question is directly and explicitly answered
intermediate The response provides related context but sidesteps the specific core
fully_evasive The question is ignored, explicitly refused, or entirely off-topic

Training

Two-Stage Training Pipeline

Qwen3-4B-Instruct-2507
        │
        â–¼ Stage 1: 60K consensus data
        │
Eva-4B-Consensus
        │
        â–¼ Stage 2: 24K three-judge data
        │
Eva-4B-V2

Training Configuration

Parameter Stage 1 Stage 2
Dataset 60K consensus 24K three-judge
Epochs 2 2
Learning Rate 2e-5 2e-5
Batch Size 32 32
Max Length 2500 2048
Precision bfloat16 bfloat16

Hardware

  • Stage 1: 2x NVIDIA B200 (180GB SXM6)
  • Stage 2: 4x NVIDIA H100 (80GB SXM5)

Usage

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FutureMa/Eva-4B-V2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")

# Prompt template
prompt = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

Question: What is the expected margin for Q4?
Answer: We expect it to be 32%.

Response format:
```json
{"label": "direct|intermediate|fully_evasive"}
```

Answer in ```json content, no other text"""

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=64, temperature=0.1, do_sample=False)

generated = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
# Output: ```json
# {"label": "direct"}
# ```

With vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="FutureMa/Eva-4B-V2")
sampling_params = SamplingParams(temperature=0, max_tokens=64)

outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)

Links

Citation

@misc{ma2026evasionbenchlargescalebenchmarkdetecting,
  title={EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A},
  author={Shijian Ma and Yan Lin and Yi Yang},
  year={2026},
  eprint={2601.09142},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2601.09142}
}

License

Apache 2.0

Downloads last month
28
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FutureMa/Eva-4B-V2

Finetuned
(609)
this model

Dataset used to train FutureMa/Eva-4B-V2

Paper for FutureMa/Eva-4B-V2