Eva-4B / README.md

Update README.md

83303c9 verified 1 day ago

6.96 kB

	---
	language:
	- en
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-4B-Instruct-2507
	tags:
	- finance
	- earnings-calls
	- financial-nlp
	- text-classification
	- qwen3
	- llm-as-judge
	- distillation
	pipeline_tag: text-generation
	library_name: transformers
	spaces:
	- FutureMa/financial-evasion-detection
	---

	# Eva-4B: Financial Evasion Detection Model

	[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)

	Eva-4B is a 4B-parameter model for detecting evasive answers in earnings call Q&A.

	## 🚀 Try the Demo

	You can test Eva-4B directly in your browser without installation:

	[👉 Click here to open the Interactive Demo](https://huggingface.co/spaces/FutureMa/financial-evasion-detection)

	## Model Summary

	- Model name: Eva-4B
	- Task: 3-way classification of Q&A pairs into:
	- `direct`
	- `intermediate`
	- `fully_evasive`
	- Base model: `Qwen/Qwen3-4B-Instruct-2507`
	- Training method: full-parameter fine-tuning
	- Training data: EvasionBench training set (30,000 samples; 10,000 per class)

	## Intended Use

	Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.

	## Task Definition

	Given an earnings call Question (analyst) and Answer (management), the model predicts one of:

	- direct: answers the core question with specific information
	- intermediate: provides related information but sidesteps the core question
	- fully_evasive: does not address the question (refusal, redirection, non-response)

	This taxonomy follows the Rasiah framework referenced in the paper.

	## Dataset: EvasionBench (as reported in the paper)

	### Sources

	- Earnings call transcripts from the S&P Capital IQ database.

	### Splits

	- Training: 30,000 samples (balanced)
	- direct: 10,000
	- intermediate: 10,000
	- fully_evasive: 10,000
	- Test (Human): 1,000 samples (natural distribution)
	- direct: 412 (41.2%)
	- intermediate: 256 (25.6%)
	- fully_evasive: 332 (33.2%)

	### Labeling / Construction

	The training set is constructed via a multi-model annotation framework:

	- Two annotators: Claude Opus 4.5 and Gemini-3-Flash
	- Agreement cases (~70–80%) are treated as high-confidence
	- Disagreement cases (~20–30%) are resolved by an LLM-as-Judge protocol using Claude Opus 4.5
	- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)

	### Human validation (test set)

	- A 100-sample subset is double-annotated by two experts.
	- Reported inter-annotator agreement: Cohen’s Kappa = 0.835.

	## Training Details

	- Base model: Qwen3-4B-Instruct-2507
	- Fine-tuning: full-parameter fine-tuning
	- Framework: MS-Swift
	- Hardware: 2× NVIDIA B200 SXM6 (180GB VRAM each)
	- Epochs: 2
	- Learning rate: 2e-5 (linear warmup; 3% warmup ratio)
	- Batch size: 8 per GPU
	- Gradient accumulation: 2 (effective batch size 32)
	- Precision: bfloat16
	- Max sequence length: 2048
	- Optimizer: AdamW
	- Gradient checkpointing: enabled

	## Performance

	### Top-5 models on the 1,000-sample human test set

	\| Rank \| Model \| Accuracy \| F1-Macro \|
	\|---:\|---\|---:\|---:\|
	\| 1 \| Claude Opus 4.5 \| 83.9% \| 0.838 \|
	\| 2 \| Gemini-3-Flash \| 83.7% \| 0.833 \|
	\| 3 \| GLM-4.7 \| 82.6% \| 0.809 \|
	\| 4 \| Eva-4B (Ours) \| 81.3% \| 0.807 \|
	\| 5 \| GPT-5.2 \| 80.5% \| 0.805 \|

	Note: based on the accuracy values, Eva-4B is 2nd among open-source models, after GLM-4.7 (82.6%).

	### Per-class F1 (Eva-4B)

	\| Class \| F1 \|
	\|------:\|---:\|
	\| direct \| 0.851 \|
	\| intermediate \| 0.698 \|
	\| fully_evasive \| 0.873 \|

	The paper notes most errors are confusion between direct and intermediate.

	### Ablation (label-source comparison)

	The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:

	- Qwen-Opus-Only: 78.9% accuracy
	- Eva-4B: 81.3% accuracy (+2.4% absolute)

	The paper reports the Opus-only baseline achieves lower training loss but worse generalization.

	## Quick Start

	The prompt below matches `prompts/evasion_rasiah_fine_tuning_minimalist.txt` in this repo.

	````python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "FutureMa/Eva-4B"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	)

	PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

	Question: {{question}}
	Answer: {{answer}}

	Response format:
	```json
	{"reason": "brief explanation under 100 characters", "label": "direct\|intermediate\|fully_evasive"}
	```

	Answer in json block content, no other text"""

	question = "What are your revenue expectations for next quarter?"
	answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."

	prompt = (
	PROMPT_TEMPLATE
	.replace("{{question}}", question)
	.replace("{{answer}}", answer)
	)

	messages = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	with torch.no_grad():
	output_ids = model.generate(
	**inputs,
	max_new_tokens=128,
	temperature=0.7,
	do_sample=True,
	)

	generated = output_ids[0][inputs["input_ids"].shape[1]:]
	response = tokenizer.decode(generated, skip_special_tokens=True)
	print(response)
	````

	Expected output format:

	```json
	{"reason": "...", "label": "direct\|intermediate\|fully_evasive"}
	```

	## Limitations

	- Domain-specific to earnings call Q&A
	- English-only evaluation
	- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
	- Judge position bias risk (no position randomization)
	- Potential self-preference concerns (Opus judging its own predictions)
	- Subjectivity in the intermediate class (lower agreement)
	- Temporal drift (training data spans 2005–2023)

	## Ethics

	Eva-4B is a research artifact and not financial advice. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.

	## Citation

	If you use this model, please cite the accompanying paper:

	```bibtex
	@misc{ma2026evasionbenchdetectingevasiveanswers,
	title={EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge},
	author={Shijian Ma and Yan Lin and Yi Yang},
	year={2026},
	eprint={2601.09142},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2601.09142},
	}
	```

	Paper: [https://arxiv.org/abs/2601.09142](https://arxiv.org/abs/2601.09142)

	## Author

	- Shijian Ma (mas8069@foxmail.com)

	---

	Last updated: 2026-01-12