Qwen3-4B-Evasion-Exp

Fine-tuned Qwen3-4B model for earnings call evasion detection

This model is a full parameter fine-tuned version of Qwen3-4B, specialized in classifying management responses to analyst questions during earnings calls using the Rasiah 3-class evasion taxonomy.

Model Description

Base Model: FutureMa/Qwen3-4B-Evasion
Model Type: Causal Language Model (Fine-tuned)
Training Method: Full parameter fine-tuning with MS-Swift
Task: Three-class evasion classification
Domain: Financial discourse analysis (earnings calls)

Training Details

Training Data

Dataset: 5,010 labeled earnings call Q&A pairs (Claude Opus 4.5 annotations)
Source: Balanced sampling from 154K earnings call dataset
Label Distribution:
- direct: 1,670 (33.3%)
- intermediate: 1,670 (33.3%)
- fully_evasive: 1,670 (33.3%)
Data Quality: 100% Claude Opus 4.5 labels, avg confidence 0.7974

Training Configuration

Framework: MS-Swift
Hardware: 4x H100 GPUs
Training Time: ~2 hours
Checkpoint: checkpoint-314
Learning Rate: 2e-5
Batch Size: 4 per device
Epochs: 3
Max Length: 4096 tokens

Prompt Template

The model uses the Rasiah evasion classification prompt template from: prompts/evasion_rasiah_only_prompt_template_gpt.txt

Question: [Analyst question]
Answer: [Management response]

Expected Output:

{"rasiah":"direct|intermediate|fully_evasive","confidence":0.85}

Performance

Evaluation on 297 Benchmark (vs Claude Opus Ground Truth)

Metric	Score
Overall Accuracy	77.44%
Macro F1	78.13%
Macro Precision	80.10%
Macro Recall	77.26%

Per-Class Metrics

Class	Precision	Recall	F1-Score	Support
direct	80.46%	73.68%	76.92%	95
intermediate	66.42%	80.91%	72.95%	110
fully_evasive	93.42%	77.17%	84.52%	92

Confusion Matrix

	Predicted: direct	Predicted: intermediate	Predicted: fully_evasive
True: direct	70	25	0
True: intermediate	16	89	5
True: fully_evasive	1	20	71

Key Findings

✅ Best Performance: fully_evasive (F1: 84.52%, Precision: 93.42%)
⚠️ Challenge: intermediate class has lower precision (66.42%)
📊 Main Confusion: direct ↔ intermediate (41 samples)

Usage

MS-Swift Inference

# Install MS-Swift
pip install ms-swift

# Inference
swift infer \
    --ckpt_dir FutureMa/Qwen3-4B-Evasion-Exp \
    --eval_human false \
    --max_new_tokens 512

Python API (MS-Swift)

from swift.llm import PtEngine

# Load model
engine = PtEngine(
    model_id_or_path="FutureMa/Qwen3-4B-Evasion-Exp",
    max_batch_size=1,
    torch_dtype="bfloat16"
)

# Build prompt
question = "Can you provide more details on the revenue guidance?"
answer = "We'll share more specifics in next quarter's call."

prompt = f"""Question: {question}
Answer: {answer}
"""

# Inference
response = engine.infer(prompt)
print(response)
# Expected: {"rasiah":"fully_evasive","confidence":0.85}

Transformers API

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "FutureMa/Qwen3-4B-Evasion-Exp",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "FutureMa/Qwen3-4B-Evasion-Exp",
    trust_remote_code=True
)

# Build prompt with full template
prompt = """You are a financial discourse analyst. Your task is to label how an executive responds to an analyst's question during an earnings-call Q&A.

Return exactly one JSON object in a Markdown code block using the `json` tag:

```json
{"rasiah":"direct|intermediate|fully_evasive","confidence":0.00}
```

========================
NOW LABEL THIS PAIR
========================
Question: What are your margin expectations for Q4?
Answer: We're focused on maintaining operational efficiency across all segments.
"""

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True
)

# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Evasion Taxonomy (Rasiah Classification)

Label Definitions

direct
- Clear, on-topic resolution to the primary information request
- Provides concrete facts, numbers, or definitive qualitative answers
- Examples: "Yes", "25% margin", "We expect $100M in Q4"
intermediate
- On-topic but incomplete, softened, or only partially responsive
- Provides directional information but avoids specifics
- Examples: "We're optimistic about margins", "Revenue will be strong"
fully_evasive
- Does not provide the requested information at all
- Explicit refusal, deferral, or topic shift
- Examples: "We don't disclose that", "We'll update later", "No comment"

Confidence Calibration

0.90-1.00: Very clear match; minimal ambiguity
0.70-0.89: Mostly clear; minor edge cases
0.50-0.69: Borderline between two labels
< 0.50: High uncertainty

Intended Use

Primary Use Cases

✅ Financial Discourse Analysis

Detect evasion patterns in earnings calls
Assess corporate transparency and disclosure quality
Analyze Q&A dynamics between analysts and management

✅ Research Applications

Study information asymmetry in financial markets
Quantify management communication strategies
Corporate governance research

✅ Model Development

Training data generation for evasion detection systems
Baseline comparison for new evasion detection models
Fine-tuning starting point for related tasks

Out-of-Scope Use

❌ Not Recommended:

Legal decision-making without human review
Real-time trading signals (requires validation)
Non-earnings call financial discourse (may not generalize)
General Q&A classification outside finance domain

Limitations

Model Limitations

Domain Specificity
- Trained exclusively on earnings call Q&A
- May not generalize to other financial contexts (analyst reports, investor letters, etc.)
- Performance on non-English earnings calls unknown
Class Imbalance Handling
- Training data is perfectly balanced (33.3% each class)
- Real-world distribution is skewed (intermediate: 50.6%, direct: 31.1%, fully_evasive: 18.3%)
- May over-predict minority classes in production
Intermediate Class Ambiguity
- Lower precision on intermediate (66.42%) due to inherent ambiguity
- Boundary cases between direct ↔ intermediate are challenging
- 41 samples (14%) confused between these classes
Context Dependency
- Requires full Q&A pair (question + answer)
- Performance degrades with truncated or incomplete text
- Sensitive to prompt formatting

Data Limitations

Annotation Source
- Single annotator (Claude Opus 4.5)
- No human validation on training data
- Potential model-specific biases
Temporal Coverage
- Source data from specific time period
- May not reflect recent market dynamics or communication trends
Sample Size
- 5,010 training samples (moderate scale)
- Benchmark evaluation on only 297 samples
- Limited coverage of edge cases

Bias and Fairness

Known Biases

Label Distribution Bias: Training on balanced data (33.3% each) vs real-world skewed distribution may cause miscalibration
Model Bias: Inherits Claude Opus 4.5's classification preferences (more liberal with direct labels)
Domain Bias: Earnings calls are formal, scripted interactions; may not apply to informal financial discourse

Fairness Considerations

Model treats all companies equally (no company-specific features)
No demographic or geographic features used
Evaluates content only, not speaker identity

Comparison with Other Models

vs DeepSeek-V3.2 (Original Annotator)

For the same 5,010 samples:

Label	DeepSeek-V3.2	Qwen3-4B-Evasion-Exp	Delta
direct	26.6%	33.3% (training dist.)	+6.7%
intermediate	35.0%	33.3% (training dist.)	-1.7%
fully_evasive	38.4%	33.3% (training dist.)	-5.1%

Key Difference: DeepSeek-V3.2 is more conservative (higher fully_evasive), while this model follows balanced training distribution.

Benchmark Performance (297 samples)

Model	Accuracy	Macro F1	Direct F1	Intermediate F1	Fully Evasive F1
Qwen3-4B-Evasion-Exp	77.44%	78.13%	76.92%	72.95%	84.52%
Baseline (placeholder)	-	-	-	-	-

Training Datasets

Primary Training Data

Dataset: evasion_opus45_5010_msswift.jsonl
Samples: 5,010
Format: MS-Swift JSONL format
Annotation: Claude Opus 4.5 (100% coverage)
Sampling: Balanced (1,670 per class)

Data Format Example

{
  "conversations": [
    {
      "role": "user",
      "content": "[Full Rasiah prompt template + Q&A pair]"
    },
    {
      "role": "assistant",
      "content": "```json\n{\"rasiah\":\"direct\",\"confidence\":0.85}\n```"
    }
  ],
  "id": "qa_12943_3224438_1"
}

Model Card Authors

FutureMa (HuggingFace: @FutureMa)

Citation

If you use this model in your research, please cite:

@misc{qwen3_4b_evasion_exp,
  author = {FutureMa},
  title = {Qwen3-4B-Evasion-Exp: Fine-tuned Model for Earnings Call Evasion Detection},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/FutureMa/Qwen3-4B-Evasion-Exp}
}

Related Work

Evasion Detection Research:

Rasiah et al. (2018) - "Detecting Evasive Answers in Financial Discourse"
Hollander et al. (2010) - "Abnormal Returns from the Common Stock Investments of Members of the U.S. House of Representatives"

Base Model:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2024}
}

License

This model is licensed under Apache License 2.0 (inherited from Qwen3 base model).

Commercial Use: Allowed with attribution Modification: Allowed Distribution: Allowed Patent Use: Allowed

See LICENSE for full terms.

Acknowledgments

Base Model: Qwen Team for Qwen3-4B
Training Framework: MS-Swift team
Annotation: Anthropic Claude Opus 4.5
Data Source: Capital IQ earnings call transcripts
Methodology: Rasiah et al. evasion taxonomy

Contact

For questions, issues, or collaborations:

HuggingFace: @FutureMa
Issues: GitHub Issues (if applicable)

Model Version: 1.0 Release Date: 2026-01-02 Last Updated: 2026-01-02 Status: ✅ Production Ready

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for FutureMa/Qwen3-4B-Evasion-Exp

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

FutureMa/Qwen3-4B-Evasion

Finetuned

(1)

this model