Qwen3-4B-Evasion-Exp
Fine-tuned Qwen3-4B model for earnings call evasion detection
This model is a full parameter fine-tuned version of Qwen3-4B, specialized in classifying management responses to analyst questions during earnings calls using the Rasiah 3-class evasion taxonomy.
Model Description
- Base Model: FutureMa/Qwen3-4B-Evasion
- Model Type: Causal Language Model (Fine-tuned)
- Training Method: Full parameter fine-tuning with MS-Swift
- Task: Three-class evasion classification
- Domain: Financial discourse analysis (earnings calls)
Training Details
Training Data
- Dataset: 5,010 labeled earnings call Q&A pairs (Claude Opus 4.5 annotations)
- Source: Balanced sampling from 154K earnings call dataset
- Label Distribution:
direct: 1,670 (33.3%)intermediate: 1,670 (33.3%)fully_evasive: 1,670 (33.3%)
- Data Quality: 100% Claude Opus 4.5 labels, avg confidence 0.7974
Training Configuration
- Framework: MS-Swift
- Hardware: 4x H100 GPUs
- Training Time: ~2 hours
- Checkpoint: checkpoint-314
- Learning Rate: 2e-5
- Batch Size: 4 per device
- Epochs: 3
- Max Length: 4096 tokens
Prompt Template
The model uses the Rasiah evasion classification prompt template from:
prompts/evasion_rasiah_only_prompt_template_gpt.txt
Question: [Analyst question]
Answer: [Management response]
Expected Output:
{"rasiah":"direct|intermediate|fully_evasive","confidence":0.85}
Performance
Evaluation on 297 Benchmark (vs Claude Opus Ground Truth)
| Metric | Score |
|---|---|
| Overall Accuracy | 77.44% |
| Macro F1 | 78.13% |
| Macro Precision | 80.10% |
| Macro Recall | 77.26% |
Per-Class Metrics
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| direct | 80.46% | 73.68% | 76.92% | 95 |
| intermediate | 66.42% | 80.91% | 72.95% | 110 |
| fully_evasive | 93.42% | 77.17% | 84.52% | 92 |
Confusion Matrix
| Predicted: direct | Predicted: intermediate | Predicted: fully_evasive | |
|---|---|---|---|
| True: direct | 70 | 25 | 0 |
| True: intermediate | 16 | 89 | 5 |
| True: fully_evasive | 1 | 20 | 71 |
Key Findings
- ✅ Best Performance:
fully_evasive(F1: 84.52%, Precision: 93.42%) - ⚠️ Challenge:
intermediateclass has lower precision (66.42%) - 📊 Main Confusion:
direct↔intermediate(41 samples)
Usage
MS-Swift Inference
# Install MS-Swift
pip install ms-swift
# Inference
swift infer \
--ckpt_dir FutureMa/Qwen3-4B-Evasion-Exp \
--eval_human false \
--max_new_tokens 512
Python API (MS-Swift)
from swift.llm import PtEngine
# Load model
engine = PtEngine(
model_id_or_path="FutureMa/Qwen3-4B-Evasion-Exp",
max_batch_size=1,
torch_dtype="bfloat16"
)
# Build prompt
question = "Can you provide more details on the revenue guidance?"
answer = "We'll share more specifics in next quarter's call."
prompt = f"""Question: {question}
Answer: {answer}
"""
# Inference
response = engine.infer(prompt)
print(response)
# Expected: {"rasiah":"fully_evasive","confidence":0.85}
Transformers API
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"FutureMa/Qwen3-4B-Evasion-Exp",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"FutureMa/Qwen3-4B-Evasion-Exp",
trust_remote_code=True
)
# Build prompt with full template
prompt = """You are a financial discourse analyst. Your task is to label how an executive responds to an analyst's question during an earnings-call Q&A.
Return exactly one JSON object in a Markdown code block using the `json` tag:
```json
{"rasiah":"direct|intermediate|fully_evasive","confidence":0.00}
```
========================
NOW LABEL THIS PAIR
========================
Question: What are your margin expectations for Q4?
Answer: We're focused on maintaining operational efficiency across all segments.
"""
# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7,
do_sample=True
)
# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Evasion Taxonomy (Rasiah Classification)
Label Definitions
direct
- Clear, on-topic resolution to the primary information request
- Provides concrete facts, numbers, or definitive qualitative answers
- Examples: "Yes", "25% margin", "We expect $100M in Q4"
intermediate
- On-topic but incomplete, softened, or only partially responsive
- Provides directional information but avoids specifics
- Examples: "We're optimistic about margins", "Revenue will be strong"
fully_evasive
- Does not provide the requested information at all
- Explicit refusal, deferral, or topic shift
- Examples: "We don't disclose that", "We'll update later", "No comment"
Confidence Calibration
- 0.90-1.00: Very clear match; minimal ambiguity
- 0.70-0.89: Mostly clear; minor edge cases
- 0.50-0.69: Borderline between two labels
- < 0.50: High uncertainty
Intended Use
Primary Use Cases
✅ Financial Discourse Analysis
- Detect evasion patterns in earnings calls
- Assess corporate transparency and disclosure quality
- Analyze Q&A dynamics between analysts and management
✅ Research Applications
- Study information asymmetry in financial markets
- Quantify management communication strategies
- Corporate governance research
✅ Model Development
- Training data generation for evasion detection systems
- Baseline comparison for new evasion detection models
- Fine-tuning starting point for related tasks
Out-of-Scope Use
❌ Not Recommended:
- Legal decision-making without human review
- Real-time trading signals (requires validation)
- Non-earnings call financial discourse (may not generalize)
- General Q&A classification outside finance domain
Limitations
Model Limitations
Domain Specificity
- Trained exclusively on earnings call Q&A
- May not generalize to other financial contexts (analyst reports, investor letters, etc.)
- Performance on non-English earnings calls unknown
Class Imbalance Handling
- Training data is perfectly balanced (33.3% each class)
- Real-world distribution is skewed (intermediate: 50.6%, direct: 31.1%, fully_evasive: 18.3%)
- May over-predict minority classes in production
Intermediate Class Ambiguity
- Lower precision on
intermediate(66.42%) due to inherent ambiguity - Boundary cases between
direct↔intermediateare challenging - 41 samples (14%) confused between these classes
- Lower precision on
Context Dependency
- Requires full Q&A pair (question + answer)
- Performance degrades with truncated or incomplete text
- Sensitive to prompt formatting
Data Limitations
Annotation Source
- Single annotator (Claude Opus 4.5)
- No human validation on training data
- Potential model-specific biases
Temporal Coverage
- Source data from specific time period
- May not reflect recent market dynamics or communication trends
Sample Size
- 5,010 training samples (moderate scale)
- Benchmark evaluation on only 297 samples
- Limited coverage of edge cases
Bias and Fairness
Known Biases
- Label Distribution Bias: Training on balanced data (33.3% each) vs real-world skewed distribution may cause miscalibration
- Model Bias: Inherits Claude Opus 4.5's classification preferences (more liberal with
directlabels) - Domain Bias: Earnings calls are formal, scripted interactions; may not apply to informal financial discourse
Fairness Considerations
- Model treats all companies equally (no company-specific features)
- No demographic or geographic features used
- Evaluates content only, not speaker identity
Comparison with Other Models
vs DeepSeek-V3.2 (Original Annotator)
For the same 5,010 samples:
| Label | DeepSeek-V3.2 | Qwen3-4B-Evasion-Exp | Delta |
|---|---|---|---|
| direct | 26.6% | 33.3% (training dist.) | +6.7% |
| intermediate | 35.0% | 33.3% (training dist.) | -1.7% |
| fully_evasive | 38.4% | 33.3% (training dist.) | -5.1% |
Key Difference: DeepSeek-V3.2 is more conservative (higher fully_evasive), while this model follows balanced training distribution.
Benchmark Performance (297 samples)
| Model | Accuracy | Macro F1 | Direct F1 | Intermediate F1 | Fully Evasive F1 |
|---|---|---|---|---|---|
| Qwen3-4B-Evasion-Exp | 77.44% | 78.13% | 76.92% | 72.95% | 84.52% |
| Baseline (placeholder) | - | - | - | - | - |
Training Datasets
Primary Training Data
- Dataset:
evasion_opus45_5010_msswift.jsonl - Samples: 5,010
- Format: MS-Swift JSONL format
- Annotation: Claude Opus 4.5 (100% coverage)
- Sampling: Balanced (1,670 per class)
Data Format Example
{
"conversations": [
{
"role": "user",
"content": "[Full Rasiah prompt template + Q&A pair]"
},
{
"role": "assistant",
"content": "```json\n{\"rasiah\":\"direct\",\"confidence\":0.85}\n```"
}
],
"id": "qa_12943_3224438_1"
}
Model Card Authors
- FutureMa (HuggingFace: @FutureMa)
Citation
If you use this model in your research, please cite:
@misc{qwen3_4b_evasion_exp,
author = {FutureMa},
title = {Qwen3-4B-Evasion-Exp: Fine-tuned Model for Earnings Call Evasion Detection},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/FutureMa/Qwen3-4B-Evasion-Exp}
}
Related Work
Evasion Detection Research:
- Rasiah et al. (2018) - "Detecting Evasive Answers in Financial Discourse"
- Hollander et al. (2010) - "Abnormal Returns from the Common Stock Investments of Members of the U.S. House of Representatives"
Base Model:
@article{qwen3,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2024}
}
License
This model is licensed under Apache License 2.0 (inherited from Qwen3 base model).
Commercial Use: Allowed with attribution Modification: Allowed Distribution: Allowed Patent Use: Allowed
See LICENSE for full terms.
Acknowledgments
- Base Model: Qwen Team for Qwen3-4B
- Training Framework: MS-Swift team
- Annotation: Anthropic Claude Opus 4.5
- Data Source: Capital IQ earnings call transcripts
- Methodology: Rasiah et al. evasion taxonomy
Contact
For questions, issues, or collaborations:
- HuggingFace: @FutureMa
- Issues: GitHub Issues (if applicable)
Model Version: 1.0 Release Date: 2026-01-02 Last Updated: 2026-01-02 Status: ✅ Production Ready
- Downloads last month
- 14