Qwen3-4B-Evasion-Exp

Fine-tuned Qwen3-4B model for earnings call evasion detection

This model is a full parameter fine-tuned version of Qwen3-4B, specialized in classifying management responses to analyst questions during earnings calls using the Rasiah 3-class evasion taxonomy.

Model Description

  • Base Model: FutureMa/Qwen3-4B-Evasion
  • Model Type: Causal Language Model (Fine-tuned)
  • Training Method: Full parameter fine-tuning with MS-Swift
  • Task: Three-class evasion classification
  • Domain: Financial discourse analysis (earnings calls)

Training Details

Training Data

  • Dataset: 5,010 labeled earnings call Q&A pairs (Claude Opus 4.5 annotations)
  • Source: Balanced sampling from 154K earnings call dataset
  • Label Distribution:
    • direct: 1,670 (33.3%)
    • intermediate: 1,670 (33.3%)
    • fully_evasive: 1,670 (33.3%)
  • Data Quality: 100% Claude Opus 4.5 labels, avg confidence 0.7974

Training Configuration

  • Framework: MS-Swift
  • Hardware: 4x H100 GPUs
  • Training Time: ~2 hours
  • Checkpoint: checkpoint-314
  • Learning Rate: 2e-5
  • Batch Size: 4 per device
  • Epochs: 3
  • Max Length: 4096 tokens

Prompt Template

The model uses the Rasiah evasion classification prompt template from: prompts/evasion_rasiah_only_prompt_template_gpt.txt

Question: [Analyst question]
Answer: [Management response]

Expected Output:

{"rasiah":"direct|intermediate|fully_evasive","confidence":0.85}

Performance

Evaluation on 297 Benchmark (vs Claude Opus Ground Truth)

Metric Score
Overall Accuracy 77.44%
Macro F1 78.13%
Macro Precision 80.10%
Macro Recall 77.26%

Per-Class Metrics

Class Precision Recall F1-Score Support
direct 80.46% 73.68% 76.92% 95
intermediate 66.42% 80.91% 72.95% 110
fully_evasive 93.42% 77.17% 84.52% 92

Confusion Matrix

Predicted: direct Predicted: intermediate Predicted: fully_evasive
True: direct 70 25 0
True: intermediate 16 89 5
True: fully_evasive 1 20 71

Key Findings

  • Best Performance: fully_evasive (F1: 84.52%, Precision: 93.42%)
  • ⚠️ Challenge: intermediate class has lower precision (66.42%)
  • 📊 Main Confusion: directintermediate (41 samples)

Usage

MS-Swift Inference

# Install MS-Swift
pip install ms-swift

# Inference
swift infer \
    --ckpt_dir FutureMa/Qwen3-4B-Evasion-Exp \
    --eval_human false \
    --max_new_tokens 512

Python API (MS-Swift)

from swift.llm import PtEngine

# Load model
engine = PtEngine(
    model_id_or_path="FutureMa/Qwen3-4B-Evasion-Exp",
    max_batch_size=1,
    torch_dtype="bfloat16"
)

# Build prompt
question = "Can you provide more details on the revenue guidance?"
answer = "We'll share more specifics in next quarter's call."

prompt = f"""Question: {question}
Answer: {answer}
"""

# Inference
response = engine.infer(prompt)
print(response)
# Expected: {"rasiah":"fully_evasive","confidence":0.85}

Transformers API

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "FutureMa/Qwen3-4B-Evasion-Exp",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "FutureMa/Qwen3-4B-Evasion-Exp",
    trust_remote_code=True
)

# Build prompt with full template
prompt = """You are a financial discourse analyst. Your task is to label how an executive responds to an analyst's question during an earnings-call Q&A.

Return exactly one JSON object in a Markdown code block using the `json` tag:

```json
{"rasiah":"direct|intermediate|fully_evasive","confidence":0.00}
```

========================
NOW LABEL THIS PAIR
========================
Question: What are your margin expectations for Q4?
Answer: We're focused on maintaining operational efficiency across all segments.
"""

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,
    do_sample=True
)

# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Evasion Taxonomy (Rasiah Classification)

Label Definitions

  1. direct

    • Clear, on-topic resolution to the primary information request
    • Provides concrete facts, numbers, or definitive qualitative answers
    • Examples: "Yes", "25% margin", "We expect $100M in Q4"
  2. intermediate

    • On-topic but incomplete, softened, or only partially responsive
    • Provides directional information but avoids specifics
    • Examples: "We're optimistic about margins", "Revenue will be strong"
  3. fully_evasive

    • Does not provide the requested information at all
    • Explicit refusal, deferral, or topic shift
    • Examples: "We don't disclose that", "We'll update later", "No comment"

Confidence Calibration

  • 0.90-1.00: Very clear match; minimal ambiguity
  • 0.70-0.89: Mostly clear; minor edge cases
  • 0.50-0.69: Borderline between two labels
  • < 0.50: High uncertainty

Intended Use

Primary Use Cases

Financial Discourse Analysis

  • Detect evasion patterns in earnings calls
  • Assess corporate transparency and disclosure quality
  • Analyze Q&A dynamics between analysts and management

Research Applications

  • Study information asymmetry in financial markets
  • Quantify management communication strategies
  • Corporate governance research

Model Development

  • Training data generation for evasion detection systems
  • Baseline comparison for new evasion detection models
  • Fine-tuning starting point for related tasks

Out-of-Scope Use

Not Recommended:

  • Legal decision-making without human review
  • Real-time trading signals (requires validation)
  • Non-earnings call financial discourse (may not generalize)
  • General Q&A classification outside finance domain

Limitations

Model Limitations

  1. Domain Specificity

    • Trained exclusively on earnings call Q&A
    • May not generalize to other financial contexts (analyst reports, investor letters, etc.)
    • Performance on non-English earnings calls unknown
  2. Class Imbalance Handling

    • Training data is perfectly balanced (33.3% each class)
    • Real-world distribution is skewed (intermediate: 50.6%, direct: 31.1%, fully_evasive: 18.3%)
    • May over-predict minority classes in production
  3. Intermediate Class Ambiguity

    • Lower precision on intermediate (66.42%) due to inherent ambiguity
    • Boundary cases between directintermediate are challenging
    • 41 samples (14%) confused between these classes
  4. Context Dependency

    • Requires full Q&A pair (question + answer)
    • Performance degrades with truncated or incomplete text
    • Sensitive to prompt formatting

Data Limitations

  1. Annotation Source

    • Single annotator (Claude Opus 4.5)
    • No human validation on training data
    • Potential model-specific biases
  2. Temporal Coverage

    • Source data from specific time period
    • May not reflect recent market dynamics or communication trends
  3. Sample Size

    • 5,010 training samples (moderate scale)
    • Benchmark evaluation on only 297 samples
    • Limited coverage of edge cases

Bias and Fairness

Known Biases

  • Label Distribution Bias: Training on balanced data (33.3% each) vs real-world skewed distribution may cause miscalibration
  • Model Bias: Inherits Claude Opus 4.5's classification preferences (more liberal with direct labels)
  • Domain Bias: Earnings calls are formal, scripted interactions; may not apply to informal financial discourse

Fairness Considerations

  • Model treats all companies equally (no company-specific features)
  • No demographic or geographic features used
  • Evaluates content only, not speaker identity

Comparison with Other Models

vs DeepSeek-V3.2 (Original Annotator)

For the same 5,010 samples:

Label DeepSeek-V3.2 Qwen3-4B-Evasion-Exp Delta
direct 26.6% 33.3% (training dist.) +6.7%
intermediate 35.0% 33.3% (training dist.) -1.7%
fully_evasive 38.4% 33.3% (training dist.) -5.1%

Key Difference: DeepSeek-V3.2 is more conservative (higher fully_evasive), while this model follows balanced training distribution.

Benchmark Performance (297 samples)

Model Accuracy Macro F1 Direct F1 Intermediate F1 Fully Evasive F1
Qwen3-4B-Evasion-Exp 77.44% 78.13% 76.92% 72.95% 84.52%
Baseline (placeholder) - - - - -

Training Datasets

Primary Training Data

  • Dataset: evasion_opus45_5010_msswift.jsonl
  • Samples: 5,010
  • Format: MS-Swift JSONL format
  • Annotation: Claude Opus 4.5 (100% coverage)
  • Sampling: Balanced (1,670 per class)

Data Format Example

{
  "conversations": [
    {
      "role": "user",
      "content": "[Full Rasiah prompt template + Q&A pair]"
    },
    {
      "role": "assistant",
      "content": "```json\n{\"rasiah\":\"direct\",\"confidence\":0.85}\n```"
    }
  ],
  "id": "qa_12943_3224438_1"
}

Model Card Authors

Citation

If you use this model in your research, please cite:

@misc{qwen3_4b_evasion_exp,
  author = {FutureMa},
  title = {Qwen3-4B-Evasion-Exp: Fine-tuned Model for Earnings Call Evasion Detection},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/FutureMa/Qwen3-4B-Evasion-Exp}
}

Related Work

Evasion Detection Research:

  • Rasiah et al. (2018) - "Detecting Evasive Answers in Financial Discourse"
  • Hollander et al. (2010) - "Abnormal Returns from the Common Stock Investments of Members of the U.S. House of Representatives"

Base Model:

@article{qwen3,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2024}
}

License

This model is licensed under Apache License 2.0 (inherited from Qwen3 base model).

Commercial Use: Allowed with attribution Modification: Allowed Distribution: Allowed Patent Use: Allowed

See LICENSE for full terms.

Acknowledgments

  • Base Model: Qwen Team for Qwen3-4B
  • Training Framework: MS-Swift team
  • Annotation: Anthropic Claude Opus 4.5
  • Data Source: Capital IQ earnings call transcripts
  • Methodology: Rasiah et al. evasion taxonomy

Contact

For questions, issues, or collaborations:


Model Version: 1.0 Release Date: 2026-01-02 Last Updated: 2026-01-02 Status: ✅ Production Ready

Downloads last month
14
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FutureMa/Qwen3-4B-Evasion-Exp

Finetuned
(1)
this model