qwen3.5-4B-mixed

Qwen3.5-4B fine-tuned for multilingual affective-state identification (MASIVE-style masked affect expression filling) across Romanian, English, Spanish, Persian, and Hindi.

Given a passage with one or more [MASK] positions, the model outputs the distinct affective expressions that fill them, in order of first appearance, separated by spaces.

Training data

5000 train / 1200 val / 5×1000 test rows, 1000 rows per language, scrambled and shuffled (seed 42). Sourced from:

  • ro: sampled from pipeline/data/benchmark_ro_asi_clean.jsonl in the Romanian_ASI repo
  • en / es / fa: sampled from the MASIVE dataset by the same pipeline
  • hi: converted from Hindi affective-state samples via scripts/convert_hindi_to_presentation_format.py

Each row supplies a masked passage and a list of distinct affective expressions (single words or short idiomatic phrases) that fill those masks. Supervision target is " ".join(labels) — see the training repo plans/sft_run_v1_qwen3.5-4b.md for the full semantics.

Training configuration

Base model Qwen/Qwen3.5-4B (post-trained, not Instruct)
Framework HF Trainer + DeepSpeed ZeRO-3 + BF16 + SDPA
Hardware 3 × A100-40GB
Batch 2 × 2 grad-accum × 3 GPUs = 12 effective
Optimizer AdamW, cosine, lr=1e-5, warmup_ratio=0.03, max_grad_norm=1.0
Epochs 3 (1251 steps)
Best checkpoint step 825 (best eval_loss 0.796), selected via load_best_model_at_end=true
Seed 42

Prompt format

Use the model's chat template with enable_thinking=False. System prompt:

You are an affective-state identifier. Given a sentence with one or more [MASK] positions, output the distinct affective expressions that fill them, in order of first appearance, separated by a single space. Expressions may be single words or short idiomatic phrases. No explanation, no punctuation, no repeats.

Minimal inference example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("alexjerpelea/qwen3.5-4B-mixed", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "alexjerpelea/qwen3.5-4B-mixed", torch_dtype=torch.bfloat16, trust_remote_code=True,
).cuda().eval()

SYSTEM = (
    "You are an affective-state identifier. Given a sentence with one or more "
    "[MASK] positions, output the distinct affective expressions that fill "
    "them, in order of first appearance, separated by a single space. "
    "Expressions may be single words or short idiomatic phrases. "
    "No explanation, no punctuation, no repeats."
)

user_text = "Eram [MASK] de vedenia ei."
messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": user_text},
]
prompt = tok.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
ids = tok(prompt, return_tensors="pt").to("cuda")
out = model.generate(**ids, max_new_tokens=50, do_sample=False, num_beams=1,
                     pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True).strip())
# → "uluita"

Evaluation

Scored via word-sequence set-match: each gold label (single word or multi-word idiomatic phrase) must appear as a contiguous whitespace-token subsequence of the normalized model output. Evaluation used greedy decoding via transformers (do_sample=False, num_beams=1); top-k sampling via vLLM pending.

language n set_acc@1 sim@1 (mBERT)
ro 1000 0.381 0.851
en 1000 0.147 0.848
es 1000 0.191 0.813
fa 1000 0.414 0.902
hi 1000 0.481 0.821
mean 5000 0.323 0.847

Stratified by training-label frequency (set_acc@1 on test rows whose labels[0] appeared ≥ k times in training)

lang ≥1 ≥5 ≥10 ≥20 ≥50
ro 0.381 0.441 0.480 0.537 0.788
en 0.147 0.174 0.029
es 0.191 0.226 0.238 0.484
fa 0.414 0.604 0.743 0.853 0.853
hi 0.481 0.492 0.510 0.515 0.581

When gold labels were seen ≥20 times in training, accuracy on non-EN languages is in the 0.48–0.85 range. EN's flat label distribution (328 unique test labels across 1000 rows) prevents any single label from accumulating that density during training, which explains its lower headline number.

Comparison to zero-shot baseline

For Romanian, the prior zero-shot reference from Qwen/Qwen3.5-9B on the same test split was set_acc@1 = 0.22. This SFT run with a smaller 4B model reaches 0.38 — a +17 pp improvement.

Known limitations

  • Lexical-exact metric: predictions like "stupid" vs gold "foolish", or "interesado" vs gold "interesada" (Spanish gender), score 0 despite being semantically correct. sim@1 (contextual mBERT similarity) is uniformly 0.81–0.90 across all five languages, indicating semantic agreement is much stronger than set_acc@1 suggests.
  • Overfitting onset at step 850: the model memorizes training in epoch 3. The released checkpoint is from step 825 (pre-overfit). A shorter 2-epoch run would likely generalize similarly.
  • Hindi label noise: short Hindi stems (e.g., डर "fear") can substring-match loanwords (किंडरगार्टेन "kindergarten") during data conversion; a handful of HI training/test rows are affected.
  • FA phrase labels: the model outputs full Persian idioms like دلم تنگ شده correctly when appropriate. Earlier eval pipelines that took only the first whitespace-token of the output under-score FA substantially — use word-sequence set-match or equivalent.

Citation / Attribution

Base model: Qwen3.5-4B by Alibaba Cloud (Apache-2.0). Fine-tuning: Alex Jerpelea, Romanian_ASI repo. See reports/sft_qwen3.5-4b_v1_results.md in the repo for the full training writeup and per-sample predictions.

Downloads last month
4
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alexjerpelea/qwen3.5-4B-mixed

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(287)
this model