Minh Toàn commited on 24 days ago

Commit

4138b08

verified ·

1 Parent(s): 46f6b62

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

demo_all_tracks_gradio.ipynb +482 -0
demo_all_tracks_gradio_pipeline.py +390 -0
demo_run_from_hf.ipynb +119 -0
demo_run_from_hf_pipeline.py +59 -0
track1/demo_track1_gradio.ipynb +144 -0
track1/demo_track1_gradio_pipeline.py +100 -0
track1/track1_baseline.ipynb +108 -0
track1/track1_baseline_pipeline.py +64 -0
track2/demo_track2_emotion_gradio.ipynb +533 -0
track2/demo_track2_emotion_gradio_pipeline.py +431 -0
track2/demo_track2_gradio.ipynb +175 -0
track2/demo_track2_gradio_pipeline.py +120 -0
track2/exp02_train_emos.ipynb +542 -0
track2/exp02_train_emos_pipeline.py +407 -0
track2/exp03_emos_sailer.ipynb +392 -0
track2/exp03_emos_sailer_pipeline.py +264 -0
track2/exp04_fusion.ipynb +790 -0
track2/exp04_fusion_pipeline.py +652 -0
track2/exp05_vad_audeering.ipynb +443 -0
track2/exp05_vad_audeering_pipeline.py +303 -0
track2/exp06_qmos_train.ipynb +628 -0
track2/exp06_qmos_train_pipeline.py +502 -0
track2/exp07_fusion_qmos.ipynb +780 -0
track2/exp07_fusion_qmos_pipeline.py +654 -0
track2/exp08_finetune_emotion.ipynb +820 -0
track2/exp08_finetune_emotion_pipeline.py +673 -0
track2/exp08b_finetune_resume.ipynb +782 -0
track2/exp08b_finetune_resume_pipeline.py +642 -0
track2/exp09a_qmos_utmosv2_probe.ipynb +339 -0
track2/exp09a_qmos_utmosv2_probe_pipeline.py +239 -0
track2/exp10_finetune_audeering.ipynb +691 -0
track2/exp10_finetune_audeering_pipeline.py +553 -0
track2/exp11_finetune_joint.ipynb +805 -0
track2/exp11_finetune_joint_pipeline.py +665 -0
track2/exp12_wavlm_scratch.ipynb +690 -0
track2/exp12_wavlm_scratch_pipeline.py +564 -0
track2/exp13_finetune_qmos.ipynb +733 -0
track2/exp13_finetune_qmos_pipeline.py +607 -0
track2/exp14_mamba_head.ipynb +952 -0
track2/exp14_mamba_head_pipeline.py +798 -0
track2/exp15_predict.ipynb +698 -0
track2/exp15_predict_pipeline.py +554 -0
track2/exp15_wavlm_mamba_emotion.ipynb +1081 -0
track2/exp15_wavlm_mamba_emotion_pipeline.py +920 -0
track2/exp16_llm_judge.ipynb +650 -0
track2/exp16_llm_judge_pipeline.py +480 -0
track2/track2_baseline.ipynb +130 -0
track2/track2_baseline_pipeline.py +321 -0
track2/track2_prepare_data.ipynb +249 -0
track2/track2_prepare_data_pipeline.py +164 -0

demo_all_tracks_gradio.ipynb ADDED Viewed

	@@ -0,0 +1,482 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "739ac809",
+   "metadata": {},
+   "source": [
+    "# VMC2026 — Demo Gradio GỘP 3 TRACK (1 link cho mentor)\n",
+    "\n",
+    "Gộp 3 demo lẻ (`track1/`, `track2/`, `track3/`) vào **1 app Gradio 3 tab**:\n",
+    "- **Track 1** · Speech Enhancement → **ACR** (chất lượng A) + **CCR** (so A vs B). Model: URGENT-MOS.\n",
+    "- **Track 2** · Emotional TTS → **EMOS / CAT / VAD**. Model TỐT NHẤT = **exp08** (WavLM fine-tune + audeering).\n",
+    "- **Track 3** · Speaker/Accent → **spk_sim / acc_sim**. Model: ECAPA fine-tuned (baseline BTC).\n",
+    "\n",
+    "> **Lazy-load:** mỗi track chỉ nạp model khi bạn bấm \"Dự đoán\" ở tab đó → tab nào thiếu checkpoint/repo\n",
+    "> chỉ báo lỗi trong tab đó, KHÔNG sập cả app. Track 1 & 3 chỉ cần Internet; Track 2 cần thêm checkpoint exp08.\n",
+    "\n",
+    "### Cách chạy trên Kaggle\n",
+    "1. Settings → **GPU T4 + Internet On**.\n",
+    "2. (Cho Track 2) Add Input: dataset Track 2 (`sets/train.csv`, `wav/`, `metadata.csv`) + dataset chứa\n",
+    "   `ft_emotion_full_20epoch.pt` (slug `toanminh222/cache-exp8`). Thiếu thì 2 tab kia vẫn chạy.\n",
+    "3. **Run All** → cell cuối in link `*.gradio.live` (sống ~72h) → gửi mentor."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6f7119e0",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt gói (1 lần cho cả 3 track)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4da07abf",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "!pip install -q gradio librosa soundfile speechbrain torchaudio loralib scipy scikit-learn pandas tqdm\n",
+    "\n",
+    "import os, sys, glob, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=False)\n",
+    "\n",
+    "# Cài nhẹ (Kaggle có sẵn torch/transformers/numpy → KHÔNG đụng numpy để tránh lệch ABI)\n",
+    "pip_install(\"gradio\", \"librosa\", \"soundfile\", \"speechbrain\", \"torchaudio\",\n",
+    "            \"loralib\", \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "import librosa\n",
+    "import numpy as np\n",
+    "import torch\n",
+    "\n",
+    "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+    "SR = 16000\n",
+    "print(\"Device:\", DEVICE, (\"✅ \" + torch.cuda.get_device_name(0)) if DEVICE == \"cuda\" else \"⚠️ CPU (chậm)\")\n",
+    "\n",
+    "def _stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "def _scalar(x):\n",
+    "    return float(x.item()) if hasattr(x, \"item\") else float(x)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ce385ee",
+   "metadata": {},
+   "source": [
+    "## 2. TRACK 1 — URGENT-MOS (ACR + CCR) · lazy-load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8f09a3ee",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "URGENT_REPO = \"/kaggle/working/URGENT-MOS\"\n",
+    "URGENT_CKPT = \"urgent-challenge/urgent-mos-f1c1m5dcorpus\"   # tự tải từ HuggingFace\n",
+    "_T1 = {}\n",
+    "\n",
+    "def _t1_load():\n",
+    "    \"\"\"Nạp URGENT-MOS 1 lần (clone repo + sys.path + checkpoint).\"\"\"\n",
+    "    if \"m\" in _T1:\n",
+    "        return _T1[\"m\"]\n",
+    "    if not os.path.isdir(URGENT_REPO):\n",
+    "        subprocess.run(f\"git clone -q https://github.com/vvwangvv/URGENT-MOS.git {URGENT_REPO}\",\n",
+    "                       shell=True, check=True)\n",
+    "    if URGENT_REPO not in sys.path:\n",
+    "        sys.path.insert(0, URGENT_REPO)\n",
+    "    import importlib\n",
+    "    importlib.invalidate_caches()\n",
+    "    try:\n",
+    "        importlib.import_module(\"urgent_mos.api.infer\")\n",
+    "    except Exception:\n",
+    "        subprocess.run(f\"pip install -q -e {URGENT_REPO}\", shell=True, check=False)\n",
+    "        importlib.invalidate_caches()\n",
+    "    from urgent_mos.utils import load_model_from_checkpoint\n",
+    "    m = load_model_from_checkpoint(URGENT_CKPT, DEVICE)\n",
+    "    m.eval()\n",
+    "    _T1[\"m\"] = m\n",
+    "    return m\n",
+    "\n",
+    "def t1_predict(audio_a, audio_b):\n",
+    "    if not audio_a:\n",
+    "        return \"⚠️ Hãy tải lên ít nhất **Audio A**.\"\n",
+    "    try:\n",
+    "        m = _t1_load()\n",
+    "        from urgent_mos.api.infer import infer, infer_pairs\n",
+    "        wa = torch.from_numpy(librosa.load(audio_a, sr=SR, mono=True)[0]).float()\n",
+    "        acr_a = max(1.0, min(5.0, _scalar(infer(m, [wa], sample_rate=[SR],\n",
+    "                                                batch_frames=None, num_workers=0)[0][\"mos_overall\"])))\n",
+    "        out = f\"**ACR (Audio A): {acr_a:.3f}**  (chất lượng tuyệt đối, thang 1–5)\"\n",
+    "        if audio_b:\n",
+    "            wb = torch.from_numpy(librosa.load(audio_b, sr=SR, mono=True)[0]).float()\n",
+    "            acr_b = max(1.0, min(5.0, _scalar(infer(m, [wb], sample_rate=[SR],\n",
+    "                                                    batch_frames=None, num_workers=0)[0][\"mos_overall\"])))\n",
+    "            ccr = max(-3.0, min(3.0, _scalar(infer_pairs(m, [(wa, wb)], sample_rate=[(SR, SR)],\n",
+    "                                                         batch_frames=None, num_workers=0)[0][\"mos_overall\"])))\n",
+    "            out += (f\"\\n\\n**ACR (Audio B): {acr_b:.3f}**\"\n",
+    "                    f\"\\n\\n**CCR (A so với B): {ccr:+.3f}**  (>0: A tốt hơn B; thang −3..+3)\")\n",
+    "        return out\n",
+    "    except Exception as e:\n",
+    "        return f\"❌ Track 1 lỗi: `{repr(e)}`\\n\\nKiểm tra **Internet On** (cần tải URGENT-MOS từ GitHub/HuggingFace).\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21d2f185",
+   "metadata": {},
+   "source": [
+    "## 3. TRACK 3 — ECAPA fine-tuned (spk_sim + acc_sim) · lazy-load"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "143f5e27",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "T3_REPO = \"/kaggle/working/vmc2026-baselines/track3\"\n",
+    "CKPT_SPK = f\"{T3_REPO}/official-egs/spk_sim_adamw_lr1e-3/model_spk_sim_step20000.pt\"\n",
+    "CKPT_ACC = f\"{T3_REPO}/official-egs/acc_sim_adamw_lr1e-3/model_acc_sim_step20000.pt\"\n",
+    "_T3 = {}\n",
+    "\n",
+    "def _t3_load():\n",
+    "    if \"spk\" in _T3:\n",
+    "        return _T3\n",
+    "    repo_root = \"/kaggle/working/vmc2026-baselines\"\n",
+    "    if not os.path.isdir(repo_root):\n",
+    "        subprocess.run(f\"git clone -q https://github.com/voicemos-challenge/vmc2026-baselines.git {repo_root}\",\n",
+    "                       shell=True, check=True)\n",
+    "    if T3_REPO not in sys.path:\n",
+    "        sys.path.insert(0, T3_REPO)\n",
+    "    from model import Model\n",
+    "    spk = Model(mlp_heads=[\"spk_sim\"])\n",
+    "    spk.load_state_dict(torch.load(CKPT_SPK, map_location=\"cpu\"))\n",
+    "    acc = Model(mlp_heads=[\"acc_sim\"])\n",
+    "    acc.load_state_dict(torch.load(CKPT_ACC, map_location=\"cpu\"))\n",
+    "    _T3.update(spk=spk.to(DEVICE).eval(), acc=acc.to(DEVICE).eval())\n",
+    "    return _T3\n",
+    "\n",
+    "def t3_predict(audio_test, audio_ref):\n",
+    "    if not audio_test or not audio_ref:\n",
+    "        return \"⚠️ Cần **cả 2 file**: audio test + audio reference.\"\n",
+    "    try:\n",
+    "        M = _t3_load()\n",
+    "        ta = torch.from_numpy(librosa.load(audio_test, sr=SR, mono=True)[0]).float().unsqueeze(0).to(DEVICE)\n",
+    "        tb = torch.from_numpy(librosa.load(audio_ref, sr=SR, mono=True)[0]).float().unsqueeze(0).to(DEVICE)\n",
+    "        with torch.no_grad():\n",
+    "            o_spk = M[\"spk\"](ta, tb)\n",
+    "            spk = float(o_spk[\"spk_sim\"].item())\n",
+    "            acc = float(M[\"acc\"](ta, tb)[\"acc_sim\"].item())\n",
+    "            cos = float(o_spk[\"cos_sim\"].item())\n",
+    "        return (f\"**Speaker similarity: {spk:.3f}**  (1–5)\\n\\n\"\n",
+    "                f\"**Accent similarity : {acc:.3f}**  (1–5)\\n\\n\"\n",
+    "                f\"Cosine zero-shot (tham khảo): {cos:.3f}\")\n",
+    "    except Exception as e:\n",
+    "        return f\"❌ Track 3 lỗi: `{repr(e)}`\\n\\nKiểm tra **Internet On** (clone repo baseline chứa checkpoint).\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fd2047d",
+   "metadata": {},
+   "source": [
+    "## 4. TRACK 2 — exp08 Emotional TTS Evaluator (EMOS/CAT/VAD) · lazy-load\n",
+    "\n",
+    "Model TỐT NHẤT: WavLM fine-tune (warm-start SAILER) + audeering frozen → trunk → 3 head.\n",
+    "Hằng kiến trúc PHẢI khớp exp08 (ckpt không lưu các số này)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "acfc33a7",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "EMO_MAX_SEC, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, USE_AMP = 8, 512, 128, 0.3, True\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\", \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\", \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def _t2_find_ckpt():\n",
+    "    for pat in [\"ft_emotion_full_20epoch*.pt\", \"ft_emotion_full*.pt\"]:\n",
+    "        for base in [\"/kaggle/input\", \"/kaggle/working\"]:\n",
+    "            hits = sorted(glob.glob(os.path.join(base, \"**\", pat), recursive=True))\n",
+    "            if hits:\n",
+    "                return hits[0]\n",
+    "    return \"\"\n",
+    "\n",
+    "_T2 = {}\n",
+    "\n",
+    "def _t2_load():\n",
+    "    if \"infer\" in _T2:\n",
+    "        return _T2[\"infer\"]\n",
+    "    import torch.nn as nn\n",
+    "    ckpt_path = _t2_find_ckpt()\n",
+    "    assert ckpt_path, \"Không thấy ft_emotion_full*.pt — Add Input dataset checkpoint exp08 (slug toanminh222/cache-exp8)?\"\n",
+    "    # code SAILER để dựng backbone WavLM\n",
+    "    repo = \"/kaggle/working/vox-profile-release\"\n",
+    "    if not os.path.exists(repo):\n",
+    "        subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                        \"https://github.com/tiantiaf0627/vox-profile-release.git\", repo], check=True)\n",
+    "    if repo not in sys.path:\n",
+    "        sys.path.insert(0, repo)\n",
+    "\n",
+    "    ckpt = torch.load(ckpt_path, map_location=\"cpu\", weights_only=False)\n",
+    "    assert \"wavlm\" in ckpt and \"heads\" in ckpt, \"Checkpoint thiếu 'wavlm'/'heads' → cần bản đủ ft_emotion_full_20epoch.pt.\"\n",
+    "    AUD_DIM = int(ckpt.get(\"AUD_DIM\", 0)); USE_AUDEERING = AUD_DIM > 0\n",
+    "\n",
+    "    def find_hf_backbone(module):\n",
+    "        cands = []\n",
+    "        for name, m in module.named_modules():\n",
+    "            enc = getattr(m, \"encoder\", None)\n",
+    "            if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                    and getattr(enc, \"layers\", None) is not None:\n",
+    "                cands.append((name, m))\n",
+    "        if not cands:\n",
+    "            return None, None\n",
+    "        cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "        return cands[0]\n",
+    "\n",
+    "    wavlm = None\n",
+    "    try:\n",
+    "        from src.model.emotion.wavlm_emotion import WavLMWrapper\n",
+    "        _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "        _name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    except Exception as e:\n",
+    "        print(\"⚠️ SAILER wrapper lỗi:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "    if wavlm is None:\n",
+    "        from transformers import WavLMModel\n",
+    "        wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    wavlm = wavlm.to(DEVICE).eval()\n",
+    "    WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "    wavlm.config.layerdrop = 0.0\n",
+    "    wavlm.load_state_dict(ckpt[\"wavlm\"], strict=False)\n",
+    "\n",
+    "    def masked_mean(hidden, attn_mask):\n",
+    "        if attn_mask is None:\n",
+    "            return hidden.mean(dim=1)\n",
+    "        try:\n",
+    "            fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "        except Exception:\n",
+    "            return hidden.mean(dim=1)\n",
+    "        fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "        return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "    # audeering frozen (nếu ckpt dùng)\n",
+    "    aud_backbone = aud_head = aud_proc = None\n",
+    "    if USE_AUDEERING:\n",
+    "        from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "        from huggingface_hub import hf_hub_download\n",
+    "        AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "        aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "        aud_backbone = Wav2Vec2Model(Wav2Vec2Config.from_pretrained(AUD_NAME))\n",
+    "        try:\n",
+    "            _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "                hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "        except Exception:\n",
+    "            _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "        bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "        aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "        _hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "        aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(),\n",
+    "                                 nn.Linear(_hid, _sd[\"classifier.out_proj.weight\"].shape[0]))\n",
+    "        aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "        aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "        aud_backbone = aud_backbone.to(DEVICE).eval(); aud_head = aud_head.to(DEVICE).eval()\n",
+    "\n",
+    "    @torch.no_grad()\n",
+    "    def audeering_feat(wave):\n",
+    "        x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(DEVICE)\n",
+    "        h = aud_backbone(x)[0].mean(dim=1)\n",
+    "        out = aud_head(h)[0].cpu().numpy()\n",
+    "        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)\n",
+    "        return np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)\n",
+    "\n",
+    "    N_EMO = len(EMOTIONS5)\n",
+    "    TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)\n",
+    "\n",
+    "    class EmoHeads(nn.Module):\n",
+    "        def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "            super().__init__()\n",
+    "            self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                       nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "            self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "            self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "            self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "        def forward(self, feat, tgt):\n",
+    "            h = self.trunk(feat)\n",
+    "            return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "    heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(DEVICE).eval()\n",
+    "    heads.load_state_dict(ckpt[\"heads\"], strict=False)\n",
+    "    emos_mu, emos_sd = float(ckpt[\"emos_mu\"]), float(ckpt[\"emos_sd\"])\n",
+    "    vad_mu = np.asarray(ckpt[\"vad_mu\"], dtype=np.float32); vad_sd = np.asarray(ckpt[\"vad_sd\"], dtype=np.float32)\n",
+    "\n",
+    "    def onehot_target(tgt):\n",
+    "        v = np.zeros(N_EMO, dtype=np.float32)\n",
+    "        if tgt in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "        return v\n",
+    "\n",
+    "    @torch.no_grad()\n",
+    "    def infer_wave(wave, target_emotion):\n",
+    "        wave = wave[: EMO_MAX_SEC * SR].astype(np.float32)\n",
+    "        iv = torch.from_numpy(wave).unsqueeze(0).to(DEVICE)\n",
+    "        am = torch.ones((1, len(wave)), dtype=torch.long, device=DEVICE)\n",
+    "        tgt = torch.from_numpy(onehot_target(norm_emotion(target_emotion) if target_emotion else None)).unsqueeze(0).to(DEVICE)\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and DEVICE == \"cuda\"):\n",
+    "            fw = wavlm(iv, attention_mask=am).last_hidden_state\n",
+    "            fw = masked_mean(fw, am)\n",
+    "            if USE_AUDEERING:\n",
+    "                fw = torch.cat([fw, torch.from_numpy(audeering_feat(wave)).unsqueeze(0).to(DEVICE)], dim=1)\n",
+    "            emos_p, cat_l, vad_p = heads(fw, tgt)\n",
+    "        emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "        cat5 = torch.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "        vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "        return emos, cat5, vad3\n",
+    "\n",
+    "    print(f\"✅ Track 2 exp08 nạp xong (audeering {'ON' if USE_AUDEERING else 'OFF'}) từ {ckpt_path}\")\n",
+    "    _T2[\"infer\"] = infer_wave\n",
+    "    return infer_wave\n",
+    "\n",
+    "def t2_predict(audio, target_emotion):\n",
+    "    \"\"\"Trả: verdict(md), EMOS(number), CAT(label dict), VAL, ARO, DOM.\"\"\"\n",
+    "    if not audio:\n",
+    "        return \"### ⚠️ Hãy tải audio (giọng TTS).\", None, {}, None, None, None\n",
+    "    try:\n",
+    "        infer_wave = _t2_load()\n",
+    "        wave, _ = librosa.load(audio, sr=SR, mono=True)\n",
+    "        emos, cat5, vad3 = infer_wave(wave, target_emotion)\n",
+    "        cat_dict = {e: float(cat5[i]) for i, e in enumerate(EMOTIONS5)}\n",
+    "        perceived = EMOTIONS5[int(np.argmax(cat5))]\n",
+    "        if target_emotion:\n",
+    "            match = \"✅ **KHỚP** target\" if perceived == norm_emotion(target_emotion) else \"⚠️ **LỆCH** target\"\n",
+    "            band = \"🟢 tốt\" if emos >= 4 else (\"🟡 khá\" if emos >= 3 else \"🔴 yếu\")\n",
+    "            verdict = (f\"### Kết luận biểu cảm\\n\"\n",
+    "                       f\"- Cảm xúc cảm nhận: **{perceived}** → {match} (`{target_emotion}`)\\n\"\n",
+    "                       f\"- EMOS = **{emos:.2f}/5** → biểu cảm {band}\")\n",
+    "        else:\n",
+    "            verdict = (f\"### Kết luận biểu cảm\\n- Cảm xúc cảm nhận: **{perceived}**\\n\"\n",
+    "                       f\"- *(Chọn cảm xúc target để bật EMOS — độ khớp ý đồ)*\")\n",
+    "            emos = None\n",
+    "        return verdict, (round(emos, 3) if emos is not None else None), cat_dict, \\\n",
+    "            round(float(vad3[0]), 3), round(float(vad3[1]), 3), round(float(vad3[2]), 3)\n",
+    "    except Exception as e:\n",
+    "        return f\"### ❌ Track 2 lỗi\\n`{repr(e)}`\\n\\nĐã Add Input checkpoint exp08 + Internet On chưa?\", None, {}, None, None, None"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "81da6580",
+   "metadata": {},
+   "source": [
+    "## 5. Giao diện Gradio GỘP — 3 tab + launch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "418857af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gradio as gr\n",
+    "\n",
+    "INTRO = (\n",
+    "    \"# 🎙️ VoiceMOS Challenge 2026 — Demo 3 Track\\n\"\n",
+    "    \"Một link cho cả 3 track. Mỗi tab nhận audio → trả điểm bộ chấm tự động.\\n\\n\"\n",
+    "    \"| Track | Bài toán | Output |\\n|---|---|---|\\n\"\n",
+    "    \"| **1** | Speech Enhancement | ACR (chất lượng) · CCR (so sánh cặp) |\\n\"\n",
+    "    \"| **2** | Emotional TTS | EMOS · CAT · VAD (5 cột cảm xúc) — *model tốt nhất exp08* |\\n\"\n",
+    "    \"| **3** | Speaker/Accent | spk_sim · acc_sim |\\n\\n\"\n",
+    "    \"> Model nạp **lần đầu bấm nút** (chờ ~1–2 phút tải). Tab thiếu checkpoint chỉ báo lỗi trong tab đó.\"\n",
+    ")\n",
+    "\n",
+    "with gr.Blocks(title=\"VMC2026 — Demo 3 Track\") as demo:\n",
+    "    gr.Markdown(INTRO)\n",
+    "\n",
+    "    with gr.Tab(\"1️⃣ Track 1 · Chất lượng (ACR/CCR)\"):\n",
+    "        gr.Markdown(\"Tải **Audio A** → ACR (1–5). Tải thêm **Audio B** → CCR (A vs B, −3..+3, >0 = A tốt hơn).\")\n",
+    "        t1a = gr.Audio(type=\"filepath\", label=\"Audio A (bắt buộc)\")\n",
+    "        t1b = gr.Audio(type=\"filepath\", label=\"Audio B (tùy chọn — để tính CCR)\")\n",
+    "        t1out = gr.Markdown()\n",
+    "        gr.Button(\"Dự đoán\", variant=\"primary\").click(t1_predict, [t1a, t1b], t1out)\n",
+    "\n",
+    "    with gr.Tab(\"2️⃣ Track 2 · Cảm xúc (EMOS/CAT/VAD)\"):\n",
+    "        gr.Markdown(\"Model tốt nhất **exp08** (WavLM fine-tune + audeering, offline). \"\n",
+    "                    \"Chọn **cảm xúc target** để bật EMOS (độ khớp ý đồ).\")\n",
+    "        with gr.Row():\n",
+    "            with gr.Column(scale=1):\n",
+    "                t2a = gr.Audio(type=\"filepath\", label=\"Audio (giọng TTS)\")\n",
+    "                t2tgt = gr.Dropdown(EMOTIONS5, label=\"🎯 Cảm xúc target (cho EMOS)\")\n",
+    "                t2btn = gr.Button(\"Chấm cảm xúc\", variant=\"primary\")\n",
+    "            with gr.Column(scale=2):\n",
+    "                t2verdict = gr.Markdown()\n",
+    "                t2emos = gr.Number(label=\"EMOS — khớp cảm xúc target (1–5)\", interactive=False)\n",
+    "                t2cat = gr.Label(label=\"CAT — phân bố cảm xúc cảm nhận (5 lớp)\")\n",
+    "                gr.Markdown(\"**VAD — toạ độ cảm xúc liên tục (1–5):**\")\n",
+    "                with gr.Row():\n",
+    "                    t2val = gr.Number(label=\"Valence (tích cực↑)\", interactive=False)\n",
+    "                    t2aro = gr.Number(label=\"Arousal (kích động↑)\", interactive=False)\n",
+    "                    t2dom = gr.Number(label=\"Dominance (chi phối↑)\", interactive=False)\n",
+    "        t2btn.click(t2_predict, [t2a, t2tgt], [t2verdict, t2emos, t2cat, t2val, t2aro, t2dom])\n",
+    "\n",
+    "    with gr.Tab(\"3️⃣ Track 3 · Speaker/Accent\"):\n",
+    "        gr.Markdown(\"Tải **audio cần đánh giá** + **audio tham chiếu** → độ giống người nói & accent (1–5).\")\n",
+    "        t3t = gr.Audio(type=\"filepath\", label=\"Audio cần đánh giá (test)\")\n",
+    "        t3r = gr.Audio(type=\"filepath\", label=\"Audio tham chiếu (reference)\")\n",
+    "        t3out = gr.Markdown()\n",
+    "        gr.Button(\"Dự đoán\", variant=\"primary\").click(t3_predict, [t3t, t3r], t3out)\n",
+    "\n",
+    "demo.launch(share=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b04c3d5e",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Lazy-load:** mỗi `_tN_load()` nạp model 1 lần rồi cache module-level → tab nào không bấm thì không tốn RAM/VRAM.\n",
+    "- Track 1 cần URGENT-MOS (GitHub + HuggingFace); Track 3 clone repo baseline (có sẵn checkpoint); Track 2 cần\n",
+    "  checkpoint exp08 (`ft_emotion_full_20epoch.pt`, slug `toanminh222/cache-exp8`) + tải WavLM/SAILER/audeering.\n",
+    "- Hằng `TRUNK_HIDDEN/HEAD_HIDDEN/EMO_MAX_SEC` của Track 2 PHẢI khớp exp08 (ckpt không lưu) — sai là lệch shape.\n",
+    "- 3 tab độc lập: thiếu checkpoint/Internet của 1 track chỉ báo lỗi trong tab đó, 2 tab còn lại vẫn chạy.\n",
+    "- Cần **GPU T4 + Internet On**. Bản chỉ Track 2 đầy đủ (có tab metric val nội bộ) ở `track2/demo_track2_emotion_gradio`."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

demo_all_tracks_gradio_pipeline.py ADDED Viewed

	@@ -0,0 +1,390 @@

+# %% [markdown]
+# # VMC2026 — Demo Gradio GỘP 3 TRACK (1 link cho mentor)
+#
+# Gộp 3 demo lẻ (`track1/`, `track2/`, `track3/`) vào **1 app Gradio 3 tab**:
+# - **Track 1** · Speech Enhancement → **ACR** (chất lượng A) + **CCR** (so A vs B). Model: URGENT-MOS.
+# - **Track 2** · Emotional TTS → **EMOS / CAT / VAD**. Model TỐT NHẤT = **exp08** (WavLM fine-tune + audeering).
+# - **Track 3** · Speaker/Accent → **spk_sim / acc_sim**. Model: ECAPA fine-tuned (baseline BTC).
+#
+# > **Lazy-load:** mỗi track chỉ nạp model khi bạn bấm "Dự đoán" ở tab đó → tab nào thiếu checkpoint/repo
+# > chỉ báo lỗi trong tab đó, KHÔNG sập cả app. Track 1 & 3 chỉ cần Internet; Track 2 cần thêm checkpoint exp08.
+#
+# ### Cách chạy trên Kaggle
+# 1. Settings → **GPU T4 + Internet On**.
+# 2. (Cho Track 2) Add Input: dataset Track 2 (`sets/train.csv`, `wav/`, `metadata.csv`) + dataset chứa
+#    `ft_emotion_full_20epoch.pt` (slug `toanminh222/cache-exp8`). Thiếu thì 2 tab kia vẫn chạy.
+# 3. **Run All** → cell cuối in link `*.gradio.live` (sống ~72h) → gửi mentor.
+# %% [markdown]
+# ## 1. Cài đặt gói (1 lần cho cả 3 track)
+# %%
+# !pip install -q gradio librosa soundfile speechbrain torchaudio loralib scipy scikit-learn pandas tqdm
+import os, sys, glob, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=False)
+# Cài nhẹ (Kaggle có sẵn torch/transformers/numpy → KHÔNG đụng numpy để tránh lệch ABI)
+pip_install("gradio", "librosa", "soundfile", "speechbrain", "torchaudio",
+            "loralib", "scipy", "scikit-learn", "pandas", "tqdm")
+import librosa
+import numpy as np
+import torch
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+SR = 16000
+print("Device:", DEVICE, ("✅ " + torch.cuda.get_device_name(0)) if DEVICE == "cuda" else "⚠️ CPU (chậm)")
+def _stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+def _scalar(x):
+    return float(x.item()) if hasattr(x, "item") else float(x)
+# %% [markdown]
+# ## 2. TRACK 1 — URGENT-MOS (ACR + CCR) · lazy-load
+# %%
+URGENT_REPO = "/kaggle/working/URGENT-MOS"
+URGENT_CKPT = "urgent-challenge/urgent-mos-f1c1m5dcorpus"   # tự tải từ HuggingFace
+_T1 = {}
+def _t1_load():
+    """Nạp URGENT-MOS 1 lần (clone repo + sys.path + checkpoint)."""
+    if "m" in _T1:
+        return _T1["m"]
+    if not os.path.isdir(URGENT_REPO):
+        subprocess.run(f"git clone -q https://github.com/vvwangvv/URGENT-MOS.git {URGENT_REPO}",
+                       shell=True, check=True)
+    if URGENT_REPO not in sys.path:
+        sys.path.insert(0, URGENT_REPO)
+    import importlib
+    importlib.invalidate_caches()
+    try:
+        importlib.import_module("urgent_mos.api.infer")
+    except Exception:
+        subprocess.run(f"pip install -q -e {URGENT_REPO}", shell=True, check=False)
+        importlib.invalidate_caches()
+    from urgent_mos.utils import load_model_from_checkpoint
+    m = load_model_from_checkpoint(URGENT_CKPT, DEVICE)
+    m.eval()
+    _T1["m"] = m
+    return m
+def t1_predict(audio_a, audio_b):
+    if not audio_a:
+        return "⚠️ Hãy tải lên ít nhất **Audio A**."
+    try:
+        m = _t1_load()
+        from urgent_mos.api.infer import infer, infer_pairs
+        wa = torch.from_numpy(librosa.load(audio_a, sr=SR, mono=True)[0]).float()
+        acr_a = max(1.0, min(5.0, _scalar(infer(m, [wa], sample_rate=[SR],
+                                                batch_frames=None, num_workers=0)[0]["mos_overall"])))
+        out = f"**ACR (Audio A): {acr_a:.3f}**  (chất lượng tuyệt đối, thang 1–5)"
+        if audio_b:
+            wb = torch.from_numpy(librosa.load(audio_b, sr=SR, mono=True)[0]).float()
+            acr_b = max(1.0, min(5.0, _scalar(infer(m, [wb], sample_rate=[SR],
+                                                    batch_frames=None, num_workers=0)[0]["mos_overall"])))
+            ccr = max(-3.0, min(3.0, _scalar(infer_pairs(m, [(wa, wb)], sample_rate=[(SR, SR)],
+                                                         batch_frames=None, num_workers=0)[0]["mos_overall"])))
+            out += (f"\n\n**ACR (Audio B): {acr_b:.3f}**"
+                    f"\n\n**CCR (A so với B): {ccr:+.3f}**  (>0: A tốt hơn B; thang −3..+3)")
+        return out
+    except Exception as e:
+        return f"❌ Track 1 lỗi: `{repr(e)}`\n\nKiểm tra **Internet On** (cần tải URGENT-MOS từ GitHub/HuggingFace)."
+# %% [markdown]
+# ## 3. TRACK 3 — ECAPA fine-tuned (spk_sim + acc_sim) · lazy-load
+# %%
+T3_REPO = "/kaggle/working/vmc2026-baselines/track3"
+CKPT_SPK = f"{T3_REPO}/official-egs/spk_sim_adamw_lr1e-3/model_spk_sim_step20000.pt"
+CKPT_ACC = f"{T3_REPO}/official-egs/acc_sim_adamw_lr1e-3/model_acc_sim_step20000.pt"
+_T3 = {}
+def _t3_load():
+    if "spk" in _T3:
+        return _T3
+    repo_root = "/kaggle/working/vmc2026-baselines"
+    if not os.path.isdir(repo_root):
+        subprocess.run(f"git clone -q https://github.com/voicemos-challenge/vmc2026-baselines.git {repo_root}",
+                       shell=True, check=True)
+    if T3_REPO not in sys.path:
+        sys.path.insert(0, T3_REPO)
+    from model import Model
+    spk = Model(mlp_heads=["spk_sim"])
+    spk.load_state_dict(torch.load(CKPT_SPK, map_location="cpu"))
+    acc = Model(mlp_heads=["acc_sim"])
+    acc.load_state_dict(torch.load(CKPT_ACC, map_location="cpu"))
+    _T3.update(spk=spk.to(DEVICE).eval(), acc=acc.to(DEVICE).eval())
+    return _T3
+def t3_predict(audio_test, audio_ref):
+    if not audio_test or not audio_ref:
+        return "⚠️ Cần **cả 2 file**: audio test + audio reference."
+    try:
+        M = _t3_load()
+        ta = torch.from_numpy(librosa.load(audio_test, sr=SR, mono=True)[0]).float().unsqueeze(0).to(DEVICE)
+        tb = torch.from_numpy(librosa.load(audio_ref, sr=SR, mono=True)[0]).float().unsqueeze(0).to(DEVICE)
+        with torch.no_grad():
+            o_spk = M["spk"](ta, tb)
+            spk = float(o_spk["spk_sim"].item())
+            acc = float(M["acc"](ta, tb)["acc_sim"].item())
+            cos = float(o_spk["cos_sim"].item())
+        return (f"**Speaker similarity: {spk:.3f}**  (1–5)\n\n"
+                f"**Accent similarity : {acc:.3f}**  (1–5)\n\n"
+                f"Cosine zero-shot (tham khảo): {cos:.3f}")
+    except Exception as e:
+        return f"❌ Track 3 lỗi: `{repr(e)}`\n\nKiểm tra **Internet On** (clone repo baseline chứa checkpoint)."
+# %% [markdown]
+# ## 4. TRACK 2 — exp08 Emotional TTS Evaluator (EMOS/CAT/VAD) · lazy-load
+#
+# Model TỐT NHẤT: WavLM fine-tune (warm-start SAILER) + audeering frozen → trunk → 3 head.
+# Hằng kiến trúc PHẢI khớp exp08 (ckpt không lưu các số này).
+# %%
+EMO_MAX_SEC, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, USE_AMP = 8, 512, 128, 0.3, True
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry", "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral", "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def _t2_find_ckpt():
+    for pat in ["ft_emotion_full_20epoch*.pt", "ft_emotion_full*.pt"]:
+        for base in ["/kaggle/input", "/kaggle/working"]:
+            hits = sorted(glob.glob(os.path.join(base, "**", pat), recursive=True))
+            if hits:
+                return hits[0]
+    return ""
+_T2 = {}
+def _t2_load():
+    if "infer" in _T2:
+        return _T2["infer"]
+    import torch.nn as nn
+    ckpt_path = _t2_find_ckpt()
+    assert ckpt_path, "Không thấy ft_emotion_full*.pt — Add Input dataset checkpoint exp08 (slug toanminh222/cache-exp8)?"
+    # code SAILER để dựng backbone WavLM
+    repo = "/kaggle/working/vox-profile-release"
+    if not os.path.exists(repo):
+        subprocess.run(["git", "clone", "--depth", "1",
+                        "https://github.com/tiantiaf0627/vox-profile-release.git", repo], check=True)
+    if repo not in sys.path:
+        sys.path.insert(0, repo)
+    ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
+    assert "wavlm" in ckpt and "heads" in ckpt, "Checkpoint thiếu 'wavlm'/'heads' → cần bản đủ ft_emotion_full_20epoch.pt."
+    AUD_DIM = int(ckpt.get("AUD_DIM", 0)); USE_AUDEERING = AUD_DIM > 0
+    def find_hf_backbone(module):
+        cands = []
+        for name, m in module.named_modules():
+            enc = getattr(m, "encoder", None)
+            if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                    and getattr(enc, "layers", None) is not None:
+                cands.append((name, m))
+        if not cands:
+            return None, None
+        cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+        return cands[0]
+    wavlm = None
+    try:
+        from src.model.emotion.wavlm_emotion import WavLMWrapper
+        _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+        _name, wavlm = find_hf_backbone(_wrapper)
+    except Exception as e:
+        print("⚠️ SAILER wrapper lỗi:", repr(e), "→ fallback WavLM trắng.")
+    if wavlm is None:
+        from transformers import WavLMModel
+        wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    wavlm = wavlm.to(DEVICE).eval()
+    WAVLM_DIM = int(wavlm.config.hidden_size)
+    wavlm.config.layerdrop = 0.0
+    wavlm.load_state_dict(ckpt["wavlm"], strict=False)
+    def masked_mean(hidden, attn_mask):
+        if attn_mask is None:
+            return hidden.mean(dim=1)
+        try:
+            fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+        except Exception:
+            return hidden.mean(dim=1)
+        fm = fm.unsqueeze(-1).to(hidden.dtype)
+        return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+    # audeering frozen (nếu ckpt dùng)
+    aud_backbone = aud_head = aud_proc = None
+    if USE_AUDEERING:
+        from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+        from huggingface_hub import hf_hub_download
+        AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+        aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+        aud_backbone = Wav2Vec2Model(Wav2Vec2Config.from_pretrained(AUD_NAME))
+        try:
+            _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+                hf_hub_download(AUD_NAME, "model.safetensors"))
+        except Exception:
+            _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+        bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+        aud_backbone.load_state_dict(bb_sd, strict=False)
+        _hid = _sd["classifier.dense.weight"].shape[0]
+        aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(),
+                                 nn.Linear(_hid, _sd["classifier.out_proj.weight"].shape[0]))
+        aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+        aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+        aud_backbone = aud_backbone.to(DEVICE).eval(); aud_head = aud_head.to(DEVICE).eval()
+    @torch.no_grad()
+    def audeering_feat(wave):
+        x = aud_proc(wave, sampling_rate=SR).input_values[0]
+        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(DEVICE)
+        h = aud_backbone(x)[0].mean(dim=1)
+        out = aud_head(h)[0].cpu().numpy()
+        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)
+        return np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)
+    N_EMO = len(EMOTIONS5)
+    TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)
+    class EmoHeads(nn.Module):
+        def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+            super().__init__()
+            self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                       nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+            self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+            self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+            self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+        def forward(self, feat, tgt):
+            h = self.trunk(feat)
+            return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+    heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(DEVICE).eval()
+    heads.load_state_dict(ckpt["heads"], strict=False)
+    emos_mu, emos_sd = float(ckpt["emos_mu"]), float(ckpt["emos_sd"])
+    vad_mu = np.asarray(ckpt["vad_mu"], dtype=np.float32); vad_sd = np.asarray(ckpt["vad_sd"], dtype=np.float32)
+    def onehot_target(tgt):
+        v = np.zeros(N_EMO, dtype=np.float32)
+        if tgt in EMOTIONS5:
+            v[EMOTIONS5.index(tgt)] = 1.0
+        return v
+    @torch.no_grad()
+    def infer_wave(wave, target_emotion):
+        wave = wave[: EMO_MAX_SEC * SR].astype(np.float32)
+        iv = torch.from_numpy(wave).unsqueeze(0).to(DEVICE)
+        am = torch.ones((1, len(wave)), dtype=torch.long, device=DEVICE)
+        tgt = torch.from_numpy(onehot_target(norm_emotion(target_emotion) if target_emotion else None)).unsqueeze(0).to(DEVICE)
+        with torch.cuda.amp.autocast(enabled=USE_AMP and DEVICE == "cuda"):
+            fw = wavlm(iv, attention_mask=am).last_hidden_state
+            fw = masked_mean(fw, am)
+            if USE_AUDEERING:
+                fw = torch.cat([fw, torch.from_numpy(audeering_feat(wave)).unsqueeze(0).to(DEVICE)], dim=1)
+            emos_p, cat_l, vad_p = heads(fw, tgt)
+        emos = float(emos_p.item()) * emos_sd + emos_mu
+        cat5 = torch.softmax(cat_l, 1)[0].float().cpu().numpy()
+        vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+        return emos, cat5, vad3
+    print(f"✅ Track 2 exp08 nạp xong (audeering {'ON' if USE_AUDEERING else 'OFF'}) từ {ckpt_path}")
+    _T2["infer"] = infer_wave
+    return infer_wave
+def t2_predict(audio, target_emotion):
+    """Trả: verdict(md), EMOS(number), CAT(label dict), VAL, ARO, DOM."""
+    if not audio:
+        return "### ⚠️ Hãy tải audio (giọng TTS).", None, {}, None, None, None
+    try:
+        infer_wave = _t2_load()
+        wave, _ = librosa.load(audio, sr=SR, mono=True)
+        emos, cat5, vad3 = infer_wave(wave, target_emotion)
+        cat_dict = {e: float(cat5[i]) for i, e in enumerate(EMOTIONS5)}
+        perceived = EMOTIONS5[int(np.argmax(cat5))]
+        if target_emotion:
+            match = "✅ **KHỚP** target" if perceived == norm_emotion(target_emotion) else "⚠️ **LỆCH** target"
+            band = "🟢 tốt" if emos >= 4 else ("🟡 khá" if emos >= 3 else "🔴 yếu")
+            verdict = (f"### Kết luận biểu cảm\n"
+                       f"- Cảm xúc cảm nhận: **{perceived}** → {match} (`{target_emotion}`)\n"
+                       f"- EMOS = **{emos:.2f}/5** → biểu cảm {band}")
+        else:
+            verdict = (f"### Kết luận biểu cảm\n- Cảm xúc cảm nhận: **{perceived}**\n"
+                       f"- *(Chọn cảm xúc target để bật EMOS — độ khớp ý đồ)*")
+            emos = None
+        return verdict, (round(emos, 3) if emos is not None else None), cat_dict, \
+            round(float(vad3[0]), 3), round(float(vad3[1]), 3), round(float(vad3[2]), 3)
+    except Exception as e:
+        return f"### ❌ Track 2 lỗi\n`{repr(e)}`\n\nĐã Add Input checkpoint exp08 + Internet On chưa?", None, {}, None, None, None
+# %% [markdown]
+# ## 5. Giao diện Gradio GỘP — 3 tab + launch
+# %%
+import gradio as gr
+INTRO = (
+    "# 🎙️ VoiceMOS Challenge 2026 — Demo 3 Track\n"
+    "Một link cho cả 3 track. Mỗi tab nhận audio → trả điểm bộ chấm tự động.\n\n"
+    "| Track | Bài toán | Output |\n|---|---|---|\n"
+    "| **1** | Speech Enhancement | ACR (chất lượng) · CCR (so sánh cặp) |\n"
+    "| **2** | Emotional TTS | EMOS · CAT · VAD (5 cột cảm xúc) — *model tốt nhất exp08* |\n"
+    "| **3** | Speaker/Accent | spk_sim · acc_sim |\n\n"
+    "> Model nạp **lần đầu bấm nút** (chờ ~1–2 phút tải). Tab thiếu checkpoint chỉ báo lỗi trong tab đó."
+)
+with gr.Blocks(title="VMC2026 — Demo 3 Track") as demo:
+    gr.Markdown(INTRO)
+    with gr.Tab("1️⃣ Track 1 · Chất lượng (ACR/CCR)"):
+        gr.Markdown("Tải **Audio A** → ACR (1–5). Tải thêm **Audio B** → CCR (A vs B, −3..+3, >0 = A tốt hơn).")
+        t1a = gr.Audio(type="filepath", label="Audio A (bắt buộc)")
+        t1b = gr.Audio(type="filepath", label="Audio B (tùy chọn — để tính CCR)")
+        t1out = gr.Markdown()
+        gr.Button("Dự đoán", variant="primary").click(t1_predict, [t1a, t1b], t1out)
+    with gr.Tab("2️⃣ Track 2 · Cảm xúc (EMOS/CAT/VAD)"):
+        gr.Markdown("Model tốt nhất **exp08** (WavLM fine-tune + audeering, offline). "
+                    "Chọn **cảm xúc target** để bật EMOS (độ khớp ý đồ).")
+        with gr.Row():
+            with gr.Column(scale=1):
+                t2a = gr.Audio(type="filepath", label="Audio (giọng TTS)")
+                t2tgt = gr.Dropdown(EMOTIONS5, label="🎯 Cảm xúc target (cho EMOS)")
+                t2btn = gr.Button("Chấm cảm xúc", variant="primary")
+            with gr.Column(scale=2):
+                t2verdict = gr.Markdown()
+                t2emos = gr.Number(label="EMOS — khớp cảm xúc target (1–5)", interactive=False)
+                t2cat = gr.Label(label="CAT — phân bố cảm xúc cảm nhận (5 lớp)")
+                gr.Markdown("**VAD — toạ độ cảm xúc liên tục (1–5):**")
+                with gr.Row():
+                    t2val = gr.Number(label="Valence (tích cực↑)", interactive=False)
+                    t2aro = gr.Number(label="Arousal (kích động↑)", interactive=False)
+                    t2dom = gr.Number(label="Dominance (chi phối↑)", interactive=False)
+        t2btn.click(t2_predict, [t2a, t2tgt], [t2verdict, t2emos, t2cat, t2val, t2aro, t2dom])
+    with gr.Tab("3️⃣ Track 3 · Speaker/Accent"):
+        gr.Markdown("Tải **audio cần đánh giá** + **audio tham chiếu** → độ giống người nói & accent (1–5).")
+        t3t = gr.Audio(type="filepath", label="Audio cần đánh giá (test)")
+        t3r = gr.Audio(type="filepath", label="Audio tham chiếu (reference)")
+        t3out = gr.Markdown()
+        gr.Button("Dự đoán", variant="primary").click(t3_predict, [t3t, t3r], t3out)
+demo.launch(share=True)
+# %% [markdown]
+# ## Ghi chú
+# - **Lazy-load:** mỗi `_tN_load()` nạp model 1 lần rồi cache module-level → tab nào không bấm thì không tốn RAM/VRAM.
+# - Track 1 cần URGENT-MOS (GitHub + HuggingFace); Track 3 clone repo baseline (có sẵn checkpoint); Track 2 cần
+#   checkpoint exp08 (`ft_emotion_full_20epoch.pt`, slug `toanminh222/cache-exp8`) + tải WavLM/SAILER/audeering.
+# - Hằng `TRUNK_HIDDEN/HEAD_HIDDEN/EMO_MAX_SEC` của Track 2 PHẢI khớp exp08 (ckpt không lưu) — sai là lệch shape.
+# - 3 tab độc lập: thiếu checkpoint/Internet của 1 track chỉ báo lỗi trong tab đó, 2 tab còn lại vẫn chạy.
+# - Cần **GPU T4 + Internet On**. Bản chỉ Track 2 đầy đủ (có tab metric val nội bộ) ở `track2/demo_track2_emotion_gradio`.

demo_run_from_hf.ipynb ADDED Viewed

	@@ -0,0 +1,119 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "50fca144",
+   "metadata": {},
+   "source": [
+    "# VMC2026 — Chạy demo Gradio trên KAGGLE bằng cách KÉO code UI từ Hugging Face\n",
+    "\n",
+    "Chiến lược: **HF = nơi chứa code UI** (Space `tranminhtoan140601/voicemos2026-demo`),\n",
+    "**Kaggle = nơi chạy** (GPU T4 free). Notebook này tải `app.py` từ Space về rồi chạy →\n",
+    "ra link `*.gradio.live` (sống ~72h) để gửi mentor. KHÔNG tốn GPU trả phí của HF.\n",
+    "\n",
+    "`app.py` tự nhận môi trường: trên Kaggle → `share=True` (link công khai); checkpoint Track 2\n",
+    "tự tải từ HF Models repo `tranminhtoan140601/voicemos2026-track2-emotion`.\n",
+    "\n",
+    "### Cách chạy\n",
+    "1. Settings → **GPU T4 + Internet On**.\n",
+    "2. **Run All** → cell cuối in link `*.gradio.live`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "53fd0798",
+   "metadata": {},
+   "source": [
+    "## 1. Cài deps (khớp Space) — KHÔNG đụng numpy/torch có sẵn Kaggle"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "59468886",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q gradio==6.17.3 huggingface_hub librosa soundfile speechbrain loralib scipy scikit-learn pandas\n",
+    "\n",
+    "import subprocess, sys\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=False)\n",
+    "\n",
+    "pip_install(\"gradio==6.17.3\", \"huggingface_hub\", \"librosa\", \"soundfile\",\n",
+    "            \"speechbrain\", \"loralib\", \"scipy\", \"scikit-learn\", \"pandas\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1feff99f",
+   "metadata": {},
+   "source": [
+    "## 2. Kéo code UI (app.py) từ HF Space về Kaggle"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca69f24e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from huggingface_hub import snapshot_download\n",
+    "\n",
+    "SPACE_REPO = \"tranminhtoan140601/voicemos2026-demo\"\n",
+    "LOCAL_DIR = \"/kaggle/working/vmc_demo\"\n",
+    "\n",
+    "# Tải toàn bộ repo Space (app.py + requirements + README) về local\n",
+    "snapshot_download(repo_id=SPACE_REPO, repo_type=\"space\", local_dir=LOCAL_DIR)\n",
+    "print(\"✅ Đã kéo Space về:\", LOCAL_DIR)\n",
+    "print(\"Files:\", os.listdir(LOCAL_DIR))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "613d7d61",
+   "metadata": {},
+   "source": [
+    "## 3. Chạy app.py (Kaggle có GPU → nhanh; app.py tự share=True ra link gradio.live)\n",
+    "\n",
+    "`app.py` tải checkpoint Track 2 từ HF Models repo, clone URGENT-MOS/SAILER/baseline lúc bấm nút.\n",
+    "Cell này sẽ **chạy mãi** (server Gradio) — đợi dòng `Running on public URL: https://....gradio.live`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8e779da3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Chạy như tiến trình con để giữ log; KHÔNG có SPACE_ID nên app.py tự bật share=True\n",
+    "subprocess.run([sys.executable, \"app.py\"], cwd=LOCAL_DIR, check=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "288e61cb",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- Đây là cách \"1 nguồn code, chạy nơi có GPU free\": sửa UI thì sửa trên Space HF → chạy lại notebook này.\n",
+    "- Nếu muốn chạy bản local trong `kaggle_baseline/demo_all_tracks_gradio` (code inline) thì dùng notebook đó.\n",
+    "- Lần đầu bấm nút mỗi track sẽ tải model (WavLM/SAILER/URGENT-MOS/ECAPA) → chờ chút; Kaggle có GPU nên inference nhanh.\n",
+    "- Cần **Internet On** (tải code HF + model) + **GPU T4**."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

demo_run_from_hf_pipeline.py ADDED Viewed

	@@ -0,0 +1,59 @@

+# %% [markdown]
+# # VMC2026 — Chạy demo Gradio trên KAGGLE bằng cách KÉO code UI từ Hugging Face
+#
+# Chiến lược: **HF = nơi chứa code UI** (Space `tranminhtoan140601/voicemos2026-demo`),
+# **Kaggle = nơi chạy** (GPU T4 free). Notebook này tải `app.py` từ Space về rồi chạy →
+# ra link `*.gradio.live` (sống ~72h) để gửi mentor. KHÔNG tốn GPU trả phí của HF.
+#
+# `app.py` tự nhận môi trường: trên Kaggle → `share=True` (link công khai); checkpoint Track 2
+# tự tải từ HF Models repo `tranminhtoan140601/voicemos2026-track2-emotion`.
+#
+# ### Cách chạy
+# 1. Settings → **GPU T4 + Internet On**.
+# 2. **Run All** → cell cuối in link `*.gradio.live`.
+# %% [markdown]
+# ## 1. Cài deps (khớp Space) — KHÔNG đụng numpy/torch có sẵn Kaggle
+# %%
+# !pip install -q gradio==6.17.3 huggingface_hub librosa soundfile speechbrain loralib scipy scikit-learn pandas
+import subprocess, sys
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=False)
+pip_install("gradio==6.17.3", "huggingface_hub", "librosa", "soundfile",
+            "speechbrain", "loralib", "scipy", "scikit-learn", "pandas")
+# %% [markdown]
+# ## 2. Kéo code UI (app.py) từ HF Space về Kaggle
+# %%
+import os
+from huggingface_hub import snapshot_download
+SPACE_REPO = "tranminhtoan140601/voicemos2026-demo"
+LOCAL_DIR = "/kaggle/working/vmc_demo"
+# Tải toàn bộ repo Space (app.py + requirements + README) về local
+snapshot_download(repo_id=SPACE_REPO, repo_type="space", local_dir=LOCAL_DIR)
+print("✅ Đã kéo Space về:", LOCAL_DIR)
+print("Files:", os.listdir(LOCAL_DIR))
+# %% [markdown]
+# ## 3. Chạy app.py (Kaggle có GPU → nhanh; app.py tự share=True ra link gradio.live)
+#
+# `app.py` tải checkpoint Track 2 từ HF Models repo, clone URGENT-MOS/SAILER/baseline lúc bấm nút.
+# Cell này sẽ **chạy mãi** (server Gradio) — đợi dòng `Running on public URL: https://....gradio.live`.
+# %%
+# Chạy như tiến trình con để giữ log; KHÔNG có SPACE_ID nên app.py tự bật share=True
+subprocess.run([sys.executable, "app.py"], cwd=LOCAL_DIR, check=True)
+# %% [markdown]
+# ## Ghi chú
+# - Đây là cách "1 nguồn code, chạy nơi có GPU free": sửa UI thì sửa trên Space HF → chạy lại notebook này.
+# - Nếu muốn chạy bản local trong `kaggle_baseline/demo_all_tracks_gradio` (code inline) thì dùng notebook đó.
+# - Lần đầu bấm nút mỗi track sẽ tải model (WavLM/SAILER/URGENT-MOS/ECAPA) → chờ chút; Kaggle có GPU nên inference nhanh.
+# - Cần **Internet On** (tải code HF + model) + **GPU T4**.

track1/demo_track1_gradio.ipynb ADDED Viewed

	@@ -0,0 +1,144 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 1 — Demo Gradio (Speech Enhancement: ACR + CCR)\n",
+    "\n",
+    "Baseline **URGENT-MOS**. Tải **Audio A** → **ACR** (chất lượng 1–5).\n",
+    "Tải thêm **Audio B** → **CCR** (so sánh A vs B, thang −3..+3, >0 nghĩa là A tốt hơn).\n",
+    "\n",
+    "### Cách dùng trên Kaggle\n",
+    "1. Settings → **GPU T4 + Internet On**.\n",
+    "2. **Run All** → cell cuối in link `*.gradio.live` (sống ~72h) → gửi mentor."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + clone URGENT-MOS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q gradio librosa soundfile\n",
+    "!git clone -q https://github.com/vvwangvv/URGENT-MOS.git /kaggle/working/URGENT-MOS\n",
+    "!pip install -q -e /kaggle/working/URGENT-MOS"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp model + hàm dự đoán"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "68754b2d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, sys, subprocess, librosa\n",
+    "\n",
+    "DEVICE = \"cuda\"\n",
+    "URGENT_REPO = \"/kaggle/working/URGENT-MOS\"\n",
+    "URGENT_CKPT = \"urgent-challenge/urgent-mos-f1c1m5dcorpus\"   # tự tải từ HuggingFace\n",
+    "\n",
+    "\n",
+    "def _ensure_urgent_mos():\n",
+    "    \"\"\"Tự clone + cài URGENT-MOS nếu chưa có (phòng khi cell cài chưa chạy).\"\"\"\n",
+    "    if not os.path.isdir(URGENT_REPO):\n",
+    "        subprocess.run(f\"git clone -q https://github.com/vvwangvv/URGENT-MOS.git {URGENT_REPO}\",\n",
+    "                       shell=True, check=True)\n",
+    "        subprocess.run(f\"pip install -q -e {URGENT_REPO}\", shell=True, check=True)\n",
+    "    if URGENT_REPO not in sys.path:        # package nằm ở root repo → thêm vào path là import được\n",
+    "        sys.path.insert(0, URGENT_REPO)\n",
+    "\n",
+    "\n",
+    "_M = {}\n",
+    "\n",
+    "def _load():\n",
+    "    if \"m\" not in _M:\n",
+    "        _ensure_urgent_mos()\n",
+    "        import torch\n",
+    "        from urgent_mos.utils import load_model_from_checkpoint\n",
+    "        dev = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "        m = load_model_from_checkpoint(URGENT_CKPT, dev)\n",
+    "        m.eval()\n",
+    "        _M[\"m\"] = m\n",
+    "    return _M[\"m\"]\n",
+    "\n",
+    "\n",
+    "def _scalar(x):\n",
+    "    return float(x.item()) if hasattr(x, \"item\") else float(x)\n",
+    "\n",
+    "\n",
+    "def predict(audio_a, audio_b):\n",
+    "    import torch\n",
+    "    from urgent_mos.api.infer import infer, infer_pairs\n",
+    "    if not audio_a:\n",
+    "        return \"⚠️ Hãy tải lên ít nhất Audio A.\"\n",
+    "    m = _load()\n",
+    "    wa = torch.from_numpy(librosa.load(audio_a, sr=16000, mono=True)[0]).float()\n",
+    "    acr_a = max(1.0, min(5.0, _scalar(infer(m, [wa], sample_rate=[16000],\n",
+    "                                            batch_frames=None, num_workers=0)[0][\"mos_overall\"])))\n",
+    "    out = f\"ACR (Audio A): {acr_a:.3f}   (chất lượng tuyệt đối, thang 1–5)\"\n",
+    "    if audio_b:\n",
+    "        wb = torch.from_numpy(librosa.load(audio_b, sr=16000, mono=True)[0]).float()\n",
+    "        acr_b = max(1.0, min(5.0, _scalar(infer(m, [wb], sample_rate=[16000],\n",
+    "                                                batch_frames=None, num_workers=0)[0][\"mos_overall\"])))\n",
+    "        ccr = max(-3.0, min(3.0, _scalar(infer_pairs(m, [(wa, wb)], sample_rate=[(16000, 16000)],\n",
+    "                                                     batch_frames=None, num_workers=0)[0][\"mos_overall\"])))\n",
+    "        out += (f\"\\nACR (Audio B): {acr_b:.3f}\"\n",
+    "                f\"\\nCCR (A so với B): {ccr:+.3f}   (>0: A tốt hơn B; thang −3..+3)\")\n",
+    "    return out"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Giao diện Gradio + launch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gradio as gr\n",
+    "\n",
+    "with gr.Blocks(title=\"VMC2026 Track 1 — ACR/CCR\") as demo:\n",
+    "    gr.Markdown(\"# 🎙️ Track 1 · Speech Enhancement (ACR / CCR)\\n\"\n",
+    "                \"Tải **Audio A** để có ACR. Tải thêm **Audio B** để so sánh CCR (A vs B).\")\n",
+    "    a = gr.Audio(type=\"filepath\", label=\"Audio A (bắt buộc)\")\n",
+    "    b = gr.Audio(type=\"filepath\", label=\"Audio B (tùy chọn — để tính CCR)\")\n",
+    "    out = gr.Textbox(label=\"Kết quả\", lines=4)\n",
+    "    gr.Button(\"Dự đoán\", variant=\"primary\").click(predict, [a, b], out)\n",
+    "\n",
+    "demo.launch(share=True)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track1/demo_track1_gradio_pipeline.py ADDED Viewed

	@@ -0,0 +1,100 @@

+# %% [markdown]
+# # VMC2026 Track 1 — Demo Gradio (Speech Enhancement: ACR + CCR)
+#
+# Baseline **URGENT-MOS**. Tải **Audio A** → **ACR** (chất lượng 1–5).
+# Tải thêm **Audio B** → **CCR** (so sánh A vs B, thang −3..+3, >0 nghĩa là A tốt hơn).
+#
+# ### Cách dùng trên Kaggle
+# 1. Settings → **GPU T4 + Internet On**.
+# 2. **Run All** → cell cuối in link `*.gradio.live` (sống ~72h) → gửi mentor.
+# %% [markdown]
+# ## 1. Cài đặt + clone URGENT-MOS
+# %%
+# !pip install -q gradio librosa soundfile
+# !git clone -q https://github.com/vvwangvv/URGENT-MOS.git /kaggle/working/URGENT-MOS
+# !pip install -q -e /kaggle/working/URGENT-MOS
+# %% [markdown]
+# ## 2. Nạp model + hàm dự đoán
+# %%
+import os, sys, subprocess, librosa
+DEVICE = "cuda"
+URGENT_REPO = "/kaggle/working/URGENT-MOS"
+URGENT_CKPT = "urgent-challenge/urgent-mos-f1c1m5dcorpus"   # tự tải từ HuggingFace
+def _ensure_urgent_mos():
+    """Đảm bảo import được `urgent_mos`: clone repo nếu thiếu, cài deps, thêm vào sys.path."""
+    if not os.path.isdir(URGENT_REPO):
+        subprocess.run(f"git clone -q https://github.com/vvwangvv/URGENT-MOS.git {URGENT_REPO}",
+                       shell=True, check=True)
+    # package nằm ở ROOT repo → thêm vào path là import được, KHÔNG phụ thuộc pip install thành công
+    if URGENT_REPO not in sys.path:
+        sys.path.insert(0, URGENT_REPO)
+    import importlib
+    importlib.invalidate_caches()
+    # thử import; nếu thiếu dependency thì cài editable (kéo theo torchcodec, hydra-core, omegaconf...)
+    try:
+        importlib.import_module("urgent_mos.api.infer")
+    except Exception:
+        subprocess.run(f"pip install -q -e {URGENT_REPO}", shell=True)
+        importlib.invalidate_caches()
+_M = {}
+def _load():
+    if "m" not in _M:
+        _ensure_urgent_mos()
+        import torch
+        from urgent_mos.utils import load_model_from_checkpoint
+        dev = DEVICE if torch.cuda.is_available() else "cpu"
+        m = load_model_from_checkpoint(URGENT_CKPT, dev)
+        m.eval()
+        _M["m"] = m
+    return _M["m"]
+def _scalar(x):
+    return float(x.item()) if hasattr(x, "item") else float(x)
+def predict(audio_a, audio_b):
+    if not audio_a:
+        return "⚠️ Hãy tải lên ít nhất Audio A."
+    import torch
+    m = _load()                                          # _load() tự ensure repo + sys.path TRƯỚC
+    from urgent_mos.api.infer import infer, infer_pairs  # giờ mới import được
+    wa = torch.from_numpy(librosa.load(audio_a, sr=16000, mono=True)[0]).float()
+    acr_a = max(1.0, min(5.0, _scalar(infer(m, [wa], sample_rate=[16000],
+                                            batch_frames=None, num_workers=0)[0]["mos_overall"])))
+    out = f"ACR (Audio A): {acr_a:.3f}   (chất lượng tuyệt đối, thang 1–5)"
+    if audio_b:
+        wb = torch.from_numpy(librosa.load(audio_b, sr=16000, mono=True)[0]).float()
+        acr_b = max(1.0, min(5.0, _scalar(infer(m, [wb], sample_rate=[16000],
+                                                batch_frames=None, num_workers=0)[0]["mos_overall"])))
+        ccr = max(-3.0, min(3.0, _scalar(infer_pairs(m, [(wa, wb)], sample_rate=[(16000, 16000)],
+                                                     batch_frames=None, num_workers=0)[0]["mos_overall"])))
+        out += (f"\nACR (Audio B): {acr_b:.3f}"
+                f"\nCCR (A so với B): {ccr:+.3f}   (>0: A tốt hơn B; thang −3..+3)")
+    return out
+# %% [markdown]
+# ## 3. Giao diện Gradio + launch
+# %%
+import gradio as gr
+with gr.Blocks(title="VMC2026 Track 1 — ACR/CCR") as demo:
+    gr.Markdown("# 🎙️ Track 1 · Speech Enhancement (ACR / CCR)\n"
+                "Tải **Audio A** để có ACR. Tải thêm **Audio B** để so sánh CCR (A vs B).")
+    a = gr.Audio(type="filepath", label="Audio A (bắt buộc)")
+    b = gr.Audio(type="filepath", label="Audio B (tùy chọn — để tính CCR)")
+    out = gr.Textbox(label="Kết quả", lines=4)
+    gr.Button("Dự đoán", variant="primary").click(predict, [a, b], out)
+demo.launch(share=True)

track1/track1_baseline.ipynb ADDED Viewed

	@@ -0,0 +1,108 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 1 — Baseline (URGENT-MOS)\n",
+    "\n",
+    "Chạy ngay được — data dev công khai trên HuggingFace, checkpoint tự tải.\n",
+    "\n",
+    "**Trước khi chạy:** Session options → Accelerator = **GPU T4**, Internet = **On** (verify phone nếu cần).\n",
+    "\n",
+    "Output: `submission_track1.zip` (chứa `predictions.csv`) → nộp Track 1."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt URGENT-MOS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git clone -q https://github.com/vvwangvv/URGENT-MOS.git /kaggle/working/URGENT-MOS\n",
+    "!pip install -q -e /kaggle/working/URGENT-MOS"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Smoke test (10 mẫu)\n",
+    "Kiểm tra môi trường + tải checkpoint `urgent-challenge/urgent-mos-f1c1m5dcorpus` từ HF (lần đầu hơi lâu)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cd /kaggle/working/URGENT-MOS && python scripts/infer_vmc2026_track1.py --split dev --limit 10 --output /kaggle/working/predictions_smoke.csv\n",
+    "import pandas as pd\n",
+    "pd.read_csv('/kaggle/working/predictions_smoke.csv').head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Inference đầy đủ (ACR 1008 + CCR 2520)\n",
+    "Nếu OOM: thêm `--batch-frames 8000` (hoặc nhỏ hơn)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cd /kaggle/working/URGENT-MOS && python scripts/infer_vmc2026_track1.py --split dev --output /kaggle/working/predictions.csv"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Validate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import pandas as pd\ndf = pd.read_csv('/kaggle/working/predictions.csv')\nassert list(df.columns) == ['sample_id', 'pred_score'], df.columns.tolist()\nacr = df[df['sample_id'].str.contains('-acr_')]\nccr = df[df['sample_id'].str.contains('-ccr_')]\nprint(f'Tổng {len(df)} | ACR {len(acr)} | CCR {len(ccr)}')\nprint('ACR:', acr['pred_score'].min(), '→', acr['pred_score'].max(), '(cần [1,5])')\nprint('CCR:', ccr['pred_score'].min(), '→', ccr['pred_score'].max(), '(cần [-3,+3])')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Đóng zip nộp"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "!cd /kaggle/working && zip -j submission_track1.zip predictions.csv && unzip -l submission_track1.zip\nprint('Tải /kaggle/working/submission_track1.zip → nộp My Submissions (chọn Track 1, bỏ chọn track khác)')"
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track1/track1_baseline_pipeline.py ADDED Viewed

	@@ -0,0 +1,64 @@

+# %% [markdown]
+# # VMC2026 Track 1 — Baseline Pipeline (Kaggle)
+#
+# Baseline = **URGENT-MOS** (checkpoint pre-trained tự tải từ HuggingFace).
+# Repo có sẵn script xuất đúng format nộp → gần như không phải code gì.
+#
+# **Không bị chặn bởi license:** dev data Track 1 công khai trên HuggingFace
+# (`urgent-challenge/vmc2026-track1-dev`, configs `acr` + `ccr`).
+#
+# **Cách dùng:** Notebook → GPU T4 + Internet On → chạy lần lượt các cell.
+# Output: `predictions.csv` (`sample_id,pred_score`) → zip → nộp Track 1.
+# ⚠️ File trong zip BẮT BUỘC tên `predictions.csv` (guideline Track 1) — đừng để `predictions_dev.csv`.
+# %% [markdown]
+# ## 1. Cài đặt URGENT-MOS
+# %%
+# !git clone -q https://github.com/vvwangvv/URGENT-MOS.git /kaggle/working/URGENT-MOS
+# !pip install -q -e /kaggle/working/URGENT-MOS
+# %% [markdown]
+# ## 2. Smoke test (vài mẫu) — kiểm tra môi trường + tải checkpoint
+# Checkpoint `urgent-challenge/urgent-mos-f1c1m5dcorpus` tự tải từ HF lần chạy đầu.
+# %%
+# !cd /kaggle/working/URGENT-MOS && python scripts/infer_vmc2026_track1.py \
+#     --split dev --limit 10 --output /kaggle/working/predictions_smoke.csv
+# import pandas as pd; pd.read_csv("/kaggle/working/predictions_smoke.csv").head()
+# %% [markdown]
+# ## 3. Inference đầy đủ trên dev set (ACR + CCR)
+# Script tự tải dataset từ HF, chạy cả ACR (1008) + CCR (2520) → 1 file predictions.
+# %%
+# !cd /kaggle/working/URGENT-MOS && python scripts/infer_vmc2026_track1.py \
+#     --split dev --output /kaggle/working/predictions.csv
+# Nếu OOM: thêm --batch-frames <N> để giảm bộ nhớ.
+# %% [markdown]
+# ## 4. Validate + đóng zip nộp
+# %%
+import pandas as pd
+PRED = "/kaggle/working/predictions.csv"
+df = pd.read_csv(PRED)
+assert list(df.columns) == ["sample_id", "pred_score"], f"Header sai: {df.columns.tolist()}"
+acr = df[df["sample_id"].str.contains("-acr_")]
+ccr = df[df["sample_id"].str.contains("-ccr_")]
+print(f"Tổng {len(df)} dòng | ACR {len(acr)} | CCR {len(ccr)}")
+print("ACR range:", acr["pred_score"].min(), "→", acr["pred_score"].max(), "(cần [1,5])")
+print("CCR range:", ccr["pred_score"].min(), "→", ccr["pred_score"].max(), "(cần [-3,+3])")
+# Kỳ vọng dev: ACR=1008, CCR=2520
+# %%
+# !cd /kaggle/working && zip -j submission_track1.zip predictions.csv && unzip -l submission_track1.zip
+# %% [markdown]
+# ## Ghi chú
+# - Nộp: My Submissions → chọn **Track 1**, **bỏ chọn** track khác → upload `submission_track1.zip`.
+# - File nộp Track 1 tên **`predictions.csv`** (KHÁC Track 2/3 dùng `answer.txt`). Script đã xuất đúng cột `sample_id,pred_score`.
+# - Eval phase: đổi `--split test` (sau khi eval data ra 31/7).
+# - GPU khuyến nghị; chỉ inference nên nhẹ, fit T4 16GB.

track2/demo_track2_emotion_gradio.ipynb ADDED Viewed

	@@ -0,0 +1,533 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "73831f26",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — Demo Gradio \"Emotional TTS Evaluator\" (model TỐT NHẤT = exp08)\n",
+    "\n",
+    "Demo này dùng **checkpoint cảm xúc tốt nhất** (`ft_emotion_full_20epoch.pt`: WavLM fine-tune warm-start\n",
+    "SAILER + audeering frozen) để chấm **5 cột cảm xúc** của 1 file giọng TTS: **EMOS / CAT / VAL / ARO / DOM**.\n",
+    "Khác demo cũ (`demo_track2_gradio`) dùng baseline UTMOS+emotion2vec+Gemini — bản này KHÔNG cần API.\n",
+    "\n",
+    "**2 tab:**\n",
+    "1. *Chấm 1 file TTS* — tải audio + chọn cảm xúc target → ra điểm biểu cảm cảm xúc + diễn giải KHỚP/LỆCH.\n",
+    "2. *Metric bộ chấm* — tính UTT-SRCC (EMOS/VAD) + CAT-err trên val nội bộ (train.csv) → cho biết độ tin cậy.\n",
+    "\n",
+    "**Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input (1) dataset Track 2, (2) dataset chứa\n",
+    "`ft_emotion_full_20epoch.pt` → Run All → cell cuối in link `*.gradio.live`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d505183e",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — auto-dò DATA_ROOT + checkpoint"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20d538e6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, glob\n",
+    "\n",
+    "def find_data_root(search_root=\"/kaggle/input\"):\n",
+    "    cands = []\n",
+    "    for train_csv in glob.glob(os.path.join(search_root, \"**\", \"sets\", \"train.csv\"), recursive=True):\n",
+    "        root = os.path.dirname(os.path.dirname(train_csv))\n",
+    "        score = os.path.isdir(os.path.join(root, \"wav\")) + os.path.exists(os.path.join(root, \"metadata.csv\"))\n",
+    "        cands.append((score, root))\n",
+    "    cands.sort(reverse=True)\n",
+    "    return cands\n",
+    "\n",
+    "_cands = find_data_root(\"/kaggle/input\")\n",
+    "if _cands:\n",
+    "    print(\"🔎 Ứng viên DATA_ROOT:\")\n",
+    "    for sc, r in _cands:\n",
+    "        print(f\"   [{sc}/2] {r}\")\n",
+    "    DATA_ROOT = _cands[0][1]\n",
+    "    print(f\"👉 Tự chọn DATA_ROOT = {DATA_ROOT}\")\n",
+    "else:\n",
+    "    DATA_ROOT = \"/kaggle/input/datasets/minhtoan2\"   # dự phòng\n",
+    "    print(f\"❌ Không thấy sets/train.csv → dự phòng {DATA_ROOT} (đã Add Input chưa?)\")\n",
+    "\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"\n",
+    "\n",
+    "# ── Checkpoint cảm xúc exp08 (ưu tiên bản 20 epoch = TỐT NHẤT) ─────────────────\n",
+    "CKPT_PATH = \"\"    # << \"\" = auto-dò; hoặc trỏ tay \"/kaggle/input/<slug>/ft_emotion_full_20epoch.pt\"\n",
+    "\n",
+    "def find_ckpt(explicit):\n",
+    "    if explicit and os.path.exists(explicit):\n",
+    "        return explicit\n",
+    "    pats = [\"ft_emotion_full_20epoch*.pt\", \"ft_emotion_full*.pt\"]   # ưu tiên bản 20epoch\n",
+    "    for pat in pats:\n",
+    "        for base in [\"/kaggle/input\", \"/kaggle/working\"]:\n",
+    "            hits = sorted(glob.glob(os.path.join(base, \"**\", pat), recursive=True))\n",
+    "            if hits:\n",
+    "                return hits[0]\n",
+    "    return \"\"\n",
+    "\n",
+    "CKPT_PATH = find_ckpt(CKPT_PATH)\n",
+    "assert CKPT_PATH, \"❌ Không thấy ft_emotion_full*.pt. Add Input dataset chứa checkpoint exp08 chưa?\"\n",
+    "print(\"✅ Checkpoint:\", CKPT_PATH)\n",
+    "\n",
+    "# ── Hằng kiến trúc PHẢI khớp exp08 (ckpt không lưu các số này) ────────────────\n",
+    "DEVICE       = \"cuda\"\n",
+    "SR           = 16000\n",
+    "EMO_MAX_SEC  = 8\n",
+    "TRUNK_HIDDEN = 512\n",
+    "HEAD_HIDDEN  = 128\n",
+    "DROPOUT      = 0.3       # không ảnh hưởng eval\n",
+    "USE_AMP      = True\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "# Mốc exp08 (val nội bộ / DEV) để so trong tab metric\n",
+    "EXP08 = {\"emos\": 0.811, \"cat_err\": 0.133, \"val\": 0.659, \"aro\": 0.793, \"dom\": 0.751}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "daadb3d8",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + clone code SAILER"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6303e9cf",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"gradio\", \"loralib\", \"speechbrain\", \"librosa\", \"soundfile\",\n",
+    "            \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "233b6770",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp model exp08 (backbone WavLM ft + audeering frozen + heads) — 1 lần"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1881d27c",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "import numpy as np\n",
+    "import librosa\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (chậm)\")\n",
+    "\n",
+    "ckpt = torch.load(CKPT_PATH, map_location=\"cpu\", weights_only=False)   # ckpt có numpy → cần False\n",
+    "assert \"wavlm\" in ckpt and \"heads\" in ckpt, \"❌ Checkpoint thiếu 'wavlm'/'heads' → cần ft_emotion_full_20epoch.pt đủ.\"\n",
+    "AUD_DIM = int(ckpt.get(\"AUD_DIM\", 0))\n",
+    "USE_AUDEERING = AUD_DIM > 0\n",
+    "print(\"✅ Nạp ckpt | keys:\", list(ckpt.keys()), \"| AUD_DIM:\", AUD_DIM, \"(audeering\", \"ON)\" if USE_AUDEERING else \"OFF)\")\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    cands = []\n",
+    "    for name, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((name, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "try:\n",
+    "    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "    _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "    _name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    if wavlm is not None:\n",
+    "        print(f\"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{_name}'\")\n",
+    "except Exception as e:\n",
+    "    print(\"⚠️ Lỗi nạp SAILER wrapper:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "if wavlm is None:\n",
+    "    from transformers import WavLMModel\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"ℹ️ Fallback: microsoft/wavlm-large.\")\n",
+    "\n",
+    "wavlm = wavlm.to(device).eval()\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "wavlm.config.layerdrop = 0.0\n",
+    "_miss, _unexp = wavlm.load_state_dict(ckpt[\"wavlm\"], strict=False)\n",
+    "print(f\"🔁 load wavlm: thiếu {len(_miss)} / dư {len(_unexp)} key (kỳ vọng ~0)\")\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    try:\n",
+    "        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "    except Exception:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def wavlm_embed(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state\n",
+    "    return masked_mean(out, attn_mask)\n",
+    "\n",
+    "# ── audeering frozen (đặc trưng phụ) — chỉ dựng nếu ckpt có dùng ──\n",
+    "aud_backbone = aud_head = aud_proc = None\n",
+    "if USE_AUDEERING:\n",
+    "    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "    from huggingface_hub import hf_hub_download\n",
+    "    AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "    aud_backbone = Wav2Vec2Model(aud_cfg)\n",
+    "    try:\n",
+    "        _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "            hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "    except Exception:\n",
+    "        _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "    bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "    aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "    _hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _sd[\"classifier.out_proj.weight\"].shape[0]))\n",
+    "    aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "    aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "    aud_backbone = aud_backbone.to(device).eval()\n",
+    "    aud_head = aud_head.to(device).eval()\n",
+    "    assert _hid + 3 == AUD_DIM, f\"⚠️ AUD_DIM dựng ({_hid+3}) ≠ ckpt ({AUD_DIM})\"\n",
+    "    print(f\"✅ audeering frozen ({AUD_DIM}-D)\")\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def audeering_feat(wave):\n",
+    "    x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "    x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)\n",
+    "    h = aud_backbone(x)[0].mean(dim=1)\n",
+    "    out = aud_head(h)[0].cpu().numpy()\n",
+    "    vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]\n",
+    "    return np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)\n",
+    "\n",
+    "# ── EmoHeads (khớp exp08) + nạp trọng số + chuẩn hóa từ ckpt ──\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device).eval()\n",
+    "_hm, _hu = heads.load_state_dict(ckpt[\"heads\"], strict=False)\n",
+    "print(f\"🔁 load heads: thiếu {len(_hm)} / dư {len(_hu)} key (kỳ vọng 0)\")\n",
+    "\n",
+    "emos_mu = float(ckpt[\"emos_mu\"]); emos_sd = float(ckpt[\"emos_sd\"])\n",
+    "vad_mu = np.asarray(ckpt[\"vad_mu\"], dtype=np.float32); vad_sd = np.asarray(ckpt[\"vad_sd\"], dtype=np.float32)\n",
+    "print(f\"Chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}\")\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(N_EMO, dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3c7ce7b5",
+   "metadata": {},
+   "source": [
+    "## 3. Hàm suy luận lõi (1 wave numpy → emos/cat5/vad3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "daf81b82",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "@torch.no_grad()\n",
+    "def infer_wave(wave, target_emotion):\n",
+    "    \"\"\"wave: numpy float32 (đã 16k mono). target_emotion: str hoặc None. Trả (emos, cat5, vad3).\"\"\"\n",
+    "    wave = wave[: EMO_MAX_SEC * SR].astype(np.float32)\n",
+    "    iv = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(norm_emotion(target_emotion) if target_emotion else None)).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        fw = wavlm_embed(iv, am)\n",
+    "        if USE_AUDEERING:\n",
+    "            fw = torch.cat([fw, torch.from_numpy(audeering_feat(wave)).unsqueeze(0).to(device)], dim=1)\n",
+    "        emos_p, cat_l, vad_p = heads(fw, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "768a4de5",
+   "metadata": {},
+   "source": [
+    "## 4. Hàm metric val nội bộ (UTT-SRCC + CAT-err) — đánh giá độ tin cậy bộ chấm"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f4900fe6",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from scipy.stats import spearmanr\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(N_EMO, dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def _col(cols_map, *names, df=None, default_idx=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = _col(cols, \"wavid\", \"wav\", df=df, default_idx=1)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col = _col(cols, \"val\", \"valence\"); aro_col = _col(cols, \"aro\", \"arousal\"); dom_col = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(N_EMO, dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(N_EMO, 0.2, dtype=np.float32)\n",
+    "        for i in range(N_EMO):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "# target cảm xúc theo wav (cho EMOS) từ metadata\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    if os.path.exists(METADATA_CSV):\n",
+    "        with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "            for ln in f:\n",
+    "                parts = ln.strip().split(\"|\")\n",
+    "                if len(parts) >= 2:\n",
+    "                    tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "_target_map = None\n",
+    "_val_df = None\n",
+    "def _prep_eval():\n",
+    "    \"\"\"Lazy: đọc nhãn + tách 10% val nội bộ (seed 42, khớp exp08).\"\"\"\n",
+    "    global _target_map, _val_df\n",
+    "    if _val_df is None:\n",
+    "        _target_map = load_target_emotions()\n",
+    "        df = load_train_labels()\n",
+    "        df = df[df[\"wavID\"].map(lambda s: os.path.exists(os.path.join(WAV_DIR, s + \".wav\")))].reset_index(drop=True)\n",
+    "        _, va = train_test_split(np.arange(len(df)), test_size=0.10, random_state=42)\n",
+    "        _val_df = df.iloc[va].reset_index(drop=True)\n",
+    "    return _target_map, _val_df\n",
+    "\n",
+    "def eval_metrics(limit):\n",
+    "    tmap, vdf = _prep_eval()\n",
+    "    n = min(int(limit), len(vdf))\n",
+    "    P = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}; Y = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}\n",
+    "    catP, catY = [], []\n",
+    "    for i in tqdm(range(n), desc=\"eval\"):\n",
+    "        r = vdf.iloc[i]; sid = r[\"wavID\"]\n",
+    "        wav = os.path.join(WAV_DIR, sid + \".wav\")\n",
+    "        wave, _ = librosa.load(wav, sr=SR, mono=True)\n",
+    "        emos, cat5, vad3 = infer_wave(wave, tmap.get(sid))\n",
+    "        P[\"emos\"].append(emos); Y[\"emos\"].append(float(r[\"emos\"]))\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            P[t].append(float(vad3[j])); Y[t].append(float(r[t]))\n",
+    "        catP.append(cat5); catY.append([r[f\"cat{k}\"] for k in range(N_EMO)])\n",
+    "    rows = []\n",
+    "    for t in [\"emos\", \"val\", \"aro\", \"dom\"]:\n",
+    "        srcc = spearmanr(P[t], Y[t]).correlation\n",
+    "        rows.append([t.upper(), f\"{srcc:.4f}\", f\"{EXP08.get(t, float('nan')):.3f}\"])\n",
+    "    cat_err = float(np.abs(np.array(catP) - np.array(catY)).sum(1).mean())\n",
+    "    rows.append([\"CAT-err ↓\", f\"{cat_err:.4f}\", f\"{EXP08['cat_err']:.3f}\"])\n",
+    "    return rows"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "36b32835",
+   "metadata": {},
+   "source": [
+    "## 5. Giao diện Gradio (2 tab)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5ea0f8f6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gradio as gr\n",
+    "\n",
+    "def ui_predict(audio, target_emotion):\n",
+    "    \"\"\"Trả về: verdict(md) · EMOS(number) · CAT(label) · VAL/ARO/DOM(number).\"\"\"\n",
+    "    if not audio:\n",
+    "        return \"### ⚠️ Hãy tải audio.\", None, {}, None, None, None\n",
+    "    wave, _ = librosa.load(audio, sr=SR, mono=True)\n",
+    "    emos, cat5, vad3 = infer_wave(wave, target_emotion)\n",
+    "    cat_dict = {e: float(cat5[i]) for i, e in enumerate(EMOTIONS5)}\n",
+    "    perceived = EMOTIONS5[int(np.argmax(cat5))]\n",
+    "    if target_emotion:\n",
+    "        match = \"✅ **KHỚP** target\" if perceived == norm_emotion(target_emotion) else \"⚠️ **LỆCH** target\"\n",
+    "        band = \"🟢 tốt\" if emos >= 4 else (\"🟡 khá\" if emos >= 3 else \"🔴 yếu\")\n",
+    "        verdict = (f\"### Kết luận biểu cảm\\n\"\n",
+    "                   f\"- Cảm xúc cảm nhận: **{perceived}** → {match} (`{target_emotion}`)\\n\"\n",
+    "                   f\"- EMOS = **{emos:.2f}/5** → biểu cảm {band}\")\n",
+    "    else:\n",
+    "        verdict = (f\"### Kết luận biểu cảm\\n\"\n",
+    "                   f\"- Cảm xúc cảm nhận: **{perceived}**\\n\"\n",
+    "                   f\"- *(Chọn cảm xúc target để bật EMOS — độ khớp ý đồ)*\")\n",
+    "        emos = None\n",
+    "    return verdict, (round(emos, 3) if emos is not None else None), cat_dict, \\\n",
+    "        round(float(vad3[0]), 3), round(float(vad3[1]), 3), round(float(vad3[2]), 3)\n",
+    "\n",
+    "def ui_eval(limit):\n",
+    "    return eval_metrics(limit)\n",
+    "\n",
+    "INTRO = (\n",
+    "    \"# 🎙️ Emotional TTS Evaluator — VoiceMOS 2026 Track 2\\n\"\n",
+    "    \"Bộ chấm **độ biểu cảm cảm xúc** của giọng TTS, chạy bằng model tốt nhất (**exp08**: WavLM fine-tune + \"\n",
+    "    \"audeering). Offline, không cần API.\\n\\n\"\n",
+    "    \"> **5 output dưới đây CHÍNH LÀ định nghĩa \\\"expressive emotion\\\" của Track 2** — mỗi cái trả lời một câu hỏi:\\n\"\n",
+    "    \"> **EMOS** = có đúng cảm xúc được yêu cầu không · **CAT** = người nghe cảm nhận cảm xúc nào · \"\n",
+    "    \"**VAD** = hóa trị / cường độ / chi phối.\"\n",
+    ")\n",
+    "\n",
+    "with gr.Blocks(title=\"VMC2026 Track 2 — Emotional TTS Evaluator (exp08)\") as demo:\n",
+    "    gr.Markdown(INTRO)\n",
+    "    with gr.Tab(\"🎯 Chấm 1 file TTS\"):\n",
+    "        with gr.Row():\n",
+    "            with gr.Column(scale=1):\n",
+    "                a = gr.Audio(type=\"filepath\", label=\"Audio (giọng TTS)\")\n",
+    "                tgt = gr.Dropdown(EMOTIONS5, label=\"🎯 Cảm xúc target (cho EMOS)\")\n",
+    "                btn = gr.Button(\"Chấm cảm xúc\", variant=\"primary\")\n",
+    "            with gr.Column(scale=2):\n",
+    "                verdict = gr.Markdown()\n",
+    "                with gr.Row():\n",
+    "                    emos_o = gr.Number(label=\"EMOS — khớp cảm xúc target (1–5)\", interactive=False)\n",
+    "                cat_o = gr.Label(label=\"CAT — phân bố cảm xúc cảm nhận (5 lớp)\")\n",
+    "                gr.Markdown(\"**VAD — toạ độ cảm xúc liên tục (1–5):**\")\n",
+    "                with gr.Row():\n",
+    "                    val_o = gr.Number(label=\"Valence (tích cực↑)\", interactive=False)\n",
+    "                    aro_o = gr.Number(label=\"Arousal (kích động↑)\", interactive=False)\n",
+    "                    dom_o = gr.Number(label=\"Dominance (chi phối↑)\", interactive=False)\n",
+    "        btn.click(ui_predict, [a, tgt], [verdict, emos_o, cat_o, val_o, aro_o, dom_o])\n",
+    "    with gr.Tab(\"📊 Độ tin cậy bộ chấm\"):\n",
+    "        gr.Markdown(\"Đo model tái lập nhãn người tốt tới đâu trên **val nội bộ** (10% train.csv, seed 42) — \"\n",
+    "                    \"**UTT-SRCC** (EMOS/VAD, cao=tốt) + **CAT-err** (thấp=tốt).\\n\"\n",
+    "                    \"⚠️ Dev label ẩn → đây **KHÔNG** phải điểm leaderboard, chỉ để biết bộ chấm đáng tin cỡ nào.\")\n",
+    "        lim = gr.Slider(20, 300, value=100, step=20, label=\"Số mẫu val để chấm (nhiều = chậm)\")\n",
+    "        tbl = gr.Dataframe(headers=[\"Cột\", \"Model (val nội bộ)\", \"Mốc exp08\"],\n",
+    "                           label=\"UTT-SRCC / CAT-err\", interactive=False)\n",
+    "        gr.Button(\"Chạy đánh giá\", variant=\"primary\").click(ui_eval, [lim], [tbl])\n",
+    "\n",
+    "demo.launch(share=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fcea6f73",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- Hằng `TRUNK_HIDDEN/HEAD_HIDDEN` PHẢI khớp exp08 (ckpt không lưu) — sai là lệch key/shape.\n",
+    "- EMOS cần cảm xúc target → chưa chọn dropdown thì chỉ hiện CAT/VAD.\n",
+    "- exp08 = mean-pool (không Mamba) → demo dùng `masked_mean`.\n",
+    "- Metric tab chấm trên val nội bộ train.csv (dev ẩn) → con số ~ mốc exp08 nếu trùng tập val.\n",
+    "- Cần GPU T4 + Internet On (tải WavLM/SAILER/audeering lần đầu)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/demo_track2_emotion_gradio_pipeline.py ADDED Viewed

	@@ -0,0 +1,431 @@

+# %% [markdown]
+# # VMC2026 Track 2 — Demo Gradio "Emotional TTS Evaluator" (model TỐT NHẤT = exp08)
+#
+# Demo này dùng **checkpoint cảm xúc tốt nhất** (`ft_emotion_full_20epoch.pt`: WavLM fine-tune warm-start
+# SAILER + audeering frozen) để chấm **5 cột cảm xúc** của 1 file giọng TTS: **EMOS / CAT / VAL / ARO / DOM**.
+# Khác demo cũ (`demo_track2_gradio`) dùng baseline UTMOS+emotion2vec+Gemini — bản này KHÔNG cần API.
+#
+# **2 tab:**
+# 1. *Chấm 1 file TTS* — tải audio + chọn cảm xúc target → ra điểm biểu cảm cảm xúc + diễn giải KHỚP/LỆCH.
+# 2. *Metric bộ chấm* — tính UTT-SRCC (EMOS/VAD) + CAT-err trên val nội bộ (train.csv) → cho biết độ tin cậy.
+#
+# **Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input (1) dataset Track 2, (2) dataset chứa
+# `ft_emotion_full_20epoch.pt` → Run All → cell cuối in link `*.gradio.live`.
+# %% [markdown]
+# ## 0. Cấu hình — auto-dò DATA_ROOT + checkpoint
+# %%
+import os, glob
+def find_data_root(search_root="/kaggle/input"):
+    cands = []
+    for train_csv in glob.glob(os.path.join(search_root, "**", "sets", "train.csv"), recursive=True):
+        root = os.path.dirname(os.path.dirname(train_csv))
+        score = os.path.isdir(os.path.join(root, "wav")) + os.path.exists(os.path.join(root, "metadata.csv"))
+        cands.append((score, root))
+    cands.sort(reverse=True)
+    return cands
+_cands = find_data_root("/kaggle/input")
+if _cands:
+    print("🔎 Ứng viên DATA_ROOT:")
+    for sc, r in _cands:
+        print(f"   [{sc}/2] {r}")
+    DATA_ROOT = _cands[0][1]
+    print(f"👉 Tự chọn DATA_ROOT = {DATA_ROOT}")
+else:
+    DATA_ROOT = "/kaggle/input/datasets/minhtoan2"   # dự phòng
+    print(f"❌ Không thấy sets/train.csv → dự phòng {DATA_ROOT} (đã Add Input chưa?)")
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"
+# ── Checkpoint cảm xúc exp08 (ưu tiên bản 20 epoch = TỐT NHẤT) ─────────────────
+CKPT_PATH = ""    # << "" = auto-dò; hoặc trỏ tay "/kaggle/input/<slug>/ft_emotion_full_20epoch.pt"
+def find_ckpt(explicit):
+    if explicit and os.path.exists(explicit):
+        return explicit
+    pats = ["ft_emotion_full_20epoch*.pt", "ft_emotion_full*.pt"]   # ưu tiên bản 20epoch
+    for pat in pats:
+        for base in ["/kaggle/input", "/kaggle/working"]:
+            hits = sorted(glob.glob(os.path.join(base, "**", pat), recursive=True))
+            if hits:
+                return hits[0]
+    return ""
+CKPT_PATH = find_ckpt(CKPT_PATH)
+assert CKPT_PATH, "❌ Không thấy ft_emotion_full*.pt. Add Input dataset chứa checkpoint exp08 chưa?"
+print("✅ Checkpoint:", CKPT_PATH)
+# ── Hằng kiến trúc PHẢI khớp exp08 (ckpt không lưu các số này) ────────────────
+DEVICE       = "cuda"
+SR           = 16000
+EMO_MAX_SEC  = 8
+TRUNK_HIDDEN = 512
+HEAD_HIDDEN  = 128
+DROPOUT      = 0.3       # không ảnh hưởng eval
+USE_AMP      = True
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+# Mốc exp08 (val nội bộ / DEV) để so trong tab metric
+EXP08 = {"emos": 0.811, "cat_err": 0.133, "val": 0.659, "aro": 0.793, "dom": 0.751}
+# %% [markdown]
+# ## 1. Cài đặt + clone code SAILER
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("gradio", "loralib", "speechbrain", "librosa", "soundfile",
+            "scipy", "scikit-learn", "pandas", "tqdm")
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Nạp model exp08 (backbone WavLM ft + audeering frozen + heads) — 1 lần
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+import librosa
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (chậm)")
+ckpt = torch.load(CKPT_PATH, map_location="cpu", weights_only=False)   # ckpt có numpy → cần False
+assert "wavlm" in ckpt and "heads" in ckpt, "❌ Checkpoint thiếu 'wavlm'/'heads' → cần ft_emotion_full_20epoch.pt đủ."
+AUD_DIM = int(ckpt.get("AUD_DIM", 0))
+USE_AUDEERING = AUD_DIM > 0
+print("✅ Nạp ckpt | keys:", list(ckpt.keys()), "| AUD_DIM:", AUD_DIM, "(audeering", "ON)" if USE_AUDEERING else "OFF)")
+def find_hf_backbone(module):
+    cands = []
+    for name, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((name, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+try:
+    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+    _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+    _name, wavlm = find_hf_backbone(_wrapper)
+    if wavlm is not None:
+        print(f"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{_name}'")
+except Exception as e:
+    print("⚠️ Lỗi nạp SAILER wrapper:", repr(e), "→ fallback WavLM trắng.")
+if wavlm is None:
+    from transformers import WavLMModel
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("ℹ️ Fallback: microsoft/wavlm-large.")
+wavlm = wavlm.to(device).eval()
+WAVLM_DIM = int(wavlm.config.hidden_size)
+wavlm.config.layerdrop = 0.0
+_miss, _unexp = wavlm.load_state_dict(ckpt["wavlm"], strict=False)
+print(f"🔁 load wavlm: thiếu {len(_miss)} / dư {len(_unexp)} key (kỳ vọng ~0)")
+def masked_mean(hidden, attn_mask):
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    try:
+        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+    except Exception:
+        return hidden.mean(dim=1)
+    fm = fm.unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+@torch.no_grad()
+def wavlm_embed(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state
+    return masked_mean(out, attn_mask)
+# ── audeering frozen (đặc trưng phụ) — chỉ dựng nếu ckpt có dùng ──
+aud_backbone = aud_head = aud_proc = None
+if USE_AUDEERING:
+    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+    from huggingface_hub import hf_hub_download
+    AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+    aud_backbone = Wav2Vec2Model(aud_cfg)
+    try:
+        _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+            hf_hub_download(AUD_NAME, "model.safetensors"))
+    except Exception:
+        _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+    bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+    aud_backbone.load_state_dict(bb_sd, strict=False)
+    _hid = _sd["classifier.dense.weight"].shape[0]
+    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _sd["classifier.out_proj.weight"].shape[0]))
+    aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+    aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+    aud_backbone = aud_backbone.to(device).eval()
+    aud_head = aud_head.to(device).eval()
+    assert _hid + 3 == AUD_DIM, f"⚠️ AUD_DIM dựng ({_hid+3}) ≠ ckpt ({AUD_DIM})"
+    print(f"✅ audeering frozen ({AUD_DIM}-D)")
+@torch.no_grad()
+def audeering_feat(wave):
+    x = aud_proc(wave, sampling_rate=SR).input_values[0]
+    x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)
+    h = aud_backbone(x)[0].mean(dim=1)
+    out = aud_head(h)[0].cpu().numpy()
+    vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]
+    return np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)
+# ── EmoHeads (khớp exp08) + nạp trọng số + chuẩn hóa từ ckpt ──
+N_EMO = len(EMOTIONS5)
+TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device).eval()
+_hm, _hu = heads.load_state_dict(ckpt["heads"], strict=False)
+print(f"🔁 load heads: thiếu {len(_hm)} / dư {len(_hu)} key (kỳ vọng 0)")
+emos_mu = float(ckpt["emos_mu"]); emos_sd = float(ckpt["emos_sd"])
+vad_mu = np.asarray(ckpt["vad_mu"], dtype=np.float32); vad_sd = np.asarray(ckpt["vad_sd"], dtype=np.float32)
+print(f"Chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}")
+def onehot_target(tgt):
+    v = np.zeros(N_EMO, dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+# %% [markdown]
+# ## 3. Hàm suy luận lõi (1 wave numpy → emos/cat5/vad3)
+# %%
+@torch.no_grad()
+def infer_wave(wave, target_emotion):
+    """wave: numpy float32 (đã 16k mono). target_emotion: str hoặc None. Trả (emos, cat5, vad3)."""
+    wave = wave[: EMO_MAX_SEC * SR].astype(np.float32)
+    iv = torch.from_numpy(wave).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(norm_emotion(target_emotion) if target_emotion else None)).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        fw = wavlm_embed(iv, am)
+        if USE_AUDEERING:
+            fw = torch.cat([fw, torch.from_numpy(audeering_feat(wave)).unsqueeze(0).to(device)], dim=1)
+        emos_p, cat_l, vad_p = heads(fw, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+# %% [markdown]
+# ## 4. Hàm metric val nội bộ (UTT-SRCC + CAT-err) — đánh giá độ tin cậy bộ chấm
+# %%
+import pandas as pd
+from scipy.stats import spearmanr
+from sklearn.model_selection import train_test_split
+from tqdm.auto import tqdm
+def parse_emocat_votes(cell):
+    v = np.zeros(N_EMO, dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def _col(cols_map, *names, df=None, default_idx=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = _col(cols, "wavid", "wav", df=df, default_idx=1)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col = _col(cols, "val", "valence"); aro_col = _col(cols, "aro", "arousal"); dom_col = _col(cols, "dom", "dominance")
+    cat_col = _col(cols, "emocat", "cat", "emotion")
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(N_EMO, dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(N_EMO, 0.2, dtype=np.float32)
+        for i in range(N_EMO):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+# target cảm xúc theo wav (cho EMOS) từ metadata
+def load_target_emotions():
+    tgt = {}
+    if os.path.exists(METADATA_CSV):
+        with open(METADATA_CSV, encoding="utf-8") as f:
+            for ln in f:
+                parts = ln.strip().split("|")
+                if len(parts) >= 2:
+                    tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+_target_map = None
+_val_df = None
+def _prep_eval():
+    """Lazy: đọc nhãn + tách 10% val nội bộ (seed 42, khớp exp08)."""
+    global _target_map, _val_df
+    if _val_df is None:
+        _target_map = load_target_emotions()
+        df = load_train_labels()
+        df = df[df["wavID"].map(lambda s: os.path.exists(os.path.join(WAV_DIR, s + ".wav")))].reset_index(drop=True)
+        _, va = train_test_split(np.arange(len(df)), test_size=0.10, random_state=42)
+        _val_df = df.iloc[va].reset_index(drop=True)
+    return _target_map, _val_df
+def eval_metrics(limit):
+    tmap, vdf = _prep_eval()
+    n = min(int(limit), len(vdf))
+    P = {"emos": [], "val": [], "aro": [], "dom": []}; Y = {"emos": [], "val": [], "aro": [], "dom": []}
+    catP, catY = [], []
+    for i in tqdm(range(n), desc="eval"):
+        r = vdf.iloc[i]; sid = r["wavID"]
+        wav = os.path.join(WAV_DIR, sid + ".wav")
+        wave, _ = librosa.load(wav, sr=SR, mono=True)
+        emos, cat5, vad3 = infer_wave(wave, tmap.get(sid))
+        P["emos"].append(emos); Y["emos"].append(float(r["emos"]))
+        for j, t in enumerate(["val", "aro", "dom"]):
+            P[t].append(float(vad3[j])); Y[t].append(float(r[t]))
+        catP.append(cat5); catY.append([r[f"cat{k}"] for k in range(N_EMO)])
+    rows = []
+    for t in ["emos", "val", "aro", "dom"]:
+        srcc = spearmanr(P[t], Y[t]).correlation
+        rows.append([t.upper(), f"{srcc:.4f}", f"{EXP08.get(t, float('nan')):.3f}"])
+    cat_err = float(np.abs(np.array(catP) - np.array(catY)).sum(1).mean())
+    rows.append(["CAT-err ↓", f"{cat_err:.4f}", f"{EXP08['cat_err']:.3f}"])
+    return rows
+# %% [markdown]
+# ## 5. Giao diện Gradio (2 tab)
+# %%
+import gradio as gr
+def ui_predict(audio, target_emotion):
+    """Trả về: verdict(md) · EMOS(number) · CAT(label) · VAL/ARO/DOM(number)."""
+    if not audio:
+        return "### ⚠️ Hãy tải audio.", None, {}, None, None, None
+    wave, _ = librosa.load(audio, sr=SR, mono=True)
+    emos, cat5, vad3 = infer_wave(wave, target_emotion)
+    cat_dict = {e: float(cat5[i]) for i, e in enumerate(EMOTIONS5)}
+    perceived = EMOTIONS5[int(np.argmax(cat5))]
+    if target_emotion:
+        match = "✅ **KHỚP** target" if perceived == norm_emotion(target_emotion) else "⚠️ **LỆCH** target"
+        band = "🟢 tốt" if emos >= 4 else ("🟡 khá" if emos >= 3 else "🔴 yếu")
+        verdict = (f"### Kết luận biểu cảm\n"
+                   f"- Cảm xúc cảm nhận: **{perceived}** → {match} (`{target_emotion}`)\n"
+                   f"- EMOS = **{emos:.2f}/5** → biểu cảm {band}")
+    else:
+        verdict = (f"### Kết luận biểu cảm\n"
+                   f"- Cảm xúc cảm nhận: **{perceived}**\n"
+                   f"- *(Chọn cảm xúc target để bật EMOS — độ khớp ý đồ)*")
+        emos = None
+    return verdict, (round(emos, 3) if emos is not None else None), cat_dict, \
+        round(float(vad3[0]), 3), round(float(vad3[1]), 3), round(float(vad3[2]), 3)
+def ui_eval(limit):
+    return eval_metrics(limit)
+INTRO = (
+    "# 🎙️ Emotional TTS Evaluator — VoiceMOS 2026 Track 2\n"
+    "Bộ chấm **độ biểu cảm cảm xúc** của giọng TTS, chạy bằng model tốt nhất (**exp08**: WavLM fine-tune + "
+    "audeering). Offline, không cần API.\n\n"
+    "> **5 output dưới đây CHÍNH LÀ định nghĩa \"expressive emotion\" của Track 2** — mỗi cái trả lời một câu hỏi:\n"
+    "> **EMOS** = có đúng cảm xúc được yêu cầu không · **CAT** = người nghe cảm nhận cảm xúc nào · "
+    "**VAD** = hóa trị / cường độ / chi phối."
+)
+with gr.Blocks(title="VMC2026 Track 2 — Emotional TTS Evaluator (exp08)") as demo:
+    gr.Markdown(INTRO)
+    with gr.Tab("🎯 Chấm 1 file TTS"):
+        with gr.Row():
+            with gr.Column(scale=1):
+                a = gr.Audio(type="filepath", label="Audio (giọng TTS)")
+                tgt = gr.Dropdown(EMOTIONS5, label="🎯 Cảm xúc target (cho EMOS)")
+                btn = gr.Button("Chấm cảm xúc", variant="primary")
+            with gr.Column(scale=2):
+                verdict = gr.Markdown()
+                with gr.Row():
+                    emos_o = gr.Number(label="EMOS — khớp cảm xúc target (1–5)", interactive=False)
+                cat_o = gr.Label(label="CAT — phân bố cảm xúc cảm nhận (5 lớp)")
+                gr.Markdown("**VAD — toạ độ cảm xúc liên tục (1–5):**")
+                with gr.Row():
+                    val_o = gr.Number(label="Valence (tích cực↑)", interactive=False)
+                    aro_o = gr.Number(label="Arousal (kích động↑)", interactive=False)
+                    dom_o = gr.Number(label="Dominance (chi phối↑)", interactive=False)
+        btn.click(ui_predict, [a, tgt], [verdict, emos_o, cat_o, val_o, aro_o, dom_o])
+    with gr.Tab("📊 Độ tin cậy bộ chấm"):
+        gr.Markdown("Đo model tái lập nhãn người tốt tới đâu trên **val nội bộ** (10% train.csv, seed 42) — "
+                    "**UTT-SRCC** (EMOS/VAD, cao=tốt) + **CAT-err** (thấp=tốt).\n"
+                    "⚠️ Dev label ẩn → đây **KHÔNG** phải điểm leaderboard, chỉ để biết bộ chấm đáng tin cỡ nào.")
+        lim = gr.Slider(20, 300, value=100, step=20, label="Số mẫu val để chấm (nhiều = chậm)")
+        tbl = gr.Dataframe(headers=["Cột", "Model (val nội bộ)", "Mốc exp08"],
+                           label="UTT-SRCC / CAT-err", interactive=False)
+        gr.Button("Chạy đánh giá", variant="primary").click(ui_eval, [lim], [tbl])
+demo.launch(share=True)
+# %% [markdown]
+# ## Ghi chú
+# - Hằng `TRUNK_HIDDEN/HEAD_HIDDEN` PHẢI khớp exp08 (ckpt không lưu) — sai là lệch key/shape.
+# - EMOS cần cảm xúc target → chưa chọn dropdown thì chỉ hiện CAT/VAD.
+# - exp08 = mean-pool (không Mamba) → demo dùng `masked_mean`.
+# - Metric tab chấm trên val nội bộ train.csv (dev ẩn) → con số ~ mốc exp08 nếu trùng tập val.
+# - Cần GPU T4 + Internet On (tải WavLM/SAILER/audeering lần đầu).

track2/demo_track2_gradio.ipynb ADDED Viewed

	@@ -0,0 +1,175 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — Demo Gradio (Emotional TTS: QMOS / CAT / EMOS / VAD)\n",
+    "\n",
+    "- **QMOS** (UTMOS) + **CAT** (emotion2vec, 5 cảm xúc): chạy ngay, chỉ cần audio.\n",
+    "- **EMOS / VAD** (Gemini): tùy chọn — cần dán `GEMINI_API_KEY` + chọn cảm xúc target.\n",
+    "\n",
+    "### Cách dùng trên Kaggle\n",
+    "1. Settings → **GPU T4 + Internet On**.\n",
+    "2. **Run All** → cell cuối in link `*.gradio.live` (sống ~72h)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q gradio speechmos funasr librosa soundfile google-genai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp model + hàm dự đoán"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re, json, librosa\n",
+    "\n",
+    "GEMINI_MODEL = \"gemini-2.0-flash\"\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "\n",
+    "_M = {}\n",
+    "\n",
+    "def _qmos():\n",
+    "    if \"qmos\" not in _M:\n",
+    "        import torch\n",
+    "        _M[\"qmos\"] = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\", trust_repo=True)\n",
+    "    return _M[\"qmos\"]\n",
+    "\n",
+    "\n",
+    "def _emocat():\n",
+    "    if \"emocat\" not in _M:\n",
+    "        from funasr import AutoModel\n",
+    "        _M[\"emocat\"] = AutoModel(model=\"iic/emotion2vec_plus_large\", hub=\"hf\")\n",
+    "    return _M[\"emocat\"]\n",
+    "\n",
+    "\n",
+    "def _gemini_emos_vad(audio_path, target_emotion, api_key):\n",
+    "    \"\"\"EMOS (1-5 độ khớp cảm xúc target) + VAD (val/aro/dom 1-5) — bản demo gọn qua Gemini.\"\"\"\n",
+    "    from google import genai\n",
+    "    from google.genai import types\n",
+    "    client = genai.Client(api_key=api_key)\n",
+    "    part = types.Part.from_bytes(data=open(audio_path, \"rb\").read(), mime_type=\"audio/wav\")\n",
+    "    cfg = types.GenerateContentConfig(temperature=0.0)\n",
+    "\n",
+    "    p_emos = (f\"The target emotion is '{target_emotion}'. On a scale of 1 to 5, how well does the \"\n",
+    "              f\"speaker express that emotion? 5=perfect match, 1=no match. Answer with ONLY one integer 1-5.\")\n",
+    "    r = client.models.generate_content(model=GEMINI_MODEL, config=cfg, contents=[p_emos, part])\n",
+    "    mm = re.search(r\"[1-5]\", getattr(r, \"text\", \"\") or \"\")\n",
+    "    emos = int(mm.group()) if mm else None\n",
+    "\n",
+    "    p_vad = ('Rate this speech on three 1-5 scales: Valence (1=very negative,5=very positive), '\n",
+    "             'Arousal (1=very calm,5=very excited), Dominance (1=very submissive,5=very dominant). '\n",
+    "             'Answer ONLY as JSON: {\"val\":x,\"aro\":y,\"dom\":z}.')\n",
+    "    r2 = client.models.generate_content(model=GEMINI_MODEL, config=cfg, contents=[p_vad, part])\n",
+    "    val = aro = dom = None\n",
+    "    try:\n",
+    "        d = json.loads(re.search(r\"\\{.*\\}\", getattr(r2, \"text\", \"\") or \"\", re.S).group())\n",
+    "        val, aro, dom = d.get(\"val\"), d.get(\"aro\"), d.get(\"dom\")\n",
+    "    except Exception:\n",
+    "        pass\n",
+    "    return emos, (val, aro, dom)\n",
+    "\n",
+    "\n",
+    "def predict(audio, target_emotion, gemini_key):\n",
+    "    import torch\n",
+    "    if not audio:\n",
+    "        return \"⚠️ Hãy tải audio.\", {}\n",
+    "    wav = librosa.load(audio, sr=16000, mono=True)[0]\n",
+    "    # QMOS\n",
+    "    qmos = float(_qmos()(torch.from_numpy(wav).unsqueeze(0), sr=16000).mean().item())\n",
+    "    # CAT\n",
+    "    rec = _emocat().generate(audio, granularity=\"utterance\", extract_embedding=False)\n",
+    "    probs = {e: 0.0 for e in EMOTIONS5}\n",
+    "    for lab, sc in zip(rec[0][\"labels\"], rec[0][\"scores\"]):\n",
+    "        name = lab.split(\"/\")[-1]\n",
+    "        if name in probs:\n",
+    "            probs[name] = float(sc)\n",
+    "    tot = sum(probs.values())\n",
+    "    if tot > 0:\n",
+    "        probs = {k: v / tot for k, v in probs.items()}\n",
+    "\n",
+    "    lines = [f\"QMOS (chất lượng giọng, 1–5): {qmos:.3f}\"]\n",
+    "    if gemini_key and target_emotion:\n",
+    "        try:\n",
+    "            emos, (val, aro, dom) = _gemini_emos_vad(audio, target_emotion, gemini_key)\n",
+    "            lines.append(f\"EMOS (độ khớp cảm xúc '{target_emotion}', 1–5): {emos}\")\n",
+    "            lines.append(f\"VAD — Valence: {val} · Arousal: {aro} · Dominance: {dom}\")\n",
+    "        except Exception as e:\n",
+    "            lines.append(f\"(EMOS/VAD lỗi: {e})\")\n",
+    "    else:\n",
+    "        lines.append(\"(EMOS/VAD: dán GEMINI_API_KEY + chọn cảm xúc target để bật)\")\n",
+    "    return \"\\n\".join(lines), probs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Giao diện Gradio + launch"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import gradio as gr\n",
+    "\n",
+    "with gr.Blocks(title=\"VMC2026 Track 2 — Emotional TTS\") as demo:\n",
+    "    gr.Markdown(\"# 🎙️ Track 2 · Emotional TTS (QMOS / CAT / EMOS / VAD)\\n\"\n",
+    "                \"QMOS + phân bố cảm xúc (CAT) chạy ngay. EMOS/VAD cần Gemini key + cảm xúc target.\")\n",
+    "    a = gr.Audio(type=\"filepath\", label=\"Audio\")\n",
+    "    with gr.Row():\n",
+    "        tgt = gr.Dropdown(EMOTIONS5, label=\"Cảm xúc target (cho EMOS, tùy chọn)\")\n",
+    "        key = gr.Textbox(label=\"GEMINI_API_KEY (tùy chọn)\", type=\"password\")\n",
+    "    out = gr.Textbox(label=\"Kết quả số\", lines=5)\n",
+    "    lbl = gr.Label(label=\"CAT — phân bố cảm xúc cảm nhận\")\n",
+    "    gr.Button(\"Dự đoán\", variant=\"primary\").click(predict, [a, tgt, key], [out, lbl])\n",
+    "\n",
+    "demo.launch(share=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- EMOS/VAD là bản demo gọn (prompt rút gọn) — KHÔNG hoàn toàn giống script baseline gốc, chỉ minh họa."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/demo_track2_gradio_pipeline.py ADDED Viewed

	@@ -0,0 +1,120 @@

+# %% [markdown]
+# # VMC2026 Track 2 — Demo Gradio (Emotional TTS: QMOS / CAT / EMOS / VAD)
+#
+# - **QMOS** (UTMOS) + **CAT** (emotion2vec, 5 cảm xúc): chạy ngay, chỉ cần audio.
+# - **EMOS / VAD** (Gemini): tùy chọn — cần dán `GEMINI_API_KEY` + chọn cảm xúc target.
+#
+# ### Cách dùng trên Kaggle
+# 1. Settings → **GPU T4 + Internet On**.
+# 2. **Run All** → cell cuối in link `*.gradio.live` (sống ~72h).
+# %% [markdown]
+# ## 1. Cài đặt
+# %%
+# !pip install -q gradio speechmos funasr librosa soundfile google-genai
+# %% [markdown]
+# ## 2. Nạp model + hàm dự đoán
+# %%
+import re, json, librosa
+GEMINI_MODEL = "gemini-2.0-flash"
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_M = {}
+def _qmos():
+    if "qmos" not in _M:
+        import torch
+        _M["qmos"] = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True)
+    return _M["qmos"]
+def _emocat():
+    if "emocat" not in _M:
+        from funasr import AutoModel
+        _M["emocat"] = AutoModel(model="iic/emotion2vec_plus_large", hub="hf")
+    return _M["emocat"]
+def _gemini_emos_vad(audio_path, target_emotion, api_key):
+    """EMOS (1-5 độ khớp cảm xúc target) + VAD (val/aro/dom 1-5) — bản demo gọn qua Gemini."""
+    from google import genai
+    from google.genai import types
+    client = genai.Client(api_key=api_key)
+    part = types.Part.from_bytes(data=open(audio_path, "rb").read(), mime_type="audio/wav")
+    cfg = types.GenerateContentConfig(temperature=0.0)
+    p_emos = (f"The target emotion is '{target_emotion}'. On a scale of 1 to 5, how well does the "
+              f"speaker express that emotion? 5=perfect match, 1=no match. Answer with ONLY one integer 1-5.")
+    r = client.models.generate_content(model=GEMINI_MODEL, config=cfg, contents=[p_emos, part])
+    mm = re.search(r"[1-5]", getattr(r, "text", "") or "")
+    emos = int(mm.group()) if mm else None
+    p_vad = ('Rate this speech on three 1-5 scales: Valence (1=very negative,5=very positive), '
+             'Arousal (1=very calm,5=very excited), Dominance (1=very submissive,5=very dominant). '
+             'Answer ONLY as JSON: {"val":x,"aro":y,"dom":z}.')
+    r2 = client.models.generate_content(model=GEMINI_MODEL, config=cfg, contents=[p_vad, part])
+    val = aro = dom = None
+    try:
+        d = json.loads(re.search(r"\{.*\}", getattr(r2, "text", "") or "", re.S).group())
+        val, aro, dom = d.get("val"), d.get("aro"), d.get("dom")
+    except Exception:
+        pass
+    return emos, (val, aro, dom)
+def predict(audio, target_emotion, gemini_key):
+    import torch
+    if not audio:
+        return "⚠️ Hãy tải audio.", {}
+    wav = librosa.load(audio, sr=16000, mono=True)[0]
+    # QMOS
+    qmos = float(_qmos()(torch.from_numpy(wav).unsqueeze(0), sr=16000).mean().item())
+    # CAT
+    rec = _emocat().generate(audio, granularity="utterance", extract_embedding=False)
+    probs = {e: 0.0 for e in EMOTIONS5}
+    for lab, sc in zip(rec[0]["labels"], rec[0]["scores"]):
+        name = lab.split("/")[-1]
+        if name in probs:
+            probs[name] = float(sc)
+    tot = sum(probs.values())
+    if tot > 0:
+        probs = {k: v / tot for k, v in probs.items()}
+    lines = [f"QMOS (chất lượng giọng, 1–5): {qmos:.3f}"]
+    if gemini_key and target_emotion:
+        try:
+            emos, (val, aro, dom) = _gemini_emos_vad(audio, target_emotion, gemini_key)
+            lines.append(f"EMOS (độ khớp cảm xúc '{target_emotion}', 1–5): {emos}")
+            lines.append(f"VAD — Valence: {val} · Arousal: {aro} · Dominance: {dom}")
+        except Exception as e:
+            lines.append(f"(EMOS/VAD lỗi: {e})")
+    else:
+        lines.append("(EMOS/VAD: dán GEMINI_API_KEY + chọn cảm xúc target để bật)")
+    return "\n".join(lines), probs
+# %% [markdown]
+# ## 3. Giao diện Gradio + launch
+# %%
+import gradio as gr
+with gr.Blocks(title="VMC2026 Track 2 — Emotional TTS") as demo:
+    gr.Markdown("# 🎙️ Track 2 · Emotional TTS (QMOS / CAT / EMOS / VAD)\n"
+                "QMOS + phân bố cảm xúc (CAT) chạy ngay. EMOS/VAD cần Gemini key + cảm xúc target.")
+    a = gr.Audio(type="filepath", label="Audio")
+    with gr.Row():
+        tgt = gr.Dropdown(EMOTIONS5, label="Cảm xúc target (cho EMOS, tùy chọn)")
+        key = gr.Textbox(label="GEMINI_API_KEY (tùy chọn)", type="password")
+    out = gr.Textbox(label="Kết quả số", lines=5)
+    lbl = gr.Label(label="CAT — phân bố cảm xúc cảm nhận")
+    gr.Button("Dự đoán", variant="primary").click(predict, [a, tgt, key], [out, lbl])
+demo.launch(share=True)
+# %% [markdown]
+# ## Ghi chú
+# - EMOS/VAD là bản demo gọn (prompt rút gọn) — KHÔNG hoàn toàn giống script baseline gốc, chỉ minh họa.

track2/exp02_train_emos.ipynb ADDED Viewed

	@@ -0,0 +1,542 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "b886ed3a",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp02 (EMOS có train) — Kaggle\n",
+    "\n",
+    "**Mục tiêu:** train một model dự đoán **EMOS** (độ khớp cảm xúc target) từ ~12.746 mẫu\n",
+    "có nhãn người nghe trong `sets/train.csv`, kỳ vọng **vượt baseline 0.194** (exp01 offline).\n",
+    "\n",
+    "## Ý tưởng (đọc 1 lần cho hiểu)\n",
+    "EMOS phụ thuộc **cả audio LẪN cảm xúc target** (cùng audio \"vui\": target=happy → điểm cao,\n",
+    "target=sad → điểm thấp). Vì vậy model phải nhận vào cả hai:\n",
+    "\n",
+    "```\n",
+    "mỗi wav ─► emotion2vec ─► (a) embedding ~D chiều   ┐\n",
+    "                          (b) xác suất 5 cảm xúc    ├─► nối ─► MLP head ─► EMOS (1–5)\n",
+    "      target emotion ───► one-hot 5 chiều           ┘            (CÁI MÌNH TRAIN)\n",
+    "```\n",
+    "\n",
+    "- **Backbone emotion2vec ĐÓNG BĂNG** (không train lại) → chỉ trích đặc trưng. Nhẹ GPU, ít data vẫn ổn.\n",
+    "- **Chỉ train MLP head nhỏ** → học ánh xạ `(đặc trưng + target) → điểm người chấm`.\n",
+    "- **Nhãn vàng** = trung bình `eMOS` của mọi listener trên cùng 1 wav (gộp theo `wavID`).\n",
+    "- Embedding **trích 1 lần → cache .npz** (12.746 file rất lâu, chạy lại tốn giờ GPU).\n",
+    "- Tách 10% train làm **validation nội bộ** → đo SRCC trong lúc train (DEV không có nhãn để tự chấm).\n",
+    "- Cuối cùng xuất `answer.txt` **đầy đủ**: QMOS=SpeechMOS · CAT=emotion2vec · **EMOS=head vừa train** → nộp được ngay.\n",
+    "\n",
+    "**Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On** → + Add Input dataset\n",
+    "Track 2 (15.477 wav, có `sets/train.csv`) → sửa `DATA_ROOT` ở cell 0 → Run All."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d0fadc26",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7fac05b4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, glob, json, time\n",
+    "\n",
+    "# ── Data Track 2 (dataset 15.477 wav đã ráp, có sets/train.csv) ──────────────\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header) → target emotion\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # nhãn người nghe: lisID,wavID,qMOS,emoCat,eMOS,val,dom,aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"     # danh sách wav tập DEV (tập cần nộp ở training phase)\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/emb_cache\"        # nơi lưu embedding đã trích (tái dùng giữa các lần chạy)\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── Siêu tham số train (đổi nếu muốn thử nghiệm) ─────────────────────────────\n",
+    "DEVICE        = \"cuda\"      # \"cuda\" trên Kaggle GPU; \"cpu\" nếu không có GPU\n",
+    "HIDDEN        = 256         # số neuron lớp ẩn của MLP head\n",
+    "DROPOUT       = 0.3\n",
+    "LR            = 1e-3\n",
+    "EPOCHS        = 60\n",
+    "BATCH         = 64\n",
+    "VAL_FRAC      = 0.10        # 10% train → validation nội bộ (đo SRCC)\n",
+    "PATIENCE      = 12          # early stop: dừng nếu val-SRCC không cải thiện sau N epoch\n",
+    "SEED          = 42\n",
+    "\n",
+    "LIMIT_TRAIN   = None        # đặt số nhỏ (vd 300) để chạy thử nhanh; None = full\n",
+    "USE_CLASSPROB = True        # thêm 5 xác suất cảm xúc của emotion2vec vào feature (tín hiệu exp01)\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    \"\"\"Đưa nhãn cảm xúc bất kỳ về 1 trong EMOTIONS5; None nếu không khớp.\"\"\"\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(path_or_name):\n",
+    "    \"\"\"Lấy tên file không đuôi, để khớp wavID giữa train.csv / metadata / dev.scp.\"\"\"\n",
+    "    return os.path.splitext(os.path.basename(str(path_or_name)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b947aceb",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "49850676",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q speechmos funasr librosa soundfile pandas scipy scikit-learn tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b4e67d6a",
+   "metadata": {},
+   "source": [
+    "## 2. Đọc & gộp nhãn\n",
+    "- `train.csv`: mỗi dòng = 1 listener chấm 1 wav → **gộp trung bình eMOS theo wavID** = nhãn vàng.\n",
+    "- `metadata.csv`: lấy **cảm xúc target** cho mỗi wav (chuẩn hóa về 5 lớp)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88b2fb84",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    \"\"\"metadata.csv (wavID|emotion|transcript, KHÔNG header) → {stem: emotion_chuẩn|None}.\"\"\"\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) < 2:\n",
+    "                continue\n",
+    "            tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    \"\"\"train.csv → DataFrame [wavID(stem), emos] đã gộp trung bình theo wav.\"\"\"\n",
+    "    df = pd.read_csv(TRAIN_CSV)\n",
+    "    # Chuẩn hóa tên cột (phòng khi viết hoa/thường khác nhau)\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = cols.get(\"wavid\") or cols.get(\"wav\") or list(df.columns)[1]\n",
+    "    emos_col = cols.get(\"emos\") or cols.get(\"emo\") or cols.get(\"emomos\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS trong train.csv (cột hiện có: {list(df.columns)})\"\n",
+    "    g = df.groupby(df[wav_col].map(stem))[emos_col].mean()\n",
+    "    out = g.reset_index()\n",
+    "    out.columns = [\"wavID\", \"emos\"]\n",
+    "    return out\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "print(f\"Target emotions: {len(target_map)} | wav train (đã gộp): {len(train_df)}\")\n",
+    "print(\"eMOS thống kê:\", train_df[\"emos\"].describe()[[\"mean\", \"std\", \"min\", \"max\"]].to_dict())\n",
+    "train_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "06dea3ef",
+   "metadata": {},
+   "source": [
+    "## 3. Trích đặc trưng emotion2vec (có cache)\n",
+    "Mỗi wav → 1 lần `generate(extract_embedding=True)` cho ra **embedding** (cho EMOS) +\n",
+    "**xác suất 5 lớp** (cho CAT và làm feature). Lưu cache `.npz` để lần sau khỏi chạy lại."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c63cc1f5",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "_e2v_model = None\n",
+    "def get_e2v():\n",
+    "    global _e2v_model\n",
+    "    if _e2v_model is None:\n",
+    "        from funasr import AutoModel\n",
+    "        _e2v_model = AutoModel(model=\"iic/emotion2vec_plus_large\", hub=\"hf\")\n",
+    "    return _e2v_model\n",
+    "\n",
+    "def extract_one(wav_path):\n",
+    "    \"\"\"→ (emb: np.float32[D], probs5: np.float32[5] tổng=1). None nếu lỗi/thiếu file.\"\"\"\n",
+    "    if not os.path.exists(wav_path):\n",
+    "        return None\n",
+    "    rec = get_e2v().generate(wav_path, granularity=\"utterance\", extract_embedding=True)\n",
+    "    r = rec[0]\n",
+    "    emb = np.asarray(r[\"feats\"], dtype=np.float32).reshape(-1)\n",
+    "    probs = {e: 0.0 for e in EMOTIONS5}\n",
+    "    for lab, sc in zip(r[\"labels\"], r[\"scores\"]):\n",
+    "        name = lab.split(\"/\")[-1]\n",
+    "        if name in probs:\n",
+    "            probs[name] = float(sc)\n",
+    "    tot = sum(probs.values())\n",
+    "    if tot > 0:\n",
+    "        probs = {k: v / tot for k, v in probs.items()}\n",
+    "    probs5 = np.array([probs[e] for e in EMOTIONS5], dtype=np.float32)\n",
+    "    return emb, probs5\n",
+    "\n",
+    "def extract_set(stems, tag):\n",
+    "    \"\"\"Trích (hoặc nạp cache) cho danh sách stem. Trả về dict {stem: (emb, probs5)}.\n",
+    "    Cache lưu tại CACHE_DIR/<tag>.npz; tự b�� qua stem đã có để chạy nối tiếp được.\"\"\"\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[{tag}] nạp cache: {len(store)} mẫu\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if not todo:\n",
+    "        print(f\"[{tag}] đủ cache, bỏ qua trích.\")\n",
+    "    else:\n",
+    "        miss = 0\n",
+    "        for i, s in enumerate(tqdm(todo, desc=f\"trích {tag}\")):\n",
+    "            res = extract_one(os.path.join(WAV_DIR, s + \".wav\"))\n",
+    "            if res is None:\n",
+    "                miss += 1\n",
+    "                continue\n",
+    "            emb, probs5 = res\n",
+    "            store[s] = np.concatenate([emb, probs5]).astype(np.float32)  # [D + 5]\n",
+    "            if (i + 1) % 500 == 0:   # lưu cache định kỳ phòng ngắt session\n",
+    "                np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        if miss:\n",
+    "            print(f\"[{tag}] {miss} file thiếu/ lỗi → bỏ qua.\")\n",
+    "        print(f\"[{tag}] tổng cache: {len(store)} mẫu → {cache_path}\")\n",
+    "    # tách lại thành (emb, probs5)\n",
+    "    out = {}\n",
+    "    for s, vec in store.items():\n",
+    "        out[s] = (vec[:-5], vec[-5:])\n",
+    "    return out\n",
+    "\n",
+    "# Trích cho tập train\n",
+    "train_stems = list(train_df[\"wavID\"])\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "train_feat = extract_set(train_stems, \"train\")\n",
+    "EMB_DIM = next(iter(train_feat.values()))[0].shape[0]\n",
+    "print(\"EMB_DIM =\", EMB_DIM)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "015834c1",
+   "metadata": {},
+   "source": [
+    "## 4. Dựng feature + nhãn cho train\n",
+    "Feature mỗi wav = `[embedding | (probs5 nếu bật) | one-hot target(5)]`. Bỏ wav thiếu target/feature."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "62e048bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "def build_feature(stem_id, feat_map):\n",
+    "    pack = feat_map.get(stem_id)\n",
+    "    if pack is None:\n",
+    "        return None\n",
+    "    emb, probs5 = pack\n",
+    "    tgt = target_map.get(stem_id)\n",
+    "    if tgt is None:           # không biết cảm xúc target → không train được mẫu này\n",
+    "        return None\n",
+    "    parts = [emb]\n",
+    "    if USE_CLASSPROB:\n",
+    "        parts.append(probs5)\n",
+    "    parts.append(onehot_target(tgt))\n",
+    "    return np.concatenate(parts).astype(np.float32)\n",
+    "\n",
+    "emos_label = dict(zip(train_df[\"wavID\"], train_df[\"emos\"]))\n",
+    "X, y = [], []\n",
+    "for s in train_stems:\n",
+    "    f = build_feature(s, train_feat)\n",
+    "    if f is None or s not in emos_label:\n",
+    "        continue\n",
+    "    X.append(f); y.append(emos_label[s])\n",
+    "X = np.stack(X); y = np.array(y, dtype=np.float32)\n",
+    "FEAT_DIM = X.shape[1]\n",
+    "print(f\"Train: X={X.shape}  y={y.shape}  FEAT_DIM={FEAT_DIM}\")\n",
+    "\n",
+    "# Chuẩn hóa feature (z-score) — lưu mean/std để áp dụng y hệt lúc dự đoán DEV.\n",
+    "feat_mean = X.mean(0, keepdims=True)\n",
+    "feat_std  = X.std(0, keepdims=True) + 1e-6\n",
+    "Xn = (X - feat_mean) / feat_std"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02718889",
+   "metadata": {},
+   "source": [
+    "## 5. Model (MLP head) + train loop\n",
+    "Loss = MSE. Theo dõi **SRCC** trên validation nội bộ; lưu model tốt nhất (early stopping)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f5c52fa4",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch, torch.nn as nn\n",
+    "from scipy.stats import spearmanr\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device)\n",
+    "\n",
+    "Xtr, Xva, ytr, yva = train_test_split(Xn, y, test_size=VAL_FRAC, random_state=SEED)\n",
+    "Xtr_t = torch.tensor(Xtr, device=device); ytr_t = torch.tensor(ytr, device=device).unsqueeze(1)\n",
+    "Xva_t = torch.tensor(Xva, device=device); yva_t = torch.tensor(yva, device=device).unsqueeze(1)\n",
+    "\n",
+    "class EmosHead(nn.Module):\n",
+    "    def __init__(self, d_in, hidden, p):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(d_in, hidden), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(hidden, hidden // 2), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(hidden // 2, 1),\n",
+    "        )\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "model = EmosHead(FEAT_DIM, HIDDEN, DROPOUT).to(device)\n",
+    "opt = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)\n",
+    "lossf = nn.MSELoss()\n",
+    "\n",
+    "def val_srcc():\n",
+    "    model.eval()\n",
+    "    with torch.no_grad():\n",
+    "        pred = model(Xva_t).cpu().numpy().ravel()\n",
+    "    return spearmanr(pred, yva).correlation\n",
+    "\n",
+    "best_srcc, best_state, bad = -1.0, None, 0\n",
+    "n = Xtr_t.shape[0]\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    model.train()\n",
+    "    perm = torch.randperm(n, device=device)\n",
+    "    tot = 0.0\n",
+    "    for i in range(0, n, BATCH):\n",
+    "        idx = perm[i:i + BATCH]\n",
+    "        opt.zero_grad()\n",
+    "        out = model(Xtr_t[idx])\n",
+    "        loss = lossf(out, ytr_t[idx])\n",
+    "        loss.backward(); opt.step()\n",
+    "        tot += loss.item() * len(idx)\n",
+    "    srcc = val_srcc()\n",
+    "    if srcc > best_srcc:\n",
+    "        best_srcc, best_state, bad = srcc, {k: v.cpu().clone() for k, v in model.state_dict().items()}, 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "    if ep % 5 == 0 or ep == 1:\n",
+    "        print(f\"epoch {ep:3d} | train MSE {tot/n:.4f} | val SRCC {srcc:.4f} | best {best_srcc:.4f}\")\n",
+    "    if bad >= PATIENCE:\n",
+    "        print(f\"Early stop ở epoch {ep} (val SRCC không tăng {PATIENCE} epoch).\")\n",
+    "        break\n",
+    "\n",
+    "model.load_state_dict(best_state)\n",
+    "print(f\"\\n✅ VAL SRCC tốt nhất = {best_srcc:.4f}  (baseline exp01 ≈ 0.194 — so ở đây)\")\n",
+    "\n",
+    "# Lưu model + tham số chuẩn hóa để tái dùng / mô tả hệ thống.\n",
+    "torch.save({\"state\": best_state, \"feat_mean\": feat_mean, \"feat_std\": feat_std,\n",
+    "            \"EMB_DIM\": EMB_DIM, \"FEAT_DIM\": FEAT_DIM, \"USE_CLASSPROB\": USE_CLASSPROB,\n",
+    "            \"EMOTIONS5\": EMOTIONS5, \"val_srcc\": float(best_srcc)},\n",
+    "           os.path.join(OUT_DIR, \"emos_head.pt\"))\n",
+    "print(\"Đã lưu\", os.path.join(OUT_DIR, \"emos_head.pt\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aef6f3ee",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán DEV → `answer.txt` đầy đủ\n",
+    "- **EMOS** = head vừa train (cần embedding + target của từng wav DEV).\n",
+    "- **CAT** = xác suất 5 lớp emotion2vec (đã có sẵn khi trích đặc trưng).\n",
+    "- **QMOS** = SpeechMOS (UTMOS) — bắt buộc, chạy thêm ở đây để answer.txt hợp lệ."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8f94d7ab",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]   # tên file .wav\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "# 6a. Trích đặc trưng emotion2vec cho DEV (cache riêng)\n",
+    "dev_feat = extract_set(dev_stems, \"dev\")\n",
+    "\n",
+    "# 6b. EMOS từ head đã train\n",
+    "def predict_emos(stem_id):\n",
+    "    f = build_feature(stem_id, dev_feat)\n",
+    "    if f is None:\n",
+    "        return None\n",
+    "    fn = (f[None, :] - feat_mean) / feat_std\n",
+    "    model.eval()\n",
+    "    with torch.no_grad():\n",
+    "        return float(model(torch.tensor(fn, dtype=torch.float32, device=device)).item())\n",
+    "\n",
+    "# 6c. QMOS = SpeechMOS\n",
+    "def run_qmos(names):\n",
+    "    import librosa\n",
+    "    predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\", trust_repo=True)\n",
+    "    out = {}\n",
+    "    from tqdm.auto import tqdm\n",
+    "    for n in tqdm(names, desc=\"QMOS\"):\n",
+    "        p = os.path.join(WAV_DIR, n)\n",
+    "        if not os.path.exists(p):\n",
+    "            continue\n",
+    "        wave, _ = librosa.load(p, sr=16000, mono=True)\n",
+    "        out[n] = float(predictor(torch.from_numpy(wave).unsqueeze(0), sr=16000).mean().item())\n",
+    "    return out\n",
+    "\n",
+    "qmos_scores = run_qmos(dev_names)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6a6680d0",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def fmt_cat(probs5):\n",
+    "    return \"|\".join(f\"{e}:{probs5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    n_emos = n_default = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT\\n\")\n",
+    "        for name in dev_names:\n",
+    "            sid = stem(name)\n",
+    "            emos = predict_emos(sid)\n",
+    "            if emos is None:\n",
+    "                emos = 3.0; n_default += 1\n",
+    "            else:\n",
+    "                n_emos += 1\n",
+    "            qmos = qmos_scores.get(name, 3.0)\n",
+    "            probs5 = dev_feat[sid][1] if sid in dev_feat else np.full(5, 0.2, dtype=np.float32)\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(probs5)}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | EMOS thật {n_emos}, mặc định {n_default}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6179873f",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "30ee8626",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    assert header[0] == \"wav\" and \"QMOS\" in header and \"EMOS\" in header, \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(header), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {header}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "!cd /kaggle/working && zip -j submission_track2_exp02.zip answer.txt && unzip -l submission_track2_exp02.zip\n",
+    "print(\"Sẵn sàng nộp: /kaggle/working/submission_track2_exp02.zip\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "316e3e1f",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **VAL SRCC** in ở mục 5 là ước lượng nội bộ (10% train) — so với baseline 0.194 để biết có khá hơn không.\n",
+    "  Điểm DEV thật phải nộp lên CodaBench mới biết (My Submissions → Track 2, bỏ chọn track khác).\n",
+    "- Muốn thử nhanh: đặt `LIMIT_TRAIN = 300` ở cell 0.\n",
+    "- Embedding đã cache trong `/kaggle/working/emb_cache/` → **Save Version** để giữ, lần sau train head khỏi trích lại.\n",
+    "- Hướng cải tiến tiếp: thêm head QMOS/CAT/VAD dùng chung backbone (exp02 multi-task đầy đủ);\n",
+    "  thử backbone wav2vec2/WavLM; thêm ranking loss; fine-tune nhẹ backbone.\n",
+    "- Nhớ ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp02)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp02_train_emos_pipeline.py ADDED Viewed

	@@ -0,0 +1,407 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp02 (EMOS có train) — Kaggle
+#
+# **Mục tiêu:** train một model dự đoán **EMOS** (độ khớp cảm xúc target) từ ~12.746 mẫu
+# có nhãn người nghe trong `sets/train.csv`, kỳ vọng **vượt baseline 0.194** (exp01 offline).
+#
+# ## Ý tưởng (đọc 1 lần cho hiểu)
+# EMOS phụ thuộc **cả audio LẪN cảm xúc target** (cùng audio "vui": target=happy → điểm cao,
+# target=sad → điểm thấp). Vì vậy model phải nhận vào cả hai:
+#
+# ```
+# mỗi wav ─► emotion2vec ─► (a) embedding ~D chiều   ┐
+#                           (b) xác suất 5 cảm xúc    ├─► nối ─► MLP head ─► EMOS (1–5)
+#       target emotion ───► one-hot 5 chiều           ┘            (CÁI MÌNH TRAIN)
+# ```
+#
+# - **Backbone emotion2vec ĐÓNG BĂNG** (không train lại) → chỉ trích đặc trưng. Nhẹ GPU, ít data vẫn ổn.
+# - **Chỉ train MLP head nhỏ** → học ánh xạ `(đặc trưng + target) → điểm người chấm`.
+# - **Nhãn vàng** = trung bình `eMOS` của mọi listener trên cùng 1 wav (gộp theo `wavID`).
+# - Embedding **trích 1 lần → cache .npz** (12.746 file rất lâu, chạy lại tốn giờ GPU).
+# - Tách 10% train làm **validation nội bộ** → đo SRCC trong lúc train (DEV không có nhãn để tự chấm).
+# - Cuối cùng xuất `answer.txt` **đầy đủ**: QMOS=SpeechMOS · CAT=emotion2vec · **EMOS=head vừa train** → nộp được ngay.
+#
+# **Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On** → + Add Input dataset
+# Track 2 (15.477 wav, có `sets/train.csv`) → sửa `DATA_ROOT` ở cell 0 → Run All.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os, glob, json, time
+# ── Data Track 2 (dataset 15.477 wav đã ráp, có sets/train.csv) ──────────────
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header) → target emotion
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # nhãn người nghe: lisID,wavID,qMOS,emoCat,eMOS,val,dom,aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"     # danh sách wav tập DEV (tập cần nộp ở training phase)
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/emb_cache"        # nơi lưu embedding đã trích (tái dùng giữa các lần chạy)
+os.makedirs(CACHE_DIR, exist_ok=True)
+# ── Siêu tham số train (đổi nếu muốn thử nghiệm) ─────────────────────────────
+DEVICE        = "cuda"      # "cuda" trên Kaggle GPU; "cpu" nếu không có GPU
+HIDDEN        = 256         # số neuron lớp ẩn của MLP head
+DROPOUT       = 0.3
+LR            = 1e-3
+EPOCHS        = 60
+BATCH         = 64
+VAL_FRAC      = 0.10        # 10% train → validation nội bộ (đo SRCC)
+PATIENCE      = 12          # early stop: dừng nếu val-SRCC không cải thiện sau N epoch
+SEED          = 42
+LIMIT_TRAIN   = None        # đặt số nhỏ (vd 300) để chạy thử nhanh; None = full
+USE_CLASSPROB = True        # thêm 5 xác suất cảm xúc của emotion2vec vào feature (tín hiệu exp01)
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    """Đưa nhãn cảm xúc bất kỳ về 1 trong EMOTIONS5; None nếu không khớp."""
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(path_or_name):
+    """Lấy tên file không đuôi, để khớp wavID giữa train.csv / metadata / dev.scp."""
+    return os.path.splitext(os.path.basename(str(path_or_name)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt
+# %%
+# !pip install -q speechmos funasr librosa soundfile pandas scipy scikit-learn tqdm
+# %% [markdown]
+# ## 2. Đọc & gộp nhãn
+# - `train.csv`: mỗi dòng = 1 listener chấm 1 wav → **gộp trung bình eMOS theo wavID** = nhãn vàng.
+# - `metadata.csv`: lấy **cảm xúc target** cho mỗi wav (chuẩn hóa về 5 lớp).
+# %%
+import pandas as pd
+def load_target_emotions():
+    """metadata.csv (wavID|emotion|transcript, KHÔNG header) → {stem: emotion_chuẩn|None}."""
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def load_train_labels():
+    """train.csv → DataFrame [wavID(stem), emos] đã gộp trung bình theo wav."""
+    df = pd.read_csv(TRAIN_CSV)
+    # Chuẩn hóa tên cột (phòng khi viết hoa/thường khác nhau)
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = cols.get("wavid") or cols.get("wav") or list(df.columns)[1]
+    emos_col = cols.get("emos") or cols.get("emo") or cols.get("emomos")
+    assert emos_col, f"Không thấy cột eMOS trong train.csv (cột hiện có: {list(df.columns)})"
+    g = df.groupby(df[wav_col].map(stem))[emos_col].mean()
+    out = g.reset_index()
+    out.columns = ["wavID", "emos"]
+    return out
+target_map = load_target_emotions()
+train_df = load_train_labels()
+print(f"Target emotions: {len(target_map)} | wav train (đã gộp): {len(train_df)}")
+print("eMOS thống kê:", train_df["emos"].describe()[["mean", "std", "min", "max"]].to_dict())
+train_df.head()
+# %% [markdown]
+# ## 3. Trích đặc trưng emotion2vec (có cache)
+# Mỗi wav → 1 lần `generate(extract_embedding=True)` cho ra **embedding** (cho EMOS) +
+# **xác suất 5 lớp** (cho CAT và làm feature). Lưu cache `.npz` để lần sau khỏi chạy lại.
+# %%
+import numpy as np
+_e2v_model = None
+def get_e2v():
+    global _e2v_model
+    if _e2v_model is None:
+        from funasr import AutoModel
+        _e2v_model = AutoModel(model="iic/emotion2vec_plus_large", hub="hf")
+    return _e2v_model
+def extract_one(wav_path):
+    """→ (emb: np.float32[D], probs5: np.float32[5] tổng=1). None nếu lỗi/thiếu file."""
+    if not os.path.exists(wav_path):
+        return None
+    rec = get_e2v().generate(wav_path, granularity="utterance", extract_embedding=True)
+    r = rec[0]
+    emb = np.asarray(r["feats"], dtype=np.float32).reshape(-1)
+    probs = {e: 0.0 for e in EMOTIONS5}
+    for lab, sc in zip(r["labels"], r["scores"]):
+        name = lab.split("/")[-1]
+        if name in probs:
+            probs[name] = float(sc)
+    tot = sum(probs.values())
+    if tot > 0:
+        probs = {k: v / tot for k, v in probs.items()}
+    probs5 = np.array([probs[e] for e in EMOTIONS5], dtype=np.float32)
+    return emb, probs5
+def extract_set(stems, tag):
+    """Trích (hoặc nạp cache) cho danh sách stem. Trả về dict {stem: (emb, probs5)}.
+    Cache lưu tại CACHE_DIR/<tag>.npz; tự bỏ qua stem đã có để chạy nối tiếp được."""
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[{tag}] nạp cache: {len(store)} mẫu")
+    todo = [s for s in stems if s not in store]
+    if not todo:
+        print(f"[{tag}] đủ cache, bỏ qua trích.")
+    else:
+        miss = 0
+        for i, s in enumerate(tqdm(todo, desc=f"trích {tag}")):
+            res = extract_one(os.path.join(WAV_DIR, s + ".wav"))
+            if res is None:
+                miss += 1
+                continue
+            emb, probs5 = res
+            store[s] = np.concatenate([emb, probs5]).astype(np.float32)  # [D + 5]
+            if (i + 1) % 500 == 0:   # lưu cache định kỳ phòng ngắt session
+                np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        if miss:
+            print(f"[{tag}] {miss} file thiếu/ lỗi → bỏ qua.")
+        print(f"[{tag}] tổng cache: {len(store)} mẫu → {cache_path}")
+    # tách lại thành (emb, probs5)
+    out = {}
+    for s, vec in store.items():
+        out[s] = (vec[:-5], vec[-5:])
+    return out
+# Trích cho tập train
+train_stems = list(train_df["wavID"])
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+train_feat = extract_set(train_stems, "train")
+EMB_DIM = next(iter(train_feat.values()))[0].shape[0]
+print("EMB_DIM =", EMB_DIM)
+# %% [markdown]
+# ## 4. Dựng feature + nhãn cho train
+# Feature mỗi wav = `[embedding | (probs5 nếu bật) | one-hot target(5)]`. Bỏ wav thiếu target/feature.
+# %%
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+def build_feature(stem_id, feat_map):
+    pack = feat_map.get(stem_id)
+    if pack is None:
+        return None
+    emb, probs5 = pack
+    tgt = target_map.get(stem_id)
+    if tgt is None:           # không biết cảm xúc target → không train được mẫu này
+        return None
+    parts = [emb]
+    if USE_CLASSPROB:
+        parts.append(probs5)
+    parts.append(onehot_target(tgt))
+    return np.concatenate(parts).astype(np.float32)
+emos_label = dict(zip(train_df["wavID"], train_df["emos"]))
+X, y = [], []
+for s in train_stems:
+    f = build_feature(s, train_feat)
+    if f is None or s not in emos_label:
+        continue
+    X.append(f); y.append(emos_label[s])
+X = np.stack(X); y = np.array(y, dtype=np.float32)
+FEAT_DIM = X.shape[1]
+print(f"Train: X={X.shape}  y={y.shape}  FEAT_DIM={FEAT_DIM}")
+# Chuẩn hóa feature (z-score) — lưu mean/std để áp dụng y hệt lúc dự đoán DEV.
+feat_mean = X.mean(0, keepdims=True)
+feat_std  = X.std(0, keepdims=True) + 1e-6
+Xn = (X - feat_mean) / feat_std
+# %% [markdown]
+# ## 5. Model (MLP head) + train loop
+# Loss = MSE. Theo dõi **SRCC** trên validation nội bộ; lưu model tốt nhất (early stopping).
+# %%
+import torch, torch.nn as nn
+from scipy.stats import spearmanr
+from sklearn.model_selection import train_test_split
+torch.manual_seed(SEED); np.random.seed(SEED)
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device)
+Xtr, Xva, ytr, yva = train_test_split(Xn, y, test_size=VAL_FRAC, random_state=SEED)
+Xtr_t = torch.tensor(Xtr, device=device); ytr_t = torch.tensor(ytr, device=device).unsqueeze(1)
+Xva_t = torch.tensor(Xva, device=device); yva_t = torch.tensor(yva, device=device).unsqueeze(1)
+class EmosHead(nn.Module):
+    def __init__(self, d_in, hidden, p):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(d_in, hidden), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(hidden, hidden // 2), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(hidden // 2, 1),
+        )
+    def forward(self, x):
+        return self.net(x)
+model = EmosHead(FEAT_DIM, HIDDEN, DROPOUT).to(device)
+opt = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)
+lossf = nn.MSELoss()
+def val_srcc():
+    model.eval()
+    with torch.no_grad():
+        pred = model(Xva_t).cpu().numpy().ravel()
+    return spearmanr(pred, yva).correlation
+best_srcc, best_state, bad = -1.0, None, 0
+n = Xtr_t.shape[0]
+for ep in range(1, EPOCHS + 1):
+    model.train()
+    perm = torch.randperm(n, device=device)
+    tot = 0.0
+    for i in range(0, n, BATCH):
+        idx = perm[i:i + BATCH]
+        opt.zero_grad()
+        out = model(Xtr_t[idx])
+        loss = lossf(out, ytr_t[idx])
+        loss.backward(); opt.step()
+        tot += loss.item() * len(idx)
+    srcc = val_srcc()
+    if srcc > best_srcc:
+        best_srcc, best_state, bad = srcc, {k: v.cpu().clone() for k, v in model.state_dict().items()}, 0
+    else:
+        bad += 1
+    if ep % 5 == 0 or ep == 1:
+        print(f"epoch {ep:3d} | train MSE {tot/n:.4f} | val SRCC {srcc:.4f} | best {best_srcc:.4f}")
+    if bad >= PATIENCE:
+        print(f"Early stop ở epoch {ep} (val SRCC không tăng {PATIENCE} epoch).")
+        break
+model.load_state_dict(best_state)
+print(f"\n✅ VAL SRCC tốt nhất = {best_srcc:.4f}  (baseline exp01 ≈ 0.194 — so ở đây)")
+# Lưu model + tham số chuẩn hóa để tái dùng / mô tả hệ thống.
+torch.save({"state": best_state, "feat_mean": feat_mean, "feat_std": feat_std,
+            "EMB_DIM": EMB_DIM, "FEAT_DIM": FEAT_DIM, "USE_CLASSPROB": USE_CLASSPROB,
+            "EMOTIONS5": EMOTIONS5, "val_srcc": float(best_srcc)},
+           os.path.join(OUT_DIR, "emos_head.pt"))
+print("Đã lưu", os.path.join(OUT_DIR, "emos_head.pt"))
+# %% [markdown]
+# ## 6. Dự đoán DEV → `answer.txt` đầy đủ
+# - **EMOS** = head vừa train (cần embedding + target của từng wav DEV).
+# - **CAT** = xác suất 5 lớp emotion2vec (đã có sẵn khi trích đặc trưng).
+# - **QMOS** = SpeechMOS (UTMOS) — bắt buộc, chạy thêm ở đây để answer.txt hợp lệ.
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]   # tên file .wav
+dev_names = list_dev()
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+# 6a. Trích đặc trưng emotion2vec cho DEV (cache riêng)
+dev_feat = extract_set(dev_stems, "dev")
+# 6b. EMOS từ head đã train
+def predict_emos(stem_id):
+    f = build_feature(stem_id, dev_feat)
+    if f is None:
+        return None
+    fn = (f[None, :] - feat_mean) / feat_std
+    model.eval()
+    with torch.no_grad():
+        return float(model(torch.tensor(fn, dtype=torch.float32, device=device)).item())
+# 6c. QMOS = SpeechMOS
+def run_qmos(names):
+    import librosa
+    predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True)
+    out = {}
+    from tqdm.auto import tqdm
+    for n in tqdm(names, desc="QMOS"):
+        p = os.path.join(WAV_DIR, n)
+        if not os.path.exists(p):
+            continue
+        wave, _ = librosa.load(p, sr=16000, mono=True)
+        out[n] = float(predictor(torch.from_numpy(wave).unsqueeze(0), sr=16000).mean().item())
+    return out
+qmos_scores = run_qmos(dev_names)
+# %%
+def fmt_cat(probs5):
+    return "|".join(f"{e}:{probs5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    n_emos = n_default = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT\n")
+        for name in dev_names:
+            sid = stem(name)
+            emos = predict_emos(sid)
+            if emos is None:
+                emos = 3.0; n_default += 1
+            else:
+                n_emos += 1
+            qmos = qmos_scores.get(name, 3.0)
+            probs5 = dev_feat[sid][1] if sid in dev_feat else np.full(5, 0.2, dtype=np.float32)
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(probs5)}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | EMOS thật {n_emos}, mặc định {n_default}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 7. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+# !cd /kaggle/working && zip -j submission_track2_exp02.zip answer.txt && unzip -l submission_track2_exp02.zip
+print("Sẵn sàng nộp: /kaggle/working/submission_track2_exp02.zip")
+# %% [markdown]
+# ## Ghi chú
+# - **VAL SRCC** in ở mục 5 là ước lượng nội bộ (10% train) — so với baseline 0.194 để biết có khá hơn không.
+#   Điểm DEV thật phải nộp lên CodaBench mới biết (My Submissions → Track 2, bỏ chọn track khác).
+# - Muốn thử nhanh: đặt `LIMIT_TRAIN = 300` ở cell 0.
+# - Embedding đã cache trong `/kaggle/working/emb_cache/` → **Save Version** để giữ, lần sau train head khỏi trích lại.
+# - Hướng cải tiến tiếp: thêm head QMOS/CAT/VAD dùng chung backbone (exp02 multi-task đầy đủ);
+#   thử backbone wav2vec2/WavLM; thêm ranking loss; fine-tune nhẹ backbone.
+# - Nhớ ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp02).

track2/exp03_emos_sailer.ipynb ADDED Viewed

	@@ -0,0 +1,392 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a6ae46f8",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp03 (EMOS bằng SAILER, offline) — Kaggle\n",
+    "\n",
+    "**Mục tiêu:** chấm **EMOS** (độ khớp cảm xúc target) bằng model **SAILER**\n",
+    "(`tiantiaf/wavlm-large-categorical-emotion`, vô địch Interspeech 2025 SER),\n",
+    "thay cho emotion2vec — KHÔNG train, chỉ lấy xác suất lớp cảm xúc target.\n",
+    "\n",
+    "## Ý tưởng (đọc 1 lần cho hiểu)\n",
+    "SAILER nhận 1 wav → xuất **logits 9 lớp cảm xúc** → softmax → **xác suất từng lớp**.\n",
+    "EMOS = mức khớp cảm xúc target → lấy thẳng **P(cảm xúc target)** rồi kéo về thang 1–5:\n",
+    "\n",
+    "```\n",
+    "mỗi wav ─► SAILER (WavLM-large) ─► softmax 9 lớp ─┬─► P(target)  ─► EMOS = 1 + 4·P\n",
+    "                                                  └─► 5 lớp (renorm) ─► CAT\n",
+    "      target emotion (metadata.csv) ─────────────────┘\n",
+    "```\n",
+    "\n",
+    "- **9 lớp SAILER:** `Anger, Contempt, Disgust, Fear, Happiness, Neutral, Sadness, Surprise, Other`.\n",
+    "  → đủ cả 5 lớp challenge (angry/happy/neutral/sad/surprised).\n",
+    "- **EMOS** = `1 + 4·P(target)` (scale [0,1]→[1,5]); SRCC bất biến với scale tuyến tính.\n",
+    "- **CAT** = lấy xác suất 5 lớp challenge từ chính SAILER (renormalize tổng=1).\n",
+    "- **VAD** = arousal/valence/dominance SAILER xuất sẵn (sigmoid 0–1 → 1–5) → 1 model lo EMOS+CAT+VAD!\n",
+    "- **QMOS** = SpeechMOS (UTMOS) — bắt buộc để `answer.txt` hợp lệ.\n",
+    "- KHÔNG train → nộp được ngay. So điểm EMOS với baseline emotion2vec (0.194) và exp01.\n",
+    "\n",
+    "**Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On**\n",
+    "→ + Add Input dataset Track 2 (15.477 wav, có `sets/dev.scp`, `metadata.csv`)\n",
+    "→ sửa `DATA_ROOT` ở cell 0 → Run All.\n",
+    "\n",
+    "⚠️ License SAILER = **Open RAIL** (phi thương mại) → phải khai báo trong `docs/12_system_description.md`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f25f6ed7",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4cb1fd8c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# ── Data Track 2 (dataset 15.477 wav đã ráp) ────────────────────────────────\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header) → target emotion\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"     # danh sách wav tập DEV (tập cần nộp ở training phase)\n",
+    "\n",
+    "OUT_DIR = \"/kaggle/working\"\n",
+    "\n",
+    "DEVICE      = \"cuda\"        # \"cuda\" trên Kaggle GPU; \"cpu\" nếu không có GPU\n",
+    "MAX_SECONDS = 15           # SAILER nhận tối đa 15s (giới hạn của model)\n",
+    "SR          = 16000        # SAILER cần 16kHz mono\n",
+    "LIMIT       = None          # đặt số nhỏ (vd 20) để chạy thử nhanh; None = full DEV\n",
+    "\n",
+    "# 5 lớp cảm xúc challenge (thứ tự cố định cho cột CAT)\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "\n",
+    "# 9 lớp SAILER (đúng thứ tự model xuất) + chỉ số của 5 lớp challenge trong đó\n",
+    "SAILER9 = [\"Anger\", \"Contempt\", \"Disgust\", \"Fear\", \"Happiness\", \"Neutral\", \"Sadness\", \"Surprise\", \"Other\"]\n",
+    "EMO2SAILER = {\"angry\": 0, \"happy\": 4, \"neutral\": 5, \"sad\": 6, \"surprised\": 7}   # EMOTIONS5 → index trong SAILER9\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    \"\"\"Đưa nhãn cảm xúc bất kỳ về 1 trong EMOTIONS5; None nếu không khớp.\"\"\"\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(path_or_name):\n",
+    "    return os.path.splitext(os.path.basename(str(path_or_name)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18c48274",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER\n",
+    "SAILER cần file `WavLMWrapper` trong repo `vox-profile-release`.\n",
+    "⚠️ **KHÔNG** `pip install -e .` (build wheel của repo hay lỗi trên Kaggle). Thay vào đó:\n",
+    "chỉ **clone + thêm repo vào `sys.path`** rồi cài đúng vài thư viện model cần\n",
+    "(`transformers/torch/huggingface_hub` Kaggle đã có sẵn; chỉ thiếu `loralib`, `speechbrain`)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd8f98d9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "\n",
+    "# Deps mà WavLMWrapper cần (xem import trong src/model/emotion/wavlm_emotion.py) + thư viện chấm QMOS.\n",
+    "pip_install(\"loralib\", \"speechbrain\", \"speechmos\", \"librosa\", \"soundfile\", \"scipy\", \"tqdm\")\n",
+    "\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)     # để `from src.model.emotion... import WavLMWrapper` chạy được"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00a49544",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp model SAILER"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2756567f",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device)\n",
+    "\n",
+    "from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "\n",
+    "sailer = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\").to(device)\n",
+    "sailer.eval()\n",
+    "print(\"✅ Đã nạp SAILER (wavlm-large-categorical-emotion)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c7b6e84",
+   "metadata": {},
+   "source": [
+    "## 3. Đọc cảm xúc target cho mỗi wav (từ metadata.csv)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8e6f4b36",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_target_emotions():\n",
+    "    \"\"\"metadata.csv (wavID|emotion|transcript, KHÔNG header) → {stem: emotion_chuẩn|None}.\"\"\"\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) < 2:\n",
+    "                continue\n",
+    "            tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "print(f\"Target emotions: {len(target_map)} wav | ví dụ:\", dict(list(target_map.items())[:3]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7467cfc6",
+   "metadata": {},
+   "source": [
+    "## 4. Hàm chấm 1 wav bằng SAILER → xác suất 9 lớp + VAD\n",
+    "WavLMWrapper khi `return_feature=True` trả **6 giá trị**:\n",
+    "`predicted(logits 9 lớp), features, detailed_logits, arousal, valence, dominance` (VAD sigmoid 0–1).\n",
+    "→ 1 model lo cả **EMOS** (P target), **CAT** (5 lớp renorm) **và VAD** (mở 3 cột đang trống!)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d3bb3d01",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import librosa\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def sailer_infer(wav_path):\n",
+    "    \"\"\"→ (probs9: float32[9], vad3: float32[3] theo thứ tự [VAL,ARO,DOM] thang 1–5);\n",
+    "       None nếu thiếu/lỗi file.\"\"\"\n",
+    "    if not os.path.exists(wav_path):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(wav_path, sr=SR, mono=True)\n",
+    "    wave = wave[: MAX_SECONDS * SR]                       # cắt tối đa 15s\n",
+    "    data = torch.from_numpy(wave).float().unsqueeze(0).to(device)\n",
+    "    logits, _feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)\n",
+    "    probs9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)\n",
+    "    # VAD sigmoid [0,1] → thang 1–5 cho khớp ví dụ BTC (SRCC bất biến với scale tuyến tính)\n",
+    "    v, a, d = float(valence.item()), float(arousal.item()), float(dominance.item())\n",
+    "    vad3 = np.array([1 + 4 * v, 1 + 4 * a, 1 + 4 * d], dtype=np.float32)   # [VAL, ARO, DOM]\n",
+    "    return probs9, vad3\n",
+    "\n",
+    "def emos_from_probs(probs9, target):\n",
+    "    \"\"\"EMOS = 1 + 4·P(target). None nếu không biết target → để caller xử lý mặc định.\"\"\"\n",
+    "    if target is None or target not in EMO2SAILER:\n",
+    "        return None\n",
+    "    return 1.0 + 4.0 * float(probs9[EMO2SAILER[target]])\n",
+    "\n",
+    "def cat5_from_probs(probs9):\n",
+    "    \"\"\"Lấy 5 lớp challenge từ 9 lớp SAILER rồi renormalize tổng=1.\"\"\"\n",
+    "    v = np.array([probs9[EMO2SAILER[e]] for e in EMOTIONS5], dtype=np.float32)\n",
+    "    s = v.sum()\n",
+    "    return v / s if s > 0 else np.full(5, 0.2, dtype=np.float32)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14f8e54a",
+   "metadata": {},
+   "source": [
+    "## 5. QMOS = SpeechMOS (UTMOS) — bắt buộc cho answer.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "992bd84b",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "@torch.no_grad()\n",
+    "def run_qmos(names):\n",
+    "    predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\", trust_repo=True).to(device).eval()\n",
+    "    from tqdm.auto import tqdm\n",
+    "    out = {}\n",
+    "    for n in tqdm(names, desc=\"QMOS\"):\n",
+    "        p = os.path.join(WAV_DIR, n)\n",
+    "        if not os.path.exists(p):\n",
+    "            continue\n",
+    "        wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "        x = torch.from_numpy(wave).unsqueeze(0).to(device)   # đẩy input lên GPU\n",
+    "        out[n] = float(predictor(x, sr=SR).mean().item())\n",
+    "    return out"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58afec1d",
+   "metadata": {},
+   "source": [
+    "## 6. Chạy trên DEV → `answer.txt` đầy đủ (QMOS, EMOS, CAT, VAL, ARO, DOM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "77b1eb8c",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT:\n",
+    "    dev_names = dev_names[:LIMIT]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "qmos_scores = run_qmos(dev_names)\n",
+    "\n",
+    "def fmt_cat(probs5):\n",
+    "    return \"|\".join(f\"{e}:{probs5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    from tqdm.auto import tqdm\n",
+    "    n_emos = n_default = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"SAILER EMOS/CAT/VAD\"):\n",
+    "            sid = stem(name)\n",
+    "            out = sailer_infer(os.path.join(WAV_DIR, name))\n",
+    "            if out is None:\n",
+    "                emos, cat5 = 3.0, np.full(5, 0.2, dtype=np.float32)\n",
+    "                vad3 = np.array([3.0, 3.0, 3.0], dtype=np.float32)\n",
+    "                n_default += 1\n",
+    "            else:\n",
+    "                probs9, vad3 = out\n",
+    "                emos = emos_from_probs(probs9, target_map.get(sid))\n",
+    "                if emos is None:\n",
+    "                    emos = 3.0; n_default += 1\n",
+    "                else:\n",
+    "                    n_emos += 1\n",
+    "                cat5 = cat5_from_probs(probs9)\n",
+    "            qmos = qmos_scores.get(name, 3.0)\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},\"\n",
+    "                    f\"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | EMOS thật {n_emos}, mặc định {n_default}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e3efd8d",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f816cbb2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    assert header[0] == \"wav\" and \"QMOS\" in header and \"EMOS\" in header, \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(header), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {header}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp03_sailer.zip answer.txt && unzip -l submission_track2_exp03_sailer.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp03_sailer.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ca84ef6",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Chưa chạy thật bao giờ** → lần đầu đặt `LIMIT = 20` ở cell 0 để bắt lỗi setup (clone repo / import / model).\n",
+    "- Điểm DEV thật phải nộp lên CodaBench mới biết (My Submissions → Track 2, bỏ chọn track khác).\n",
+    "- Notebook này đổi **EMOS + CAT + VAD** sang SAILER (1 model lo 6 cột metric). QMOS vẫn SpeechMOS cũ.\n",
+    "  Muốn ablation EMOS sạch (giữ CAT=emotion2vec) thì chỉ lấy cột EMOS từ đây, ghép với CAT của `track2_baseline`.\n",
+    "- Rủi ro setup duy nhất = import `src.model.emotion.wavlm_emotion` (cần repo vox-profile-release).\n",
+    "  Nếu lỗi import: kiểm tra `REPO_DIR` đã clone + `sys.path` đã thêm REPO_DIR (KHÔNG dùng pip install -e .).\n",
+    "- Nhớ ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp03)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp03_emos_sailer_pipeline.py ADDED Viewed

	@@ -0,0 +1,264 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp03 (EMOS bằng SAILER, offline) — Kaggle
+#
+# **Mục tiêu:** chấm **EMOS** (độ khớp cảm xúc target) bằng model **SAILER**
+# (`tiantiaf/wavlm-large-categorical-emotion`, vô địch Interspeech 2025 SER),
+# thay cho emotion2vec — KHÔNG train, chỉ lấy xác suất lớp cảm xúc target.
+#
+# ## Ý tưởng (đọc 1 lần cho hiểu)
+# SAILER nhận 1 wav → xuất **logits 9 lớp cảm xúc** → softmax → **xác suất từng lớp**.
+# EMOS = mức khớp cảm xúc target → lấy thẳng **P(cảm xúc target)** rồi kéo về thang 1–5:
+#
+# ```
+# mỗi wav ─► SAILER (WavLM-large) ─► softmax 9 lớp ─┬─► P(target)  ─► EMOS = 1 + 4·P
+#                                                   └─► 5 lớp (renorm) ─► CAT
+#       target emotion (metadata.csv) ─────────────────┘
+# ```
+#
+# - **9 lớp SAILER:** `Anger, Contempt, Disgust, Fear, Happiness, Neutral, Sadness, Surprise, Other`.
+#   → đủ cả 5 lớp challenge (angry/happy/neutral/sad/surprised).
+# - **EMOS** = `1 + 4·P(target)` (scale [0,1]→[1,5]); SRCC bất biến với scale tuyến tính.
+# - **CAT** = lấy xác suất 5 lớp challenge từ chính SAILER (renormalize tổng=1).
+# - **VAD** = arousal/valence/dominance SAILER xuất sẵn (sigmoid 0–1 → 1–5) → 1 model lo EMOS+CAT+VAD!
+# - **QMOS** = SpeechMOS (UTMOS) — bắt buộc để `answer.txt` hợp lệ.
+# - KHÔNG train → nộp được ngay. So điểm EMOS với baseline emotion2vec (0.194) và exp01.
+#
+# **Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On**
+# → + Add Input dataset Track 2 (15.477 wav, có `sets/dev.scp`, `metadata.csv`)
+# → sửa `DATA_ROOT` ở cell 0 → Run All.
+#
+# ⚠️ License SAILER = **Open RAIL** (phi thương mại) → phải khai báo trong `docs/12_system_description.md`.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+# ── Data Track 2 (dataset 15.477 wav đã ráp) ────────────────────────────────
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header) → target emotion
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"     # danh sách wav tập DEV (tập cần nộp ở training phase)
+OUT_DIR = "/kaggle/working"
+DEVICE      = "cuda"        # "cuda" trên Kaggle GPU; "cpu" nếu không có GPU
+MAX_SECONDS = 15           # SAILER nhận tối đa 15s (giới hạn của model)
+SR          = 16000        # SAILER cần 16kHz mono
+LIMIT       = None          # đặt số nhỏ (vd 20) để chạy thử nhanh; None = full DEV
+# 5 lớp cảm xúc challenge (thứ tự cố định cho cột CAT)
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+# 9 lớp SAILER (đúng thứ tự model xuất) + chỉ số của 5 lớp challenge trong đó
+SAILER9 = ["Anger", "Contempt", "Disgust", "Fear", "Happiness", "Neutral", "Sadness", "Surprise", "Other"]
+EMO2SAILER = {"angry": 0, "happy": 4, "neutral": 5, "sad": 6, "surprised": 7}   # EMOTIONS5 → index trong SAILER9
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    """Đưa nhãn cảm xúc bất kỳ về 1 trong EMOTIONS5; None nếu không khớp."""
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(path_or_name):
+    return os.path.splitext(os.path.basename(str(path_or_name)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER
+# SAILER cần file `WavLMWrapper` trong repo `vox-profile-release`.
+# ⚠️ **KHÔNG** `pip install -e .` (build wheel của repo hay lỗi trên Kaggle). Thay vào đó:
+# chỉ **clone + thêm repo vào `sys.path`** rồi cài đúng vài thư viện model cần
+# (`transformers/torch/huggingface_hub` Kaggle đã có sẵn; chỉ thiếu `loralib`, `speechbrain`).
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+# Deps mà WavLMWrapper cần (xem import trong src/model/emotion/wavlm_emotion.py) + thư viện chấm QMOS.
+pip_install("loralib", "speechbrain", "speechmos", "librosa", "soundfile", "scipy", "tqdm")
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)     # để `from src.model.emotion... import WavLMWrapper` chạy được
+# %% [markdown]
+# ## 2. Nạp model SAILER
+# %%
+import torch
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device)
+from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+sailer = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion").to(device)
+sailer.eval()
+print("✅ Đã nạp SAILER (wavlm-large-categorical-emotion)")
+# %% [markdown]
+# ## 3. Đọc cảm xúc target cho mỗi wav (từ metadata.csv)
+# %%
+def load_target_emotions():
+    """metadata.csv (wavID|emotion|transcript, KHÔNG header) → {stem: emotion_chuẩn|None}."""
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+target_map = load_target_emotions()
+print(f"Target emotions: {len(target_map)} wav | ví dụ:", dict(list(target_map.items())[:3]))
+# %% [markdown]
+# ## 4. Hàm chấm 1 wav bằng SAILER → xác suất 9 lớp + VAD
+# WavLMWrapper khi `return_feature=True` trả **6 giá trị**:
+# `predicted(logits 9 lớp), features, detailed_logits, arousal, valence, dominance` (VAD sigmoid 0–1).
+# → 1 model lo cả **EMOS** (P target), **CAT** (5 lớp renorm) **và VAD** (mở 3 cột đang trống!).
+# %%
+import numpy as np
+import librosa
+@torch.no_grad()
+def sailer_infer(wav_path):
+    """→ (probs9: float32[9], vad3: float32[3] theo thứ tự [VAL,ARO,DOM] thang 1–5);
+       None nếu thiếu/lỗi file."""
+    if not os.path.exists(wav_path):
+        return None
+    wave, _ = librosa.load(wav_path, sr=SR, mono=True)
+    wave = wave[: MAX_SECONDS * SR]                       # cắt tối đa 15s
+    data = torch.from_numpy(wave).float().unsqueeze(0).to(device)
+    logits, _feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)
+    probs9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)
+    # VAD sigmoid [0,1] → thang 1–5 cho khớp ví dụ BTC (SRCC bất biến với scale tuyến tính)
+    v, a, d = float(valence.item()), float(arousal.item()), float(dominance.item())
+    vad3 = np.array([1 + 4 * v, 1 + 4 * a, 1 + 4 * d], dtype=np.float32)   # [VAL, ARO, DOM]
+    return probs9, vad3
+def emos_from_probs(probs9, target):
+    """EMOS = 1 + 4·P(target). None nếu không biết target → để caller xử lý mặc định."""
+    if target is None or target not in EMO2SAILER:
+        return None
+    return 1.0 + 4.0 * float(probs9[EMO2SAILER[target]])
+def cat5_from_probs(probs9):
+    """Lấy 5 lớp challenge từ 9 lớp SAILER rồi renormalize tổng=1."""
+    v = np.array([probs9[EMO2SAILER[e]] for e in EMOTIONS5], dtype=np.float32)
+    s = v.sum()
+    return v / s if s > 0 else np.full(5, 0.2, dtype=np.float32)
+# %% [markdown]
+# ## 5. QMOS = SpeechMOS (UTMOS) — bắt buộc cho answer.txt
+# %%
+@torch.no_grad()
+def run_qmos(names):
+    predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True).to(device).eval()
+    from tqdm.auto import tqdm
+    out = {}
+    for n in tqdm(names, desc="QMOS"):
+        p = os.path.join(WAV_DIR, n)
+        if not os.path.exists(p):
+            continue
+        wave, _ = librosa.load(p, sr=SR, mono=True)
+        x = torch.from_numpy(wave).unsqueeze(0).to(device)   # đẩy input lên GPU
+        out[n] = float(predictor(x, sr=SR).mean().item())
+    return out
+# %% [markdown]
+# ## 6. Chạy trên DEV → `answer.txt` đầy đủ (QMOS, EMOS, CAT, VAL, ARO, DOM)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT:
+    dev_names = dev_names[:LIMIT]
+print("DEV:", len(dev_names), "mẫu")
+qmos_scores = run_qmos(dev_names)
+def fmt_cat(probs5):
+    return "|".join(f"{e}:{probs5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    from tqdm.auto import tqdm
+    n_emos = n_default = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="SAILER EMOS/CAT/VAD"):
+            sid = stem(name)
+            out = sailer_infer(os.path.join(WAV_DIR, name))
+            if out is None:
+                emos, cat5 = 3.0, np.full(5, 0.2, dtype=np.float32)
+                vad3 = np.array([3.0, 3.0, 3.0], dtype=np.float32)
+                n_default += 1
+            else:
+                probs9, vad3 = out
+                emos = emos_from_probs(probs9, target_map.get(sid))
+                if emos is None:
+                    emos = 3.0; n_default += 1
+                else:
+                    n_emos += 1
+                cat5 = cat5_from_probs(probs9)
+            qmos = qmos_scores.get(name, 3.0)
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},"
+                    f"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | EMOS thật {n_emos}, mặc định {n_default}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 7. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp03_sailer.zip answer.txt && unzip -l submission_track2_exp03_sailer.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp03_sailer.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Chưa chạy thật bao giờ** → lần đầu đặt `LIMIT = 20` ở cell 0 để bắt lỗi setup (clone repo / import / model).
+# - Điểm DEV thật phải nộp lên CodaBench mới biết (My Submissions → Track 2, bỏ chọn track khác).
+# - Notebook này đổi **EMOS + CAT + VAD** sang SAILER (1 model lo 6 cột metric). QMOS vẫn SpeechMOS cũ.
+#   Muốn ablation EMOS sạch (giữ CAT=emotion2vec) thì chỉ lấy cột EMOS từ đây, ghép với CAT của `track2_baseline`.
+# - Rủi ro setup duy nhất = import `src.model.emotion.wavlm_emotion` (cần repo vox-profile-release).
+#   Nếu lỗi import: kiểm tra `REPO_DIR` đã clone + `sys.path` đã thêm REPO_DIR (KHÔNG dùng pip install -e .).
+# - Nhớ ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp03).

track2/exp04_fusion.ipynb ADDED Viewed

	@@ -0,0 +1,790 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d85dcf89",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp04 (FUSION multi-task) — Kaggle\n",
+    "\n",
+    "**Mục tiêu:** gộp 2 backbone bổ sung nhau (**emotion2vec** thắng EMOS · **SAILER/WavLM** thắng VAD)\n",
+    "thành **1 model multi-task** dự đoán chung 5 đầu ra cảm xúc: **EMOS · CAT · VAL · ARO · DOM**.\n",
+    "QMOS để **riêng** (giữ SpeechMOS) — đúng thiết kế đã chốt: *\"QMOS riêng + 5 cảm xúc chung\"*.\n",
+    "\n",
+    "## Ý tưởng (đọc 1 lần cho hiểu)\n",
+    "Bằng chứng để fusion (từ exp01 & exp03): emotion2vec đứng đầu **EMOS** (0.637), SAILER đứng đầu\n",
+    "**VAD** (ARO 0.712 / DOM 0.630). Hai model \"nhìn\" cảm xúc theo cách khác nhau → **nối đặc trưng**\n",
+    "của cả hai rồi cho một mạng nhỏ học → kỳ vọng mạnh hơn từng model lẻ.\n",
+    "\n",
+    "```\n",
+    "                ┌─ emotion2vec ─► embedding ~D1 + xác suất 5 lớp ─┐\n",
+    " mỗi wav ──────►│                                                 ├─► NỐI ─► TRUNK chung\n",
+    "                └─ SAILER(WavLM) ► embedding ~D2 + 9 lớp + VAD3  ─┘        (Linear+ReLU)\n",
+    "                                                                             │\n",
+    "                             ┌───────────────────────────────────────────────┤\n",
+    "      target emotion(one-hot)│                                                │\n",
+    "                             ▼                                                ▼\n",
+    "                      [EMOS head]                              [CAT head]  [VAD head]\n",
+    "                      (cần target)                             (5 lớp)     (VAL/ARO/DOM)\n",
+    "```\n",
+    "\n",
+    "- **Cả 2 backbone ĐÓNG BĂNG** → chỉ trích đặc trưng (cache `.npz`), **chỉ train phần trunk + head nhỏ**\n",
+    "  → nhẹ GPU, train vài phút, hợp T4. (Né fine-tune end-to-end lúc đầu.)\n",
+    "- **EMOS phụ thuộc target** (cùng audio, target khác → điểm khác) → EMOS head nhận thêm one-hot target.\n",
+    "  **CAT/VAD** là cảm nhận về chính audio → chỉ cần trunk (không cần target).\n",
+    "- **Nhãn vàng** gộp theo `wavID` từ `sets/train.csv`:\n",
+    "  EMOS = TB `eMOS` · VAL/ARO/DOM = TB `val/aro/dom` · CAT = **tỉ lệ vote 5 lớp** của `emoCat`.\n",
+    "- **Cân loss = uncertainty weighting** (Kendall 2018): mỗi task có 1 trọng số σ **tự học**\n",
+    "  → không phải dò tay. Có cờ `USE_UNCERTAINTY=False` để quay về trọng số cố định khi cần debug.\n",
+    "- Cuối cùng xuất `answer.txt` **đủ 7 cột**: `wav,QMOS,EMOS,CAT,VAL,ARO,DOM`\n",
+    "  (QMOS=SpeechMOS · 5 cột còn lại = model fusion) → nộp được ngay. So mốc: EMOS 0.637 · VAD ARO 0.712.\n",
+    "\n",
+    "**Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On**\n",
+    "→ + Add Input dataset Track 2 (15.477 wav, có `sets/train.csv`, `sets/dev.scp`, `metadata.csv`)\n",
+    "→ sửa `DATA_ROOT` ở cell 0 → Run All. Lần đầu nên đặt `LIMIT_TRAIN = 300`, `LIMIT_DEV = 20` để bắt lỗi setup."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5101bb4e",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3fee9b16",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# ── Data Track 2 (dataset 15.477 wav đã ráp, có sets/train.csv) ──────────────\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header) → target emotion\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # nhãn người nghe: lisID,wavID,qMOS,emoCat,eMOS,val,dom,aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"     # danh sách wav tập DEV (tập cần nộp ở training phase)\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/fusion_cache\"     # cache embedding 2 backbone (tái dùng giữa các lần chạy)\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── Siêu tham số train ───────────────────────────────────────────────────────\n",
+    "DEVICE          = \"cuda\"      # \"cuda\" trên Kaggle GPU; \"cpu\" nếu không có GPU\n",
+    "TRUNK_HIDDEN    = 512         # số neuron lớp trunk chung\n",
+    "HEAD_HIDDEN     = 128         # số neuron lớp ẩn mỗi head\n",
+    "DROPOUT         = 0.3\n",
+    "LR              = 1e-3\n",
+    "EPOCHS          = 80\n",
+    "BATCH           = 64\n",
+    "VAL_FRAC        = 0.10        # 10% train → validation nội bộ (đo SRCC từng task)\n",
+    "PATIENCE        = 15          # early stop theo điểm tổng val (xem SCORE_FOR_STOP)\n",
+    "SEED            = 42\n",
+    "\n",
+    "USE_UNCERTAINTY = True        # True = tự cân loss (Kendall); False = dùng LOSS_W cố định bên dưới\n",
+    "LOSS_W          = {\"emos\": 1.0, \"cat\": 1.0, \"val\": 1.0, \"aro\": 1.0, \"dom\": 1.0}  # chỉ dùng khi tắt uncertainty\n",
+    "USE_E2V         = True        # bật/tắt nhánh emotion2vec trong fusion (để ablation)\n",
+    "USE_SAILER      = True        # bật/tắt nhánh SAILER trong fusion (để ablation)\n",
+    "USE_CLASSPROB   = True        # thêm xác suất lớp (e2v 5 + sailer 9) + VAD3 của SAILER vào feature\n",
+    "\n",
+    "LIMIT_TRAIN     = None        # đặt số nhỏ (vd 300) để chạy thử nhanh; None = full\n",
+    "LIMIT_DEV       = None        # đặt số nhỏ (vd 20) để chạy thử nhanh; None = full\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "\n",
+    "# 9 lớp SAILER (đúng thứ tự model xuất) + chỉ số của 5 lớp challenge trong đó\n",
+    "SAILER9 = [\"Anger\", \"Contempt\", \"Disgust\", \"Fear\", \"Happiness\", \"Neutral\", \"Sadness\", \"Surprise\", \"Other\"]\n",
+    "EMO2SAILER = {\"angry\": 0, \"happy\": 4, \"neutral\": 5, \"sad\": 6, \"surprised\": 7}   # EMOTIONS5 → index trong SAILER9\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    \"\"\"Đưa nhãn cảm xúc bất kỳ về 1 trong EMOTIONS5; None nếu không khớp.\"\"\"\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(path_or_name):\n",
+    "    \"\"\"Lấy tên file không đuôi, để khớp wavID giữa train.csv / metadata / dev.scp.\"\"\"\n",
+    "    return os.path.splitext(os.path.basename(str(path_or_name)))[0]\n",
+    "\n",
+    "assert USE_E2V or USE_SAILER, \"Phải bật ít nhất 1 backbone (USE_E2V hoặc USE_SAILER).\"\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "580854fb",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER\n",
+    "emotion2vec qua `funasr` (offline). SAILER cần `WavLMWrapper` trong repo `vox-profile-release`\n",
+    "→ **clone + sys.path** (KHÔNG `pip install -e .` vì build wheel hay lỗi trên Kaggle)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0ea1faa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"speechmos\", \"funasr\", \"librosa\", \"soundfile\", \"pandas\", \"scipy\", \"scikit-learn\", \"tqdm\")\n",
+    "\n",
+    "if USE_SAILER:\n",
+    "    pip_install(\"loralib\", \"speechbrain\")   # deps WavLMWrapper cần\n",
+    "    REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "    if not os.path.exists(REPO_DIR):\n",
+    "        subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                        \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "    if REPO_DIR not in sys.path:\n",
+    "        sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43033a70",
+   "metadata": {},
+   "source": [
+    "## 2. Đọc & gộp nhãn (gộp theo wavID)\n",
+    "- `train.csv`: mỗi dòng = 1 listener chấm 1 wav → gộp **theo wavID**:\n",
+    "  EMOS=TB `eMOS` · VAL/ARO/DOM=TB `val/aro/dom` · CAT=**tỉ lệ vote 5 lớp** của `emoCat`.\n",
+    "- `metadata.csv`: lấy **cảm xúc target** cho mỗi wav (để feed EMOS head)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d4051547",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    \"\"\"metadata.csv (wavID|emotion|transcript, KHÔNG header) → {stem: emotion_chuẩn|None}.\"\"\"\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) < 2:\n",
+    "                continue\n",
+    "            tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, default_idx=None, df=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    \"\"\"1 ô emoCat (có thể đa nhãn, vd 'happy;surprised') → vector đếm 5 lớp (chưa chuẩn hóa).\"\"\"\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    \"\"\"train.csv → DataFrame [wavID, emos, val, aro, dom, cat0..cat4] gộp theo wav.\n",
+    "    CAT = tỉ lệ vote 5 lớp (tổng=1); nếu wav không có vote hợp lệ → phân phối đều.\"\"\"\n",
+    "    # train.csv phân tách bằng \"|\"; cột emoCat đa nhãn dùng \",\" bên trong (vd \"Angry,Surprised\").\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col  = _col(cols, \"wavid\", \"wav\", default_idx=1, df=df)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col  = _col(cols, \"val\", \"valence\")\n",
+    "    aro_col  = _col(cols, \"aro\", \"arousal\")\n",
+    "    dom_col  = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col  = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS trong train.csv (cột: {list(df.columns)})\"\n",
+    "\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 1.0 / len(EMOTIONS5), dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target emotions: {len(target_map)} | wav train (gộp): {len(train_df)} | có nhãn VAD: {HAS_VAD}\")\n",
+    "print(\"eMOS:\", train_df[\"emos\"].describe()[[\"mean\", \"std\", \"min\", \"max\"]].to_dict())\n",
+    "train_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c5e27b9",
+   "metadata": {},
+   "source": [
+    "## 3. Trích đặc trưng 2 backbone (có cache riêng từng model)\n",
+    "- **emotion2vec** → embedding + xác suất 5 lớp (như exp02).\n",
+    "- **SAILER** → embedding (features) + xác suất 9 lớp + VAD3 (như exp03).\n",
+    "Mỗi backbone cache riêng (`e2v_<tag>.npz`, `sailer_<tag>.npz`) → chạy nối tiếp được, đổi 1 backbone\n",
+    "không phải trích lại cái kia. Trích xong **giải phóng GPU** rồi mới nạp backbone sau."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ebaac593",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device)\n",
+    "if device == \"cuda\":\n",
+    "    print(\"  ✅ GPU:\", torch.cuda.get_device_name(0))\n",
+    "else:\n",
+    "    print(\"  ⚠️ KHÔNG thấy GPU! Trích đặc trưng ~15k file trên CPU rất lâu.\")\n",
+    "    print(\"     → Settings → Accelerator = GPU T4 rồi chạy lại.\")\n",
+    "\n",
+    "# ---- emotion2vec ----\n",
+    "def extract_e2v(stems, tag):\n",
+    "    \"\"\"→ dict {stem: (emb[D1], probs5[5])}. Cache CACHE_DIR/e2v_<tag>.npz.\"\"\"\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"e2v_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[e2v/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        import logging\n",
+    "        logging.getLogger(\"funasr\").setLevel(logging.ERROR)   # bớt log ồn của funasr\n",
+    "        from funasr import AutoModel\n",
+    "        m = AutoModel(model=\"iic/emotion2vec_plus_large\", hub=\"hf\", device=device,\n",
+    "                      disable_update=True, disable_pbar=True, disable_log=True)   # ép GPU + tắt log\n",
+    "        miss = 0\n",
+    "        for i, s in enumerate(tqdm(todo, desc=f\"e2v {tag}\")):\n",
+    "            wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "            if not os.path.exists(wav):\n",
+    "                miss += 1; continue\n",
+    "            r = m.generate(wav, granularity=\"utterance\", extract_embedding=True)[0]\n",
+    "            emb = np.asarray(r[\"feats\"], dtype=np.float32).reshape(-1)\n",
+    "            probs = {e: 0.0 for e in EMOTIONS5}\n",
+    "            for lab, sc in zip(r[\"labels\"], r[\"scores\"]):\n",
+    "                name = lab.split(\"/\")[-1]\n",
+    "                if name in probs:\n",
+    "                    probs[name] = float(sc)\n",
+    "            tot = sum(probs.values())\n",
+    "            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)\n",
+    "            store[s] = np.concatenate([emb, p5]).astype(np.float32)   # [D1 + 5]\n",
+    "            if (i + 1) % 500 == 0:\n",
+    "                np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del m\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "        if miss:\n",
+    "            print(f\"[e2v/{tag}] {miss} file thiếu → bỏ qua.\")\n",
+    "    return {s: (v[:-5], v[-5:]) for s, v in store.items()}\n",
+    "\n",
+    "# ---- SAILER ----\n",
+    "def _pool_feat(features):\n",
+    "    \"\"\"features (tensor) → vector 1 chiều (mean-pool nếu còn chiều thời gian).\"\"\"\n",
+    "    f = features.detach().cpu().numpy()\n",
+    "    if f.ndim <= 1:\n",
+    "        return f.reshape(-1).astype(np.float32)\n",
+    "    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)\n",
+    "\n",
+    "def extract_sailer(stems, tag):\n",
+    "    \"\"\"→ dict {stem: (emb[D2], probs9[9], vad3[3] thang 1–5)}. Cache CACHE_DIR/sailer_<tag>.npz.\n",
+    "    Mỗi mẫu lưu vector [emb | probs9(9) | vad3(3)] → cắt lại khi nạp.\"\"\"\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"sailer_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[sailer/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        from src.model.emotion.wavlm_emotion import WavLMWrapper\n",
+    "        sailer = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\").to(device).eval()\n",
+    "        miss = 0\n",
+    "        with torch.no_grad():\n",
+    "            for i, s in enumerate(tqdm(todo, desc=f\"sailer {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    miss += 1; continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                wave = wave[: 15 * 16000]\n",
+    "                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)\n",
+    "                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)\n",
+    "                emb = _pool_feat(feat)\n",
+    "                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)\n",
+    "                vad3 = np.array([1 + 4 * float(valence.item()),\n",
+    "                                 1 + 4 * float(arousal.item()),\n",
+    "                                 1 + 4 * float(dominance.item())], dtype=np.float32)  # [VAL,ARO,DOM]\n",
+    "                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)   # [D2 + 9 + 3]\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del sailer\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "        if miss:\n",
+    "            print(f\"[sailer/{tag}] {miss} file thiếu → bỏ qua.\")\n",
+    "    return {s: (v[:-12], v[-12:-3], v[-3:]) for s, v in store.items()}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "751d646c",
+   "metadata": {},
+   "source": [
+    "## 4. Dựng feature + nhãn cho train\n",
+    "Feature audio (KHÔNG gồm target) = nối các phần đang bật:\n",
+    "`[e2v_emb | e2v_probs5 | sailer_emb | sailer_probs9 | sailer_vad3]`.\n",
+    "One-hot target để **riêng** (chỉ EMOS head dùng). Bỏ wav thiếu feature."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "005cdf2f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_stems = list(train_df[\"wavID\"])\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "\n",
+    "e2v_tr    = extract_e2v(train_stems, \"train\")    if USE_E2V    else {}\n",
+    "sailer_tr = extract_sailer(train_stems, \"train\") if USE_SAILER else {}\n",
+    "\n",
+    "def audio_feature(sid, e2v_map, sailer_map):\n",
+    "    \"\"\"Nối đặc trưng audio cho 1 wav. None nếu thiếu phần bắt buộc.\"\"\"\n",
+    "    parts = []\n",
+    "    if USE_E2V:\n",
+    "        pk = e2v_map.get(sid)\n",
+    "        if pk is None:\n",
+    "            return None\n",
+    "        emb, p5 = pk\n",
+    "        parts.append(emb)\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(p5)\n",
+    "    if USE_SAILER:\n",
+    "        pk = sailer_map.get(sid)\n",
+    "        if pk is None:\n",
+    "            return None\n",
+    "        emb, p9, vad3 = pk\n",
+    "        parts.append(emb)\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(p9); parts.append(vad3)\n",
+    "    return np.concatenate(parts).astype(np.float32)\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "X, T, y_emos, y_vad, y_cat = [], [], [], [], []\n",
+    "for s in train_stems:\n",
+    "    f = audio_feature(s, e2v_tr, sailer_tr)\n",
+    "    tgt = target_map.get(s)\n",
+    "    if f is None or tgt is None or s not in lab.index:\n",
+    "        continue\n",
+    "    X.append(f)\n",
+    "    T.append(onehot_target(tgt))\n",
+    "    y_emos.append(lab.loc[s, \"emos\"])\n",
+    "    y_vad.append([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]])\n",
+    "    y_cat.append([lab.loc[s, f\"cat{i}\"] for i in range(len(EMOTIONS5))])\n",
+    "\n",
+    "X = np.stack(X).astype(np.float32)\n",
+    "T = np.stack(T).astype(np.float32)\n",
+    "y_emos = np.array(y_emos, dtype=np.float32)\n",
+    "y_vad  = np.array(y_vad,  dtype=np.float32)         # [N,3] (VAL,ARO,DOM) — có thể toàn NaN nếu thiếu nhãn\n",
+    "y_cat  = np.array(y_cat,  dtype=np.float32)         # [N,5] phân phối tổng=1\n",
+    "FEAT_DIM = X.shape[1]\n",
+    "print(f\"Train: X={X.shape} target={T.shape} emos={y_emos.shape} vad={y_vad.shape} cat={y_cat.shape}\")\n",
+    "\n",
+    "# Chuẩn hóa feature audio (z-score) — lưu mean/std để áp dụng y hệt lúc dự đoán DEV.\n",
+    "feat_mean = X.mean(0, keepdims=True)\n",
+    "feat_std  = X.std(0, keepdims=True) + 1e-6\n",
+    "Xn = (X - feat_mean) / feat_std\n",
+    "\n",
+    "# Chuẩn hóa nhãn liên tục (eMOS, VAD) về z-score → các MSE cùng thang (uncertainty weighting ổn định hơn).\n",
+    "# SRCC bất biến với scale → khi xuất answer.txt chỉ cần đảo z-score về thang gốc cho đẹp.\n",
+    "emos_mu, emos_sd = float(y_emos.mean()), float(y_emos.std() + 1e-6)\n",
+    "y_emos_z = (y_emos - emos_mu) / emos_sd\n",
+    "if HAS_VAD:\n",
+    "    vad_mu = np.nanmean(y_vad, axis=0)\n",
+    "    vad_sd = np.nanstd(y_vad, axis=0) + 1e-6\n",
+    "    y_vad_z = (y_vad - vad_mu) / vad_sd\n",
+    "else:\n",
+    "    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)\n",
+    "    y_vad_z = np.zeros_like(y_vad)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f41faa42",
+   "metadata": {},
+   "source": [
+    "## 5. Model fusion multi-task + train loop\n",
+    "- **Trunk** chung: `Linear(FEAT_DIM→TRUNK_HIDDEN)+ReLU+Dropout` (×2).\n",
+    "- **EMOS head**: nối `[trunk | one-hot target]` → MLP → 1 (vì EMOS phụ thuộc target).\n",
+    "- **CAT head**: trunk → 5 logits → softmax (dự đoán phân phối vote). Loss = soft-CE (KL).\n",
+    "- **VAD head**: trunk → 3 (VAL/ARO/DOM). Loss = MSE (bỏ qua nếu thiếu nhãn VAD).\n",
+    "- **Cân loss**: uncertainty weighting — tổng `Σ exp(-sᵢ)·Lᵢ + sᵢ`, `sᵢ=log σᵢ²` **học được**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dc5e0242",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "from scipy.stats import spearmanr\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "\n",
+    "idx_all = np.arange(X.shape[0])\n",
+    "tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)\n",
+    "\n",
+    "def to_t(a):\n",
+    "    return torch.tensor(a, dtype=torch.float32, device=device)\n",
+    "\n",
+    "Xn_t, T_t = to_t(Xn), to_t(T)\n",
+    "emos_t = to_t(y_emos_z).unsqueeze(1)\n",
+    "vad_t  = to_t(y_vad_z)\n",
+    "cat_t  = to_t(y_cat)\n",
+    "\n",
+    "class FusionMTL(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(\n",
+    "            nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "        )\n",
+    "        self.emos = nn.Sequential(   # nhận [trunk | target]\n",
+    "            nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat  = nn.Sequential(\n",
+    "            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad  = nn.Sequential(\n",
+    "            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "\n",
+    "    def forward(self, x, tgt):\n",
+    "        h = self.trunk(x)\n",
+    "        emos = self.emos(torch.cat([h, tgt], dim=1))\n",
+    "        cat_logits = self.cat(h)\n",
+    "        vad = self.vad(h)\n",
+    "        return emos, cat_logits, vad\n",
+    "\n",
+    "model = FusionMTL(FEAT_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "\n",
+    "# Trọng số bất định (log σ²) cho 5 task: emos, cat, val, aro, dom.\n",
+    "TASKS = [\"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "params = list(model.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.Adam(params, lr=LR, weight_decay=1e-5)\n",
+    "\n",
+    "mse = nn.MSELoss(reduction=\"none\")\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    \"\"\"Cross-entropy với nhãn mềm (phân phối): −Σ p·log q.\"\"\"\n",
+    "    logq = F.log_softmax(logits, dim=1)\n",
+    "    return -(target_dist * logq).sum(dim=1)\n",
+    "\n",
+    "def task_losses(emos_p, cat_logits, vad_p, b):\n",
+    "    \"\"\"Trả về dict loss TB từng task cho 1 batch (chỉ số b).\"\"\"\n",
+    "    L = {}\n",
+    "    L[\"emos\"] = mse(emos_p, emos_t[b]).mean()\n",
+    "    L[\"cat\"]  = soft_ce(cat_logits, cat_t[b]).mean()\n",
+    "    if HAS_VAD:\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vad_t[b, 0:1]).mean()\n",
+    "        L[\"aro\"] = mse(vad_p[:, 1:2], vad_t[b, 1:2]).mean()\n",
+    "        L[\"dom\"] = mse(vad_p[:, 2:3], vad_t[b, 2:3]).mean()\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device)\n",
+    "        L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    return L\n",
+    "\n",
+    "def combine(L):\n",
+    "    \"\"\"Gộp 5 loss thành 1 số: uncertainty weighting hoặc trọng số cố định.\"\"\"\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        tot = 0.0\n",
+    "        for i, t in enumerate(TASKS):\n",
+    "            tot = tot + torch.exp(-log_var[i]) * L[t] + log_var[i]\n",
+    "        return tot\n",
+    "    return sum(LOSS_W[t] * L[t] for t in TASKS)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def eval_val():\n",
+    "    \"\"\"SRCC từng task trên tập val nội bộ (CAT báo bằng −KL để 'cao=tốt' cho early-stop).\"\"\"\n",
+    "    model.eval()\n",
+    "    ep, cl, vp = model(Xn_t[va_idx], T_t[va_idx])\n",
+    "    ep = ep.cpu().numpy().ravel()\n",
+    "    out = {\"emos\": spearmanr(ep, y_emos[va_idx]).correlation}\n",
+    "    if HAS_VAD:\n",
+    "        vp = vp.cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            out[t] = spearmanr(vp[:, j], y_vad[va_idx, j]).correlation\n",
+    "    # CAT: dùng −KL(p‖q) trung bình (càng gần 0 càng tốt) → đổi dấu để hợp early-stop\n",
+    "    q = F.softmax(cl, dim=1).cpu().numpy()\n",
+    "    p = y_cat[va_idx]\n",
+    "    kl = (p * (np.log(p + 1e-9) - np.log(q + 1e-9))).sum(1).mean()\n",
+    "    out[\"cat_negkl\"] = float(-kl)\n",
+    "    return out\n",
+    "\n",
+    "def val_score(m):\n",
+    "    \"\"\"Điểm tổng để early-stop = TB SRCC các task liên tục có nhãn.\"\"\"\n",
+    "    keys = [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "best_score, best_state, bad = -1e9, None, 0\n",
+    "tr_t = torch.tensor(tr_idx, device=device)\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    model.train()\n",
+    "    perm = tr_t[torch.randperm(len(tr_t), device=device)]\n",
+    "    run = 0.0\n",
+    "    for i in range(0, len(perm), BATCH):\n",
+    "        b = perm[i:i + BATCH]\n",
+    "        opt.zero_grad()\n",
+    "        emos_p, cat_logits, vad_p = model(Xn_t[b], T_t[b])\n",
+    "        L = task_losses(emos_p, cat_logits, vad_p, b)\n",
+    "        loss = combine(L)\n",
+    "        loss.backward(); opt.step()\n",
+    "        run += loss.item() * len(b)\n",
+    "    m = eval_val()\n",
+    "    sc = val_score(m)\n",
+    "    if sc > best_score:\n",
+    "        best_score = sc\n",
+    "        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}\n",
+    "        bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "    if ep % 5 == 0 or ep == 1:\n",
+    "        msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in m)\n",
+    "        print(f\"epoch {ep:3d} | loss {run/len(perm):.4f} | {msg} | best {best_score:.4f}\")\n",
+    "    if bad >= PATIENCE:\n",
+    "        print(f\"Early stop ở epoch {ep}.\")\n",
+    "        break\n",
+    "\n",
+    "model.load_state_dict(best_state)\n",
+    "final = eval_val()\n",
+    "print(\"\\n✅ VAL (nội bộ) tốt nhất:\")\n",
+    "print(f\"   EMOS SRCC = {final['emos']:.4f}   (so mốc exp01 emotion2vec = 0.637)\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\"   VAL/ARO/DOM SRCC = {final['val']:.4f} / {final['aro']:.4f} / {final['dom']:.4f}\"\n",
+    "          f\"   (so mốc SAILER = 0.341 / 0.712 / 0.630)\")\n",
+    "if USE_UNCERTAINTY:\n",
+    "    print(\"   log σ² mỗi task:\", {t: round(float(log_var[i]), 3) for i, t in enumerate(TASKS)})\n",
+    "\n",
+    "# Lưu model + tham số chuẩn hóa.\n",
+    "torch.save({\"state\": best_state, \"feat_mean\": feat_mean, \"feat_std\": feat_std,\n",
+    "            \"emos_mu\": emos_mu, \"emos_sd\": emos_sd, \"vad_mu\": vad_mu, \"vad_sd\": vad_sd,\n",
+    "            \"FEAT_DIM\": FEAT_DIM, \"EMOTIONS5\": EMOTIONS5, \"HAS_VAD\": HAS_VAD,\n",
+    "            \"USE_E2V\": USE_E2V, \"USE_SAILER\": USE_SAILER, \"USE_CLASSPROB\": USE_CLASSPROB,\n",
+    "            \"TRUNK_HIDDEN\": TRUNK_HIDDEN, \"HEAD_HIDDEN\": HEAD_HIDDEN, \"val_score\": best_score},\n",
+    "           os.path.join(OUT_DIR, \"fusion_mtl.pt\"))\n",
+    "print(\"Đã lưu\", os.path.join(OUT_DIR, \"fusion_mtl.pt\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "39e3c014",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán DEV → `answer.txt` đầy đủ 7 cột\n",
+    "- **EMOS/CAT/VAD** = model fusion (đảo z-score về thang gốc cho EMOS/VAD; CAT = softmax 5 lớp).\n",
+    "- **QMOS** = SpeechMOS (UTMOS) — để riêng, đúng thiết kế."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9d06ec4",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]   # tên file .wav\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "# 6a. Trích đặc trưng 2 backbone cho DEV (cache riêng)\n",
+    "e2v_dev    = extract_e2v(dev_stems, \"dev\")    if USE_E2V    else {}\n",
+    "sailer_dev = extract_sailer(dev_stems, \"dev\") if USE_SAILER else {}\n",
+    "\n",
+    "# 6b. Dự đoán 5 cột cảm xúc bằng model fusion\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    f = audio_feature(sid, e2v_dev, sailer_dev)\n",
+    "    if f is None:\n",
+    "        return None\n",
+    "    fn = (f[None, :] - feat_mean) / feat_std\n",
+    "    tgt = onehot_target(target_map.get(sid))[None, :]\n",
+    "    model.eval()\n",
+    "    emos_p, cat_logits, vad_p = model(to_t(fn), to_t(tgt))\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu                      # đảo z-score\n",
+    "    cat5 = F.softmax(cat_logits, dim=1)[0].cpu().numpy()\n",
+    "    vad3 = vad_p[0].cpu().numpy() * vad_sd + vad_mu                      # [VAL,ARO,DOM]\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "# 6c. QMOS = SpeechMOS (để riêng)\n",
+    "@torch.no_grad()\n",
+    "def run_qmos(names):\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\", trust_repo=True).to(device).eval()\n",
+    "    out = {}\n",
+    "    for n in tqdm(names, desc=\"QMOS\"):\n",
+    "        p = os.path.join(WAV_DIR, n)\n",
+    "        if not os.path.exists(p):\n",
+    "            continue\n",
+    "        wave, _ = librosa.load(p, sr=16000, mono=True)\n",
+    "        out[n] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device), sr=16000).mean().item())\n",
+    "    return out\n",
+    "\n",
+    "qmos_scores = run_qmos(dev_names)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "999f19fc",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def fmt_cat(probs5):\n",
+    "    return \"|\".join(f\"{e}:{probs5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    from tqdm.auto import tqdm\n",
+    "    n_real = n_default = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pred = predict_emotion(sid)\n",
+    "            if pred is None:\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])\n",
+    "                n_default += 1\n",
+    "            else:\n",
+    "                emos, cat5, vad3 = pred\n",
+    "                n_real += 1\n",
+    "            qmos = qmos_scores.get(name, 3.0)\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},\"\n",
+    "                    f\"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | fusion thật {n_real}, mặc định {n_default}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "708acd7a",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ba406750",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    assert header[0] == \"wav\" and \"QMOS\" in header and \"EMOS\" in header, \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(header), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {header}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp04_fusion.zip answer.txt && unzip -l submission_track2_exp04_fusion.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp04_fusion.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0f4e2ae",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Lần đầu**: đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` ở cell 0 để bắt lỗi setup (clone repo / import / model).\n",
+    "  Chạy OK rồi đặt `None` chạy full.\n",
+    "- **VAL SRCC** ở mục 5 là ước lượng nội bộ (10% train) → so mốc EMOS 0.637 / ARO 0.712. Điểm DEV thật\n",
+    "  phải nộp CodaBench mới biết (My Submissions → Track 2, bỏ chọn track khác).\n",
+    "- Embedding đã cache trong `/kaggle/working/fusion_cache/` → **Save Version** để giữ; lần sau đổi\n",
+    "  siêu tham số/đổi cách cân loss chỉ train lại head (vài phút), khỏi trích lại.\n",
+    "- **Ablation cho paper** (đổi cờ ở cell 0, train lại head):\n",
+    "  `USE_E2V=False` (chỉ SAILER) · `USE_SAILER=False` (chỉ emotion2vec) · `USE_UNCERTAINTY=False` (trọng số tay)\n",
+    "  · `USE_CLASSPROB=False` (chỉ embedding) → điền bảng ablation `docs/04_experiments_log.md`.\n",
+    "- License SAILER = **Open RAIL (phi thương mại)** → nhắc trong `docs/12_system_description.md`.\n",
+    "- Nhớ ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp04)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp04_fusion_pipeline.py ADDED Viewed

	@@ -0,0 +1,652 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp04 (FUSION multi-task) — Kaggle
+#
+# **Mục tiêu:** gộp 2 backbone bổ sung nhau (**emotion2vec** thắng EMOS · **SAILER/WavLM** thắng VAD)
+# thành **1 model multi-task** dự đoán chung 5 đầu ra cảm xúc: **EMOS · CAT · VAL · ARO · DOM**.
+# QMOS để **riêng** (giữ SpeechMOS) — đúng thiết kế đã chốt: *"QMOS riêng + 5 cảm xúc chung"*.
+#
+# ## Ý tưởng (đọc 1 lần cho hiểu)
+# Bằng chứng để fusion (từ exp01 & exp03): emotion2vec đứng đầu **EMOS** (0.637), SAILER đứng đầu
+# **VAD** (ARO 0.712 / DOM 0.630). Hai model "nhìn" cảm xúc theo cách khác nhau → **nối đặc trưng**
+# của cả hai rồi cho một mạng nhỏ học → kỳ vọng mạnh hơn từng model lẻ.
+#
+# ```
+#                 ┌─ emotion2vec ─► embedding ~D1 + xác suất 5 lớp ─┐
+#  mỗi wav ──────►│                                                 ├─► NỐI ─► TRUNK chung
+#                 └─ SAILER(WavLM) ► embedding ~D2 + 9 lớp + VAD3  ─┘        (Linear+ReLU)
+#                                                                              │
+#                              ┌───────────────────────────────────────────────┤
+#       target emotion(one-hot)│                                                │
+#                              ▼                                                ▼
+#                       [EMOS head]                              [CAT head]  [VAD head]
+#                       (cần target)                             (5 lớp)     (VAL/ARO/DOM)
+# ```
+#
+# - **Cả 2 backbone ĐÓNG BĂNG** → chỉ trích đặc trưng (cache `.npz`), **chỉ train phần trunk + head nhỏ**
+#   → nhẹ GPU, train vài phút, hợp T4. (Né fine-tune end-to-end lúc đầu.)
+# - **EMOS phụ thuộc target** (cùng audio, target khác → điểm khác) → EMOS head nhận thêm one-hot target.
+#   **CAT/VAD** là cảm nhận về chính audio → chỉ cần trunk (không cần target).
+# - **Nhãn vàng** gộp theo `wavID` từ `sets/train.csv`:
+#   EMOS = TB `eMOS` · VAL/ARO/DOM = TB `val/aro/dom` · CAT = **tỉ lệ vote 5 lớp** của `emoCat`.
+# - **Cân loss = uncertainty weighting** (Kendall 2018): mỗi task có 1 trọng số σ **tự học**
+#   → không phải dò tay. Có cờ `USE_UNCERTAINTY=False` để quay về trọng số cố định khi cần debug.
+# - Cuối cùng xuất `answer.txt` **đủ 7 cột**: `wav,QMOS,EMOS,CAT,VAL,ARO,DOM`
+#   (QMOS=SpeechMOS · 5 cột còn lại = model fusion) → nộp được ngay. So mốc: EMOS 0.637 · VAD ARO 0.712.
+#
+# **Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On**
+# → + Add Input dataset Track 2 (15.477 wav, có `sets/train.csv`, `sets/dev.scp`, `metadata.csv`)
+# → sửa `DATA_ROOT` ở cell 0 → Run All. Lần đầu nên đặt `LIMIT_TRAIN = 300`, `LIMIT_DEV = 20` để bắt lỗi setup.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+# ── Data Track 2 (dataset 15.477 wav đã ráp, có sets/train.csv) ──────────────
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header) → target emotion
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # nhãn người nghe: lisID,wavID,qMOS,emoCat,eMOS,val,dom,aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"     # danh sách wav tập DEV (tập cần nộp ở training phase)
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/fusion_cache"     # cache embedding 2 backbone (tái dùng giữa các lần chạy)
+os.makedirs(CACHE_DIR, exist_ok=True)
+# ── Siêu tham số train ───────────────────────────────────────────────────────
+DEVICE          = "cuda"      # "cuda" trên Kaggle GPU; "cpu" nếu không có GPU
+TRUNK_HIDDEN    = 512         # số neuron lớp trunk chung
+HEAD_HIDDEN     = 128         # số neuron lớp ẩn mỗi head
+DROPOUT         = 0.3
+LR              = 1e-3
+EPOCHS          = 80
+BATCH           = 64
+VAL_FRAC        = 0.10        # 10% train → validation nội bộ (đo SRCC từng task)
+PATIENCE        = 15          # early stop theo điểm tổng val (xem SCORE_FOR_STOP)
+SEED            = 42
+USE_UNCERTAINTY = True        # True = tự cân loss (Kendall); False = dùng LOSS_W cố định bên dưới
+LOSS_W          = {"emos": 1.0, "cat": 1.0, "val": 1.0, "aro": 1.0, "dom": 1.0}  # chỉ dùng khi tắt uncertainty
+USE_E2V         = True        # bật/tắt nhánh emotion2vec trong fusion (để ablation)
+USE_SAILER      = True        # bật/tắt nhánh SAILER trong fusion (để ablation)
+USE_CLASSPROB   = True        # thêm xác suất lớp (e2v 5 + sailer 9) + VAD3 của SAILER vào feature
+LIMIT_TRAIN     = None        # đặt số nhỏ (vd 300) để chạy thử nhanh; None = full
+LIMIT_DEV       = None        # đặt số nhỏ (vd 20) để chạy thử nhanh; None = full
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+# 9 lớp SAILER (đúng thứ tự model xuất) + chỉ số của 5 lớp challenge trong đó
+SAILER9 = ["Anger", "Contempt", "Disgust", "Fear", "Happiness", "Neutral", "Sadness", "Surprise", "Other"]
+EMO2SAILER = {"angry": 0, "happy": 4, "neutral": 5, "sad": 6, "surprised": 7}   # EMOTIONS5 → index trong SAILER9
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    """Đưa nhãn cảm xúc bất kỳ về 1 trong EMOTIONS5; None nếu không khớp."""
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(path_or_name):
+    """Lấy tên file không đuôi, để khớp wavID giữa train.csv / metadata / dev.scp."""
+    return os.path.splitext(os.path.basename(str(path_or_name)))[0]
+assert USE_E2V or USE_SAILER, "Phải bật ít nhất 1 backbone (USE_E2V hoặc USE_SAILER)."
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER
+# emotion2vec qua `funasr` (offline). SAILER cần `WavLMWrapper` trong repo `vox-profile-release`
+# → **clone + sys.path** (KHÔNG `pip install -e .` vì build wheel hay lỗi trên Kaggle).
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("speechmos", "funasr", "librosa", "soundfile", "pandas", "scipy", "scikit-learn", "tqdm")
+if USE_SAILER:
+    pip_install("loralib", "speechbrain")   # deps WavLMWrapper cần
+    REPO_DIR = "/kaggle/working/vox-profile-release"
+    if not os.path.exists(REPO_DIR):
+        subprocess.run(["git", "clone", "--depth", "1",
+                        "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+    if REPO_DIR not in sys.path:
+        sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Đọc & gộp nhãn (gộp theo wavID)
+# - `train.csv`: mỗi dòng = 1 listener chấm 1 wav → gộp **theo wavID**:
+#   EMOS=TB `eMOS` · VAL/ARO/DOM=TB `val/aro/dom` · CAT=**tỉ lệ vote 5 lớp** của `emoCat`.
+# - `metadata.csv`: lấy **cảm xúc target** cho mỗi wav (để feed EMOS head).
+# %%
+import numpy as np
+import pandas as pd
+def load_target_emotions():
+    """metadata.csv (wavID|emotion|transcript, KHÔNG header) → {stem: emotion_chuẩn|None}."""
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, default_idx=None, df=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    """1 ô emoCat (có thể đa nhãn, vd 'happy;surprised') → vector đếm 5 lớp (chưa chuẩn hóa)."""
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    """train.csv → DataFrame [wavID, emos, val, aro, dom, cat0..cat4] gộp theo wav.
+    CAT = tỉ lệ vote 5 lớp (tổng=1); nếu wav không có vote hợp lệ → phân phối đều."""
+    # train.csv phân tách bằng "|"; cột emoCat đa nhãn dùng "," bên trong (vd "Angry,Surprised").
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col  = _col(cols, "wavid", "wav", default_idx=1, df=df)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col  = _col(cols, "val", "valence")
+    aro_col  = _col(cols, "aro", "arousal")
+    dom_col  = _col(cols, "dom", "dominance")
+    cat_col  = _col(cols, "emocat", "cat", "emotion")
+    assert emos_col, f"Không thấy cột eMOS trong train.csv (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 1.0 / len(EMOTIONS5), dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target emotions: {len(target_map)} | wav train (gộp): {len(train_df)} | có nhãn VAD: {HAS_VAD}")
+print("eMOS:", train_df["emos"].describe()[["mean", "std", "min", "max"]].to_dict())
+train_df.head()
+# %% [markdown]
+# ## 3. Trích đặc trưng 2 backbone (có cache riêng từng model)
+# - **emotion2vec** → embedding + xác suất 5 lớp (như exp02).
+# - **SAILER** → embedding (features) + xác suất 9 lớp + VAD3 (như exp03).
+# Mỗi backbone cache riêng (`e2v_<tag>.npz`, `sailer_<tag>.npz`) → chạy nối tiếp được, đổi 1 backbone
+# không phải trích lại cái kia. Trích xong **giải phóng GPU** rồi mới nạp backbone sau.
+# %%
+import torch
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device)
+if device == "cuda":
+    print("  ✅ GPU:", torch.cuda.get_device_name(0))
+else:
+    print("  ⚠️ KHÔNG thấy GPU! Trích đặc trưng ~15k file trên CPU rất lâu.")
+    print("     → Settings → Accelerator = GPU T4 rồi chạy lại.")
+# ---- emotion2vec ----
+def extract_e2v(stems, tag):
+    """→ dict {stem: (emb[D1], probs5[5])}. Cache CACHE_DIR/e2v_<tag>.npz."""
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"e2v_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[e2v/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from funasr import AutoModel
+        m = AutoModel(model="iic/emotion2vec_plus_large", hub="hf", device=device)   # ép GPU
+        miss = 0
+        for i, s in enumerate(tqdm(todo, desc=f"e2v {tag}")):
+            wav = os.path.join(WAV_DIR, s + ".wav")
+            if not os.path.exists(wav):
+                miss += 1; continue
+            r = m.generate(wav, granularity="utterance", extract_embedding=True)[0]
+            emb = np.asarray(r["feats"], dtype=np.float32).reshape(-1)
+            probs = {e: 0.0 for e in EMOTIONS5}
+            for lab, sc in zip(r["labels"], r["scores"]):
+                name = lab.split("/")[-1]
+                if name in probs:
+                    probs[name] = float(sc)
+            tot = sum(probs.values())
+            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)
+            store[s] = np.concatenate([emb, p5]).astype(np.float32)   # [D1 + 5]
+            if (i + 1) % 500 == 0:
+                np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del m
+        torch.cuda.empty_cache() if device == "cuda" else None
+        if miss:
+            print(f"[e2v/{tag}] {miss} file thiếu → bỏ qua.")
+    return {s: (v[:-5], v[-5:]) for s, v in store.items()}
+# ---- SAILER ----
+def _pool_feat(features):
+    """features (tensor) → vector 1 chiều (mean-pool nếu còn chiều thời gian)."""
+    f = features.detach().cpu().numpy()
+    if f.ndim <= 1:
+        return f.reshape(-1).astype(np.float32)
+    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)
+def extract_sailer(stems, tag):
+    """→ dict {stem: (emb[D2], probs9[9], vad3[3] thang 1–5)}. Cache CACHE_DIR/sailer_<tag>.npz.
+    Mỗi mẫu lưu vector [emb | probs9(9) | vad3(3)] → cắt lại khi nạp."""
+    import librosa
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"sailer_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[sailer/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from src.model.emotion.wavlm_emotion import WavLMWrapper
+        sailer = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion").to(device).eval()
+        miss = 0
+        with torch.no_grad():
+            for i, s in enumerate(tqdm(todo, desc=f"sailer {tag}")):
+                wav = os.path.join(WAV_DIR, s + ".wav")
+                if not os.path.exists(wav):
+                    miss += 1; continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                wave = wave[: 15 * 16000]
+                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)
+                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)
+                emb = _pool_feat(feat)
+                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)
+                vad3 = np.array([1 + 4 * float(valence.item()),
+                                 1 + 4 * float(arousal.item()),
+                                 1 + 4 * float(dominance.item())], dtype=np.float32)  # [VAL,ARO,DOM]
+                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)   # [D2 + 9 + 3]
+                if (i + 1) % 500 == 0:
+                    np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del sailer
+        torch.cuda.empty_cache() if device == "cuda" else None
+        if miss:
+            print(f"[sailer/{tag}] {miss} file thiếu → bỏ qua.")
+    return {s: (v[:-12], v[-12:-3], v[-3:]) for s, v in store.items()}
+# %% [markdown]
+# ## 4. Dựng feature + nhãn cho train
+# Feature audio (KHÔNG gồm target) = nối các phần đang bật:
+# `[e2v_emb | e2v_probs5 | sailer_emb | sailer_probs9 | sailer_vad3]`.
+# One-hot target để **riêng** (chỉ EMOS head dùng). Bỏ wav thiếu feature.
+# %%
+train_stems = list(train_df["wavID"])
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+e2v_tr    = extract_e2v(train_stems, "train")    if USE_E2V    else {}
+sailer_tr = extract_sailer(train_stems, "train") if USE_SAILER else {}
+def audio_feature(sid, e2v_map, sailer_map):
+    """Nối đặc trưng audio cho 1 wav. None nếu thiếu phần bắt buộc."""
+    parts = []
+    if USE_E2V:
+        pk = e2v_map.get(sid)
+        if pk is None:
+            return None
+        emb, p5 = pk
+        parts.append(emb)
+        if USE_CLASSPROB:
+            parts.append(p5)
+    if USE_SAILER:
+        pk = sailer_map.get(sid)
+        if pk is None:
+            return None
+        emb, p9, vad3 = pk
+        parts.append(emb)
+        if USE_CLASSPROB:
+            parts.append(p9); parts.append(vad3)
+    return np.concatenate(parts).astype(np.float32)
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+lab = train_df.set_index("wavID")
+X, T, y_emos, y_vad, y_cat = [], [], [], [], []
+for s in train_stems:
+    f = audio_feature(s, e2v_tr, sailer_tr)
+    tgt = target_map.get(s)
+    if f is None or tgt is None or s not in lab.index:
+        continue
+    X.append(f)
+    T.append(onehot_target(tgt))
+    y_emos.append(lab.loc[s, "emos"])
+    y_vad.append([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]])
+    y_cat.append([lab.loc[s, f"cat{i}"] for i in range(len(EMOTIONS5))])
+X = np.stack(X).astype(np.float32)
+T = np.stack(T).astype(np.float32)
+y_emos = np.array(y_emos, dtype=np.float32)
+y_vad  = np.array(y_vad,  dtype=np.float32)         # [N,3] (VAL,ARO,DOM) — có thể toàn NaN nếu thiếu nhãn
+y_cat  = np.array(y_cat,  dtype=np.float32)         # [N,5] phân phối tổng=1
+FEAT_DIM = X.shape[1]
+print(f"Train: X={X.shape} target={T.shape} emos={y_emos.shape} vad={y_vad.shape} cat={y_cat.shape}")
+# Chuẩn hóa feature audio (z-score) — lưu mean/std để áp dụng y hệt lúc dự đoán DEV.
+feat_mean = X.mean(0, keepdims=True)
+feat_std  = X.std(0, keepdims=True) + 1e-6
+Xn = (X - feat_mean) / feat_std
+# Chuẩn hóa nhãn liên tục (eMOS, VAD) về z-score → các MSE cùng thang (uncertainty weighting ổn định hơn).
+# SRCC bất biến với scale → khi xuất answer.txt chỉ cần đảo z-score về thang gốc cho đẹp.
+emos_mu, emos_sd = float(y_emos.mean()), float(y_emos.std() + 1e-6)
+y_emos_z = (y_emos - emos_mu) / emos_sd
+if HAS_VAD:
+    vad_mu = np.nanmean(y_vad, axis=0)
+    vad_sd = np.nanstd(y_vad, axis=0) + 1e-6
+    y_vad_z = (y_vad - vad_mu) / vad_sd
+else:
+    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)
+    y_vad_z = np.zeros_like(y_vad)
+# %% [markdown]
+# ## 5. Model fusion multi-task + train loop
+# - **Trunk** chung: `Linear(FEAT_DIM→TRUNK_HIDDEN)+ReLU+Dropout` (×2).
+# - **EMOS head**: nối `[trunk | one-hot target]` → MLP → 1 (vì EMOS phụ thuộc target).
+# - **CAT head**: trunk → 5 logits → softmax (dự đoán phân phối vote). Loss = soft-CE (KL).
+# - **VAD head**: trunk → 3 (VAL/ARO/DOM). Loss = MSE (bỏ qua nếu thiếu nhãn VAD).
+# - **Cân loss**: uncertainty weighting — tổng `Σ exp(-sᵢ)·Lᵢ + sᵢ`, `sᵢ=log σᵢ²` **học được**.
+# %%
+import torch.nn as nn
+from scipy.stats import spearmanr
+from sklearn.model_selection import train_test_split
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+idx_all = np.arange(X.shape[0])
+tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)
+def to_t(a):
+    return torch.tensor(a, dtype=torch.float32, device=device)
+Xn_t, T_t = to_t(Xn), to_t(T)
+emos_t = to_t(y_emos_z).unsqueeze(1)
+vad_t  = to_t(y_vad_z)
+cat_t  = to_t(y_cat)
+class FusionMTL(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(
+            nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p),
+        )
+        self.emos = nn.Sequential(   # nhận [trunk | target]
+            nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat  = nn.Sequential(
+            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad  = nn.Sequential(
+            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, x, tgt):
+        h = self.trunk(x)
+        emos = self.emos(torch.cat([h, tgt], dim=1))
+        cat_logits = self.cat(h)
+        vad = self.vad(h)
+        return emos, cat_logits, vad
+model = FusionMTL(FEAT_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+# Trọng số bất định (log σ²) cho 5 task: emos, cat, val, aro, dom.
+TASKS = ["emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+params = list(model.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.Adam(params, lr=LR, weight_decay=1e-5)
+mse = nn.MSELoss(reduction="none")
+def soft_ce(logits, target_dist):
+    """Cross-entropy với nhãn mềm (phân phối): −Σ p·log q."""
+    logq = F.log_softmax(logits, dim=1)
+    return -(target_dist * logq).sum(dim=1)
+def task_losses(emos_p, cat_logits, vad_p, b):
+    """Trả về dict loss TB từng task cho 1 batch (chỉ số b)."""
+    L = {}
+    L["emos"] = mse(emos_p, emos_t[b]).mean()
+    L["cat"]  = soft_ce(cat_logits, cat_t[b]).mean()
+    if HAS_VAD:
+        L["val"] = mse(vad_p[:, 0:1], vad_t[b, 0:1]).mean()
+        L["aro"] = mse(vad_p[:, 1:2], vad_t[b, 1:2]).mean()
+        L["dom"] = mse(vad_p[:, 2:3], vad_t[b, 2:3]).mean()
+    else:
+        z = torch.zeros((), device=device)
+        L["val"] = L["aro"] = L["dom"] = z
+    return L
+def combine(L):
+    """Gộp 5 loss thành 1 số: uncertainty weighting hoặc trọng số cố định."""
+    if USE_UNCERTAINTY:
+        tot = 0.0
+        for i, t in enumerate(TASKS):
+            tot = tot + torch.exp(-log_var[i]) * L[t] + log_var[i]
+        return tot
+    return sum(LOSS_W[t] * L[t] for t in TASKS)
+@torch.no_grad()
+def eval_val():
+    """SRCC từng task trên tập val nội bộ (CAT báo bằng −KL để 'cao=tốt' cho early-stop)."""
+    model.eval()
+    ep, cl, vp = model(Xn_t[va_idx], T_t[va_idx])
+    ep = ep.cpu().numpy().ravel()
+    out = {"emos": spearmanr(ep, y_emos[va_idx]).correlation}
+    if HAS_VAD:
+        vp = vp.cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            out[t] = spearmanr(vp[:, j], y_vad[va_idx, j]).correlation
+    # CAT: dùng −KL(p‖q) trung bình (càng gần 0 càng tốt) → đổi dấu để hợp early-stop
+    q = F.softmax(cl, dim=1).cpu().numpy()
+    p = y_cat[va_idx]
+    kl = (p * (np.log(p + 1e-9) - np.log(q + 1e-9))).sum(1).mean()
+    out["cat_negkl"] = float(-kl)
+    return out
+def val_score(m):
+    """Điểm tổng để early-stop = TB SRCC các task liên tục có nhãn."""
+    keys = ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+best_score, best_state, bad = -1e9, None, 0
+tr_t = torch.tensor(tr_idx, device=device)
+for ep in range(1, EPOCHS + 1):
+    model.train()
+    perm = tr_t[torch.randperm(len(tr_t), device=device)]
+    run = 0.0
+    for i in range(0, len(perm), BATCH):
+        b = perm[i:i + BATCH]
+        opt.zero_grad()
+        emos_p, cat_logits, vad_p = model(Xn_t[b], T_t[b])
+        L = task_losses(emos_p, cat_logits, vad_p, b)
+        loss = combine(L)
+        loss.backward(); opt.step()
+        run += loss.item() * len(b)
+    m = eval_val()
+    sc = val_score(m)
+    if sc > best_score:
+        best_score = sc
+        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
+        bad = 0
+    else:
+        bad += 1
+    if ep % 5 == 0 or ep == 1:
+        msg = " ".join(f"{k}={m[k]:.3f}" for k in m)
+        print(f"epoch {ep:3d} | loss {run/len(perm):.4f} | {msg} | best {best_score:.4f}")
+    if bad >= PATIENCE:
+        print(f"Early stop ở epoch {ep}.")
+        break
+model.load_state_dict(best_state)
+final = eval_val()
+print("\n✅ VAL (nội bộ) tốt nhất:")
+print(f"   EMOS SRCC = {final['emos']:.4f}   (so mốc exp01 emotion2vec = 0.637)")
+if HAS_VAD:
+    print(f"   VAL/ARO/DOM SRCC = {final['val']:.4f} / {final['aro']:.4f} / {final['dom']:.4f}"
+          f"   (so mốc SAILER = 0.341 / 0.712 / 0.630)")
+if USE_UNCERTAINTY:
+    print("   log σ² mỗi task:", {t: round(float(log_var[i]), 3) for i, t in enumerate(TASKS)})
+# Lưu model + tham số chuẩn hóa.
+torch.save({"state": best_state, "feat_mean": feat_mean, "feat_std": feat_std,
+            "emos_mu": emos_mu, "emos_sd": emos_sd, "vad_mu": vad_mu, "vad_sd": vad_sd,
+            "FEAT_DIM": FEAT_DIM, "EMOTIONS5": EMOTIONS5, "HAS_VAD": HAS_VAD,
+            "USE_E2V": USE_E2V, "USE_SAILER": USE_SAILER, "USE_CLASSPROB": USE_CLASSPROB,
+            "TRUNK_HIDDEN": TRUNK_HIDDEN, "HEAD_HIDDEN": HEAD_HIDDEN, "val_score": best_score},
+           os.path.join(OUT_DIR, "fusion_mtl.pt"))
+print("Đã lưu", os.path.join(OUT_DIR, "fusion_mtl.pt"))
+# %% [markdown]
+# ## 6. Dự đoán DEV → `answer.txt` đầy đủ 7 cột
+# - **EMOS/CAT/VAD** = model fusion (đảo z-score về thang gốc cho EMOS/VAD; CAT = softmax 5 lớp).
+# - **QMOS** = SpeechMOS (UTMOS) — để riêng, đúng thiết kế.
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]   # tên file .wav
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+# 6a. Trích đặc trưng 2 backbone cho DEV (cache riêng)
+e2v_dev    = extract_e2v(dev_stems, "dev")    if USE_E2V    else {}
+sailer_dev = extract_sailer(dev_stems, "dev") if USE_SAILER else {}
+# 6b. Dự đoán 5 cột cảm xúc bằng model fusion
+@torch.no_grad()
+def predict_emotion(sid):
+    f = audio_feature(sid, e2v_dev, sailer_dev)
+    if f is None:
+        return None
+    fn = (f[None, :] - feat_mean) / feat_std
+    tgt = onehot_target(target_map.get(sid))[None, :]
+    model.eval()
+    emos_p, cat_logits, vad_p = model(to_t(fn), to_t(tgt))
+    emos = float(emos_p.item()) * emos_sd + emos_mu                      # đảo z-score
+    cat5 = F.softmax(cat_logits, dim=1)[0].cpu().numpy()
+    vad3 = vad_p[0].cpu().numpy() * vad_sd + vad_mu                      # [VAL,ARO,DOM]
+    return emos, cat5, vad3
+# 6c. QMOS = SpeechMOS (để riêng)
+@torch.no_grad()
+def run_qmos(names):
+    import librosa
+    from tqdm.auto import tqdm
+    predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True).to(device).eval()
+    out = {}
+    for n in tqdm(names, desc="QMOS"):
+        p = os.path.join(WAV_DIR, n)
+        if not os.path.exists(p):
+            continue
+        wave, _ = librosa.load(p, sr=16000, mono=True)
+        out[n] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device), sr=16000).mean().item())
+    return out
+qmos_scores = run_qmos(dev_names)
+# %%
+def fmt_cat(probs5):
+    return "|".join(f"{e}:{probs5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    from tqdm.auto import tqdm
+    n_real = n_default = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pred = predict_emotion(sid)
+            if pred is None:
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])
+                n_default += 1
+            else:
+                emos, cat5, vad3 = pred
+                n_real += 1
+            qmos = qmos_scores.get(name, 3.0)
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},"
+                    f"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | fusion thật {n_real}, mặc định {n_default}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 7. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp04_fusion.zip answer.txt && unzip -l submission_track2_exp04_fusion.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp04_fusion.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Lần đầu**: đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` ở cell 0 để bắt lỗi setup (clone repo / import / model).
+#   Chạy OK rồi đặt `None` chạy full.
+# - **VAL SRCC** ở mục 5 là ước lượng nội bộ (10% train) → so mốc EMOS 0.637 / ARO 0.712. Điểm DEV thật
+#   phải nộp CodaBench mới biết (My Submissions → Track 2, bỏ chọn track khác).
+# - Embedding đã cache trong `/kaggle/working/fusion_cache/` → **Save Version** để giữ; lần sau đổi
+#   siêu tham số/đổi cách cân loss chỉ train lại head (vài phút), khỏi trích lại.
+# - **Ablation cho paper** (đổi cờ ở cell 0, train lại head):
+#   `USE_E2V=False` (chỉ SAILER) · `USE_SAILER=False` (chỉ emotion2vec) · `USE_UNCERTAINTY=False` (trọng số tay)
+#   · `USE_CLASSPROB=False` (chỉ embedding) → điền bảng ablation `docs/04_experiments_log.md`.
+# - License SAILER = **Open RAIL (phi thương mại)** → nhắc trong `docs/12_system_description.md`.
+# - Nhớ ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp04).

track2/exp05_vad_audeering.ipynb ADDED Viewed

	@@ -0,0 +1,443 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "40a15eae",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp05 (VAD bằng audeering MSP-dim) — Kaggle\n",
+    "\n",
+    "**Mục tiêu:** đẩy **VAL** (SAILER chỉ 0.341 — thấp nhất) bằng model VAD chuyên\n",
+    "`audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim` (dimensional, xuất thẳng\n",
+    "arousal/dominance/valence ∈ [0,1]). **Thay cả 3 cột VAD** bằng audeering.\n",
+    "\n",
+    "## Phân công model (giữ cái tốt của exp03, chỉ đổi VAD)\n",
+    "```\n",
+    "QMOS  ← SpeechMOS (UTMOS)         (để riêng)\n",
+    "EMOS  ← SAILER  (1 + 4·P(target))  ┐ giữ nguyên exp03\n",
+    "CAT   ← SAILER  (5 lớp renorm)     ┘\n",
+    "VAL   ← audeering ┐\n",
+    "ARO   ← audeering ├─ THAY cả 3 (model VAD chuyên)\n",
+    "DOM   ← audeering ┘\n",
+    "```\n",
+    "- Mỗi wav chạy **2 forward**: SAILER (EMOS+CAT) + audeering (VAD). KHÔNG train.\n",
+    "- So với exp03 (VAD từ SAILER: VAL 0.341 / ARO 0.712 / DOM 0.630) → nộp để A/B từng cột.\n",
+    "\n",
+    "**Cách chạy Kaggle:** GPU **T4** + Internet **On** → + Add Input dataset Track 2 (có `sets/dev.scp`,\n",
+    "`metadata.csv`) → sửa `DATA_ROOT` → lần đầu `LIMIT = 20` kiểm tra VAD ra 1–5 hợp lý → rồi `None`.\n",
+    "\n",
+    "⚠️ License **SAILER = Open RAIL** · **audeering = CC BY-NC-SA 4.0** (đều phi thương mại) → khai báo `docs/12_`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2c098aff",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa143e27",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript → target emotion (cho EMOS)\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"     # danh sách wav tập DEV\n",
+    "\n",
+    "OUT_DIR = \"/kaggle/working\"\n",
+    "\n",
+    "DEVICE      = \"cuda\"\n",
+    "MAX_SECONDS = 15\n",
+    "SR          = 16000\n",
+    "LIMIT       = None          # đặt 20 để chạy thử nhanh; None = full DEV\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "SAILER9 = [\"Anger\", \"Contempt\", \"Disgust\", \"Fear\", \"Happiness\", \"Neutral\", \"Sadness\", \"Surprise\", \"Other\"]\n",
+    "EMO2SAILER = {\"angry\": 0, \"happy\": 4, \"neutral\": 5, \"sad\": 6, \"surprised\": 7}\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(path_or_name):\n",
+    "    return os.path.splitext(os.path.basename(str(path_or_name)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2d1dd91",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER (clone + sys.path, KHÔNG pip install -e .)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d426b50b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "\n",
+    "pip_install(\"loralib\", \"speechbrain\", \"speechmos\", \"librosa\", \"soundfile\", \"scipy\", \"tqdm\")\n",
+    "\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "798ad5ef",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp model SAILER (cho EMOS + CAT)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5d9ffc83",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device)\n",
+    "if device == \"cuda\":\n",
+    "    print(\"  ✅ GPU:\", torch.cuda.get_device_name(0))\n",
+    "else:\n",
+    "    print(\"  ⚠️ KHÔNG thấy GPU → Settings → Accelerator = GPU T4 rồi chạy lại.\")\n",
+    "\n",
+    "from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "\n",
+    "sailer = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\").to(device)\n",
+    "sailer.eval()\n",
+    "print(\"✅ Đã nạp SAILER (wavlm-large-categorical-emotion)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6c18fa40",
+   "metadata": {},
+   "source": [
+    "## 2b. Nạp model VAD chuyên: audeering wav2vec2 MSP-dim\n",
+    "⚠️ Kế thừa `Wav2Vec2PreTrainedModel` (theo model card) hay dính lỗi version transformers\n",
+    "(thiếu `__file__` / `all_tied_weights_keys`...). Cách dứt điểm: CHỈ dùng `Wav2Vec2Model` (backbone\n",
+    "được hỗ trợ tốt) + **tự nạp tay** trọng số regression head từ checkpoint → không đụng tie-weights/experts.\n",
+    "⚠️ Model xuất thứ tự **[arousal, dominance, valence]** ∈ [0,1] → đổi về [VAL,ARO,DOM] thang 1–5 khi ghi."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ddf569cd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "\n",
+    "# 1) backbone wav2vec2 (load chuẩn, không subclass)\n",
+    "aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "aud_backbone = Wav2Vec2Model(aud_cfg)\n",
+    "\n",
+    "# 2) tải state_dict gốc của checkpoint (ưu tiên safetensors)\n",
+    "try:\n",
+    "    _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "        hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "except Exception:\n",
+    "    _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "\n",
+    "# 3) nạp phần backbone (key có tiền tố \"wav2vec2.\") vào Wav2Vec2Model\n",
+    "bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "missing, unexpected = aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "print(f\"  backbone: thiếu {len(missing)} key, dư {len(unexpected)} key (strict=False)\")\n",
+    "\n",
+    "# 4) dựng regression head theo đúng shape trong checkpoint rồi nạp trọng số \"classifier.*\"\n",
+    "_hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "_out = _sd[\"classifier.out_proj.weight\"].shape[0]    # = 3 (arousal, dominance, valence)\n",
+    "aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))\n",
+    "aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"])\n",
+    "aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"])\n",
+    "aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "\n",
+    "aud_backbone = aud_backbone.to(device).eval()\n",
+    "aud_head = aud_head.to(device).eval()\n",
+    "print(f\"✅ Đã nạp audeering MSP-dim (backbone + head {_hid}→{_out}) — model VAD chuyên\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "849f083a",
+   "metadata": {},
+   "source": [
+    "## 3. Đọc cảm xúc target cho mỗi wav (cho EMOS của SAILER)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "546df027",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import librosa\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) < 2:\n",
+    "                continue\n",
+    "            tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "print(f\"Target emotions: {len(target_map)} wav | ví dụ:\", dict(list(target_map.items())[:3]))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1fee644a",
+   "metadata": {},
+   "source": [
+    "## 4. Hàm chấm: SAILER (EMOS+CAT) + audeering (VAD)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "54a8ad31",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "@torch.no_grad()\n",
+    "def sailer_probs(wav_path):\n",
+    "    \"\"\"→ probs9 (float32[9]); None nếu thiếu/lỗi. Chỉ lấy 9 lớp (EMOS+CAT), bỏ VAD của SAILER.\"\"\"\n",
+    "    if not os.path.exists(wav_path):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(wav_path, sr=SR, mono=True)\n",
+    "    wave = wave[: MAX_SECONDS * SR]\n",
+    "    data = torch.from_numpy(wave).float().unsqueeze(0).to(device)\n",
+    "    logits, _feat, _det, _aro, _val, _dom = sailer(data, return_feature=True)\n",
+    "    return F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)\n",
+    "\n",
+    "def emos_from_probs(probs9, target):\n",
+    "    if target is None or target not in EMO2SAILER:\n",
+    "        return None\n",
+    "    return 1.0 + 4.0 * float(probs9[EMO2SAILER[target]])\n",
+    "\n",
+    "def cat5_from_probs(probs9):\n",
+    "    v = np.array([probs9[EMO2SAILER[e]] for e in EMOTIONS5], dtype=np.float32)\n",
+    "    s = v.sum()\n",
+    "    return v / s if s > 0 else np.full(5, 0.2, dtype=np.float32)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def audeering_vad(wav_path):\n",
+    "    \"\"\"VAD bằng audeering → [VAL, ARO, DOM] thang 1–5; None nếu thiếu/lỗi.\n",
+    "    Model xuất [arousal, dominance, valence] ∈ [0,1].\"\"\"\n",
+    "    if not os.path.exists(wav_path):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(wav_path, sr=SR, mono=True)\n",
+    "    wave = wave[: MAX_SECONDS * SR]\n",
+    "    x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "    x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)\n",
+    "    h = aud_backbone(x)[0].mean(dim=1)                       # mean-pool theo thời gian\n",
+    "    out = aud_head(h)[0].detach().cpu().numpy()              # [arousal, dominance, valence]\n",
+    "    aro, dom, val = float(out[0]), float(out[1]), float(out[2])\n",
+    "    return np.array([1 + 4 * val, 1 + 4 * aro, 1 + 4 * dom], dtype=np.float32)   # [VAL,ARO,DOM]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e662c05e",
+   "metadata": {},
+   "source": [
+    "## 5. QMOS = SpeechMOS (UTMOS) — bắt buộc cho answer.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "aacc9e34",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "@torch.no_grad()\n",
+    "def run_qmos(names):\n",
+    "    predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\", trust_repo=True).to(device).eval()\n",
+    "    from tqdm.auto import tqdm\n",
+    "    out = {}\n",
+    "    for n in tqdm(names, desc=\"QMOS\"):\n",
+    "        p = os.path.join(WAV_DIR, n)\n",
+    "        if not os.path.exists(p):\n",
+    "            continue\n",
+    "        wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "        x = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "        out[n] = float(predictor(x, sr=SR).mean().item())\n",
+    "    return out"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d6712414",
+   "metadata": {},
+   "source": [
+    "## 6. Chạy trên DEV → `answer.txt` (QMOS, EMOS, CAT ← SAILER/UTMOS · VAL,ARO,DOM ← audeering)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "011b2530",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT:\n",
+    "    dev_names = dev_names[:LIMIT]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "qmos_scores = run_qmos(dev_names)\n",
+    "\n",
+    "def fmt_cat(probs5):\n",
+    "    return \"|\".join(f\"{e}:{probs5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    from tqdm.auto import tqdm\n",
+    "    n_emos = n_default = n_vad_def = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"EMOS/CAT(SAILER)+VAD(audeering)\"):\n",
+    "            sid = stem(name)\n",
+    "            wav = os.path.join(WAV_DIR, name)\n",
+    "            # EMOS + CAT từ SAILER\n",
+    "            probs9 = sailer_probs(wav)\n",
+    "            if probs9 is None:\n",
+    "                emos, cat5 = 3.0, np.full(5, 0.2, dtype=np.float32); n_default += 1\n",
+    "            else:\n",
+    "                emos = emos_from_probs(probs9, target_map.get(sid))\n",
+    "                if emos is None:\n",
+    "                    emos = 3.0; n_default += 1\n",
+    "                else:\n",
+    "                    n_emos += 1\n",
+    "                cat5 = cat5_from_probs(probs9)\n",
+    "            # VAD từ audeering\n",
+    "            vad3 = audeering_vad(wav)\n",
+    "            if vad3 is None:\n",
+    "                vad3 = np.array([3.0, 3.0, 3.0], dtype=np.float32); n_vad_def += 1\n",
+    "            qmos = qmos_scores.get(name, 3.0)\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},\"\n",
+    "                    f\"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | EMOS thật {n_emos}, mặc định {n_default} | VAD mặc định {n_vad_def}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6afa397f",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "749f1366",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    assert header[0] == \"wav\" and \"QMOS\" in header and \"EMOS\" in header, \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(header), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {header}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp05_vad-audeering.zip answer.txt && unzip -l submission_track2_exp05_vad-audeering.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp05_vad-audeering.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "69fb16b7",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Quan hệ với exp03:** exp03 = SAILER lo cả EMOS+CAT+VAD (giữ nguyên, file `exp03_emos_sailer`).\n",
+    "  exp05 (file này) chỉ **đổi VAD sang audeering**, EMOS/CAT vẫn SAILER → nộp 2 bản để A/B từng cột VAD.\n",
+    "- **Lần đầu** đặt `LIMIT = 20`, kiểm tra VAL/ARO/DOM ∈ [1,5] hợp lý (không toàn 3 / không âm).\n",
+    "  Nếu giá trị lệch → có thể sai thứ tự arousal/dominance/valence, báo lại để chỉnh.\n",
+    "- Khi chạy để ý dòng `backbone: thiếu N key, dư M key`: thiếu/dư vài key phụ là bình thường;\n",
+    "  thiếu hàng trăm key = sai tiền tố → báo lại.\n",
+    "- Nếu audeering thắng VAL nhưng thua ARO/DOM so SAILER → bản tối ưu = trộn cột\n",
+    "  (VAL từ audeering, ARO/DOM từ exp03). Ghi kết quả vào `docs/04_experiments_log.md` (exp05)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp05_vad_audeering_pipeline.py ADDED Viewed

	@@ -0,0 +1,303 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp05 (VAD bằng audeering MSP-dim) — Kaggle
+#
+# **Mục tiêu:** đẩy **VAL** (SAILER chỉ 0.341 — thấp nhất) bằng model VAD chuyên
+# `audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim` (dimensional, xuất thẳng
+# arousal/dominance/valence ∈ [0,1]). **Thay cả 3 cột VAD** bằng audeering.
+#
+# ## Phân công model (giữ cái tốt của exp03, chỉ đổi VAD)
+# ```
+# QMOS  ← SpeechMOS (UTMOS)         (để riêng)
+# EMOS  ← SAILER  (1 + 4·P(target))  ┐ giữ nguyên exp03
+# CAT   ← SAILER  (5 lớp renorm)     ┘
+# VAL   ← audeering ┐
+# ARO   ← audeering ├─ THAY cả 3 (model VAD chuyên)
+# DOM   ← audeering ┘
+# ```
+# - Mỗi wav chạy **2 forward**: SAILER (EMOS+CAT) + audeering (VAD). KHÔNG train.
+# - So với exp03 (VAD từ SAILER: VAL 0.341 / ARO 0.712 / DOM 0.630) → nộp để A/B từng cột.
+#
+# **Cách chạy Kaggle:** GPU **T4** + Internet **On** → + Add Input dataset Track 2 (có `sets/dev.scp`,
+# `metadata.csv`) → sửa `DATA_ROOT` → lần đầu `LIMIT = 20` kiểm tra VAD ra 1–5 hợp lý → rồi `None`.
+#
+# ⚠️ License **SAILER = Open RAIL** · **audeering = CC BY-NC-SA 4.0** (đều phi thương mại) → khai báo `docs/12_`.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript → target emotion (cho EMOS)
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"     # danh sách wav tập DEV
+OUT_DIR = "/kaggle/working"
+DEVICE      = "cuda"
+MAX_SECONDS = 15
+SR          = 16000
+LIMIT       = None          # đặt 20 để chạy thử nhanh; None = full DEV
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+SAILER9 = ["Anger", "Contempt", "Disgust", "Fear", "Happiness", "Neutral", "Sadness", "Surprise", "Other"]
+EMO2SAILER = {"angry": 0, "happy": 4, "neutral": 5, "sad": 6, "surprised": 7}
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(path_or_name):
+    return os.path.splitext(os.path.basename(str(path_or_name)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER (clone + sys.path, KHÔNG pip install -e .)
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+pip_install("loralib", "speechbrain", "speechmos", "librosa", "soundfile", "scipy", "tqdm")
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Nạp model SAILER (cho EMOS + CAT)
+# %%
+import torch
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device)
+if device == "cuda":
+    print("  ✅ GPU:", torch.cuda.get_device_name(0))
+else:
+    print("  ⚠️ KHÔNG thấy GPU → Settings → Accelerator = GPU T4 rồi chạy lại.")
+from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+sailer = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion").to(device)
+sailer.eval()
+print("✅ Đã nạp SAILER (wavlm-large-categorical-emotion)")
+# %% [markdown]
+# ## 2b. Nạp model VAD chuyên: audeering wav2vec2 MSP-dim
+# ⚠️ Kế thừa `Wav2Vec2PreTrainedModel` (theo model card) hay dính lỗi version transformers
+# (thiếu `__file__` / `all_tied_weights_keys`...). Cách dứt điểm: CHỈ dùng `Wav2Vec2Model` (backbone
+# được hỗ trợ tốt) + **tự nạp tay** trọng số regression head từ checkpoint → không đụng tie-weights/experts.
+# ⚠️ Model xuất thứ tự **[arousal, dominance, valence]** ∈ [0,1] → đổi về [VAL,ARO,DOM] thang 1–5 khi ghi.
+# %%
+import torch.nn as nn
+from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+from huggingface_hub import hf_hub_download
+AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+# 1) backbone wav2vec2 (load chuẩn, không subclass)
+aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+aud_backbone = Wav2Vec2Model(aud_cfg)
+# 2) tải state_dict gốc của checkpoint (ưu tiên safetensors)
+try:
+    _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+        hf_hub_download(AUD_NAME, "model.safetensors"))
+except Exception:
+    _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+# 3) nạp phần backbone (key có tiền tố "wav2vec2.") vào Wav2Vec2Model
+bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+missing, unexpected = aud_backbone.load_state_dict(bb_sd, strict=False)
+print(f"  backbone: thiếu {len(missing)} key, dư {len(unexpected)} key (strict=False)")
+# 4) dựng regression head theo đúng shape trong checkpoint rồi nạp trọng số "classifier.*"
+_hid = _sd["classifier.dense.weight"].shape[0]
+_out = _sd["classifier.out_proj.weight"].shape[0]    # = 3 (arousal, dominance, valence)
+aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))
+aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"])
+aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"])
+aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+aud_backbone = aud_backbone.to(device).eval()
+aud_head = aud_head.to(device).eval()
+print(f"✅ Đã nạp audeering MSP-dim (backbone + head {_hid}→{_out}) — model VAD chuyên")
+# %% [markdown]
+# ## 3. Đọc cảm xúc target cho mỗi wav (cho EMOS của SAILER)
+# %%
+import numpy as np
+import librosa
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+target_map = load_target_emotions()
+print(f"Target emotions: {len(target_map)} wav | ví dụ:", dict(list(target_map.items())[:3]))
+# %% [markdown]
+# ## 4. Hàm chấm: SAILER (EMOS+CAT) + audeering (VAD)
+# %%
+@torch.no_grad()
+def sailer_probs(wav_path):
+    """→ probs9 (float32[9]); None nếu thiếu/lỗi. Chỉ lấy 9 lớp (EMOS+CAT), bỏ VAD của SAILER."""
+    if not os.path.exists(wav_path):
+        return None
+    wave, _ = librosa.load(wav_path, sr=SR, mono=True)
+    wave = wave[: MAX_SECONDS * SR]
+    data = torch.from_numpy(wave).float().unsqueeze(0).to(device)
+    logits, _feat, _det, _aro, _val, _dom = sailer(data, return_feature=True)
+    return F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)
+def emos_from_probs(probs9, target):
+    if target is None or target not in EMO2SAILER:
+        return None
+    return 1.0 + 4.0 * float(probs9[EMO2SAILER[target]])
+def cat5_from_probs(probs9):
+    v = np.array([probs9[EMO2SAILER[e]] for e in EMOTIONS5], dtype=np.float32)
+    s = v.sum()
+    return v / s if s > 0 else np.full(5, 0.2, dtype=np.float32)
+@torch.no_grad()
+def audeering_vad(wav_path):
+    """VAD bằng audeering → [VAL, ARO, DOM] thang 1–5; None nếu thiếu/lỗi.
+    Model xuất [arousal, dominance, valence] ∈ [0,1]."""
+    if not os.path.exists(wav_path):
+        return None
+    wave, _ = librosa.load(wav_path, sr=SR, mono=True)
+    wave = wave[: MAX_SECONDS * SR]
+    x = aud_proc(wave, sampling_rate=SR).input_values[0]
+    x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)
+    h = aud_backbone(x)[0].mean(dim=1)                       # mean-pool theo thời gian
+    out = aud_head(h)[0].detach().cpu().numpy()              # [arousal, dominance, valence]
+    aro, dom, val = float(out[0]), float(out[1]), float(out[2])
+    return np.array([1 + 4 * val, 1 + 4 * aro, 1 + 4 * dom], dtype=np.float32)   # [VAL,ARO,DOM]
+# %% [markdown]
+# ## 5. QMOS = SpeechMOS (UTMOS) — bắt buộc cho answer.txt
+# %%
+@torch.no_grad()
+def run_qmos(names):
+    predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True).to(device).eval()
+    from tqdm.auto import tqdm
+    out = {}
+    for n in tqdm(names, desc="QMOS"):
+        p = os.path.join(WAV_DIR, n)
+        if not os.path.exists(p):
+            continue
+        wave, _ = librosa.load(p, sr=SR, mono=True)
+        x = torch.from_numpy(wave).unsqueeze(0).to(device)
+        out[n] = float(predictor(x, sr=SR).mean().item())
+    return out
+# %% [markdown]
+# ## 6. Chạy trên DEV → `answer.txt` (QMOS, EMOS, CAT ← SAILER/UTMOS · VAL,ARO,DOM ← audeering)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT:
+    dev_names = dev_names[:LIMIT]
+print("DEV:", len(dev_names), "mẫu")
+qmos_scores = run_qmos(dev_names)
+def fmt_cat(probs5):
+    return "|".join(f"{e}:{probs5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    from tqdm.auto import tqdm
+    n_emos = n_default = n_vad_def = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="EMOS/CAT(SAILER)+VAD(audeering)"):
+            sid = stem(name)
+            wav = os.path.join(WAV_DIR, name)
+            # EMOS + CAT từ SAILER
+            probs9 = sailer_probs(wav)
+            if probs9 is None:
+                emos, cat5 = 3.0, np.full(5, 0.2, dtype=np.float32); n_default += 1
+            else:
+                emos = emos_from_probs(probs9, target_map.get(sid))
+                if emos is None:
+                    emos = 3.0; n_default += 1
+                else:
+                    n_emos += 1
+                cat5 = cat5_from_probs(probs9)
+            # VAD từ audeering
+            vad3 = audeering_vad(wav)
+            if vad3 is None:
+                vad3 = np.array([3.0, 3.0, 3.0], dtype=np.float32); n_vad_def += 1
+            qmos = qmos_scores.get(name, 3.0)
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},"
+                    f"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | EMOS thật {n_emos}, mặc định {n_default} | VAD mặc định {n_vad_def}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 7. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp05_vad-audeering.zip answer.txt && unzip -l submission_track2_exp05_vad-audeering.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp05_vad-audeering.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Quan hệ với exp03:** exp03 = SAILER lo cả EMOS+CAT+VAD (giữ nguyên, file `exp03_emos_sailer`).
+#   exp05 (file này) chỉ **đổi VAD sang audeering**, EMOS/CAT vẫn SAILER → nộp 2 bản để A/B từng cột VAD.
+# - **Lần đầu** đặt `LIMIT = 20`, kiểm tra VAL/ARO/DOM ∈ [1,5] hợp lý (không toàn 3 / không âm).
+#   Nếu giá trị lệch → có thể sai thứ tự arousal/dominance/valence, báo lại để chỉnh.
+# - Khi chạy để ý dòng `backbone: thiếu N key, dư M key`: thiếu/dư vài key phụ là bình thường;
+#   thiếu hàng trăm key = sai tiền tố → báo lại.
+# - Nếu audeering thắng VAL nhưng thua ARO/DOM so SAILER → bản tối ưu = trộn cột
+#   (VAL từ audeering, ARO/DOM từ exp03). Ghi kết quả vào `docs/04_experiments_log.md` (exp05).

track2/exp06_qmos_train.ipynb ADDED Viewed

	@@ -0,0 +1,628 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e2d94d72",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp06 (TRAIN QMOS head) — Kaggle\n",
+    "\n",
+    "**Mục tiêu:** QMOS là cột **duy nhất chưa train** (đang dùng UTMOS zero-shot → SRCC kẹt 0.414).\n",
+    "`train.csv` CÓ sẵn cột `qMOS` → ta train 1 **head hồi quy nhỏ** trên đặc trưng SSL (đã cache ở exp04)\n",
+    "để vượt 0.414.\n",
+    "\n",
+    "## Ý tưởng (đọc 1 lần cho hiểu)\n",
+    "- Tái dùng đặc trưng **emotion2vec + SAILER** đã trích & cache trong `fusion_cache/` (exp04) → KHÔNG trích lại.\n",
+    "- Thêm **chính điểm UTMOS** (SpeechMOS) làm 1 đặc trưng đầu vào → head chỉ cần **học chỉnh sửa (residual)**\n",
+    "  quanh 0.414 thay vì học lại từ đầu → an toàn, gần như chắc chắn ≥ UTMOS đơn lẻ.\n",
+    "- Nhãn vàng QMOS = **TB `qMOS` theo wav** (gộp các listener trong `train.csv`).\n",
+    "- Có **val nội bộ 10%** → đo SRCC, so thẳng với UTMOS trên CÙNG tập val → biết có cải thiện thật\n",
+    "  **trước khi** tốn lượt nộp CodaBench.\n",
+    "- Cuối cùng: **GIỮ NGUYÊN exp04** (5 cột cảm xúc đang thắng), chỉ **thay cột QMOS** trong `answer.txt`.\n",
+    "\n",
+    "```\n",
+    " mỗi wav ─► [e2v_emb | e2v_probs5 | sailer_emb | sailer_probs9 | sailer_vad3 | UTMOS] ─► MLP ─► QMOS\n",
+    "                                                                                 (head train)\n",
+    "```\n",
+    "\n",
+    "**Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On**.\n",
+    "+ Add Input: (1) dataset Track 2 (15.477 wav, có `sets/train.csv`) ; (2) — nếu có — dataset chứa\n",
+    "`fusion_cache/*.npz` đã Save Version ở exp04 (đỡ ~15') ; (3) file `answer.txt` của exp04 để ghép cột.\n",
+    "Lần đầu đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để bắt lỗi setup, OK rồi đặt `None`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b42d5d49",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "93e29194",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "# ── Data Track 2 ─────────────────────────────────────────────────────────────\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # nhãn người nghe: lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"     # danh sách wav tập DEV\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "# Dùng CHUNG cache với exp04. Nếu đã Save Version cache ở exp04, trỏ CACHE_DIR vào dataset đó\n",
+    "# (vd \"/kaggle/input/<slug-cache>/fusion_cache\") để khỏi trích lại; nếu không, để mặc định sẽ tự trích.\n",
+    "CACHE_DIR = \"/kaggle/working/fusion_cache\"\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# File answer.txt của exp04 (5 cột cảm xúc đang thắng) để GHÉP cột QMOS mới vào.\n",
+    "# Trỏ tới nơi bạn đặt file exp04. Nếu không có, notebook vẫn xuất qmos_dev.csv riêng + cảnh báo.\n",
+    "EXP04_ANSWER = \"/kaggle/input/exp04-answer/answer.txt\"   # << SỬA; hoặc \"/kaggle/working/answer.txt\"\n",
+    "\n",
+    "# ── Đặc trưng dùng cho QMOS ──────────────────────────────────────────────────\n",
+    "USE_E2V        = True     # nối embedding emotion2vec\n",
+    "USE_SAILER     = True     # nối embedding SAILER/WavLM\n",
+    "USE_CLASSPROB  = True     # nối thêm xác suất lớp (e2v5 + sailer9 + vad3)\n",
+    "USE_UTMOS_FEAT = True     # nối thêm điểm UTMOS làm 1 đặc trưng (neo residual quanh 0.414)\n",
+    "\n",
+    "# ── Siêu tham số train head ──────────────────────────────────────────────────\n",
+    "DEVICE      = \"cuda\"\n",
+    "HIDDEN      = 256\n",
+    "DROPOUT     = 0.3\n",
+    "LR          = 1e-3\n",
+    "EPOCHS      = 120\n",
+    "BATCH       = 64\n",
+    "VAL_FRAC    = 0.10\n",
+    "PATIENCE    = 20\n",
+    "SEED        = 42\n",
+    "RANK_LAMBDA = 0.0         # 0 = chỉ MSE. >0 (vd 0.2) = cộng thêm pairwise ranking loss (tối ưu thứ hạng=SRCC)\n",
+    "\n",
+    "LIMIT_TRAIN = None        # số nhỏ (vd 300) để chạy thử; None = full\n",
+    "LIMIT_DEV   = None\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "assert USE_E2V or USE_SAILER or USE_UTMOS_FEAT, \"Phải bật ít nhất 1 nguồn đặc trưng.\"\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "47ac221d",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + (nếu cần) tải code SAILER\n",
+    "emotion2vec qua `funasr`; SAILER cần `WavLMWrapper` trong repo `vox-profile-release` (clone + sys.path).\n",
+    "Nếu cache đã đủ thì các model này sẽ KHÔNG được nạp (chỉ nạp khi còn file phải trích)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99ba1947",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"speechmos\", \"funasr\", \"librosa\", \"soundfile\", \"pandas\", \"scipy\", \"scikit-learn\", \"tqdm\")\n",
+    "\n",
+    "if USE_SAILER:\n",
+    "    pip_install(\"loralib\", \"speechbrain\")\n",
+    "    REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "    if not os.path.exists(REPO_DIR):\n",
+    "        subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                        \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "    if REPO_DIR not in sys.path:\n",
+    "        sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ac9dcefc",
+   "metadata": {},
+   "source": [
+    "## 2. Nhãn vàng QMOS (gộp `qMOS` theo wavID)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "db4a41a5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "def load_qmos_labels():\n",
+    "    \"\"\"train.csv (sep='|') → DataFrame [wavID, qmos] với qmos = TB theo wav.\"\"\"\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col  = cols.get(\"wavid\") or cols.get(\"wav\") or list(df.columns)[1]\n",
+    "    qmos_col = cols.get(\"qmos\")  or cols.get(\"qMOS\".lower()) or cols.get(\"mos\")\n",
+    "    assert qmos_col, f\"Không thấy cột qMOS trong train.csv (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    g = df.groupby(\"_stem\")[qmos_col].mean().reset_index()\n",
+    "    g.columns = [\"wavID\", \"qmos\"]\n",
+    "    return g\n",
+    "\n",
+    "qmos_df = load_qmos_labels()\n",
+    "print(f\"wav train (gộp): {len(qmos_df)}\")\n",
+    "print(\"qMOS:\", qmos_df[\"qmos\"].describe()[[\"mean\", \"std\", \"min\", \"max\"]].to_dict())\n",
+    "qmos_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dfd7df0c",
+   "metadata": {},
+   "source": [
+    "## 3. Trích / nạp đặc trưng (cache CHUNG với exp04) + điểm UTMOS\n",
+    "- `extract_e2v` / `extract_sailer`: y hệt exp04, cache `e2v_<tag>.npz` / `sailer_<tag>.npz`.\n",
+    "- `extract_utmos`: chấm UTMOS từng wav → cache `utmos_<tag>.npz` (dùng vừa làm đặc trưng, vừa làm baseline so sánh)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec1e63a1",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU\")\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "\n",
+    "def extract_e2v(stems, tag):\n",
+    "    \"\"\"→ dict {stem: emb_full[D1+5]}. Cache CACHE_DIR/e2v_<tag>.npz (giống exp04).\"\"\"\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"e2v_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[e2v/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        from funasr import AutoModel\n",
+    "        m = AutoModel(model=\"iic/emotion2vec_plus_large\", hub=\"hf\", device=device)\n",
+    "        for i, s in enumerate(tqdm(todo, desc=f\"e2v {tag}\")):\n",
+    "            wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "            if not os.path.exists(wav):\n",
+    "                continue\n",
+    "            r = m.generate(wav, granularity=\"utterance\", extract_embedding=True)[0]\n",
+    "            emb = np.asarray(r[\"feats\"], dtype=np.float32).reshape(-1)\n",
+    "            probs = {e: 0.0 for e in EMOTIONS5}\n",
+    "            for lab, sc in zip(r[\"labels\"], r[\"scores\"]):\n",
+    "                name = lab.split(\"/\")[-1]\n",
+    "                if name in probs:\n",
+    "                    probs[name] = float(sc)\n",
+    "            tot = sum(probs.values())\n",
+    "            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)\n",
+    "            store[s] = np.concatenate([emb, p5]).astype(np.float32)\n",
+    "            if (i + 1) % 500 == 0:\n",
+    "                np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del m\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return store   # mỗi value = [D1 | 5]\n",
+    "\n",
+    "def _pool_feat(features):\n",
+    "    f = features.detach().cpu().numpy()\n",
+    "    if f.ndim <= 1:\n",
+    "        return f.reshape(-1).astype(np.float32)\n",
+    "    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)\n",
+    "\n",
+    "def extract_sailer(stems, tag):\n",
+    "    \"\"\"→ dict {stem: vec[D2+9+3]}. Cache CACHE_DIR/sailer_<tag>.npz (giống exp04).\"\"\"\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"sailer_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[sailer/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        from src.model.emotion.wavlm_emotion import WavLMWrapper\n",
+    "        sailer = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\").to(device).eval()\n",
+    "        with torch.no_grad():\n",
+    "            for i, s in enumerate(tqdm(todo, desc=f\"sailer {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                wave = wave[: 15 * 16000]\n",
+    "                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)\n",
+    "                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)\n",
+    "                emb = _pool_feat(feat)\n",
+    "                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)\n",
+    "                vad3 = np.array([1 + 4 * float(valence.item()),\n",
+    "                                 1 + 4 * float(arousal.item()),\n",
+    "                                 1 + 4 * float(dominance.item())], dtype=np.float32)\n",
+    "                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del sailer\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return store   # mỗi value = [D2 | 9 | 3]\n",
+    "\n",
+    "def extract_utmos(names, tag):\n",
+    "    \"\"\"Chấm UTMOS từng wav (theo TÊN file, vì DEV gọi .wav theo tên). → dict {stem: score}.\n",
+    "    Cache CACHE_DIR/utmos_<tag>.npz. Dùng vừa làm đặc trưng vừa làm baseline so sánh.\"\"\"\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"utmos_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: float(z[k]) for k in z.files}\n",
+    "        print(f\"[utmos/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [n for n in names if stem(n) not in store]\n",
+    "    if todo:\n",
+    "        predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\",\n",
+    "                                   trust_repo=True).to(device).eval()\n",
+    "        with torch.no_grad():\n",
+    "            for i, n in enumerate(tqdm(todo, desc=f\"utmos {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, n if n.endswith(\".wav\") else n + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                sc = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device), sr=16000).mean().item())\n",
+    "                store[stem(n)] = sc\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        del predictor\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aed7338b",
+   "metadata": {},
+   "source": [
+    "## 4. Dựng feature + nhãn cho train"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c09bb508",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_stems = list(qmos_df[\"wavID\"])\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "\n",
+    "e2v_tr    = extract_e2v(train_stems, \"train\")    if USE_E2V    else {}\n",
+    "sailer_tr = extract_sailer(train_stems, \"train\") if USE_SAILER else {}\n",
+    "utmos_tr  = extract_utmos(train_stems, \"train\")  if USE_UTMOS_FEAT else {}\n",
+    "\n",
+    "def qmos_feature(sid, e2v_map, sailer_map, utmos_map):\n",
+    "    \"\"\"Nối đặc trưng QMOS cho 1 wav. None nếu thiếu phần bắt buộc.\"\"\"\n",
+    "    parts = []\n",
+    "    if USE_E2V:\n",
+    "        v = e2v_map.get(sid)\n",
+    "        if v is None:\n",
+    "            return None\n",
+    "        parts.append(v[:-5])                      # emb e2v\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(v[-5:])                  # probs5\n",
+    "    if USE_SAILER:\n",
+    "        v = sailer_map.get(sid)\n",
+    "        if v is None:\n",
+    "            return None\n",
+    "        parts.append(v[:-12])                     # emb sailer\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(v[-12:])                 # probs9 + vad3\n",
+    "    if USE_UTMOS_FEAT:\n",
+    "        u = utmos_map.get(sid)\n",
+    "        if u is None:\n",
+    "            return None\n",
+    "        parts.append(np.array([u], dtype=np.float32))\n",
+    "    return np.concatenate(parts).astype(np.float32)\n",
+    "\n",
+    "lab = qmos_df.set_index(\"wavID\")[\"qmos\"]\n",
+    "X, y = [], []\n",
+    "for s in train_stems:\n",
+    "    f = qmos_feature(s, e2v_tr, sailer_tr, utmos_tr)\n",
+    "    if f is None or s not in lab.index:\n",
+    "        continue\n",
+    "    X.append(f)\n",
+    "    y.append(float(lab.loc[s]))\n",
+    "\n",
+    "X = np.stack(X).astype(np.float32)\n",
+    "y = np.array(y, dtype=np.float32)\n",
+    "FEAT_DIM = X.shape[1]\n",
+    "print(f\"Train: X={X.shape} y={y.shape}\")\n",
+    "\n",
+    "feat_mean = X.mean(0, keepdims=True)\n",
+    "feat_std  = X.std(0, keepdims=True) + 1e-6\n",
+    "Xn = (X - feat_mean) / feat_std\n",
+    "y_mu, y_sd = float(y.mean()), float(y.std() + 1e-6)\n",
+    "yn = (y - y_mu) / y_sd"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "82cc65f8",
+   "metadata": {},
+   "source": [
+    "## 5. Train head QMOS + so với UTMOS trên CÙNG val nội bộ\n",
+    "- Head = MLP nhỏ (`Linear→ReLU→Dropout ×2 → 1`). Loss = MSE (+ tùy chọn pairwise ranking).\n",
+    "- In **SRCC head** và **SRCC UTMOS** trên cùng tập val → biết head có thật sự vượt 0.414 không."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "324ab564",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "from scipy.stats import spearmanr\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "idx_all = np.arange(X.shape[0])\n",
+    "tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)\n",
+    "\n",
+    "def to_t(a):\n",
+    "    return torch.tensor(a, dtype=torch.float32, device=device)\n",
+    "\n",
+    "Xn_t = to_t(Xn); yn_t = to_t(yn).unsqueeze(1)\n",
+    "\n",
+    "class QMOSHead(nn.Module):\n",
+    "    def __init__(self, d_in, h, p):\n",
+    "        super().__init__()\n",
+    "        self.net = nn.Sequential(\n",
+    "            nn.Linear(d_in, h), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(h, h), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(h, 1),\n",
+    "        )\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return self.net(x)\n",
+    "\n",
+    "model = QMOSHead(FEAT_DIM, HIDDEN, DROPOUT).to(device)\n",
+    "opt = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def pairwise_rank_loss(pred, target):\n",
+    "    \"\"\"Khuyến khích pred xếp hạng giống target (margin ranking trên các cặp trong batch).\"\"\"\n",
+    "    n = pred.shape[0]\n",
+    "    if n < 2:\n",
+    "        return torch.zeros((), device=device)\n",
+    "    pi, pj = pred.unsqueeze(0), pred.unsqueeze(1)\n",
+    "    ti, tj = target.unsqueeze(0), target.unsqueeze(1)\n",
+    "    sign = torch.sign(ti - tj)                       # +1 nếu i nên cao hơn j\n",
+    "    diff = pi - pj\n",
+    "    # hinge: phạt khi thứ tự sai\n",
+    "    return torch.relu(-sign * diff).mean()\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def eval_val():\n",
+    "    model.eval()\n",
+    "    p = model(Xn_t[va_idx]).cpu().numpy().ravel()\n",
+    "    srcc_head = spearmanr(p, y[va_idx]).correlation\n",
+    "    out = {\"head\": float(srcc_head)}\n",
+    "    if USE_UTMOS_FEAT:\n",
+    "        u = X[va_idx, -1]                            # cột UTMOS (đặc trưng cuối, chưa chuẩn hóa)\n",
+    "        out[\"utmos\"] = float(spearmanr(u, y[va_idx]).correlation)\n",
+    "    return out\n",
+    "\n",
+    "best, best_state, bad = -1e9, None, 0\n",
+    "tr_t = torch.tensor(tr_idx, device=device)\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    model.train()\n",
+    "    perm = tr_t[torch.randperm(len(tr_t), device=device)]\n",
+    "    run = 0.0\n",
+    "    for i in range(0, len(perm), BATCH):\n",
+    "        b = perm[i:i + BATCH]\n",
+    "        opt.zero_grad()\n",
+    "        pred = model(Xn_t[b])\n",
+    "        loss = mse(pred, yn_t[b])\n",
+    "        if RANK_LAMBDA > 0:\n",
+    "            loss = loss + RANK_LAMBDA * pairwise_rank_loss(pred.ravel(), yn_t[b].ravel())\n",
+    "        loss.backward(); opt.step()\n",
+    "        run += loss.item() * len(b)\n",
+    "    m = eval_val()\n",
+    "    if m[\"head\"] > best:\n",
+    "        best = m[\"head\"]\n",
+    "        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}\n",
+    "        bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "    if ep % 5 == 0 or ep == 1:\n",
+    "        extra = f\" | UTMOS={m['utmos']:.4f}\" if \"utmos\" in m else \"\"\n",
+    "        print(f\"epoch {ep:3d} | loss {run/len(perm):.4f} | head SRCC={m['head']:.4f}{extra} | best {best:.4f}\")\n",
+    "    if bad >= PATIENCE:\n",
+    "        print(f\"Early stop ở epoch {ep}.\")\n",
+    "        break\n",
+    "\n",
+    "model.load_state_dict(best_state)\n",
+    "final = eval_val()\n",
+    "print(\"\\n✅ VAL (nội bộ):\")\n",
+    "print(f\"   QMOS head SRCC = {final['head']:.4f}\")\n",
+    "if \"utmos\" in final:\n",
+    "    print(f\"   UTMOS  baseline = {final['utmos']:.4f}  (mốc leaderboard 0.414)\")\n",
+    "    print(\"   →\", \"✅ HEAD VƯỢT UTMOS\" if final[\"head\"] > final[\"utmos\"] else \"⚠️ chưa vượt — thử tăng EPOCHS / RANK_LAMBDA / bật thêm đặc trưng\")\n",
+    "\n",
+    "torch.save({\"state\": best_state, \"feat_mean\": feat_mean, \"feat_std\": feat_std,\n",
+    "            \"y_mu\": y_mu, \"y_sd\": y_sd, \"FEAT_DIM\": FEAT_DIM,\n",
+    "            \"USE_E2V\": USE_E2V, \"USE_SAILER\": USE_SAILER,\n",
+    "            \"USE_CLASSPROB\": USE_CLASSPROB, \"USE_UTMOS_FEAT\": USE_UTMOS_FEAT,\n",
+    "            \"val_srcc\": best}, os.path.join(OUT_DIR, \"qmos_head.pt\"))\n",
+    "print(\"Đã lưu\", os.path.join(OUT_DIR, \"qmos_head.pt\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d33a7aca",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán QMOS cho DEV"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "69efbd00",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "e2v_dev    = extract_e2v(dev_stems, \"dev\")    if USE_E2V    else {}\n",
+    "sailer_dev = extract_sailer(dev_stems, \"dev\") if USE_SAILER else {}\n",
+    "utmos_dev  = extract_utmos(dev_names, \"dev\")  if USE_UTMOS_FEAT else {}\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_qmos(sid):\n",
+    "    f = qmos_feature(sid, e2v_dev, sailer_dev, utmos_dev)\n",
+    "    if f is None:\n",
+    "        return None\n",
+    "    fn = (f[None, :] - feat_mean) / feat_std\n",
+    "    model.eval()\n",
+    "    return float(model(to_t(fn)).item()) * y_sd + y_mu     # đảo z-score\n",
+    "\n",
+    "qmos_pred = {}\n",
+    "n_real = n_def = 0\n",
+    "for n in dev_names:\n",
+    "    sid = stem(n)\n",
+    "    p = predict_qmos(sid)\n",
+    "    if p is None:\n",
+    "        p = utmos_dev.get(sid, 3.0)                          # rơi về UTMOS nếu thiếu feature\n",
+    "        n_def += 1\n",
+    "    else:\n",
+    "        n_real += 1\n",
+    "    qmos_pred[n] = p\n",
+    "print(f\"QMOS dự đoán: head thật {n_real}, dự phòng UTMOS {n_def}\")\n",
+    "\n",
+    "# Lưu riêng (để ghép tay nếu cần)\n",
+    "import csv\n",
+    "qmos_csv = os.path.join(OUT_DIR, \"qmos_dev.csv\")\n",
+    "with open(qmos_csv, \"w\", newline=\"\") as f:\n",
+    "    w = csv.writer(f); w.writerow([\"wav\", \"QMOS\"])\n",
+    "    for n in dev_names:\n",
+    "        w.writerow([n, f\"{qmos_pred[n]:.6g}\"])\n",
+    "print(\"Đã ghi\", qmos_csv)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f3e47def",
+   "metadata": {},
+   "source": [
+    "## 7. Ghép QMOS mới vào answer.txt của exp04 → bản nộp mới\n",
+    "Giữ NGUYÊN 5 cột cảm xúc đang thắng (EMOS/CAT/VAL/ARO/DOM), chỉ thay cột QMOS."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a3b94589",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def merge_into_exp04(exp04_path, out_path):\n",
+    "    if not os.path.exists(exp04_path):\n",
+    "        print(f\"⚠️ Không thấy {exp04_path} → BỎ QUA ghép. Hãy dùng qmos_dev.csv để thay cột QMOS thủ công,\")\n",
+    "        print(\"   hoặc trỏ EXP04_ANSWER đúng đường dẫn answer.txt của exp04 rồi chạy lại cell này.\")\n",
+    "        return False\n",
+    "    with open(exp04_path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    qi = header.index(\"QMOS\")\n",
+    "    wi = header.index(\"wav\")\n",
+    "    n_swapped = n_miss = 0\n",
+    "    with open(out_path, \"w\", newline=\"\") as f:\n",
+    "        w = csv.writer(f); w.writerow(header)\n",
+    "        for r in rows[1:]:\n",
+    "            name = r[wi]\n",
+    "            if name in qmos_pred:\n",
+    "                r[qi] = f\"{qmos_pred[name]:.6g}\"; n_swapped += 1\n",
+    "            else:\n",
+    "                n_miss += 1\n",
+    "            w.writerow(r)\n",
+    "    print(f\"Ghép xong → {out_path} | thay {n_swapped} cột QMOS, thiếu {n_miss} (giữ QMOS cũ)\")\n",
+    "    return True\n",
+    "\n",
+    "merged = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "ok = merge_into_exp04(EXP04_ANSWER, merged)\n",
+    "\n",
+    "if ok:\n",
+    "    # validate + zip\n",
+    "    with open(merged) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0]\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "    os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp06_qmos.zip answer.txt \"\n",
+    "              f\"&& unzip -l submission_track2_exp06_qmos.zip\")\n",
+    "    print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp06_qmos.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a517b97",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Lần đầu** đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để bắt lỗi; OK rồi đặt `None`.\n",
+    "- **So sánh công bằng**: mục 5 in cả `head SRCC` và `UTMOS SRCC` trên CÙNG val nội bộ → chỉ nộp khi head > UTMOS.\n",
+    "- Nếu head **chưa vượt** 0.414: thử (a) tăng `EPOCHS`; (b) bật `RANK_LAMBDA=0.2` (tối ưu thứ hạng);\n",
+    "  (c) đảm bảo `USE_UTMOS_FEAT=True` (neo residual); (d) thử bỏ bớt đặc trưng nhiễu (tắt `USE_CLASSPROB`).\n",
+    "- **Ablation QMOS cho paper**: bật/tắt `USE_E2V/USE_SAILER/USE_UTMOS_FEAT/USE_CLASSPROB` → ghi `docs/04_experiments_log.md` (exp06).\n",
+    "- Cache dùng CHUNG `fusion_cache/` với exp04 → nhớ **Save Version** giữ lại (gồm `utmos_*.npz` mới).\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp06)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp06_qmos_train_pipeline.py ADDED Viewed

	@@ -0,0 +1,502 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp06 (TRAIN QMOS head) — Kaggle
+#
+# **Mục tiêu:** QMOS là cột **duy nhất chưa train** (đang dùng UTMOS zero-shot → SRCC kẹt 0.414).
+# `train.csv` CÓ sẵn cột `qMOS` → ta train 1 **head hồi quy nhỏ** trên đặc trưng SSL (đã cache ở exp04)
+# để vượt 0.414.
+#
+# ## Ý tưởng (đọc 1 lần cho hiểu)
+# - Tái dùng đặc trưng **emotion2vec + SAILER** đã trích & cache trong `fusion_cache/` (exp04) → KHÔNG trích lại.
+# - Thêm **chính điểm UTMOS** (SpeechMOS) làm 1 đặc trưng đầu vào → head chỉ cần **học chỉnh sửa (residual)**
+#   quanh 0.414 thay vì học lại từ đầu → an toàn, gần như chắc chắn ≥ UTMOS đơn lẻ.
+# - Nhãn vàng QMOS = **TB `qMOS` theo wav** (gộp các listener trong `train.csv`).
+# - Có **val nội bộ 10%** → đo SRCC, so thẳng với UTMOS trên CÙNG tập val → biết có cải thiện thật
+#   **trước khi** tốn lượt nộp CodaBench.
+# - Cuối cùng: **GIỮ NGUYÊN exp04** (5 cột cảm xúc đang thắng), chỉ **thay cột QMOS** trong `answer.txt`.
+#
+# ```
+#  mỗi wav ─► [e2v_emb | e2v_probs5 | sailer_emb | sailer_probs9 | sailer_vad3 | UTMOS] ─► MLP ─► QMOS
+#                                                                                  (head train)
+# ```
+#
+# **Cách chạy trên Kaggle:** Settings → Accelerator = **GPU T4**, Internet = **On**.
+# + Add Input: (1) dataset Track 2 (15.477 wav, có `sets/train.csv`) ; (2) — nếu có — dataset chứa
+# `fusion_cache/*.npz` đã Save Version ở exp04 (đỡ ~15') ; (3) file `answer.txt` của exp04 để ghép cột.
+# Lần đầu đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để bắt lỗi setup, OK rồi đặt `None`.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+# ── Data Track 2 ─────────────────────────────────────────────────────────────
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # nhãn người nghe: lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"     # danh sách wav tập DEV
+OUT_DIR   = "/kaggle/working"
+# Dùng CHUNG cache với exp04. Nếu đã Save Version cache ở exp04, trỏ CACHE_DIR vào dataset đó
+# (vd "/kaggle/input/<slug-cache>/fusion_cache") để khỏi trích lại; nếu không, để mặc định sẽ tự trích.
+CACHE_DIR = "/kaggle/working/fusion_cache"
+os.makedirs(CACHE_DIR, exist_ok=True)
+# File answer.txt của exp04 (5 cột cảm xúc đang thắng) để GHÉP cột QMOS mới vào.
+# Trỏ tới nơi bạn đặt file exp04. Nếu không có, notebook vẫn xuất qmos_dev.csv riêng + cảnh báo.
+EXP04_ANSWER = "/kaggle/input/exp04-answer/answer.txt"   # << SỬA; hoặc "/kaggle/working/answer.txt"
+# ── Đặc trưng dùng cho QMOS ──────────────────────────────────────────────────
+USE_E2V        = True     # nối embedding emotion2vec
+USE_SAILER     = True     # nối embedding SAILER/WavLM
+USE_CLASSPROB  = True     # nối thêm xác suất lớp (e2v5 + sailer9 + vad3)
+USE_UTMOS_FEAT = True     # nối thêm điểm UTMOS làm 1 đặc trưng (neo residual quanh 0.414)
+# ── Siêu tham số train head ──────────────────────────────────────────────────
+DEVICE      = "cuda"
+HIDDEN      = 256
+DROPOUT     = 0.3
+LR          = 1e-3
+EPOCHS      = 120
+BATCH       = 64
+VAL_FRAC    = 0.10
+PATIENCE    = 20
+SEED        = 42
+RANK_LAMBDA = 0.0         # 0 = chỉ MSE. >0 (vd 0.2) = cộng thêm pairwise ranking loss (tối ưu thứ hạng=SRCC)
+LIMIT_TRAIN = None        # số nhỏ (vd 300) để chạy thử; None = full
+LIMIT_DEV   = None
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+assert USE_E2V or USE_SAILER or USE_UTMOS_FEAT, "Phải bật ít nhất 1 nguồn đặc trưng."
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt + (nếu cần) tải code SAILER
+# emotion2vec qua `funasr`; SAILER cần `WavLMWrapper` trong repo `vox-profile-release` (clone + sys.path).
+# Nếu cache đã đủ thì các model này sẽ KHÔNG được nạp (chỉ nạp khi còn file phải trích).
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("speechmos", "funasr", "librosa", "soundfile", "pandas", "scipy", "scikit-learn", "tqdm")
+if USE_SAILER:
+    pip_install("loralib", "speechbrain")
+    REPO_DIR = "/kaggle/working/vox-profile-release"
+    if not os.path.exists(REPO_DIR):
+        subprocess.run(["git", "clone", "--depth", "1",
+                        "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+    if REPO_DIR not in sys.path:
+        sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Nhãn vàng QMOS (gộp `qMOS` theo wavID)
+# %%
+import numpy as np
+import pandas as pd
+def load_qmos_labels():
+    """train.csv (sep='|') → DataFrame [wavID, qmos] với qmos = TB theo wav."""
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col  = cols.get("wavid") or cols.get("wav") or list(df.columns)[1]
+    qmos_col = cols.get("qmos")  or cols.get("qMOS".lower()) or cols.get("mos")
+    assert qmos_col, f"Không thấy cột qMOS trong train.csv (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    g = df.groupby("_stem")[qmos_col].mean().reset_index()
+    g.columns = ["wavID", "qmos"]
+    return g
+qmos_df = load_qmos_labels()
+print(f"wav train (gộp): {len(qmos_df)}")
+print("qMOS:", qmos_df["qmos"].describe()[["mean", "std", "min", "max"]].to_dict())
+qmos_df.head()
+# %% [markdown]
+# ## 3. Trích / nạp đặc trưng (cache CHUNG với exp04) + điểm UTMOS
+# - `extract_e2v` / `extract_sailer`: y hệt exp04, cache `e2v_<tag>.npz` / `sailer_<tag>.npz`.
+# - `extract_utmos`: chấm UTMOS từng wav → cache `utmos_<tag>.npz` (dùng vừa làm đặc trưng, vừa làm baseline so sánh).
+# %%
+import torch
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU")
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+def extract_e2v(stems, tag):
+    """→ dict {stem: emb_full[D1+5]}. Cache CACHE_DIR/e2v_<tag>.npz (giống exp04)."""
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"e2v_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[e2v/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from funasr import AutoModel
+        m = AutoModel(model="iic/emotion2vec_plus_large", hub="hf", device=device)
+        for i, s in enumerate(tqdm(todo, desc=f"e2v {tag}")):
+            wav = os.path.join(WAV_DIR, s + ".wav")
+            if not os.path.exists(wav):
+                continue
+            r = m.generate(wav, granularity="utterance", extract_embedding=True)[0]
+            emb = np.asarray(r["feats"], dtype=np.float32).reshape(-1)
+            probs = {e: 0.0 for e in EMOTIONS5}
+            for lab, sc in zip(r["labels"], r["scores"]):
+                name = lab.split("/")[-1]
+                if name in probs:
+                    probs[name] = float(sc)
+            tot = sum(probs.values())
+            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)
+            store[s] = np.concatenate([emb, p5]).astype(np.float32)
+            if (i + 1) % 500 == 0:
+                np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del m
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return store   # mỗi value = [D1 | 5]
+def _pool_feat(features):
+    f = features.detach().cpu().numpy()
+    if f.ndim <= 1:
+        return f.reshape(-1).astype(np.float32)
+    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)
+def extract_sailer(stems, tag):
+    """→ dict {stem: vec[D2+9+3]}. Cache CACHE_DIR/sailer_<tag>.npz (giống exp04)."""
+    import librosa
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"sailer_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[sailer/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from src.model.emotion.wavlm_emotion import WavLMWrapper
+        sailer = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion").to(device).eval()
+        with torch.no_grad():
+            for i, s in enumerate(tqdm(todo, desc=f"sailer {tag}")):
+                wav = os.path.join(WAV_DIR, s + ".wav")
+                if not os.path.exists(wav):
+                    continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                wave = wave[: 15 * 16000]
+                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)
+                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)
+                emb = _pool_feat(feat)
+                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)
+                vad3 = np.array([1 + 4 * float(valence.item()),
+                                 1 + 4 * float(arousal.item()),
+                                 1 + 4 * float(dominance.item())], dtype=np.float32)
+                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)
+                if (i + 1) % 500 == 0:
+                    np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del sailer
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return store   # mỗi value = [D2 | 9 | 3]
+def extract_utmos(names, tag):
+    """Chấm UTMOS từng wav (theo TÊN file, vì DEV gọi .wav theo tên). → dict {stem: score}.
+    Cache CACHE_DIR/utmos_<tag>.npz. Dùng vừa làm đặc trưng vừa làm baseline so sánh."""
+    import librosa
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"utmos_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: float(z[k]) for k in z.files}
+        print(f"[utmos/{tag}] nạp cache: {len(store)}")
+    todo = [n for n in names if stem(n) not in store]
+    if todo:
+        predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong",
+                                   trust_repo=True).to(device).eval()
+        with torch.no_grad():
+            for i, n in enumerate(tqdm(todo, desc=f"utmos {tag}")):
+                wav = os.path.join(WAV_DIR, n if n.endswith(".wav") else n + ".wav")
+                if not os.path.exists(wav):
+                    continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                sc = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device), sr=16000).mean().item())
+                store[stem(n)] = sc
+                if (i + 1) % 500 == 0:
+                    np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})
+        np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})
+        del predictor
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return store
+# %% [markdown]
+# ## 4. Dựng feature + nhãn cho train
+# %%
+train_stems = list(qmos_df["wavID"])
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+e2v_tr    = extract_e2v(train_stems, "train")    if USE_E2V    else {}
+sailer_tr = extract_sailer(train_stems, "train") if USE_SAILER else {}
+utmos_tr  = extract_utmos(train_stems, "train")  if USE_UTMOS_FEAT else {}
+def qmos_feature(sid, e2v_map, sailer_map, utmos_map):
+    """Nối đặc trưng QMOS cho 1 wav. None nếu thiếu phần bắt buộc."""
+    parts = []
+    if USE_E2V:
+        v = e2v_map.get(sid)
+        if v is None:
+            return None
+        parts.append(v[:-5])                      # emb e2v
+        if USE_CLASSPROB:
+            parts.append(v[-5:])                  # probs5
+    if USE_SAILER:
+        v = sailer_map.get(sid)
+        if v is None:
+            return None
+        parts.append(v[:-12])                     # emb sailer
+        if USE_CLASSPROB:
+            parts.append(v[-12:])                 # probs9 + vad3
+    if USE_UTMOS_FEAT:
+        u = utmos_map.get(sid)
+        if u is None:
+            return None
+        parts.append(np.array([u], dtype=np.float32))
+    return np.concatenate(parts).astype(np.float32)
+lab = qmos_df.set_index("wavID")["qmos"]
+X, y = [], []
+for s in train_stems:
+    f = qmos_feature(s, e2v_tr, sailer_tr, utmos_tr)
+    if f is None or s not in lab.index:
+        continue
+    X.append(f)
+    y.append(float(lab.loc[s]))
+X = np.stack(X).astype(np.float32)
+y = np.array(y, dtype=np.float32)
+FEAT_DIM = X.shape[1]
+print(f"Train: X={X.shape} y={y.shape}")
+feat_mean = X.mean(0, keepdims=True)
+feat_std  = X.std(0, keepdims=True) + 1e-6
+Xn = (X - feat_mean) / feat_std
+y_mu, y_sd = float(y.mean()), float(y.std() + 1e-6)
+yn = (y - y_mu) / y_sd
+# %% [markdown]
+# ## 5. Train head QMOS + so với UTMOS trên CÙNG val nội bộ
+# - Head = MLP nhỏ (`Linear→ReLU→Dropout ×2 → 1`). Loss = MSE (+ tùy chọn pairwise ranking).
+# - In **SRCC head** và **SRCC UTMOS** trên cùng tập val → biết head có thật sự vượt 0.414 không.
+# %%
+import torch.nn as nn
+from scipy.stats import spearmanr
+from sklearn.model_selection import train_test_split
+torch.manual_seed(SEED); np.random.seed(SEED)
+idx_all = np.arange(X.shape[0])
+tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)
+def to_t(a):
+    return torch.tensor(a, dtype=torch.float32, device=device)
+Xn_t = to_t(Xn); yn_t = to_t(yn).unsqueeze(1)
+class QMOSHead(nn.Module):
+    def __init__(self, d_in, h, p):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(d_in, h), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(h, h), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(h, 1),
+        )
+    def forward(self, x):
+        return self.net(x)
+model = QMOSHead(FEAT_DIM, HIDDEN, DROPOUT).to(device)
+opt = torch.optim.Adam(model.parameters(), lr=LR, weight_decay=1e-5)
+mse = nn.MSELoss()
+def pairwise_rank_loss(pred, target):
+    """Khuyến khích pred xếp hạng giống target (margin ranking trên các cặp trong batch)."""
+    n = pred.shape[0]
+    if n < 2:
+        return torch.zeros((), device=device)
+    pi, pj = pred.unsqueeze(0), pred.unsqueeze(1)
+    ti, tj = target.unsqueeze(0), target.unsqueeze(1)
+    sign = torch.sign(ti - tj)                       # +1 nếu i nên cao hơn j
+    diff = pi - pj
+    # hinge: phạt khi thứ tự sai
+    return torch.relu(-sign * diff).mean()
+@torch.no_grad()
+def eval_val():
+    model.eval()
+    p = model(Xn_t[va_idx]).cpu().numpy().ravel()
+    srcc_head = spearmanr(p, y[va_idx]).correlation
+    out = {"head": float(srcc_head)}
+    if USE_UTMOS_FEAT:
+        u = X[va_idx, -1]                            # cột UTMOS (đặc trưng cuối, chưa chuẩn hóa)
+        out["utmos"] = float(spearmanr(u, y[va_idx]).correlation)
+    return out
+best, best_state, bad = -1e9, None, 0
+tr_t = torch.tensor(tr_idx, device=device)
+for ep in range(1, EPOCHS + 1):
+    model.train()
+    perm = tr_t[torch.randperm(len(tr_t), device=device)]
+    run = 0.0
+    for i in range(0, len(perm), BATCH):
+        b = perm[i:i + BATCH]
+        opt.zero_grad()
+        pred = model(Xn_t[b])
+        loss = mse(pred, yn_t[b])
+        if RANK_LAMBDA > 0:
+            loss = loss + RANK_LAMBDA * pairwise_rank_loss(pred.ravel(), yn_t[b].ravel())
+        loss.backward(); opt.step()
+        run += loss.item() * len(b)
+    m = eval_val()
+    if m["head"] > best:
+        best = m["head"]
+        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
+        bad = 0
+    else:
+        bad += 1
+    if ep % 5 == 0 or ep == 1:
+        extra = f" | UTMOS={m['utmos']:.4f}" if "utmos" in m else ""
+        print(f"epoch {ep:3d} | loss {run/len(perm):.4f} | head SRCC={m['head']:.4f}{extra} | best {best:.4f}")
+    if bad >= PATIENCE:
+        print(f"Early stop ở epoch {ep}.")
+        break
+model.load_state_dict(best_state)
+final = eval_val()
+print("\n✅ VAL (nội bộ):")
+print(f"   QMOS head SRCC = {final['head']:.4f}")
+if "utmos" in final:
+    print(f"   UTMOS  baseline = {final['utmos']:.4f}  (mốc leaderboard 0.414)")
+    print("   →", "✅ HEAD VƯỢT UTMOS" if final["head"] > final["utmos"] else "⚠️ chưa vượt — thử tăng EPOCHS / RANK_LAMBDA / bật thêm đặc trưng")
+torch.save({"state": best_state, "feat_mean": feat_mean, "feat_std": feat_std,
+            "y_mu": y_mu, "y_sd": y_sd, "FEAT_DIM": FEAT_DIM,
+            "USE_E2V": USE_E2V, "USE_SAILER": USE_SAILER,
+            "USE_CLASSPROB": USE_CLASSPROB, "USE_UTMOS_FEAT": USE_UTMOS_FEAT,
+            "val_srcc": best}, os.path.join(OUT_DIR, "qmos_head.pt"))
+print("Đã lưu", os.path.join(OUT_DIR, "qmos_head.pt"))
+# %% [markdown]
+# ## 6. Dự đoán QMOS cho DEV
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+e2v_dev    = extract_e2v(dev_stems, "dev")    if USE_E2V    else {}
+sailer_dev = extract_sailer(dev_stems, "dev") if USE_SAILER else {}
+utmos_dev  = extract_utmos(dev_names, "dev")  if USE_UTMOS_FEAT else {}
+@torch.no_grad()
+def predict_qmos(sid):
+    f = qmos_feature(sid, e2v_dev, sailer_dev, utmos_dev)
+    if f is None:
+        return None
+    fn = (f[None, :] - feat_mean) / feat_std
+    model.eval()
+    return float(model(to_t(fn)).item()) * y_sd + y_mu     # đảo z-score
+qmos_pred = {}
+n_real = n_def = 0
+for n in dev_names:
+    sid = stem(n)
+    p = predict_qmos(sid)
+    if p is None:
+        p = utmos_dev.get(sid, 3.0)                          # rơi về UTMOS nếu thiếu feature
+        n_def += 1
+    else:
+        n_real += 1
+    qmos_pred[n] = p
+print(f"QMOS dự đoán: head thật {n_real}, dự phòng UTMOS {n_def}")
+# Lưu riêng (để ghép tay nếu cần)
+import csv
+qmos_csv = os.path.join(OUT_DIR, "qmos_dev.csv")
+with open(qmos_csv, "w", newline="") as f:
+    w = csv.writer(f); w.writerow(["wav", "QMOS"])
+    for n in dev_names:
+        w.writerow([n, f"{qmos_pred[n]:.6g}"])
+print("Đã ghi", qmos_csv)
+# %% [markdown]
+# ## 7. Ghép QMOS mới vào answer.txt của exp04 → bản nộp mới
+# Giữ NGUYÊN 5 cột cảm xúc đang thắng (EMOS/CAT/VAL/ARO/DOM), chỉ thay cột QMOS.
+# %%
+def merge_into_exp04(exp04_path, out_path):
+    if not os.path.exists(exp04_path):
+        print(f"⚠️ Không thấy {exp04_path} → BỎ QUA ghép. Hãy dùng qmos_dev.csv để thay cột QMOS thủ công,")
+        print("   hoặc trỏ EXP04_ANSWER đúng đường dẫn answer.txt của exp04 rồi chạy lại cell này.")
+        return False
+    with open(exp04_path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    qi = header.index("QMOS")
+    wi = header.index("wav")
+    n_swapped = n_miss = 0
+    with open(out_path, "w", newline="") as f:
+        w = csv.writer(f); w.writerow(header)
+        for r in rows[1:]:
+            name = r[wi]
+            if name in qmos_pred:
+                r[qi] = f"{qmos_pred[name]:.6g}"; n_swapped += 1
+            else:
+                n_miss += 1
+            w.writerow(r)
+    print(f"Ghép xong → {out_path} | thay {n_swapped} cột QMOS, thiếu {n_miss} (giữ QMOS cũ)")
+    return True
+merged = os.path.join(OUT_DIR, "answer.txt")
+ok = merge_into_exp04(EXP04_ANSWER, merged)
+if ok:
+    # validate + zip
+    with open(merged) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0]
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+    os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp06_qmos.zip answer.txt "
+              f"&& unzip -l submission_track2_exp06_qmos.zip")
+    print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp06_qmos.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Lần đầu** đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để bắt lỗi; OK rồi đặt `None`.
+# - **So sánh công bằng**: mục 5 in cả `head SRCC` và `UTMOS SRCC` trên CÙNG val nội bộ → chỉ nộp khi head > UTMOS.
+# - Nếu head **chưa vượt** 0.414: thử (a) tăng `EPOCHS`; (b) bật `RANK_LAMBDA=0.2` (tối ưu thứ hạng);
+#   (c) đảm bảo `USE_UTMOS_FEAT=True` (neo residual); (d) thử bỏ bớt đặc trưng nhiễu (tắt `USE_CLASSPROB`).
+# - **Ablation QMOS cho paper**: bật/tắt `USE_E2V/USE_SAILER/USE_UTMOS_FEAT/USE_CLASSPROB` → ghi `docs/04_experiments_log.md` (exp06).
+# - Cache dùng CHUNG `fusion_cache/` với exp04 → nhớ **Save Version** giữ lại (gồm `utmos_*.npz` mới).
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp06).

track2/exp07_fusion_qmos.ipynb ADDED Viewed

	@@ -0,0 +1,780 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "c75f9ad6",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp07 (FUSION + QMOS head, HỢP NHẤT 6 cột) — Kaggle\n",
+    "\n",
+    "**Khác exp04 ở đâu:** exp04 để **QMOS riêng** (UTMOS zero-shot). exp07 **gộp luôn QMOS vào trunk chung**\n",
+    "→ 1 model multi-task dự đoán **đủ 6 đầu ra**: QMOS · EMOS · CAT · VAL · ARO · DOM.\n",
+    "\n",
+    "## Giả thuyết (của bạn) cần kiểm chứng\n",
+    "\"Chất giọng tự nhiên có liên quan tới cảm nhận cảm xúc\" → nếu đúng, QMOS sẽ **hưởng lợi** từ biểu diễn\n",
+    "cảm xúc chung (emotion2vec + SAILER). **Rủi ro:** 2 backbone này chuyên *cảm xúc*, chưa chắc bắt tốt\n",
+    "*lỗi chất lượng/artifact* (thứ UTMOS chuyên trị) → QMOS có thể **thua** UTMOS, hoặc gộp làm **tụt** EMOS/VAD.\n",
+    "\n",
+    "## Lưới an toàn trong thiết kế\n",
+    "- **Vẫn đưa điểm UTMOS làm 1 đầu vào** cho QMOS head (`USE_UTMOS_FEAT`) → head học **chỉnh sửa** quanh\n",
+    "  0.414 thay vì học lại từ đầu → khó tệ hơn UTMOS.\n",
+    "- **In SRCC cả 6 cột + so mốc exp04** (EMOS 0.788 · CAT err 0.145 · VAL 0.578 · ARO 0.754 · DOM 0.706)\n",
+    "  → cảnh báo ngay nếu gộp QMOS làm tụt 5 cột cảm xúc.\n",
+    "- **File riêng**, KHÔNG đụng `exp04_fusion_pipeline.py` (exp04 vẫn nguyên).\n",
+    "\n",
+    "```\n",
+    " mỗi wav ─► [e2v_emb | e2v_p5 | sailer_emb | sailer_p9 | sailer_vad3] ─► TRUNK chung\n",
+    "                                                                           │\n",
+    "        ┌──────────────┬───────────────┬─────────────┬───────────────────┤\n",
+    "  [QMOS head]      [EMOS head]      [CAT head]    [VAD head]\n",
+    "  trunk + UTMOS    trunk + target    trunk         trunk\n",
+    "```\n",
+    "\n",
+    "**Cách chạy:** GPU T4 + Internet On → Add Input dataset Track 2 → sửa `DATA_ROOT` → Run All.\n",
+    "Lần đầu đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20`. Dùng CHUNG cache `fusion_cache/` với exp04 (thêm `utmos_*.npz`)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b4e814c4",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "57b9eedb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header)\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/fusion_cache\"     # dùng CHUNG với exp04 (thêm utmos_*.npz)\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── Siêu tham số ─────────────────────────────────────────────────────────────\n",
+    "DEVICE          = \"cuda\"\n",
+    "TRUNK_HIDDEN    = 512\n",
+    "HEAD_HIDDEN     = 128\n",
+    "DROPOUT         = 0.3\n",
+    "LR              = 1e-3\n",
+    "EPOCHS          = 80\n",
+    "BATCH           = 64\n",
+    "VAL_FRAC        = 0.10\n",
+    "PATIENCE        = 15\n",
+    "SEED            = 42\n",
+    "\n",
+    "USE_UNCERTAINTY = True        # tự cân 6 loss (Kendall); False = dùng LOSS_W cố định\n",
+    "LOSS_W          = {\"qmos\": 1.0, \"emos\": 1.0, \"cat\": 1.0, \"val\": 1.0, \"aro\": 1.0, \"dom\": 1.0}\n",
+    "USE_E2V         = True\n",
+    "USE_SAILER      = True\n",
+    "USE_CLASSPROB   = True\n",
+    "USE_UTMOS_FEAT  = True        # đưa điểm UTMOS làm đầu vào QMOS head (neo residual quanh 0.414)\n",
+    "\n",
+    "LIMIT_TRAIN     = None\n",
+    "LIMIT_DEV       = None\n",
+    "\n",
+    "# Mốc exp04 để so (cảnh báo nếu tụt khi gộp QMOS)\n",
+    "EXP04 = {\"emos\": 0.788, \"cat_err\": 0.145, \"val\": 0.578, \"aro\": 0.754, \"dom\": 0.706, \"qmos_utmos\": 0.414}\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "SAILER9 = [\"Anger\", \"Contempt\", \"Disgust\", \"Fear\", \"Happiness\", \"Neutral\", \"Sadness\", \"Surprise\", \"Other\"]\n",
+    "EMO2SAILER = {\"angry\": 0, \"happy\": 4, \"neutral\": 5, \"sad\": 6, \"surprised\": 7}\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "assert USE_E2V or USE_SAILER, \"Phải bật ít nhất 1 backbone.\"\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "547ccf32",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "132b9321",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"speechmos\", \"funasr\", \"librosa\", \"soundfile\", \"pandas\", \"scipy\", \"scikit-learn\", \"tqdm\")\n",
+    "\n",
+    "if USE_SAILER:\n",
+    "    pip_install(\"loralib\", \"speechbrain\")\n",
+    "    REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "    if not os.path.exists(REPO_DIR):\n",
+    "        subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                        \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "    if REPO_DIR not in sys.path:\n",
+    "        sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "75c6f07c",
+   "metadata": {},
+   "source": [
+    "## 2. Đọc & gộp nhãn (gộp theo wavID) — THÊM cột qMOS\n",
+    "Khác exp04: gộp thêm **qMOS** (= TB `qMOS` theo wav) làm nhãn cho QMOS head."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c73a7fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) < 2:\n",
+    "                continue\n",
+    "            tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, default_idx=None, df=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    \"\"\"train.csv → DataFrame [wavID, qmos, emos, val, aro, dom, cat0..cat4] gộp theo wav.\"\"\"\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col  = _col(cols, \"wavid\", \"wav\", default_idx=1, df=df)\n",
+    "    qmos_col = _col(cols, \"qmos\", \"mos\")\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col  = _col(cols, \"val\", \"valence\")\n",
+    "    aro_col  = _col(cols, \"aro\", \"arousal\")\n",
+    "    dom_col  = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col  = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert qmos_col, f\"Không thấy cột qMOS trong train.csv (cột: {list(df.columns)})\"\n",
+    "    assert emos_col, f\"Không thấy cột eMOS trong train.csv (cột: {list(df.columns)})\"\n",
+    "\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid,\n",
+    "               \"qmos\": float(g[qmos_col].mean()),\n",
+    "               \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 1.0 / len(EMOTIONS5), dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")\n",
+    "print(\"qMOS:\", train_df[\"qmos\"].describe()[[\"mean\", \"std\", \"min\", \"max\"]].to_dict())\n",
+    "print(\"eMOS:\", train_df[\"emos\"].describe()[[\"mean\", \"std\", \"min\", \"max\"]].to_dict())\n",
+    "train_df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0726b340",
+   "metadata": {},
+   "source": [
+    "## 3. Trích đặc trưng 2 backbone + điểm UTMOS (cache CHUNG với exp04)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae27e424",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU\")\n",
+    "\n",
+    "def extract_e2v(stems, tag):\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"e2v_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[e2v/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        from funasr import AutoModel\n",
+    "        m = AutoModel(model=\"iic/emotion2vec_plus_large\", hub=\"hf\", device=device)\n",
+    "        for i, s in enumerate(tqdm(todo, desc=f\"e2v {tag}\")):\n",
+    "            wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "            if not os.path.exists(wav):\n",
+    "                continue\n",
+    "            r = m.generate(wav, granularity=\"utterance\", extract_embedding=True)[0]\n",
+    "            emb = np.asarray(r[\"feats\"], dtype=np.float32).reshape(-1)\n",
+    "            probs = {e: 0.0 for e in EMOTIONS5}\n",
+    "            for lab, sc in zip(r[\"labels\"], r[\"scores\"]):\n",
+    "                name = lab.split(\"/\")[-1]\n",
+    "                if name in probs:\n",
+    "                    probs[name] = float(sc)\n",
+    "            tot = sum(probs.values())\n",
+    "            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)\n",
+    "            store[s] = np.concatenate([emb, p5]).astype(np.float32)\n",
+    "            if (i + 1) % 500 == 0:\n",
+    "                np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del m\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return {s: (v[:-5], v[-5:]) for s, v in store.items()}\n",
+    "\n",
+    "def _pool_feat(features):\n",
+    "    f = features.detach().cpu().numpy()\n",
+    "    if f.ndim <= 1:\n",
+    "        return f.reshape(-1).astype(np.float32)\n",
+    "    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)\n",
+    "\n",
+    "def extract_sailer(stems, tag):\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"sailer_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[sailer/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        from src.model.emotion.wavlm_emotion import WavLMWrapper\n",
+    "        sailer = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\").to(device).eval()\n",
+    "        with torch.no_grad():\n",
+    "            for i, s in enumerate(tqdm(todo, desc=f\"sailer {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                wave = wave[: 15 * 16000]\n",
+    "                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)\n",
+    "                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)\n",
+    "                emb = _pool_feat(feat)\n",
+    "                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)\n",
+    "                vad3 = np.array([1 + 4 * float(valence.item()),\n",
+    "                                 1 + 4 * float(arousal.item()),\n",
+    "                                 1 + 4 * float(dominance.item())], dtype=np.float32)\n",
+    "                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del sailer\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return {s: (v[:-12], v[-12:-3], v[-3:]) for s, v in store.items()}\n",
+    "\n",
+    "def extract_utmos(names, tag):\n",
+    "    \"\"\"Chấm UTMOS từng wav (theo TÊN, vì DEV gọi .wav theo tên). → dict {stem: score}.\n",
+    "    Cache CACHE_DIR/utmos_<tag>.npz. Dùng vừa làm đầu vào QMOS head, vừa làm baseline so sánh.\"\"\"\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"utmos_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: float(z[k]) for k in z.files}\n",
+    "        print(f\"[utmos/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [n for n in names if stem(n) not in store]\n",
+    "    if todo:\n",
+    "        predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\",\n",
+    "                                   trust_repo=True).to(device).eval()\n",
+    "        with torch.no_grad():\n",
+    "            for i, n in enumerate(tqdm(todo, desc=f\"utmos {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else n + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                store[stem(n)] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device),\n",
+    "                                                 sr=16000).mean().item())\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        del predictor\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e9fad1a0",
+   "metadata": {},
+   "source": [
+    "## 4. Dựng feature + nhãn cho train\n",
+    "Feature audio (cảm xúc) = `[e2v_emb | e2v_p5 | sailer_emb | sailer_p9 | sailer_vad3]` (như exp04).\n",
+    "Thêm: vector **UTMOS** (1 số/ wav) cho QMOS head, và nhãn **qMOS**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "768f374b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_stems = list(train_df[\"wavID\"])\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "\n",
+    "e2v_tr    = extract_e2v(train_stems, \"train\")    if USE_E2V    else {}\n",
+    "sailer_tr = extract_sailer(train_stems, \"train\") if USE_SAILER else {}\n",
+    "utmos_tr  = extract_utmos(train_stems, \"train\")  if USE_UTMOS_FEAT else {}\n",
+    "\n",
+    "def audio_feature(sid, e2v_map, sailer_map):\n",
+    "    parts = []\n",
+    "    if USE_E2V:\n",
+    "        pk = e2v_map.get(sid)\n",
+    "        if pk is None:\n",
+    "            return None\n",
+    "        emb, p5 = pk\n",
+    "        parts.append(emb)\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(p5)\n",
+    "    if USE_SAILER:\n",
+    "        pk = sailer_map.get(sid)\n",
+    "        if pk is None:\n",
+    "            return None\n",
+    "        emb, p9, vad3 = pk\n",
+    "        parts.append(emb)\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(p9); parts.append(vad3)\n",
+    "    return np.concatenate(parts).astype(np.float32)\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "X, T, U, y_qmos, y_emos, y_vad, y_cat = [], [], [], [], [], [], []\n",
+    "for s in train_stems:\n",
+    "    f = audio_feature(s, e2v_tr, sailer_tr)\n",
+    "    tgt = target_map.get(s)\n",
+    "    if f is None or tgt is None or s not in lab.index:\n",
+    "        continue\n",
+    "    if USE_UTMOS_FEAT and s not in utmos_tr:\n",
+    "        continue\n",
+    "    X.append(f)\n",
+    "    T.append(onehot_target(tgt))\n",
+    "    U.append(utmos_tr.get(s, 3.0) if USE_UTMOS_FEAT else 0.0)\n",
+    "    y_qmos.append(lab.loc[s, \"qmos\"])\n",
+    "    y_emos.append(lab.loc[s, \"emos\"])\n",
+    "    y_vad.append([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]])\n",
+    "    y_cat.append([lab.loc[s, f\"cat{i}\"] for i in range(len(EMOTIONS5))])\n",
+    "\n",
+    "X = np.stack(X).astype(np.float32)\n",
+    "T = np.stack(T).astype(np.float32)\n",
+    "U = np.array(U, dtype=np.float32).reshape(-1, 1)\n",
+    "y_qmos = np.array(y_qmos, dtype=np.float32)\n",
+    "y_emos = np.array(y_emos, dtype=np.float32)\n",
+    "y_vad  = np.array(y_vad,  dtype=np.float32)\n",
+    "y_cat  = np.array(y_cat,  dtype=np.float32)\n",
+    "FEAT_DIM = X.shape[1]\n",
+    "print(f\"Train: X={X.shape} U={U.shape} qmos={y_qmos.shape} emos={y_emos.shape} vad={y_vad.shape}\")\n",
+    "\n",
+    "# Chuẩn hóa feature audio + UTMOS (z-score), lưu mean/std.\n",
+    "feat_mean = X.mean(0, keepdims=True); feat_std = X.std(0, keepdims=True) + 1e-6\n",
+    "Xn = (X - feat_mean) / feat_std\n",
+    "u_mu, u_sd = float(U.mean()), float(U.std() + 1e-6)\n",
+    "Un = (U - u_mu) / u_sd\n",
+    "\n",
+    "# Chuẩn hóa nhãn liên tục về z-score.\n",
+    "qmos_mu, qmos_sd = float(y_qmos.mean()), float(y_qmos.std() + 1e-6)\n",
+    "y_qmos_z = (y_qmos - qmos_mu) / qmos_sd\n",
+    "emos_mu, emos_sd = float(y_emos.mean()), float(y_emos.std() + 1e-6)\n",
+    "y_emos_z = (y_emos - emos_mu) / emos_sd\n",
+    "if HAS_VAD:\n",
+    "    vad_mu = np.nanmean(y_vad, axis=0); vad_sd = np.nanstd(y_vad, axis=0) + 1e-6\n",
+    "    y_vad_z = (y_vad - vad_mu) / vad_sd\n",
+    "else:\n",
+    "    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)\n",
+    "    y_vad_z = np.zeros_like(y_vad)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5903e6ca",
+   "metadata": {},
+   "source": [
+    "## 5. Model fusion multi-task (6 head) + train loop\n",
+    "Thêm so exp04: **QMOS head** nhận `[trunk | UTMOS]` → 1; `qmos` vào uncertainty weighting (6 task)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "68a3a836",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch.nn as nn\n",
+    "from scipy.stats import spearmanr\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "idx_all = np.arange(X.shape[0])\n",
+    "tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)\n",
+    "\n",
+    "def to_t(a):\n",
+    "    return torch.tensor(a, dtype=torch.float32, device=device)\n",
+    "\n",
+    "Xn_t, T_t, Un_t = to_t(Xn), to_t(T), to_t(Un)\n",
+    "qmos_t = to_t(y_qmos_z).unsqueeze(1)\n",
+    "emos_t = to_t(y_emos_z).unsqueeze(1)\n",
+    "vad_t  = to_t(y_vad_z)\n",
+    "cat_t  = to_t(y_cat)\n",
+    "\n",
+    "class FusionMTL6(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo, use_utmos):\n",
+    "        super().__init__()\n",
+    "        self.use_utmos = use_utmos\n",
+    "        self.trunk = nn.Sequential(\n",
+    "            nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "        )\n",
+    "        self.qmos = nn.Sequential(   # nhận [trunk | utmos] nếu bật\n",
+    "            nn.Linear(trunk_h + (1 if use_utmos else 0), head_h), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(head_h, 1))\n",
+    "        self.emos = nn.Sequential(   # nhận [trunk | target]\n",
+    "            nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat  = nn.Sequential(\n",
+    "            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad  = nn.Sequential(\n",
+    "            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "\n",
+    "    def forward(self, x, tgt, utmos):\n",
+    "        h = self.trunk(x)\n",
+    "        qmos_in = torch.cat([h, utmos], dim=1) if self.use_utmos else h\n",
+    "        qmos = self.qmos(qmos_in)\n",
+    "        emos = self.emos(torch.cat([h, tgt], dim=1))\n",
+    "        cat_logits = self.cat(h)\n",
+    "        vad = self.vad(h)\n",
+    "        return qmos, emos, cat_logits, vad\n",
+    "\n",
+    "model = FusionMTL6(FEAT_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO, USE_UTMOS_FEAT).to(device)\n",
+    "\n",
+    "TASKS = [\"qmos\", \"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "params = list(model.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.Adam(params, lr=LR, weight_decay=1e-5)\n",
+    "mse = nn.MSELoss(reduction=\"none\")\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    logq = F.log_softmax(logits, dim=1)\n",
+    "    return -(target_dist * logq).sum(dim=1)\n",
+    "\n",
+    "def task_losses(qmos_p, emos_p, cat_logits, vad_p, b):\n",
+    "    L = {}\n",
+    "    L[\"qmos\"] = mse(qmos_p, qmos_t[b]).mean()\n",
+    "    L[\"emos\"] = mse(emos_p, emos_t[b]).mean()\n",
+    "    L[\"cat\"]  = soft_ce(cat_logits, cat_t[b]).mean()\n",
+    "    if HAS_VAD:\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vad_t[b, 0:1]).mean()\n",
+    "        L[\"aro\"] = mse(vad_p[:, 1:2], vad_t[b, 1:2]).mean()\n",
+    "        L[\"dom\"] = mse(vad_p[:, 2:3], vad_t[b, 2:3]).mean()\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device)\n",
+    "        L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    return L\n",
+    "\n",
+    "def combine(L):\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        tot = 0.0\n",
+    "        for i, t in enumerate(TASKS):\n",
+    "            tot = tot + torch.exp(-log_var[i]) * L[t] + log_var[i]\n",
+    "        return tot\n",
+    "    return sum(LOSS_W[t] * L[t] for t in TASKS)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def eval_val():\n",
+    "    model.eval()\n",
+    "    qp, ep, cl, vp = model(Xn_t[va_idx], T_t[va_idx], Un_t[va_idx])\n",
+    "    qp = qp.cpu().numpy().ravel(); ep = ep.cpu().numpy().ravel()\n",
+    "    out = {\"qmos\": spearmanr(qp, y_qmos[va_idx]).correlation,\n",
+    "           \"emos\": spearmanr(ep, y_emos[va_idx]).correlation}\n",
+    "    if USE_UTMOS_FEAT:\n",
+    "        out[\"qmos_utmos\"] = spearmanr(U[va_idx, 0], y_qmos[va_idx]).correlation   # baseline UTMOS đơn lẻ\n",
+    "    if HAS_VAD:\n",
+    "        vp = vp.cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            out[t] = spearmanr(vp[:, j], y_vad[va_idx, j]).correlation\n",
+    "    q = F.softmax(cl, dim=1).cpu().numpy(); p = y_cat[va_idx]\n",
+    "    kl = (p * (np.log(p + 1e-9) - np.log(q + 1e-9))).sum(1).mean()\n",
+    "    out[\"cat_negkl\"] = float(-kl)\n",
+    "    return out\n",
+    "\n",
+    "def val_score(m):\n",
+    "    \"\"\"Điểm tổng early-stop = TB SRCC các task liên tục (qmos+emos+VAD).\"\"\"\n",
+    "    keys = [\"qmos\", \"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "best_score, best_state, bad = -1e9, None, 0\n",
+    "tr_t = torch.tensor(tr_idx, device=device)\n",
+    "for ep_i in range(1, EPOCHS + 1):\n",
+    "    model.train()\n",
+    "    perm = tr_t[torch.randperm(len(tr_t), device=device)]\n",
+    "    run = 0.0\n",
+    "    for i in range(0, len(perm), BATCH):\n",
+    "        b = perm[i:i + BATCH]\n",
+    "        opt.zero_grad()\n",
+    "        qmos_p, emos_p, cat_logits, vad_p = model(Xn_t[b], T_t[b], Un_t[b])\n",
+    "        L = task_losses(qmos_p, emos_p, cat_logits, vad_p, b)\n",
+    "        loss = combine(L)\n",
+    "        loss.backward(); opt.step()\n",
+    "        run += loss.item() * len(b)\n",
+    "    m = eval_val()\n",
+    "    sc = val_score(m)\n",
+    "    if sc > best_score:\n",
+    "        best_score = sc\n",
+    "        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}\n",
+    "        bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "    if ep_i % 5 == 0 or ep_i == 1:\n",
+    "        msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"qmos\", \"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "        print(f\"epoch {ep_i:3d} | loss {run/len(perm):.4f} | {msg} | best {best_score:.4f}\")\n",
+    "    if bad >= PATIENCE:\n",
+    "        print(f\"Early stop ở epoch {ep_i}.\")\n",
+    "        break\n",
+    "\n",
+    "model.load_state_dict(best_state)\n",
+    "final = eval_val()\n",
+    "print(\"\\n✅ VAL (nội bộ) — exp07 (fusion + QMOS head):\")\n",
+    "print(f\"   QMOS SRCC = {final['qmos']:.4f}\", end=\"\")\n",
+    "if \"qmos_utmos\" in final:\n",
+    "    tag = \"✅ vượt UTMOS\" if final[\"qmos\"] > final[\"qmos_utmos\"] else \"⚠️ CHƯA vượt UTMOS\"\n",
+    "    print(f\"   (UTMOS đơn lẻ = {final['qmos_utmos']:.4f} → {tag})\")\n",
+    "else:\n",
+    "    print()\n",
+    "print(f\"   EMOS SRCC = {final['emos']:.4f}   (mốc exp04 = {EXP04['emos']})\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\"   VAL/ARO/DOM = {final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f}\"\n",
+    "          f\"   (mốc exp04 = {EXP04['val']}/{EXP04['aro']}/{EXP04['dom']})\")\n",
+    "# Cảnh báo negative transfer (gộp QMOS làm tụt cảm xúc)\n",
+    "warn = []\n",
+    "if final[\"emos\"] < EXP04[\"emos\"] - 0.02:\n",
+    "    warn.append(f\"EMOS {final['emos']:.3f} < {EXP04['emos']}\")\n",
+    "if HAS_VAD:\n",
+    "    for t in [\"val\", \"aro\", \"dom\"]:\n",
+    "        if final[t] < EXP04[t] - 0.02:\n",
+    "            warn.append(f\"{t.upper()} {final[t]:.3f} < {EXP04[t]}\")\n",
+    "if warn:\n",
+    "    print(\"   ⚠️ NEGATIVE TRANSFER? Cảm xúc tụt so exp04:\", \"; \".join(warn),\n",
+    "          \"\\n      → cân nhắc giữ exp04 cho 5 cột cảm xúc + chỉ lấy QMOS từ exp07/exp06.\")\n",
+    "else:\n",
+    "    print(\"   ✅ Không thấy 5 cột cảm xúc tụt rõ so exp04.\")\n",
+    "if USE_UNCERTAINTY:\n",
+    "    print(\"   log σ² mỗi task:\", {t: round(float(log_var[i]), 3) for i, t in enumerate(TASKS)})\n",
+    "\n",
+    "torch.save({\"state\": best_state, \"feat_mean\": feat_mean, \"feat_std\": feat_std,\n",
+    "            \"u_mu\": u_mu, \"u_sd\": u_sd,\n",
+    "            \"qmos_mu\": qmos_mu, \"qmos_sd\": qmos_sd, \"emos_mu\": emos_mu, \"emos_sd\": emos_sd,\n",
+    "            \"vad_mu\": vad_mu, \"vad_sd\": vad_sd, \"FEAT_DIM\": FEAT_DIM,\n",
+    "            \"USE_E2V\": USE_E2V, \"USE_SAILER\": USE_SAILER, \"USE_CLASSPROB\": USE_CLASSPROB,\n",
+    "            \"USE_UTMOS_FEAT\": USE_UTMOS_FEAT, \"val_score\": best_score},\n",
+    "           os.path.join(OUT_DIR, \"fusion_qmos_mtl.pt\"))\n",
+    "print(\"Đã lưu\", os.path.join(OUT_DIR, \"fusion_qmos_mtl.pt\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9a788c48",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán DEV → `answer.txt` đủ 6 cột (QMOS giờ từ HEAD, không phải SpeechMOS riêng)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8acd813f",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "e2v_dev    = extract_e2v(dev_stems, \"dev\")    if USE_E2V    else {}\n",
+    "sailer_dev = extract_sailer(dev_stems, \"dev\") if USE_SAILER else {}\n",
+    "utmos_dev  = extract_utmos(dev_names, \"dev\")  if USE_UTMOS_FEAT else {}\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_all(sid):\n",
+    "    f = audio_feature(sid, e2v_dev, sailer_dev)\n",
+    "    if f is None:\n",
+    "        return None\n",
+    "    fn = (f[None, :] - feat_mean) / feat_std\n",
+    "    tgt = onehot_target(target_map.get(sid))[None, :]\n",
+    "    u = np.array([[utmos_dev.get(sid, 3.0)]], dtype=np.float32)\n",
+    "    un = (u - u_mu) / u_sd\n",
+    "    model.eval()\n",
+    "    qmos_p, emos_p, cat_logits, vad_p = model(to_t(fn), to_t(tgt), to_t(un))\n",
+    "    qmos = float(qmos_p.item()) * qmos_sd + qmos_mu\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_logits, dim=1)[0].cpu().numpy()\n",
+    "    vad3 = vad_p[0].cpu().numpy() * vad_sd + vad_mu\n",
+    "    return qmos, emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(probs5):\n",
+    "    return \"|\".join(f\"{e}:{probs5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    from tqdm.auto import tqdm\n",
+    "    n_real = n_default = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pred = predict_all(sid)\n",
+    "            if pred is None:\n",
+    "                # rơi về: QMOS=UTMOS nếu có, còn lại mặc định\n",
+    "                qmos = utmos_dev.get(sid, 3.0)\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])\n",
+    "                n_default += 1\n",
+    "            else:\n",
+    "                qmos, emos, cat5, vad3 = pred\n",
+    "                n_real += 1\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},\"\n",
+    "                    f\"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | head thật {n_real}, mặc định {n_default}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0ac67bb5",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8244b0c3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    assert header[0] == \"wav\" and \"QMOS\" in header and \"EMOS\" in header, \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(header), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {header}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp07_fusion_qmos.zip answer.txt \"\n",
+    "          f\"&& unzip -l submission_track2_exp07_fusion_qmos.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp07_fusion_qmos.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a73c1c11",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Lần đầu** đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20`; OK rồi đặt `None`.\n",
+    "- **Đọc kết quả mục 5 theo 2 câu hỏi:**\n",
+    "  1. QMOS head có **vượt UTMOS đơn lẻ (0.414)** không? (dòng \"vượt/CHƯA vượt UTMOS\")\n",
+    "  2. Gộp QMOS có **làm tụt** EMOS/VAD so exp04 không? (dòng \"NEGATIVE TRANSFER?\")\n",
+    "- **Quyết định nộp:**\n",
+    "  - Nếu QMOS↑ và cảm xúc KHÔNG tụt → nộp answer.txt exp07 (1 model trọn 6 cột — đẹp cho paper).\n",
+    "  - Nếu QMOS↑ nhưng cảm xúc TỤT → giữ exp04 cho 5 cột cảm xúc, chỉ lấy **cột QMOS** của exp07/exp06 ghép vào.\n",
+    "  - Nếu QMOS không vượt UTMOS → kết luận \"chất lượng trực giao cảm xúc\" (vẫn là phát hiện cho paper); giữ exp04.\n",
+    "- **Ablation cho paper**: `USE_UTMOS_FEAT=False` (QMOS chỉ từ trunk cảm xúc) → đo trực tiếp giả thuyết của bạn.\n",
+    "- Cache dùng CHUNG `fusion_cache/` với exp04 → **Save Version** giữ lại.\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp07)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp07_fusion_qmos_pipeline.py ADDED Viewed

	@@ -0,0 +1,654 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp07 (FUSION + QMOS head, HỢP NHẤT 6 cột) — Kaggle
+#
+# **Khác exp04 ở đâu:** exp04 để **QMOS riêng** (UTMOS zero-shot). exp07 **gộp luôn QMOS vào trunk chung**
+# → 1 model multi-task dự đoán **đủ 6 đầu ra**: QMOS · EMOS · CAT · VAL · ARO · DOM.
+#
+# ## Giả thuyết (của bạn) cần kiểm chứng
+# "Chất giọng tự nhiên có liên quan tới cảm nhận cảm xúc" → nếu đúng, QMOS sẽ **hưởng lợi** từ biểu diễn
+# cảm xúc chung (emotion2vec + SAILER). **Rủi ro:** 2 backbone này chuyên *cảm xúc*, chưa chắc bắt tốt
+# *lỗi chất lượng/artifact* (thứ UTMOS chuyên trị) → QMOS có thể **thua** UTMOS, hoặc gộp làm **tụt** EMOS/VAD.
+#
+# ## Lưới an toàn trong thiết kế
+# - **Vẫn đưa điểm UTMOS làm 1 đầu vào** cho QMOS head (`USE_UTMOS_FEAT`) → head học **chỉnh sửa** quanh
+#   0.414 thay vì học lại từ đầu → khó tệ hơn UTMOS.
+# - **In SRCC cả 6 cột + so mốc exp04** (EMOS 0.788 · CAT err 0.145 · VAL 0.578 · ARO 0.754 · DOM 0.706)
+#   → cảnh báo ngay nếu gộp QMOS làm tụt 5 cột cảm xúc.
+# - **File riêng**, KHÔNG đụng `exp04_fusion_pipeline.py` (exp04 vẫn nguyên).
+#
+# ```
+#  mỗi wav ─► [e2v_emb | e2v_p5 | sailer_emb | sailer_p9 | sailer_vad3] ─► TRUNK chung
+#                                                                            │
+#         ┌──────────────┬───────────────┬─────────────┬───────────────────┤
+#   [QMOS head]      [EMOS head]      [CAT head]    [VAD head]
+#   trunk + UTMOS    trunk + target    trunk         trunk
+# ```
+#
+# **Cách chạy:** GPU T4 + Internet On → Add Input dataset Track 2 → sửa `DATA_ROOT` → Run All.
+# Lần đầu đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20`. Dùng CHUNG cache `fusion_cache/` với exp04 (thêm `utmos_*.npz`).
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header)
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/fusion_cache"     # dùng CHUNG với exp04 (thêm utmos_*.npz)
+os.makedirs(CACHE_DIR, exist_ok=True)
+# ── Siêu tham số ─────────────────────────────────────────────────────────────
+DEVICE          = "cuda"
+TRUNK_HIDDEN    = 512
+HEAD_HIDDEN     = 128
+DROPOUT         = 0.3
+LR              = 1e-3
+EPOCHS          = 80
+BATCH           = 64
+VAL_FRAC        = 0.10
+PATIENCE        = 15
+SEED            = 42
+USE_UNCERTAINTY = True        # tự cân 6 loss (Kendall); False = dùng LOSS_W cố định
+LOSS_W          = {"qmos": 1.0, "emos": 1.0, "cat": 1.0, "val": 1.0, "aro": 1.0, "dom": 1.0}
+USE_E2V         = True
+USE_SAILER      = True
+USE_CLASSPROB   = True
+USE_UTMOS_FEAT  = True        # đưa điểm UTMOS làm đầu vào QMOS head (neo residual quanh 0.414)
+LIMIT_TRAIN     = None
+LIMIT_DEV       = None
+# Mốc exp04 để so (cảnh báo nếu tụt khi gộp QMOS)
+EXP04 = {"emos": 0.788, "cat_err": 0.145, "val": 0.578, "aro": 0.754, "dom": 0.706, "qmos_utmos": 0.414}
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+SAILER9 = ["Anger", "Contempt", "Disgust", "Fear", "Happiness", "Neutral", "Sadness", "Surprise", "Other"]
+EMO2SAILER = {"angry": 0, "happy": 4, "neutral": 5, "sad": 6, "surprised": 7}
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+assert USE_E2V or USE_SAILER, "Phải bật ít nhất 1 backbone."
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("speechmos", "funasr", "librosa", "soundfile", "pandas", "scipy", "scikit-learn", "tqdm")
+if USE_SAILER:
+    pip_install("loralib", "speechbrain")
+    REPO_DIR = "/kaggle/working/vox-profile-release"
+    if not os.path.exists(REPO_DIR):
+        subprocess.run(["git", "clone", "--depth", "1",
+                        "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+    if REPO_DIR not in sys.path:
+        sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Đọc & gộp nhãn (gộp theo wavID) — THÊM cột qMOS
+# Khác exp04: gộp thêm **qMOS** (= TB `qMOS` theo wav) làm nhãn cho QMOS head.
+# %%
+import numpy as np
+import pandas as pd
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, default_idx=None, df=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    """train.csv → DataFrame [wavID, qmos, emos, val, aro, dom, cat0..cat4] gộp theo wav."""
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col  = _col(cols, "wavid", "wav", default_idx=1, df=df)
+    qmos_col = _col(cols, "qmos", "mos")
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col  = _col(cols, "val", "valence")
+    aro_col  = _col(cols, "aro", "arousal")
+    dom_col  = _col(cols, "dom", "dominance")
+    cat_col  = _col(cols, "emocat", "cat", "emotion")
+    assert qmos_col, f"Không thấy cột qMOS trong train.csv (cột: {list(df.columns)})"
+    assert emos_col, f"Không thấy cột eMOS trong train.csv (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid,
+               "qmos": float(g[qmos_col].mean()),
+               "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 1.0 / len(EMOTIONS5), dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+print("qMOS:", train_df["qmos"].describe()[["mean", "std", "min", "max"]].to_dict())
+print("eMOS:", train_df["emos"].describe()[["mean", "std", "min", "max"]].to_dict())
+train_df.head()
+# %% [markdown]
+# ## 3. Trích đặc trưng 2 backbone + điểm UTMOS (cache CHUNG với exp04)
+# %%
+import torch
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU")
+def extract_e2v(stems, tag):
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"e2v_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[e2v/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from funasr import AutoModel
+        m = AutoModel(model="iic/emotion2vec_plus_large", hub="hf", device=device)
+        for i, s in enumerate(tqdm(todo, desc=f"e2v {tag}")):
+            wav = os.path.join(WAV_DIR, s + ".wav")
+            if not os.path.exists(wav):
+                continue
+            r = m.generate(wav, granularity="utterance", extract_embedding=True)[0]
+            emb = np.asarray(r["feats"], dtype=np.float32).reshape(-1)
+            probs = {e: 0.0 for e in EMOTIONS5}
+            for lab, sc in zip(r["labels"], r["scores"]):
+                name = lab.split("/")[-1]
+                if name in probs:
+                    probs[name] = float(sc)
+            tot = sum(probs.values())
+            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)
+            store[s] = np.concatenate([emb, p5]).astype(np.float32)
+            if (i + 1) % 500 == 0:
+                np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del m
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return {s: (v[:-5], v[-5:]) for s, v in store.items()}
+def _pool_feat(features):
+    f = features.detach().cpu().numpy()
+    if f.ndim <= 1:
+        return f.reshape(-1).astype(np.float32)
+    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)
+def extract_sailer(stems, tag):
+    import librosa
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"sailer_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[sailer/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from src.model.emotion.wavlm_emotion import WavLMWrapper
+        sailer = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion").to(device).eval()
+        with torch.no_grad():
+            for i, s in enumerate(tqdm(todo, desc=f"sailer {tag}")):
+                wav = os.path.join(WAV_DIR, s + ".wav")
+                if not os.path.exists(wav):
+                    continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                wave = wave[: 15 * 16000]
+                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)
+                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)
+                emb = _pool_feat(feat)
+                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)
+                vad3 = np.array([1 + 4 * float(valence.item()),
+                                 1 + 4 * float(arousal.item()),
+                                 1 + 4 * float(dominance.item())], dtype=np.float32)
+                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)
+                if (i + 1) % 500 == 0:
+                    np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del sailer
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return {s: (v[:-12], v[-12:-3], v[-3:]) for s, v in store.items()}
+def extract_utmos(names, tag):
+    """Chấm UTMOS từng wav (theo TÊN, vì DEV gọi .wav theo tên). → dict {stem: score}.
+    Cache CACHE_DIR/utmos_<tag>.npz. Dùng vừa làm đầu vào QMOS head, vừa làm baseline so sánh."""
+    import librosa
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"utmos_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: float(z[k]) for k in z.files}
+        print(f"[utmos/{tag}] nạp cache: {len(store)}")
+    todo = [n for n in names if stem(n) not in store]
+    if todo:
+        predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong",
+                                   trust_repo=True).to(device).eval()
+        with torch.no_grad():
+            for i, n in enumerate(tqdm(todo, desc=f"utmos {tag}")):
+                wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else n + ".wav")
+                if not os.path.exists(wav):
+                    continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                store[stem(n)] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device),
+                                                 sr=16000).mean().item())
+                if (i + 1) % 500 == 0:
+                    np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})
+        np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})
+        del predictor
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return store
+# %% [markdown]
+# ## 4. Dựng feature + nhãn cho train
+# Feature audio (cảm xúc) = `[e2v_emb | e2v_p5 | sailer_emb | sailer_p9 | sailer_vad3]` (như exp04).
+# Thêm: vector **UTMOS** (1 số/ wav) cho QMOS head, và nhãn **qMOS**.
+# %%
+train_stems = list(train_df["wavID"])
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+e2v_tr    = extract_e2v(train_stems, "train")    if USE_E2V    else {}
+sailer_tr = extract_sailer(train_stems, "train") if USE_SAILER else {}
+utmos_tr  = extract_utmos(train_stems, "train")  if USE_UTMOS_FEAT else {}
+def audio_feature(sid, e2v_map, sailer_map):
+    parts = []
+    if USE_E2V:
+        pk = e2v_map.get(sid)
+        if pk is None:
+            return None
+        emb, p5 = pk
+        parts.append(emb)
+        if USE_CLASSPROB:
+            parts.append(p5)
+    if USE_SAILER:
+        pk = sailer_map.get(sid)
+        if pk is None:
+            return None
+        emb, p9, vad3 = pk
+        parts.append(emb)
+        if USE_CLASSPROB:
+            parts.append(p9); parts.append(vad3)
+    return np.concatenate(parts).astype(np.float32)
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+lab = train_df.set_index("wavID")
+X, T, U, y_qmos, y_emos, y_vad, y_cat = [], [], [], [], [], [], []
+for s in train_stems:
+    f = audio_feature(s, e2v_tr, sailer_tr)
+    tgt = target_map.get(s)
+    if f is None or tgt is None or s not in lab.index:
+        continue
+    if USE_UTMOS_FEAT and s not in utmos_tr:
+        continue
+    X.append(f)
+    T.append(onehot_target(tgt))
+    U.append(utmos_tr.get(s, 3.0) if USE_UTMOS_FEAT else 0.0)
+    y_qmos.append(lab.loc[s, "qmos"])
+    y_emos.append(lab.loc[s, "emos"])
+    y_vad.append([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]])
+    y_cat.append([lab.loc[s, f"cat{i}"] for i in range(len(EMOTIONS5))])
+X = np.stack(X).astype(np.float32)
+T = np.stack(T).astype(np.float32)
+U = np.array(U, dtype=np.float32).reshape(-1, 1)
+y_qmos = np.array(y_qmos, dtype=np.float32)
+y_emos = np.array(y_emos, dtype=np.float32)
+y_vad  = np.array(y_vad,  dtype=np.float32)
+y_cat  = np.array(y_cat,  dtype=np.float32)
+FEAT_DIM = X.shape[1]
+print(f"Train: X={X.shape} U={U.shape} qmos={y_qmos.shape} emos={y_emos.shape} vad={y_vad.shape}")
+# Chuẩn hóa feature audio + UTMOS (z-score), lưu mean/std.
+feat_mean = X.mean(0, keepdims=True); feat_std = X.std(0, keepdims=True) + 1e-6
+Xn = (X - feat_mean) / feat_std
+u_mu, u_sd = float(U.mean()), float(U.std() + 1e-6)
+Un = (U - u_mu) / u_sd
+# Chuẩn hóa nhãn liên tục về z-score.
+qmos_mu, qmos_sd = float(y_qmos.mean()), float(y_qmos.std() + 1e-6)
+y_qmos_z = (y_qmos - qmos_mu) / qmos_sd
+emos_mu, emos_sd = float(y_emos.mean()), float(y_emos.std() + 1e-6)
+y_emos_z = (y_emos - emos_mu) / emos_sd
+if HAS_VAD:
+    vad_mu = np.nanmean(y_vad, axis=0); vad_sd = np.nanstd(y_vad, axis=0) + 1e-6
+    y_vad_z = (y_vad - vad_mu) / vad_sd
+else:
+    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)
+    y_vad_z = np.zeros_like(y_vad)
+# %% [markdown]
+# ## 5. Model fusion multi-task (6 head) + train loop
+# Thêm so exp04: **QMOS head** nhận `[trunk | UTMOS]` → 1; `qmos` vào uncertainty weighting (6 task).
+# %%
+import torch.nn as nn
+from scipy.stats import spearmanr
+from sklearn.model_selection import train_test_split
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+idx_all = np.arange(X.shape[0])
+tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)
+def to_t(a):
+    return torch.tensor(a, dtype=torch.float32, device=device)
+Xn_t, T_t, Un_t = to_t(Xn), to_t(T), to_t(Un)
+qmos_t = to_t(y_qmos_z).unsqueeze(1)
+emos_t = to_t(y_emos_z).unsqueeze(1)
+vad_t  = to_t(y_vad_z)
+cat_t  = to_t(y_cat)
+class FusionMTL6(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo, use_utmos):
+        super().__init__()
+        self.use_utmos = use_utmos
+        self.trunk = nn.Sequential(
+            nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p),
+        )
+        self.qmos = nn.Sequential(   # nhận [trunk | utmos] nếu bật
+            nn.Linear(trunk_h + (1 if use_utmos else 0), head_h), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(head_h, 1))
+        self.emos = nn.Sequential(   # nhận [trunk | target]
+            nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat  = nn.Sequential(
+            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad  = nn.Sequential(
+            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, x, tgt, utmos):
+        h = self.trunk(x)
+        qmos_in = torch.cat([h, utmos], dim=1) if self.use_utmos else h
+        qmos = self.qmos(qmos_in)
+        emos = self.emos(torch.cat([h, tgt], dim=1))
+        cat_logits = self.cat(h)
+        vad = self.vad(h)
+        return qmos, emos, cat_logits, vad
+model = FusionMTL6(FEAT_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO, USE_UTMOS_FEAT).to(device)
+TASKS = ["qmos", "emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+params = list(model.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.Adam(params, lr=LR, weight_decay=1e-5)
+mse = nn.MSELoss(reduction="none")
+def soft_ce(logits, target_dist):
+    logq = F.log_softmax(logits, dim=1)
+    return -(target_dist * logq).sum(dim=1)
+def task_losses(qmos_p, emos_p, cat_logits, vad_p, b):
+    L = {}
+    L["qmos"] = mse(qmos_p, qmos_t[b]).mean()
+    L["emos"] = mse(emos_p, emos_t[b]).mean()
+    L["cat"]  = soft_ce(cat_logits, cat_t[b]).mean()
+    if HAS_VAD:
+        L["val"] = mse(vad_p[:, 0:1], vad_t[b, 0:1]).mean()
+        L["aro"] = mse(vad_p[:, 1:2], vad_t[b, 1:2]).mean()
+        L["dom"] = mse(vad_p[:, 2:3], vad_t[b, 2:3]).mean()
+    else:
+        z = torch.zeros((), device=device)
+        L["val"] = L["aro"] = L["dom"] = z
+    return L
+def combine(L):
+    if USE_UNCERTAINTY:
+        tot = 0.0
+        for i, t in enumerate(TASKS):
+            tot = tot + torch.exp(-log_var[i]) * L[t] + log_var[i]
+        return tot
+    return sum(LOSS_W[t] * L[t] for t in TASKS)
+@torch.no_grad()
+def eval_val():
+    model.eval()
+    qp, ep, cl, vp = model(Xn_t[va_idx], T_t[va_idx], Un_t[va_idx])
+    qp = qp.cpu().numpy().ravel(); ep = ep.cpu().numpy().ravel()
+    out = {"qmos": spearmanr(qp, y_qmos[va_idx]).correlation,
+           "emos": spearmanr(ep, y_emos[va_idx]).correlation}
+    if USE_UTMOS_FEAT:
+        out["qmos_utmos"] = spearmanr(U[va_idx, 0], y_qmos[va_idx]).correlation   # baseline UTMOS đơn lẻ
+    if HAS_VAD:
+        vp = vp.cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            out[t] = spearmanr(vp[:, j], y_vad[va_idx, j]).correlation
+    q = F.softmax(cl, dim=1).cpu().numpy(); p = y_cat[va_idx]
+    kl = (p * (np.log(p + 1e-9) - np.log(q + 1e-9))).sum(1).mean()
+    out["cat_negkl"] = float(-kl)
+    return out
+def val_score(m):
+    """Điểm tổng early-stop = TB SRCC các task liên tục (qmos+emos+VAD)."""
+    keys = ["qmos", "emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+best_score, best_state, bad = -1e9, None, 0
+tr_t = torch.tensor(tr_idx, device=device)
+for ep_i in range(1, EPOCHS + 1):
+    model.train()
+    perm = tr_t[torch.randperm(len(tr_t), device=device)]
+    run = 0.0
+    for i in range(0, len(perm), BATCH):
+        b = perm[i:i + BATCH]
+        opt.zero_grad()
+        qmos_p, emos_p, cat_logits, vad_p = model(Xn_t[b], T_t[b], Un_t[b])
+        L = task_losses(qmos_p, emos_p, cat_logits, vad_p, b)
+        loss = combine(L)
+        loss.backward(); opt.step()
+        run += loss.item() * len(b)
+    m = eval_val()
+    sc = val_score(m)
+    if sc > best_score:
+        best_score = sc
+        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
+        bad = 0
+    else:
+        bad += 1
+    if ep_i % 5 == 0 or ep_i == 1:
+        msg = " ".join(f"{k}={m[k]:.3f}" for k in ["qmos", "emos", "val", "aro", "dom"] if k in m)
+        print(f"epoch {ep_i:3d} | loss {run/len(perm):.4f} | {msg} | best {best_score:.4f}")
+    if bad >= PATIENCE:
+        print(f"Early stop ở epoch {ep_i}.")
+        break
+model.load_state_dict(best_state)
+final = eval_val()
+print("\n✅ VAL (nội bộ) — exp07 (fusion + QMOS head):")
+print(f"   QMOS SRCC = {final['qmos']:.4f}", end="")
+if "qmos_utmos" in final:
+    tag = "✅ vượt UTMOS" if final["qmos"] > final["qmos_utmos"] else "⚠️ CHƯA vượt UTMOS"
+    print(f"   (UTMOS đơn lẻ = {final['qmos_utmos']:.4f} → {tag})")
+else:
+    print()
+print(f"   EMOS SRCC = {final['emos']:.4f}   (mốc exp04 = {EXP04['emos']})")
+if HAS_VAD:
+    print(f"   VAL/ARO/DOM = {final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f}"
+          f"   (mốc exp04 = {EXP04['val']}/{EXP04['aro']}/{EXP04['dom']})")
+# Cảnh báo negative transfer (gộp QMOS làm tụt cảm xúc)
+warn = []
+if final["emos"] < EXP04["emos"] - 0.02:
+    warn.append(f"EMOS {final['emos']:.3f} < {EXP04['emos']}")
+if HAS_VAD:
+    for t in ["val", "aro", "dom"]:
+        if final[t] < EXP04[t] - 0.02:
+            warn.append(f"{t.upper()} {final[t]:.3f} < {EXP04[t]}")
+if warn:
+    print("   ⚠️ NEGATIVE TRANSFER? Cảm xúc tụt so exp04:", "; ".join(warn),
+          "\n      → cân nhắc giữ exp04 cho 5 cột cảm xúc + chỉ lấy QMOS từ exp07/exp06.")
+else:
+    print("   ✅ Không thấy 5 cột cảm xúc tụt rõ so exp04.")
+if USE_UNCERTAINTY:
+    print("   log σ² mỗi task:", {t: round(float(log_var[i]), 3) for i, t in enumerate(TASKS)})
+torch.save({"state": best_state, "feat_mean": feat_mean, "feat_std": feat_std,
+            "u_mu": u_mu, "u_sd": u_sd,
+            "qmos_mu": qmos_mu, "qmos_sd": qmos_sd, "emos_mu": emos_mu, "emos_sd": emos_sd,
+            "vad_mu": vad_mu, "vad_sd": vad_sd, "FEAT_DIM": FEAT_DIM,
+            "USE_E2V": USE_E2V, "USE_SAILER": USE_SAILER, "USE_CLASSPROB": USE_CLASSPROB,
+            "USE_UTMOS_FEAT": USE_UTMOS_FEAT, "val_score": best_score},
+           os.path.join(OUT_DIR, "fusion_qmos_mtl.pt"))
+print("Đã lưu", os.path.join(OUT_DIR, "fusion_qmos_mtl.pt"))
+# %% [markdown]
+# ## 6. Dự đoán DEV → `answer.txt` đủ 6 cột (QMOS giờ từ HEAD, không phải SpeechMOS riêng)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+e2v_dev    = extract_e2v(dev_stems, "dev")    if USE_E2V    else {}
+sailer_dev = extract_sailer(dev_stems, "dev") if USE_SAILER else {}
+utmos_dev  = extract_utmos(dev_names, "dev")  if USE_UTMOS_FEAT else {}
+@torch.no_grad()
+def predict_all(sid):
+    f = audio_feature(sid, e2v_dev, sailer_dev)
+    if f is None:
+        return None
+    fn = (f[None, :] - feat_mean) / feat_std
+    tgt = onehot_target(target_map.get(sid))[None, :]
+    u = np.array([[utmos_dev.get(sid, 3.0)]], dtype=np.float32)
+    un = (u - u_mu) / u_sd
+    model.eval()
+    qmos_p, emos_p, cat_logits, vad_p = model(to_t(fn), to_t(tgt), to_t(un))
+    qmos = float(qmos_p.item()) * qmos_sd + qmos_mu
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_logits, dim=1)[0].cpu().numpy()
+    vad3 = vad_p[0].cpu().numpy() * vad_sd + vad_mu
+    return qmos, emos, cat5, vad3
+def fmt_cat(probs5):
+    return "|".join(f"{e}:{probs5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    from tqdm.auto import tqdm
+    n_real = n_default = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pred = predict_all(sid)
+            if pred is None:
+                # rơi về: QMOS=UTMOS nếu có, còn lại mặc định
+                qmos = utmos_dev.get(sid, 3.0)
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])
+                n_default += 1
+            else:
+                qmos, emos, cat5, vad3 = pred
+                n_real += 1
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},"
+                    f"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | head thật {n_real}, mặc định {n_default}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 7. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp07_fusion_qmos.zip answer.txt "
+          f"&& unzip -l submission_track2_exp07_fusion_qmos.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp07_fusion_qmos.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Lần đầu** đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20`; OK rồi đặt `None`.
+# - **Đọc kết quả mục 5 theo 2 câu hỏi:**
+#   1. QMOS head có **vượt UTMOS đơn lẻ (0.414)** không? (dòng "vượt/CHƯA vượt UTMOS")
+#   2. Gộp QMOS có **làm tụt** EMOS/VAD so exp04 không? (dòng "NEGATIVE TRANSFER?")
+# - **Quyết định nộp:**
+#   - Nếu QMOS↑ và cảm xúc KHÔNG tụt → nộp answer.txt exp07 (1 model trọn 6 cột — đẹp cho paper).
+#   - Nếu QMOS↑ nhưng cảm xúc TỤT → giữ exp04 cho 5 cột cảm xúc, chỉ lấy **cột QMOS** của exp07/exp06 ghép vào.
+#   - Nếu QMOS không vượt UTMOS → kết luận "chất lượng trực giao cảm xúc" (vẫn là phát hiện cho paper); giữ exp04.
+# - **Ablation cho paper**: `USE_UTMOS_FEAT=False` (QMOS chỉ từ trunk cảm xúc) → đo trực tiếp giả thuyết của bạn.
+# - Cache dùng CHUNG `fusion_cache/` với exp04 → **Save Version** giữ lại.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp07).

track2/exp08_finetune_emotion.ipynb ADDED Viewed

	@@ -0,0 +1,820 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ee3b7231",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp08 (FINE-TUNE WavLM cho 5 cột cảm xúc) — Kaggle\n",
+    "\n",
+    "**Khác mọi exp trước:** exp03–07 đều **đóng băng** backbone (chỉ trích đặc trưng + train head nhỏ trên cache).\n",
+    "exp08 **MỞ BĂNG (fine-tune)** WavLM-large để nó học lại đặc trưng riêng cho bài MOS cảm xúc 2026.\n",
+    "\n",
+    "## Thiết kế (chốt với mentor 5/6)\n",
+    "```\n",
+    " wav ─┬─► WavLM-large (warm-start SAILER, TRAINABLE: chỉ mở băng N lớp trên)  ─► pool ─► emb_wavlm ┐\n",
+    "      └─► audeering MSP-dim (FROZEN, cache .npz)  ─► [emb_aud | vad3]                                ├─► TRUNK ─┬─► EMOS (+target)\n",
+    "                                                                                                      ┘          ├─► CAT (5)\n",
+    "                                                                                                                 └─► VAD (3)\n",
+    " QMOS: KHÔNG train ở đây → mượn cột QMOS của exp07 (0.548) hoặc UTMOSv2 (T05, vô địch VMC2024).\n",
+    "```\n",
+    "- **Warm-start:** khởi tạo WavLM từ checkpoint **SAILER** (`tiantiaf/wavlm-large-categorical-emotion`,\n",
+    "  đã giỏi cảm xúc) thay vì WavLM \"trắng\" → điểm xuất phát tốt hơn nhiều.\n",
+    "- **Phụ (frozen):** audeering — dimensional, bổ trợ góc nhìn categorical của WavLM, kỳ vọng kéo **VAL**.\n",
+    "- **Đóng băng partial:** chỉ train `UNFREEZE_TOP_LAYERS` lớp Transformer trên cùng + feature-extractor giữ băng\n",
+    "  → tiết kiệm VRAM T4 + chống overfit (chỉ 12.7k mẫu).\n",
+    "\n",
+    "## ⚠️ Đánh đổi phải biết trước (so freeze+head)\n",
+    "- **Mất lợi thế cache:** mỗi epoch chạy lại cả WavLM (forward+backward) → chậm & đốt giờ GPU (30h/tuần).\n",
+    "  → **Lần đầu BẮT BUỘC đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20`** để chỉnh trơn rồi mới `None`.\n",
+    "- **Dễ overfit / OOM:** nếu OOM → giảm `BATCH`, tăng `ACCUM`, giảm `MAX_SECONDS`, giảm `UNFREEZE_TOP_LAYERS`.\n",
+    "- **Lưới an toàn:** exp07 vẫn là bản nộp vô địch tới khi exp08 **thắng trên VAL nội bộ** (đừng đốt lượt nộp).\n",
+    "\n",
+    "**Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input dataset Track 2 → sửa `DATA_ROOT` → Run All."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "656d8385",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "38d86264",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header)\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/ft_cache\"         # cache audeering (.npz) — backbone WavLM KHÔNG cache (đang train)\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# (Tùy chọn) TÁI DÙNG cache audeering cũ: trỏ tới dataset chứa aud_train.npz/aud_dev.npz → tự copy sang CACHE_DIR.\n",
+    "# Để \"\" nếu chạy mới hoàn toàn. /kaggle/input read-only nên phải copy sang working để ghi/append.\n",
+    "CACHE_INPUT = \"/kaggle/input/datasets/minhtoan2/cache-exp8\"   # << SỬA slug cho khớp (hoặc \"\")\n",
+    "if CACHE_INPUT and os.path.isdir(CACHE_INPUT):\n",
+    "    import shutil\n",
+    "    _n = 0\n",
+    "    for _fn in os.listdir(CACHE_INPUT):\n",
+    "        if _fn.startswith(\"aud_\") and _fn.endswith(\".npz\"):\n",
+    "            shutil.copy(os.path.join(CACHE_INPUT, _fn), os.path.join(CACHE_DIR, _fn)); _n += 1\n",
+    "    print(f\"📦 Tái dùng cache: copy {_n} file aud_*.npz từ {CACHE_INPUT} → {CACHE_DIR}\")\n",
+    "\n",
+    "# Mượn cột QMOS của exp07 (tốt nhất 0.548). Trỏ tới answer.txt exp07 nếu có; không thì dùng UTMOSv2.\n",
+    "EXP07_ANSWER = \"/kaggle/input/exp07-answer/answer.txt\"   # << (tùy chọn) Add Input answer.txt exp07; không có → UTMOSv2\n",
+    "\n",
+    "# ── Fine-tune / siêu tham số ─────────────────────────────────────────────────\n",
+    "DEVICE              = \"cuda\"\n",
+    "SR                  = 16000\n",
+    "MAX_SECONDS         = 8           # cắt audio để chặn bộ nhớ backprop; OOM thì giảm còn 6\n",
+    "UNFREEZE_TOP_LAYERS = 6           # số lớp Transformer trên cùng được train (0 = freeze hết = quay về head-only)\n",
+    "TRUNK_HIDDEN        = 512\n",
+    "HEAD_HIDDEN         = 128\n",
+    "DROPOUT             = 0.3\n",
+    "LR_BACKBONE         = 1e-5        # LR nhỏ cho backbone fine-tune\n",
+    "LR_HEAD             = 1e-3        # LR lớn cho trunk + head (train từ đầu)\n",
+    "WEIGHT_DECAY        = 1e-5\n",
+    "EPOCHS              = 12          # TRẦN; early-stop quyết định số epoch thực (8 hơi thấp cho lần chạy thật)\n",
+    "PATIENCE            = 3            # dừng khi val SRCC không lên 3 epoch; LUÔN giữ best_state\n",
+    "BATCH               = 4           # nhỏ vì backbone to; tăng ACCUM để bù\n",
+    "ACCUM               = 8           # effective batch = BATCH*ACCUM = 32\n",
+    "VAL_FRAC            = 0.10\n",
+    "SEED                = 42\n",
+    "USE_AMP             = True        # mixed precision fp16 — tiết kiệm VRAM\n",
+    "USE_GRAD_CKPT       = True        # gradient checkpointing — tiết kiệm VRAM (đổi lấy chậm hơn)\n",
+    "USE_AUDEERING       = True        # nhánh phụ frozen audeering; False = chỉ WavLM\n",
+    "USE_UNCERTAINTY     = True        # tự cân 5 loss (Kendall); False = trọng số 1.0\n",
+    "\n",
+    "LIMIT_TRAIN         = 300         # << LẦN ĐẦU để 300; chạy thật đặt None\n",
+    "LIMIT_DEV           = 20          # << LẦN ĐẦU để 20; chạy thật đặt None\n",
+    "\n",
+    "# Mốc exp07 để so (cảnh báo nếu fine-tune KHÔNG thắng → giữ exp07)\n",
+    "EXP07 = {\"emos\": 0.795, \"cat_err\": 0.153, \"val\": 0.581, \"aro\": 0.752, \"dom\": 0.705}\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)\n",
+    "print(f\"Fine-tune: mở băng {UNFREEZE_TOP_LAYERS} lớp trên · BATCH {BATCH}×ACCUM {ACCUM} · MAX {MAX_SECONDS}s\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed538923",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER (clone + sys.path, KHÔNG pip install -e .)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f052d016",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"loralib\", \"speechbrain\", \"speechmos\", \"librosa\", \"soundfile\",\n",
+    "            \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0bf41e8c",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp SAILER → lấy backbone WavLM bên trong để FINE-TUNE\n",
+    "Thay vì gọi wrapper như hộp đen, ta **lôi module WavLM-large (HuggingFace) bên trong wrapper** ra\n",
+    "→ toàn quyền đóng băng/mở băng từng lớp + tự pool. Nếu không tìm thấy (cấu trúc lạ) → **fallback**\n",
+    "nạp `microsoft/wavlm-large` trắng (mất warm-start, có cảnh báo)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "50a7cac6",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (rất chậm!)\")\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    \"\"\"Tìm submodule kiểu HF Wav2Vec2/WavLM backbone: có .feature_extractor và .encoder.layers.\"\"\"\n",
+    "    cands = []\n",
+    "    for name, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((name, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "try:\n",
+    "    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "    _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "    name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    if wavlm is not None:\n",
+    "        print(f\"✅ Warm-start SAILER: lấy backbone WavLM bên trong wrapper tại '.{name}' \"\n",
+    "              f\"({sum(p.numel() for p in wavlm.parameters())/1e6:.0f}M params)\")\n",
+    "    else:\n",
+    "        print(\"⚠️ Không tìm thấy backbone HF bên trong wrapper SAILER → sẽ fallback WavLM trắng.\")\n",
+    "except Exception as e:\n",
+    "    print(\"⚠️ Lỗi nạp SAILER wrapper:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "\n",
+    "if wavlm is None:\n",
+    "    from transformers import WavLMModel\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"ℹ️ Fallback: nạp microsoft/wavlm-large (KHÔNG warm-start SAILER).\")\n",
+    "\n",
+    "wavlm = wavlm.to(device)\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "\n",
+    "# ── Đóng băng partial: feature-extractor + tất cả trừ UNFREEZE_TOP_LAYERS lớp trên ──\n",
+    "for p in wavlm.parameters():\n",
+    "    p.requires_grad = False\n",
+    "enc_layers = wavlm.encoder.layers\n",
+    "n_layers = len(enc_layers)\n",
+    "for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:\n",
+    "    for p in layer.parameters():\n",
+    "        p.requires_grad = True\n",
+    "n_train = sum(p.numel() for p in wavlm.parameters() if p.requires_grad)\n",
+    "print(f\"WavLM: {n_layers} lớp encoder · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} lớp trên \"\n",
+    "      f\"→ {n_train/1e6:.1f}M param train (trên dim {WAVLM_DIM})\")\n",
+    "\n",
+    "if USE_GRAD_CKPT:\n",
+    "    wavlm.gradient_checkpointing_enable()\n",
+    "    if hasattr(wavlm, \"enable_input_require_grads\"):\n",
+    "        wavlm.enable_input_require_grads()   # cần khi grad-ckpt + lớp dưới đóng băng\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    \"\"\"Mean-pool theo thời gian, bỏ qua phần pad (giữ gradient).\"\"\"\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    try:\n",
+    "        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "    except Exception:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "def wavlm_embed(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state   # [B,T,D]\n",
+    "    return masked_mean(out, attn_mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d8b8b8de",
+   "metadata": {},
+   "source": [
+    "## 3. Nạp audeering MSP-dim (FROZEN) — đặc trưng phụ\n",
+    "Lấy `[emb_pool(1024) | vad3(1–5)]` mỗi wav rồi **cache .npz** (chạy 1 lần). Kỹ thuật nạp head tay\n",
+    "y hệt exp05 (tránh lỗi version transformers khi subclass `Wav2Vec2PreTrainedModel`)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8731aa54",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "AUD_DIM = 0\n",
+    "aud_backbone = aud_head = aud_proc = None\n",
+    "if USE_AUDEERING:\n",
+    "    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "    from huggingface_hub import hf_hub_download\n",
+    "    AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "    aud_backbone = Wav2Vec2Model(aud_cfg)\n",
+    "    try:\n",
+    "        _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "            hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "    except Exception:\n",
+    "        _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "    bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "    missing, unexpected = aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "    print(f\"  audeering backbone: thiếu {len(missing)} / dư {len(unexpected)} key (strict=False)\")\n",
+    "    _hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "    _out = _sd[\"classifier.out_proj.weight\"].shape[0]\n",
+    "    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))\n",
+    "    aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "    aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "    aud_backbone = aud_backbone.to(device).eval()\n",
+    "    aud_head = aud_head.to(device).eval()\n",
+    "    AUD_DIM = _hid + 3   # emb_pool + [VAL,ARO,DOM]\n",
+    "    print(f\"✅ audeering frozen (đặc trưng phụ {AUD_DIM}-D = emb {_hid} + vad 3)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d12e4737",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import librosa\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "def load_wav(name_or_stem, in_wav_dir=True):\n",
+    "    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(\n",
+    "        WAV_DIR, name_or_stem if str(name_or_stem).endswith(\".wav\") else str(name_or_stem) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    return wave[: MAX_SECONDS * SR].astype(np.float32)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def extract_audeering(stems, tag):\n",
+    "    \"\"\"→ dict {stem: float32[AUD_DIM]}; cache CACHE_DIR/aud_<tag>.npz (resume mỗi 500).\"\"\"\n",
+    "    if not USE_AUDEERING:\n",
+    "        return {}\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"aud_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[aud/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    for i, s in enumerate(tqdm(todo, desc=f\"audeering {tag}\")):\n",
+    "        wave = load_wav(s)\n",
+    "        if wave is None:\n",
+    "            continue\n",
+    "        x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)\n",
+    "        h = aud_backbone(x)[0].mean(dim=1)                    # [1, hid]\n",
+    "        out = aud_head(h)[0].cpu().numpy()                    # [arousal, dominance, valence] ∈[0,1]\n",
+    "        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]\n",
+    "        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)\n",
+    "        if (i + 1) % 500 == 0:\n",
+    "            np.savez(cache_path, **store)\n",
+    "    if todo:\n",
+    "        np.savez(cache_path, **store)\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3397dbe7",
+   "metadata": {},
+   "source": [
+    "## 4. Đọc & gộp nhãn theo wavID (EMOS / VAD / CAT) — như exp04/07 nhưng KHÔNG cần qMOS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "df3b95e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, df=None, default_idx=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = _col(cols, \"wavid\", \"wav\", df=df, default_idx=1)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col = _col(cols, \"val\", \"valence\"); aro_col = _col(cols, \"aro\", \"arousal\"); dom_col = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "48ea29a7",
+   "metadata": {},
+   "source": [
+    "## 5. Dataset / DataLoader (load wav theo batch — KHÔNG cache WavLM vì đang train)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "478f2af9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "train_stems = [s for s in train_df[\"wavID\"] if target_map.get(s) is not None]\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "aud_tr = extract_audeering(train_stems, \"train\")\n",
+    "\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "\n",
+    "# Chuẩn hóa nhãn liên tục về z-score (để các MSE cùng thang) — lưu để giải mã lúc dự đoán.\n",
+    "def _zfit(arr):\n",
+    "    a = np.asarray(arr, dtype=np.float32)\n",
+    "    return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)\n",
+    "\n",
+    "emos_mu, emos_sd = _zfit([lab.loc[s, \"emos\"] for s in train_stems])\n",
+    "if HAS_VAD:\n",
+    "    vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in [\"val\", \"aro\", \"dom\"]], dtype=np.float32)\n",
+    "    vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in [\"val\", \"aro\", \"dom\"]], dtype=np.float32)\n",
+    "else:\n",
+    "    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "class EmoDataset(Dataset):\n",
+    "    def __init__(self, stems):\n",
+    "        self.stems = [s for s in stems if (load_wav(s) is not None) and ((not USE_AUDEERING) or s in aud_tr)]\n",
+    "    def __len__(self):\n",
+    "        return len(self.stems)\n",
+    "    def __getitem__(self, i):\n",
+    "        s = self.stems[i]\n",
+    "        wave = load_wav(s)\n",
+    "        emos = (float(lab.loc[s, \"emos\"]) - emos_mu) / emos_sd\n",
+    "        if HAS_VAD:\n",
+    "            vad = (np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32) - vad_mu) / vad_sd\n",
+    "        else:\n",
+    "            vad = np.zeros(3, dtype=np.float32)\n",
+    "        cat = np.array([lab.loc[s, f\"cat{j}\"] for j in range(len(EMOTIONS5))], dtype=np.float32)\n",
+    "        aud = aud_tr[s] if USE_AUDEERING else np.zeros(0, dtype=np.float32)\n",
+    "        return {\"wave\": wave, \"tgt\": onehot_target(target_map.get(s)), \"aud\": aud,\n",
+    "                \"emos\": np.float32(emos), \"vad\": vad, \"cat\": cat,\n",
+    "                \"emos_raw\": np.float32(lab.loc[s, \"emos\"]),\n",
+    "                \"vad_raw\": np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32)}\n",
+    "\n",
+    "def collate(batch):\n",
+    "    lens = [len(b[\"wave\"]) for b in batch]\n",
+    "    L = max(lens)\n",
+    "    waves = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    mask = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    for i, b in enumerate(batch):\n",
+    "        waves[i, : len(b[\"wave\"])] = b[\"wave\"]; mask[i, : len(b[\"wave\"])] = 1.0\n",
+    "    out = {\n",
+    "        \"input_values\": torch.from_numpy(waves), \"attn_mask\": torch.from_numpy(mask).long(),\n",
+    "        \"tgt\": torch.from_numpy(np.stack([b[\"tgt\"] for b in batch])),\n",
+    "        \"aud\": torch.from_numpy(np.stack([b[\"aud\"] for b in batch])) if USE_AUDEERING else None,\n",
+    "        \"emos\": torch.from_numpy(np.stack([b[\"emos\"] for b in batch])).unsqueeze(1),\n",
+    "        \"vad\": torch.from_numpy(np.stack([b[\"vad\"] for b in batch])),\n",
+    "        \"cat\": torch.from_numpy(np.stack([b[\"cat\"] for b in batch])),\n",
+    "        \"emos_raw\": np.stack([b[\"emos_raw\"] for b in batch]),\n",
+    "        \"vad_raw\": np.stack([b[\"vad_raw\"] for b in batch]),\n",
+    "    }\n",
+    "    return out\n",
+    "\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "ds = EmoDataset(train_stems)\n",
+    "print(\"Dataset hợp lệ:\", len(ds), \"wav\")\n",
+    "tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)\n",
+    "tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)\n",
+    "va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f3342c6f",
+   "metadata": {},
+   "source": [
+    "## 6. Head fusion (trunk + 3 head cảm xúc) + train loop (AMP + grad accumulation)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3671b2da",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import spearmanr\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "print(f\"Trunk input = {TRUNK_IN} (wavlm {WAVLM_DIM} + aud {AUD_DIM if USE_AUDEERING else 0})\")\n",
+    "\n",
+    "TASKS = [\"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "bb_params = [p for p in wavlm.parameters() if p.requires_grad]\n",
+    "head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.AdamW([\n",
+    "    {\"params\": bb_params, \"lr\": LR_BACKBONE},\n",
+    "    {\"params\": head_params, \"lr\": LR_HEAD},\n",
+    "], weight_decay=WEIGHT_DECAY)\n",
+    "scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == \"cuda\")\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()\n",
+    "\n",
+    "def forward_batch(b):\n",
+    "    feat_wavlm = wavlm_embed(b[\"input_values\"].to(device), b[\"attn_mask\"].to(device))\n",
+    "    if USE_AUDEERING:\n",
+    "        feat = torch.cat([feat_wavlm, b[\"aud\"].to(device)], dim=1)\n",
+    "    else:\n",
+    "        feat = feat_wavlm\n",
+    "    return heads(feat, b[\"tgt\"].to(device))\n",
+    "\n",
+    "def compute_loss(emos_p, cat_l, vad_p, b):\n",
+    "    L = {}\n",
+    "    L[\"emos\"] = mse(emos_p, b[\"emos\"].to(device))\n",
+    "    L[\"cat\"] = soft_ce(cat_l, b[\"cat\"].to(device))\n",
+    "    if HAS_VAD:\n",
+    "        vt = b[\"vad\"].to(device)\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L[\"aro\"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L[\"dom\"] = mse(vad_p[:, 2:3], vt[:, 2:3])\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device); L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))\n",
+    "    return sum(L.values())\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def evaluate():\n",
+    "    wavlm.eval(); heads.eval()\n",
+    "    P = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}; Y = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}\n",
+    "    catP, catY = [], []\n",
+    "    for b in va_loader:\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "        P[\"emos\"] += emos_p.float().cpu().numpy().ravel().tolist(); Y[\"emos\"] += b[\"emos_raw\"].tolist()\n",
+    "        vad_p = vad_p.float().cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            P[t] += vad_p[:, j].tolist(); Y[t] += b[\"vad_raw\"][:, j].tolist()\n",
+    "        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b[\"cat\"])\n",
+    "    out = {}\n",
+    "    for t in [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else []):\n",
+    "        out[t] = spearmanr(P[t], Y[t]).correlation\n",
+    "    q = np.concatenate(catP); p = np.concatenate(catY)\n",
+    "    out[\"cat_err\"] = float(np.abs(q - p).sum(1).mean())   # ~ tổng |Δ| trung bình (xấp xỉ CAT-ERR)\n",
+    "    return out\n",
+    "\n",
+    "def mean_srcc(m):\n",
+    "    keys = [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "# Lưu checkpoint FULL (có backbone WavLM) — gọi NGAY mỗi best để kernel chết giữa chừng vẫn còn file.\n",
+    "CKPT_PATH = os.path.join(OUT_DIR, \"ft_emotion_full.pt\")\n",
+    "def save_full_ckpt(state, val_emos=float(\"nan\")):\n",
+    "    torch.save({\"wavlm\": state[\"wavlm\"], \"heads\": state[\"heads\"],\n",
+    "                \"emos_mu\": emos_mu, \"emos_sd\": emos_sd, \"vad_mu\": vad_mu, \"vad_sd\": vad_sd,\n",
+    "                \"WAVLM_DIM\": WAVLM_DIM, \"AUD_DIM\": AUD_DIM,\n",
+    "                \"UNFREEZE_TOP_LAYERS\": UNFREEZE_TOP_LAYERS, \"val_emos\": float(val_emos)}, CKPT_PATH)\n",
+    "\n",
+    "best, best_state, bad = -1e9, None, 0\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    wavlm.train(); heads.train()\n",
+    "    opt.zero_grad(); run = 0.0; nb = 0\n",
+    "    for step, b in enumerate(tqdm(tr_loader, desc=f\"epoch {ep}\")):\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM\n",
+    "        scaler.scale(loss).backward()\n",
+    "        if (step + 1) % ACCUM == 0:\n",
+    "            scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "        run += loss.item() * ACCUM; nb += 1\n",
+    "    m = evaluate(); sc = mean_srcc(m)\n",
+    "    msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "    print(f\"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})\")\n",
+    "    if sc > best:\n",
+    "        best = sc\n",
+    "        best_state = {\"wavlm\": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},\n",
+    "                      \"heads\": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}\n",
+    "        save_full_ckpt(best_state, m[\"emos\"])   # LƯU NGAY mỗi best → an toàn nếu kernel chết\n",
+    "        print(f\"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})\")\n",
+    "        bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "        if bad >= PATIENCE:\n",
+    "            print(f\"Early stop ở epoch {ep}.\"); break\n",
+    "\n",
+    "if best_state:\n",
+    "    wavlm.load_state_dict(best_state[\"wavlm\"]); heads.load_state_dict(best_state[\"heads\"])\n",
+    "final = evaluate()\n",
+    "print(\"\\n✅ VAL (nội bộ) — exp08 (fine-tune WavLM cho cảm xúc):\")\n",
+    "print(f\"   EMOS={final['emos']:.4f} (exp07 {EXP07['emos']})\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} \"\n",
+    "          f\"(exp07 {EXP07['val']}/{EXP07['aro']}/{EXP07['dom']})\")\n",
+    "warn = [f\"EMOS {final['emos']:.3f}<{EXP07['emos']}\"] if final[\"emos\"] < EXP07[\"emos\"] - 0.005 else []\n",
+    "if HAS_VAD:\n",
+    "    warn += [f\"{t.upper()} {final[t]:.3f}<{EXP07[t]}\" for t in [\"val\", \"aro\", \"dom\"] if final[t] < EXP07[t] - 0.005]\n",
+    "print(\"   ⚠️ CHƯA thắng exp07 ở:\", \"; \".join(warn), \"→ cân nhắc giữ exp07.\" if warn else \"\")\n",
+    "if not warn:\n",
+    "    print(\"   ✅ Fine-tune thắng/ngang exp07 ở mọi cột cảm xúc → đáng nộp.\")\n",
+    "# Lưu lần cuối từ best (đã lưu sẵn mỗi best trong loop; đây là phát cuối cho chắc).\n",
+    "save_full_ckpt(best_state if best_state else\n",
+    "               {\"wavlm\": wavlm.state_dict(), \"heads\": heads.state_dict()}, final[\"emos\"])\n",
+    "print(f\"✅ Đã lưu {CKPT_PATH} (CÓ backbone WavLM + heads → resume được). \"\n",
+    "      f\"NHỚ Save Version để file ra Output!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b0b1b42",
+   "metadata": {},
+   "source": [
+    "## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc từ exp08; QMOS mượn exp07 hoặc UTMOS)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b29f616c",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "aud_dev = extract_audeering(dev_stems, \"dev\")\n",
+    "\n",
+    "# QMOS: ưu tiên mượn cột QMOS của exp07; không có file → chấm UTMOSv2 (T05, vô địch VMC2024).\n",
+    "def load_exp07_qmos():\n",
+    "    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "        import csv\n",
+    "        d = {}\n",
+    "        with open(EXP07_ANSWER) as f:\n",
+    "            r = csv.DictReader(f)\n",
+    "            for row in r:\n",
+    "                d[row[\"wav\"]] = float(row[\"QMOS\"]); d[stem(row[\"wav\"])] = float(row[\"QMOS\"])\n",
+    "        print(f\"✅ Mượn QMOS từ exp07 ({EXP07_ANSWER}): {len(d)//2} wav\")\n",
+    "        return d\n",
+    "    return None\n",
+    "\n",
+    "qmos_map = load_exp07_qmos()\n",
+    "if qmos_map is None:\n",
+    "    print(\"ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024 Track 1).\")\n",
+    "    pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")   # cần Internet On, checkpoint tự tải\n",
+    "    import utmosv2\n",
+    "    v2 = utmosv2.create_model(pretrained=True)\n",
+    "    qmos_map = {}\n",
+    "    for n in tqdm(dev_names, desc=\"UTMOSv2\"):\n",
+    "        wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else str(n) + \".wav\")\n",
+    "        if not os.path.exists(wav):\n",
+    "            continue\n",
+    "        out = v2.predict(input_path=wav)   # trả float hoặc dict {'predicted_mos': ...} tùy phiên bản\n",
+    "        qmos_map[n] = float(out[\"predicted_mos\"]) if isinstance(out, dict) else float(out)\n",
+    "    del v2; torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    wave = load_wav(sid)\n",
+    "    if wave is None or (USE_AUDEERING and sid not in aud_dev):\n",
+    "        return None\n",
+    "    wavlm.eval(); heads.eval()\n",
+    "    iv = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        fw = wavlm_embed(iv, am)\n",
+    "        if USE_AUDEERING:\n",
+    "            aud = torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)\n",
+    "            feat = torch.cat([fw, aud], dim=1)\n",
+    "        else:\n",
+    "            feat = fw\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    n_real = n_def = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pr = predict_emotion(sid)\n",
+    "            if pr is None:\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1\n",
+    "            else:\n",
+    "                emos, cat5, vad3 = pr; n_real += 1\n",
+    "            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e9cab58",
+   "metadata": {},
+   "source": [
+    "## 8. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "88cb0280",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0] and \"EMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp08_ft-emotion.zip answer.txt \"\n",
+    "          f\"&& unzip -l submission_track2_exp08_ft-emotion.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp08_ft-emotion.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2018df3",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Lần đầu** `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để kiểm tra chạy trơn (1 epoch xong không OOM); rồi đặt `None`.\n",
+    "- **OOM trên T4?** giảm theo thứ tự: `MAX_SECONDS` (8→6) → `UNFREEZE_TOP_LAYERS` (6→4→2) → `BATCH` (4→2, tăng `ACCUM`).\n",
+    "- **Đọc mục 6:** so EMOS/VAD VAL nội bộ với mốc exp07 (EMOS 0.795 · VAL 0.581 · ARO 0.752 · DOM 0.705).\n",
+    "  - Nếu fine-tune **thắng** → nộp answer.txt exp08 (5 cột cảm xúc của exp08 + QMOS mượn exp07).\n",
+    "  - Nếu **thua** → giữ exp07; vẫn là kết quả cho paper (\"fine-tune chưa vượt frozen-fusion trên data nhỏ\").\n",
+    "- **QMOS:** Add Input answer.txt exp07 vào `/kaggle/input/exp07-answer/answer.txt` để mượn cột QMOS 0.548;\n",
+    "  không có thì tự chấm UTMOSv2 (T05, vô địch VMC2024 — mạnh hơn UTMOS, cần Internet On).\n",
+    "- **Ablation cho paper:** `UNFREEZE_TOP_LAYERS=0` (≈ head-only) vs `=6` (fine-tune) → bảng \"frozen vs fine-tuned\".\n",
+    "  `USE_AUDEERING=False` → đo đóng góp nhánh phụ.\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp08)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp08_finetune_emotion_pipeline.py ADDED Viewed

	@@ -0,0 +1,673 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp08 (FINE-TUNE WavLM cho 5 cột cảm xúc) — Kaggle
+#
+# **Khác mọi exp trước:** exp03–07 đều **đóng băng** backbone (chỉ trích đặc trưng + train head nhỏ trên cache).
+# exp08 **MỞ BĂNG (fine-tune)** WavLM-large để nó học lại đặc trưng riêng cho bài MOS cảm xúc 2026.
+#
+# ## Thiết kế (chốt với mentor 5/6)
+# ```
+#  wav ─┬─► WavLM-large (warm-start SAILER, TRAINABLE: chỉ mở băng N lớp trên)  ─► pool ─► emb_wavlm ┐
+#       └─► audeering MSP-dim (FROZEN, cache .npz)  ─► [emb_aud | vad3]                                ├─► TRUNK ─┬─► EMOS (+target)
+#                                                                                                       ┘          ├─► CAT (5)
+#                                                                                                                  └─► VAD (3)
+#  QMOS: KHÔNG train ở đây → mượn cột QMOS của exp07 (0.548) hoặc UTMOSv2 (T05, vô địch VMC2024).
+# ```
+# - **Warm-start:** khởi tạo WavLM từ checkpoint **SAILER** (`tiantiaf/wavlm-large-categorical-emotion`,
+#   đã giỏi cảm xúc) thay vì WavLM "trắng" → điểm xuất phát tốt hơn nhiều.
+# - **Phụ (frozen):** audeering — dimensional, bổ trợ góc nhìn categorical của WavLM, kỳ vọng kéo **VAL**.
+# - **Đóng băng partial:** chỉ train `UNFREEZE_TOP_LAYERS` lớp Transformer trên cùng + feature-extractor giữ băng
+#   → tiết kiệm VRAM T4 + chống overfit (chỉ 12.7k mẫu).
+#
+# ## ⚠️ Đánh đổi phải biết trước (so freeze+head)
+# - **Mất lợi thế cache:** mỗi epoch chạy lại cả WavLM (forward+backward) → chậm & đốt giờ GPU (30h/tuần).
+#   → **Lần đầu BẮT BUỘC đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20`** để chỉnh trơn rồi mới `None`.
+# - **Dễ overfit / OOM:** nếu OOM → giảm `BATCH`, tăng `ACCUM`, giảm `MAX_SECONDS`, giảm `UNFREEZE_TOP_LAYERS`.
+# - **Lưới an toàn:** exp07 vẫn là bản nộp vô địch tới khi exp08 **thắng trên VAL nội bộ** (đừng đốt lượt nộp).
+#
+# **Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input dataset Track 2 → sửa `DATA_ROOT` → Run All.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header)
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/ft_cache"         # cache audeering (.npz) — backbone WavLM KHÔNG cache (đang train)
+os.makedirs(CACHE_DIR, exist_ok=True)
+# (Tùy chọn) TÁI DÙNG cache audeering cũ: trỏ tới dataset chứa aud_train.npz/aud_dev.npz → tự copy sang CACHE_DIR.
+# Để "" nếu chạy mới hoàn toàn. /kaggle/input read-only nên phải copy sang working để ghi/append.
+CACHE_INPUT = "/kaggle/input/datasets/minhtoan2/cache-exp8"   # << SỬA slug cho khớp (hoặc "")
+if CACHE_INPUT and os.path.isdir(CACHE_INPUT):
+    import shutil
+    _n = 0
+    for _fn in os.listdir(CACHE_INPUT):
+        if _fn.startswith("aud_") and _fn.endswith(".npz"):
+            shutil.copy(os.path.join(CACHE_INPUT, _fn), os.path.join(CACHE_DIR, _fn)); _n += 1
+    print(f"📦 Tái dùng cache: copy {_n} file aud_*.npz từ {CACHE_INPUT} → {CACHE_DIR}")
+# Mượn cột QMOS của exp07 (tốt nhất 0.548). Trỏ tới answer.txt exp07 nếu có; không thì dùng UTMOSv2.
+EXP07_ANSWER = "/kaggle/input/exp07-answer/answer.txt"   # << (tùy chọn) Add Input answer.txt exp07; không có → UTMOSv2
+# ── Fine-tune / siêu tham số ─────────────────────────────────────────────────
+DEVICE              = "cuda"
+SR                  = 16000
+MAX_SECONDS         = 8           # cắt audio để chặn bộ nhớ backprop; OOM thì giảm còn 6
+UNFREEZE_TOP_LAYERS = 6           # số lớp Transformer trên cùng được train (0 = freeze hết = quay về head-only)
+TRUNK_HIDDEN        = 512
+HEAD_HIDDEN         = 128
+DROPOUT             = 0.3
+LR_BACKBONE         = 1e-5        # LR nhỏ cho backbone fine-tune
+LR_HEAD             = 1e-3        # LR lớn cho trunk + head (train từ đầu)
+WEIGHT_DECAY        = 1e-5
+EPOCHS              = 12          # TRẦN; early-stop quyết định số epoch thực (8 hơi thấp cho lần chạy thật)
+PATIENCE            = 3            # dừng khi val SRCC không lên 3 epoch; LUÔN giữ best_state
+BATCH               = 4           # nhỏ vì backbone to; tăng ACCUM để bù
+ACCUM               = 8           # effective batch = BATCH*ACCUM = 32
+VAL_FRAC            = 0.10
+SEED                = 42
+USE_AMP             = True        # mixed precision fp16 — tiết kiệm VRAM
+USE_GRAD_CKPT       = True        # gradient checkpointing — tiết kiệm VRAM (đổi lấy chậm hơn)
+USE_AUDEERING       = True        # nhánh phụ frozen audeering; False = chỉ WavLM
+USE_UNCERTAINTY     = True        # tự cân 5 loss (Kendall); False = trọng số 1.0
+LIMIT_TRAIN         = 300         # << LẦN ĐẦU để 300; chạy thật đặt None
+LIMIT_DEV           = 20          # << LẦN ĐẦU để 20; chạy thật đặt None
+# Mốc exp07 để so (cảnh báo nếu fine-tune KHÔNG thắng → giữ exp07)
+EXP07 = {"emos": 0.795, "cat_err": 0.153, "val": 0.581, "aro": 0.752, "dom": 0.705}
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+print(f"Fine-tune: mở băng {UNFREEZE_TOP_LAYERS} lớp trên · BATCH {BATCH}×ACCUM {ACCUM} · MAX {MAX_SECONDS}s")
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER (clone + sys.path, KHÔNG pip install -e .)
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("loralib", "speechbrain", "speechmos", "librosa", "soundfile",
+            "scipy", "scikit-learn", "pandas", "tqdm")
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Nạp SAILER → lấy backbone WavLM bên trong để FINE-TUNE
+# Thay vì gọi wrapper như hộp đen, ta **lôi module WavLM-large (HuggingFace) bên trong wrapper** ra
+# → toàn quyền đóng băng/mở băng từng lớp + tự pool. Nếu không tìm thấy (cấu trúc lạ) → **fallback**
+# nạp `microsoft/wavlm-large` trắng (mất warm-start, có cảnh báo).
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (rất chậm!)")
+def find_hf_backbone(module):
+    """Tìm submodule kiểu HF Wav2Vec2/WavLM backbone: có .feature_extractor và .encoder.layers."""
+    cands = []
+    for name, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((name, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+try:
+    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+    _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+    name, wavlm = find_hf_backbone(_wrapper)
+    if wavlm is not None:
+        print(f"✅ Warm-start SAILER: lấy backbone WavLM bên trong wrapper tại '.{name}' "
+              f"({sum(p.numel() for p in wavlm.parameters())/1e6:.0f}M params)")
+    else:
+        print("⚠️ Không tìm thấy backbone HF bên trong wrapper SAILER → sẽ fallback WavLM trắng.")
+except Exception as e:
+    print("⚠️ Lỗi nạp SAILER wrapper:", repr(e), "→ fallback WavLM trắng.")
+if wavlm is None:
+    from transformers import WavLMModel
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("ℹ️ Fallback: nạp microsoft/wavlm-large (KHÔNG warm-start SAILER).")
+wavlm = wavlm.to(device)
+WAVLM_DIM = int(wavlm.config.hidden_size)
+# ── Đóng băng partial: feature-extractor + tất cả trừ UNFREEZE_TOP_LAYERS lớp trên ──
+for p in wavlm.parameters():
+    p.requires_grad = False
+enc_layers = wavlm.encoder.layers
+n_layers = len(enc_layers)
+for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:
+    for p in layer.parameters():
+        p.requires_grad = True
+n_train = sum(p.numel() for p in wavlm.parameters() if p.requires_grad)
+print(f"WavLM: {n_layers} lớp encoder · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} lớp trên "
+      f"→ {n_train/1e6:.1f}M param train (trên dim {WAVLM_DIM})")
+if USE_GRAD_CKPT:
+    wavlm.gradient_checkpointing_enable()
+    if hasattr(wavlm, "enable_input_require_grads"):
+        wavlm.enable_input_require_grads()   # cần khi grad-ckpt + lớp dưới đóng băng
+def masked_mean(hidden, attn_mask):
+    """Mean-pool theo thời gian, bỏ qua phần pad (giữ gradient)."""
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    try:
+        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+    except Exception:
+        return hidden.mean(dim=1)
+    fm = fm.unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+def wavlm_embed(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state   # [B,T,D]
+    return masked_mean(out, attn_mask)
+# %% [markdown]
+# ## 3. Nạp audeering MSP-dim (FROZEN) — đặc trưng phụ
+# Lấy `[emb_pool(1024) | vad3(1–5)]` mỗi wav rồi **cache .npz** (chạy 1 lần). Kỹ thuật nạp head tay
+# y hệt exp05 (tránh lỗi version transformers khi subclass `Wav2Vec2PreTrainedModel`).
+# %%
+AUD_DIM = 0
+aud_backbone = aud_head = aud_proc = None
+if USE_AUDEERING:
+    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+    from huggingface_hub import hf_hub_download
+    AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+    aud_backbone = Wav2Vec2Model(aud_cfg)
+    try:
+        _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+            hf_hub_download(AUD_NAME, "model.safetensors"))
+    except Exception:
+        _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+    bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+    missing, unexpected = aud_backbone.load_state_dict(bb_sd, strict=False)
+    print(f"  audeering backbone: thiếu {len(missing)} / dư {len(unexpected)} key (strict=False)")
+    _hid = _sd["classifier.dense.weight"].shape[0]
+    _out = _sd["classifier.out_proj.weight"].shape[0]
+    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))
+    aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+    aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+    aud_backbone = aud_backbone.to(device).eval()
+    aud_head = aud_head.to(device).eval()
+    AUD_DIM = _hid + 3   # emb_pool + [VAL,ARO,DOM]
+    print(f"✅ audeering frozen (đặc trưng phụ {AUD_DIM}-D = emb {_hid} + vad 3)")
+# %%
+import numpy as np
+import librosa
+from tqdm.auto import tqdm
+def load_wav(name_or_stem, in_wav_dir=True):
+    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(
+        WAV_DIR, name_or_stem if str(name_or_stem).endswith(".wav") else str(name_or_stem) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    return wave[: MAX_SECONDS * SR].astype(np.float32)
+@torch.no_grad()
+def extract_audeering(stems, tag):
+    """→ dict {stem: float32[AUD_DIM]}; cache CACHE_DIR/aud_<tag>.npz (resume mỗi 500)."""
+    if not USE_AUDEERING:
+        return {}
+    cache_path = os.path.join(CACHE_DIR, f"aud_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[aud/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    for i, s in enumerate(tqdm(todo, desc=f"audeering {tag}")):
+        wave = load_wav(s)
+        if wave is None:
+            continue
+        x = aud_proc(wave, sampling_rate=SR).input_values[0]
+        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)
+        h = aud_backbone(x)[0].mean(dim=1)                    # [1, hid]
+        out = aud_head(h)[0].cpu().numpy()                    # [arousal, dominance, valence] ∈[0,1]
+        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]
+        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)
+        if (i + 1) % 500 == 0:
+            np.savez(cache_path, **store)
+    if todo:
+        np.savez(cache_path, **store)
+    return store
+# %% [markdown]
+# ## 4. Đọc & gộp nhãn theo wavID (EMOS / VAD / CAT) — như exp04/07 nhưng KHÔNG cần qMOS
+# %%
+import pandas as pd
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, df=None, default_idx=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = _col(cols, "wavid", "wav", df=df, default_idx=1)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col = _col(cols, "val", "valence"); aro_col = _col(cols, "aro", "arousal"); dom_col = _col(cols, "dom", "dominance")
+    cat_col = _col(cols, "emocat", "cat", "emotion")
+    assert emos_col, f"Không thấy cột eMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+# %% [markdown]
+# ## 5. Dataset / DataLoader (load wav theo batch — KHÔNG cache WavLM vì đang train)
+# %%
+from torch.utils.data import Dataset, DataLoader
+train_stems = [s for s in train_df["wavID"] if target_map.get(s) is not None]
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+aud_tr = extract_audeering(train_stems, "train")
+lab = train_df.set_index("wavID")
+# Chuẩn hóa nhãn liên tục về z-score (để các MSE cùng thang) — lưu để giải mã lúc dự đoán.
+def _zfit(arr):
+    a = np.asarray(arr, dtype=np.float32)
+    return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)
+emos_mu, emos_sd = _zfit([lab.loc[s, "emos"] for s in train_stems])
+if HAS_VAD:
+    vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in ["val", "aro", "dom"]], dtype=np.float32)
+    vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in ["val", "aro", "dom"]], dtype=np.float32)
+else:
+    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+class EmoDataset(Dataset):
+    def __init__(self, stems):
+        self.stems = [s for s in stems if (load_wav(s) is not None) and ((not USE_AUDEERING) or s in aud_tr)]
+    def __len__(self):
+        return len(self.stems)
+    def __getitem__(self, i):
+        s = self.stems[i]
+        wave = load_wav(s)
+        emos = (float(lab.loc[s, "emos"]) - emos_mu) / emos_sd
+        if HAS_VAD:
+            vad = (np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32) - vad_mu) / vad_sd
+        else:
+            vad = np.zeros(3, dtype=np.float32)
+        cat = np.array([lab.loc[s, f"cat{j}"] for j in range(len(EMOTIONS5))], dtype=np.float32)
+        aud = aud_tr[s] if USE_AUDEERING else np.zeros(0, dtype=np.float32)
+        return {"wave": wave, "tgt": onehot_target(target_map.get(s)), "aud": aud,
+                "emos": np.float32(emos), "vad": vad, "cat": cat,
+                "emos_raw": np.float32(lab.loc[s, "emos"]),
+                "vad_raw": np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32)}
+def collate(batch):
+    lens = [len(b["wave"]) for b in batch]
+    L = max(lens)
+    waves = np.zeros((len(batch), L), dtype=np.float32)
+    mask = np.zeros((len(batch), L), dtype=np.float32)
+    for i, b in enumerate(batch):
+        waves[i, : len(b["wave"])] = b["wave"]; mask[i, : len(b["wave"])] = 1.0
+    out = {
+        "input_values": torch.from_numpy(waves), "attn_mask": torch.from_numpy(mask).long(),
+        "tgt": torch.from_numpy(np.stack([b["tgt"] for b in batch])),
+        "aud": torch.from_numpy(np.stack([b["aud"] for b in batch])) if USE_AUDEERING else None,
+        "emos": torch.from_numpy(np.stack([b["emos"] for b in batch])).unsqueeze(1),
+        "vad": torch.from_numpy(np.stack([b["vad"] for b in batch])),
+        "cat": torch.from_numpy(np.stack([b["cat"] for b in batch])),
+        "emos_raw": np.stack([b["emos_raw"] for b in batch]),
+        "vad_raw": np.stack([b["vad_raw"] for b in batch]),
+    }
+    return out
+from sklearn.model_selection import train_test_split
+ds = EmoDataset(train_stems)
+print("Dataset hợp lệ:", len(ds), "wav")
+tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)
+tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)
+va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)
+# %% [markdown]
+# ## 6. Head fusion (trunk + 3 head cảm xúc) + train loop (AMP + grad accumulation)
+# %%
+from scipy.stats import spearmanr
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+print(f"Trunk input = {TRUNK_IN} (wavlm {WAVLM_DIM} + aud {AUD_DIM if USE_AUDEERING else 0})")
+TASKS = ["emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+bb_params = [p for p in wavlm.parameters() if p.requires_grad]
+head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.AdamW([
+    {"params": bb_params, "lr": LR_BACKBONE},
+    {"params": head_params, "lr": LR_HEAD},
+], weight_decay=WEIGHT_DECAY)
+scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == "cuda")
+mse = nn.MSELoss()
+def soft_ce(logits, target_dist):
+    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()
+def forward_batch(b):
+    feat_wavlm = wavlm_embed(b["input_values"].to(device), b["attn_mask"].to(device))
+    if USE_AUDEERING:
+        feat = torch.cat([feat_wavlm, b["aud"].to(device)], dim=1)
+    else:
+        feat = feat_wavlm
+    return heads(feat, b["tgt"].to(device))
+def compute_loss(emos_p, cat_l, vad_p, b):
+    L = {}
+    L["emos"] = mse(emos_p, b["emos"].to(device))
+    L["cat"] = soft_ce(cat_l, b["cat"].to(device))
+    if HAS_VAD:
+        vt = b["vad"].to(device)
+        L["val"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L["aro"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L["dom"] = mse(vad_p[:, 2:3], vt[:, 2:3])
+    else:
+        z = torch.zeros((), device=device); L["val"] = L["aro"] = L["dom"] = z
+    if USE_UNCERTAINTY:
+        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))
+    return sum(L.values())
+@torch.no_grad()
+def evaluate():
+    wavlm.eval(); heads.eval()
+    P = {"emos": [], "val": [], "aro": [], "dom": []}; Y = {"emos": [], "val": [], "aro": [], "dom": []}
+    catP, catY = [], []
+    for b in va_loader:
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+        P["emos"] += emos_p.float().cpu().numpy().ravel().tolist(); Y["emos"] += b["emos_raw"].tolist()
+        vad_p = vad_p.float().cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            P[t] += vad_p[:, j].tolist(); Y[t] += b["vad_raw"][:, j].tolist()
+        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b["cat"])
+    out = {}
+    for t in ["emos"] + (["val", "aro", "dom"] if HAS_VAD else []):
+        out[t] = spearmanr(P[t], Y[t]).correlation
+    q = np.concatenate(catP); p = np.concatenate(catY)
+    out["cat_err"] = float(np.abs(q - p).sum(1).mean())   # ~ tổng |Δ| trung bình (xấp xỉ CAT-ERR)
+    return out
+def mean_srcc(m):
+    keys = ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+# Lưu checkpoint FULL (có backbone WavLM) — gọi NGAY mỗi best để kernel chết giữa chừng vẫn còn file.
+CKPT_PATH = os.path.join(OUT_DIR, "ft_emotion_full.pt")
+def save_full_ckpt(state, val_emos=float("nan")):
+    torch.save({"wavlm": state["wavlm"], "heads": state["heads"],
+                "emos_mu": emos_mu, "emos_sd": emos_sd, "vad_mu": vad_mu, "vad_sd": vad_sd,
+                "WAVLM_DIM": WAVLM_DIM, "AUD_DIM": AUD_DIM,
+                "UNFREEZE_TOP_LAYERS": UNFREEZE_TOP_LAYERS, "val_emos": float(val_emos)}, CKPT_PATH)
+best, best_state, bad = -1e9, None, 0
+for ep in range(1, EPOCHS + 1):
+    wavlm.train(); heads.train()
+    opt.zero_grad(); run = 0.0; nb = 0
+    for step, b in enumerate(tqdm(tr_loader, desc=f"epoch {ep}")):
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM
+        scaler.scale(loss).backward()
+        if (step + 1) % ACCUM == 0:
+            scaler.step(opt); scaler.update(); opt.zero_grad()
+        run += loss.item() * ACCUM; nb += 1
+    m = evaluate(); sc = mean_srcc(m)
+    msg = " ".join(f"{k}={m[k]:.3f}" for k in ["emos", "val", "aro", "dom"] if k in m)
+    print(f"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})")
+    if sc > best:
+        best = sc
+        best_state = {"wavlm": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},
+                      "heads": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}
+        save_full_ckpt(best_state, m["emos"])   # LƯU NGAY mỗi best → an toàn nếu kernel chết
+        print(f"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})")
+        bad = 0
+    else:
+        bad += 1
+        if bad >= PATIENCE:
+            print(f"Early stop ở epoch {ep}."); break
+if best_state:
+    wavlm.load_state_dict(best_state["wavlm"]); heads.load_state_dict(best_state["heads"])
+final = evaluate()
+print("\n✅ VAL (nội bộ) — exp08 (fine-tune WavLM cho cảm xúc):")
+print(f"   EMOS={final['emos']:.4f} (exp07 {EXP07['emos']})")
+if HAS_VAD:
+    print(f"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} "
+          f"(exp07 {EXP07['val']}/{EXP07['aro']}/{EXP07['dom']})")
+warn = [f"EMOS {final['emos']:.3f}<{EXP07['emos']}"] if final["emos"] < EXP07["emos"] - 0.005 else []
+if HAS_VAD:
+    warn += [f"{t.upper()} {final[t]:.3f}<{EXP07[t]}" for t in ["val", "aro", "dom"] if final[t] < EXP07[t] - 0.005]
+print("   ⚠️ CHƯA thắng exp07 ở:", "; ".join(warn), "→ cân nhắc giữ exp07." if warn else "")
+if not warn:
+    print("   ✅ Fine-tune thắng/ngang exp07 ở mọi cột cảm xúc → đáng nộp.")
+# Lưu lần cuối từ best (đã lưu sẵn mỗi best trong loop; đây là phát cuối cho chắc).
+save_full_ckpt(best_state if best_state else
+               {"wavlm": wavlm.state_dict(), "heads": heads.state_dict()}, final["emos"])
+print(f"✅ Đã lưu {CKPT_PATH} (CÓ backbone WavLM + heads → resume được). "
+      f"NHỚ Save Version để file ra Output!")
+# %% [markdown]
+# ## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc từ exp08; QMOS mượn exp07 hoặc UTMOS)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+aud_dev = extract_audeering(dev_stems, "dev")
+# QMOS: ưu tiên mượn cột QMOS của exp07; không có file → chấm UTMOSv2 (T05, vô địch VMC2024).
+def load_exp07_qmos():
+    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+        import csv
+        d = {}
+        with open(EXP07_ANSWER) as f:
+            r = csv.DictReader(f)
+            for row in r:
+                d[row["wav"]] = float(row["QMOS"]); d[stem(row["wav"])] = float(row["QMOS"])
+        print(f"✅ Mượn QMOS từ exp07 ({EXP07_ANSWER}): {len(d)//2} wav")
+        return d
+    return None
+qmos_map = load_exp07_qmos()
+if qmos_map is None:
+    print("ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024 Track 1).")
+    pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")   # cần Internet On, checkpoint tự tải
+    import utmosv2
+    v2 = utmosv2.create_model(pretrained=True)
+    qmos_map = {}
+    for n in tqdm(dev_names, desc="UTMOSv2"):
+        wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else str(n) + ".wav")
+        if not os.path.exists(wav):
+            continue
+        out = v2.predict(input_path=wav)   # trả float hoặc dict {'predicted_mos': ...} tùy phiên bản
+        qmos_map[n] = float(out["predicted_mos"]) if isinstance(out, dict) else float(out)
+    del v2; torch.cuda.empty_cache() if device == "cuda" else None
+@torch.no_grad()
+def predict_emotion(sid):
+    wave = load_wav(sid)
+    if wave is None or (USE_AUDEERING and sid not in aud_dev):
+        return None
+    wavlm.eval(); heads.eval()
+    iv = torch.from_numpy(wave).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        fw = wavlm_embed(iv, am)
+        if USE_AUDEERING:
+            aud = torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)
+            feat = torch.cat([fw, aud], dim=1)
+        else:
+            feat = fw
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    n_real = n_def = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pr = predict_emotion(sid)
+            if pr is None:
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1
+            else:
+                emos, cat5, vad3 = pr; n_real += 1
+            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 8. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0] and "EMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp08_ft-emotion.zip answer.txt "
+          f"&& unzip -l submission_track2_exp08_ft-emotion.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp08_ft-emotion.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Lần đầu** `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để kiểm tra chạy trơn (1 epoch xong không OOM); rồi đặt `None`.
+# - **OOM trên T4?** giảm theo thứ tự: `MAX_SECONDS` (8→6) → `UNFREEZE_TOP_LAYERS` (6→4→2) → `BATCH` (4→2, tăng `ACCUM`).
+# - **Đọc mục 6:** so EMOS/VAD VAL nội bộ với mốc exp07 (EMOS 0.795 · VAL 0.581 · ARO 0.752 · DOM 0.705).
+#   - Nếu fine-tune **thắng** → nộp answer.txt exp08 (5 cột cảm xúc của exp08 + QMOS mượn exp07).
+#   - Nếu **thua** → giữ exp07; vẫn là kết quả cho paper ("fine-tune chưa vượt frozen-fusion trên data nhỏ").
+# - **QMOS:** Add Input answer.txt exp07 vào `/kaggle/input/exp07-answer/answer.txt` để mượn cột QMOS 0.548;
+#   không có thì tự chấm UTMOSv2 (T05, vô địch VMC2024 — mạnh hơn UTMOS, cần Internet On).
+# - **Ablation cho paper:** `UNFREEZE_TOP_LAYERS=0` (≈ head-only) vs `=6` (fine-tune) → bảng "frozen vs fine-tuned".
+#   `USE_AUDEERING=False` → đo đóng góp nhánh phụ.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp08).

track2/exp08b_finetune_resume.ipynb ADDED Viewed

	@@ -0,0 +1,782 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ce468400",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp08-RESUME (fine-tune TIẾP từ checkpoint + cache) — Kaggle\n",
+    "\n",
+    "**Mục đích:** train tiếp model fine-tune cảm xúc (exp08) từ **checkpoint đã lưu** thay vì train lại từ\n",
+    "đầu — tiết kiệm giờ GPU. Tận dụng:\n",
+    "- `ft_emotion_full.pt` (CÓ cả backbone WavLM + heads + thống kê chuẩn hóa) → nạp lại đúng trạng thái.\n",
+    "- **cache audeering** `aud_*.npz` (đặc trưng frozen) → KHÔNG trích lại (~đỡ chục phút).\n",
+    "\n",
+    "> ⚠️ Bắt buộc dùng checkpoint **đủ backbone** (`ft_emotion_full.pt` từ cell \"TRAIN TIẾP\", hoặc bản\n",
+    "> `ft_emotion_meta.pt` MỚI đã vá để lưu cả `wavlm`). Bản `ft_emotion_meta.pt` CŨ chỉ có `heads` → KHÔNG dùng được.\n",
+    "\n",
+    "## Chuẩn bị input trên Kaggle (Add Input)\n",
+    "1. Dataset Track 2 (`vmc2026-track2-full`) — wav + nhãn.\n",
+    "2. **Checkpoint**: upload `ft_emotion_full.pt` thành 1 Dataset → trỏ `RESUME_CKPT`.\n",
+    "3. **Cache** (tùy chọn nhưng nên có): upload thư mục chứa `aud_train.npz`, `aud_dev.npz` → trỏ `CACHE_INPUT`.\n",
+    "4. (tùy chọn) `answer.txt` exp07 để mượn cột QMOS 0.548.\n",
+    "\n",
+    "**Cách chạy:** GPU T4 + Internet On → sửa các slug ở cell 0 → Run All. Lần đầu để `LIMIT_TRAIN=300`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1c6752ee",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d6317ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, shutil\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "# ── Checkpoint + cache để RESUME ─────────────────────────────────────────────\n",
+    "RESUME_CKPT  = \"/kaggle/input/ft-emotion-full/ft_emotion_full.pt\"   # << CHECKPOINT đủ backbone\n",
+    "CACHE_INPUT  = \"/kaggle/input/ft-emotion-cache\"                     # << thư mục chứa aud_*.npz (hoặc \"\" nếu không có)\n",
+    "EXP07_ANSWER = \"/kaggle/input/exp07-answer/answer.txt\"             # << (tùy chọn) mượn QMOS 0.548; không có → UTMOSv2\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/ft_cache\"     # /kaggle/input read-only → copy cache sang đây để ghi/append được\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── Fine-tune / siêu tham số (train TIẾP) ────────────────────────────────────\n",
+    "DEVICE              = \"cuda\"\n",
+    "SR                  = 16000\n",
+    "MAX_SECONDS         = 8\n",
+    "UNFREEZE_TOP_LAYERS = 6           # PHẢI khớp checkpoint (mặc định exp08 = 6)\n",
+    "TRUNK_HIDDEN        = 512          # PHẢI khớp checkpoint\n",
+    "HEAD_HIDDEN         = 128          # PHẢI khớp checkpoint\n",
+    "DROPOUT             = 0.3\n",
+    "LR_BACKBONE         = 1e-5\n",
+    "LR_HEAD             = 1e-3\n",
+    "RESUME_LR_SCALE     = 1.0          # <1.0 để giảm LR khi train tiếp (vd 0.5 nếu val đã chững)\n",
+    "WEIGHT_DECAY        = 1e-5\n",
+    "EPOCHS              = 10           # số epoch train THÊM (run này)\n",
+    "PATIENCE            = 5            # dừng khi val không lên; LUÔN giữ best\n",
+    "BATCH               = 4\n",
+    "ACCUM               = 8           # effective batch = 32\n",
+    "VAL_FRAC            = 0.10\n",
+    "SEED                = 42\n",
+    "USE_AMP             = True\n",
+    "USE_GRAD_CKPT       = True\n",
+    "USE_AUDEERING       = True         # PHẢI khớp checkpoint (exp08 = True)\n",
+    "USE_UNCERTAINTY     = True\n",
+    "\n",
+    "LIMIT_TRAIN         = 300          # << LẦN ĐẦU 300; chạy thật None\n",
+    "LIMIT_DEV           = 20           # << LẦN ĐẦU 20; chạy thật None\n",
+    "\n",
+    "# Mốc exp07 + exp08 để so\n",
+    "EXP07 = {\"emos\": 0.795, \"cat_err\": 0.153, \"val\": 0.581, \"aro\": 0.752, \"dom\": 0.705}\n",
+    "EXP08 = {\"emos\": 0.811, \"cat_err\": 0.133, \"val\": 0.659, \"aro\": 0.793, \"dom\": 0.751}  # bản đã nộp\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP, RESUME_CKPT]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)\n",
+    "\n",
+    "# Copy cache (aud_*.npz) từ input read-only sang working để append được\n",
+    "if CACHE_INPUT and os.path.isdir(CACHE_INPUT):\n",
+    "    n = 0\n",
+    "    for fn in os.listdir(CACHE_INPUT):\n",
+    "        if fn.startswith(\"aud_\") and fn.endswith(\".npz\"):\n",
+    "            shutil.copy(os.path.join(CACHE_INPUT, fn), os.path.join(CACHE_DIR, fn)); n += 1\n",
+    "    print(f\"📦 Copy {n} file cache audeering từ {CACHE_INPUT} → {CACHE_DIR}\")\n",
+    "else:\n",
+    "    print(\"ℹ️ Không có CACHE_INPUT → sẽ tự trích audeering (chậm hơn lần đầu).\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57e6416a",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER (để dựng đúng kiến trúc WavLM rồi nạp checkpoint đè lên)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "76497a3f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"loralib\", \"speechbrain\", \"speechmos\", \"librosa\", \"soundfile\",\n",
+    "            \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cf6cf213",
+   "metadata": {},
+   "source": [
+    "## 2. Dựng WavLM (như exp08) → NẠP trọng số backbone từ checkpoint\n",
+    "Dựng đúng kiến trúc (SAILER wrapper → lấy backbone HF; fallback WavLM trắng), rồi `load_state_dict`\n",
+    "bằng `ckpt[\"wavlm\"]` → khôi phục đúng trạng thái fine-tune đã lưu."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20a2e84b",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (rất chậm!)\")\n",
+    "\n",
+    "ckpt = torch.load(RESUME_CKPT, map_location=\"cpu\", weights_only=False)   # ckpt có numpy (vad_mu) → cần False\n",
+    "assert \"wavlm\" in ckpt, (\"❌ Checkpoint KHÔNG có 'wavlm' (backbone). Đây là bản ft_emotion_meta.pt CŨ \"\n",
+    "                         \"chỉ lưu heads → không resume được. Hãy dùng ft_emotion_full.pt.\")\n",
+    "print(\"✅ Nạp checkpoint:\", RESUME_CKPT, \"| keys:\", list(ckpt.keys()))\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    cands = []\n",
+    "    for name, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((name, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "try:\n",
+    "    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "    _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "    name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    if wavlm is not None:\n",
+    "        print(f\"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{name}'\")\n",
+    "except Exception as e:\n",
+    "    print(\"⚠️ Lỗi nạp SAILER wrapper:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "\n",
+    "if wavlm is None:\n",
+    "    from transformers import WavLMModel\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"ℹ️ Fallback: microsoft/wavlm-large.\")\n",
+    "\n",
+    "wavlm = wavlm.to(device)\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "\n",
+    "# Nạp trọng số đã fine-tune từ checkpoint (đè lên kiến trúc vừa dựng)\n",
+    "miss, unexp = wavlm.load_state_dict(ckpt[\"wavlm\"], strict=False)\n",
+    "print(f\"🔁 load wavlm từ checkpoint: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0).\")\n",
+    "if len(miss) > 20 or len(unexp) > 20:\n",
+    "    print(\"   ⚠️ Lệch key nhiều → kiến trúc có thể không khớp checkpoint. Kiểm tra UNFREEZE/USE_AUDEERING.\")\n",
+    "\n",
+    "# Đóng băng partial: chỉ mở UNFREEZE_TOP_LAYERS lớp trên\n",
+    "for p in wavlm.parameters():\n",
+    "    p.requires_grad = False\n",
+    "enc_layers = wavlm.encoder.layers\n",
+    "n_layers = len(enc_layers)\n",
+    "for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:\n",
+    "    for p in layer.parameters():\n",
+    "        p.requires_grad = True\n",
+    "n_train = sum(p.numel() for p in wavlm.parameters() if p.requires_grad)\n",
+    "print(f\"WavLM: {n_layers} lớp · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} → {n_train/1e6:.1f}M param train (dim {WAVLM_DIM})\")\n",
+    "\n",
+    "if USE_GRAD_CKPT:\n",
+    "    wavlm.gradient_checkpointing_enable()\n",
+    "    if hasattr(wavlm, \"enable_input_require_grads\"):\n",
+    "        wavlm.enable_input_require_grads()\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    try:\n",
+    "        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "    except Exception:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "def wavlm_embed(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state\n",
+    "    return masked_mean(out, attn_mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "156a5f4d",
+   "metadata": {},
+   "source": [
+    "## 3. audeering FROZEN (đặc trưng phụ) — dùng cache nếu có"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "670569c7",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import librosa\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "AUD_DIM = 0\n",
+    "aud_backbone = aud_head = aud_proc = None\n",
+    "if USE_AUDEERING:\n",
+    "    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "    from huggingface_hub import hf_hub_download\n",
+    "    AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "    aud_backbone = Wav2Vec2Model(aud_cfg)\n",
+    "    try:\n",
+    "        _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "            hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "    except Exception:\n",
+    "        _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "    bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "    aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "    _hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "    _out = _sd[\"classifier.out_proj.weight\"].shape[0]\n",
+    "    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))\n",
+    "    aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "    aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "    aud_backbone = aud_backbone.to(device).eval()\n",
+    "    aud_head = aud_head.to(device).eval()\n",
+    "    AUD_DIM = _hid + 3\n",
+    "    print(f\"✅ audeering frozen ({AUD_DIM}-D)\")\n",
+    "\n",
+    "def load_wav(name_or_stem):\n",
+    "    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(\n",
+    "        WAV_DIR, name_or_stem if str(name_or_stem).endswith(\".wav\") else str(name_or_stem) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    return wave[: MAX_SECONDS * SR].astype(np.float32)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def extract_audeering(stems, tag):\n",
+    "    if not USE_AUDEERING:\n",
+    "        return {}\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"aud_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[aud/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    for i, s in enumerate(tqdm(todo, desc=f\"audeering {tag}\")):\n",
+    "        wave = load_wav(s)\n",
+    "        if wave is None:\n",
+    "            continue\n",
+    "        x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)\n",
+    "        h = aud_backbone(x)[0].mean(dim=1)\n",
+    "        out = aud_head(h)[0].cpu().numpy()\n",
+    "        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)\n",
+    "        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)\n",
+    "        if (i + 1) % 500 == 0:\n",
+    "            np.savez(cache_path, **store)\n",
+    "    if todo:\n",
+    "        np.savez(cache_path, **store)\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c5ed6f49",
+   "metadata": {},
+   "source": [
+    "## 4. Đọc & gộp nhãn theo wavID"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "910d097f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, df=None, default_idx=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = _col(cols, \"wavid\", \"wav\", df=df, default_idx=1)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col = _col(cols, \"val\", \"valence\"); aro_col = _col(cols, \"aro\", \"arousal\"); dom_col = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1d9509a7",
+   "metadata": {},
+   "source": [
+    "## 5. Dataset/loader — DÙNG thống kê chuẩn hóa TỪ CHECKPOINT (để khớp head đã train)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4c09387b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "train_stems = [s for s in train_df[\"wavID\"] if target_map.get(s) is not None]\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "aud_tr = extract_audeering(train_stems, \"train\")\n",
+    "\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "\n",
+    "# QUAN TRỌNG: lấy mean/std từ checkpoint (head đã train theo thang này) thay vì tính lại.\n",
+    "emos_mu = float(ckpt[\"emos_mu\"]); emos_sd = float(ckpt[\"emos_sd\"])\n",
+    "vad_mu = np.asarray(ckpt[\"vad_mu\"], dtype=np.float32); vad_sd = np.asarray(ckpt[\"vad_sd\"], dtype=np.float32)\n",
+    "print(f\"Dùng chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}\")\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "class EmoDataset(Dataset):\n",
+    "    def __init__(self, stems):\n",
+    "        self.stems = [s for s in stems if (load_wav(s) is not None) and ((not USE_AUDEERING) or s in aud_tr)]\n",
+    "    def __len__(self):\n",
+    "        return len(self.stems)\n",
+    "    def __getitem__(self, i):\n",
+    "        s = self.stems[i]\n",
+    "        wave = load_wav(s)\n",
+    "        emos = (float(lab.loc[s, \"emos\"]) - emos_mu) / emos_sd\n",
+    "        if HAS_VAD:\n",
+    "            vad = (np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32) - vad_mu) / vad_sd\n",
+    "        else:\n",
+    "            vad = np.zeros(3, dtype=np.float32)\n",
+    "        cat = np.array([lab.loc[s, f\"cat{j}\"] for j in range(len(EMOTIONS5))], dtype=np.float32)\n",
+    "        aud = aud_tr[s] if USE_AUDEERING else np.zeros(0, dtype=np.float32)\n",
+    "        return {\"wave\": wave, \"tgt\": onehot_target(target_map.get(s)), \"aud\": aud,\n",
+    "                \"emos\": np.float32(emos), \"vad\": vad, \"cat\": cat,\n",
+    "                \"emos_raw\": np.float32(lab.loc[s, \"emos\"]),\n",
+    "                \"vad_raw\": np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32)}\n",
+    "\n",
+    "def collate(batch):\n",
+    "    lens = [len(b[\"wave\"]) for b in batch]\n",
+    "    L = max(lens)\n",
+    "    waves = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    mask = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    for i, b in enumerate(batch):\n",
+    "        waves[i, : len(b[\"wave\"])] = b[\"wave\"]; mask[i, : len(b[\"wave\"])] = 1.0\n",
+    "    return {\n",
+    "        \"input_values\": torch.from_numpy(waves), \"attn_mask\": torch.from_numpy(mask).long(),\n",
+    "        \"tgt\": torch.from_numpy(np.stack([b[\"tgt\"] for b in batch])),\n",
+    "        \"aud\": torch.from_numpy(np.stack([b[\"aud\"] for b in batch])) if USE_AUDEERING else None,\n",
+    "        \"emos\": torch.from_numpy(np.stack([b[\"emos\"] for b in batch])).unsqueeze(1),\n",
+    "        \"vad\": torch.from_numpy(np.stack([b[\"vad\"] for b in batch])),\n",
+    "        \"cat\": torch.from_numpy(np.stack([b[\"cat\"] for b in batch])),\n",
+    "        \"emos_raw\": np.stack([b[\"emos_raw\"] for b in batch]),\n",
+    "        \"vad_raw\": np.stack([b[\"vad_raw\"] for b in batch]),\n",
+    "    }\n",
+    "\n",
+    "ds = EmoDataset(train_stems)\n",
+    "print(\"Dataset hợp lệ:\", len(ds), \"wav\")\n",
+    "tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)\n",
+    "tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)\n",
+    "va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5c16d942",
+   "metadata": {},
+   "source": [
+    "## 6. Heads (NẠP từ checkpoint) + optimizer + train TIẾP"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7adfb320",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import spearmanr\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "hmiss, hunexp = heads.load_state_dict(ckpt[\"heads\"], strict=False)\n",
+    "print(f\"🔁 load heads từ checkpoint: thiếu {len(hmiss)} / dư {len(hunexp)} key (kỳ vọng 0).\")\n",
+    "print(f\"Trunk input = {TRUNK_IN} (wavlm {WAVLM_DIM} + aud {AUD_DIM if USE_AUDEERING else 0})\")\n",
+    "\n",
+    "TASKS = [\"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "bb_params = [p for p in wavlm.parameters() if p.requires_grad]\n",
+    "head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.AdamW([\n",
+    "    {\"params\": bb_params, \"lr\": LR_BACKBONE * RESUME_LR_SCALE},\n",
+    "    {\"params\": head_params, \"lr\": LR_HEAD * RESUME_LR_SCALE},\n",
+    "], weight_decay=WEIGHT_DECAY)\n",
+    "scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == \"cuda\")\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()\n",
+    "\n",
+    "def forward_batch(b):\n",
+    "    feat_wavlm = wavlm_embed(b[\"input_values\"].to(device), b[\"attn_mask\"].to(device))\n",
+    "    feat = torch.cat([feat_wavlm, b[\"aud\"].to(device)], dim=1) if USE_AUDEERING else feat_wavlm\n",
+    "    return heads(feat, b[\"tgt\"].to(device))\n",
+    "\n",
+    "def compute_loss(emos_p, cat_l, vad_p, b):\n",
+    "    L = {}\n",
+    "    L[\"emos\"] = mse(emos_p, b[\"emos\"].to(device))\n",
+    "    L[\"cat\"] = soft_ce(cat_l, b[\"cat\"].to(device))\n",
+    "    if HAS_VAD:\n",
+    "        vt = b[\"vad\"].to(device)\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L[\"aro\"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L[\"dom\"] = mse(vad_p[:, 2:3], vt[:, 2:3])\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device); L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))\n",
+    "    return sum(L.values())\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def evaluate():\n",
+    "    wavlm.eval(); heads.eval()\n",
+    "    P = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}; Y = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}\n",
+    "    catP, catY = [], []\n",
+    "    for b in va_loader:\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "        P[\"emos\"] += emos_p.float().cpu().numpy().ravel().tolist(); Y[\"emos\"] += b[\"emos_raw\"].tolist()\n",
+    "        vad_p = vad_p.float().cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            P[t] += vad_p[:, j].tolist(); Y[t] += b[\"vad_raw\"][:, j].tolist()\n",
+    "        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b[\"cat\"])\n",
+    "    out = {}\n",
+    "    for t in [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else []):\n",
+    "        out[t] = spearmanr(P[t], Y[t]).correlation\n",
+    "    q = np.concatenate(catP); p = np.concatenate(catY)\n",
+    "    out[\"cat_err\"] = float(np.abs(q - p).sum(1).mean())\n",
+    "    return out\n",
+    "\n",
+    "def mean_srcc(m):\n",
+    "    keys = [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "# Init best TỪ checkpoint hiện tại → chỉ lưu nếu train tiếp TỐT HƠN\n",
+    "m0 = evaluate(); best = mean_srcc(m0)\n",
+    "best_state = {\"wavlm\": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},\n",
+    "              \"heads\": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}\n",
+    "print(f\"📍 Checkpoint hiện tại: mean SRCC = {best:.4f} | \"\n",
+    "      + \" \".join(f\"{k}={m0[k]:.3f}\" for k in ['emos','val','aro','dom'] if k in m0))\n",
+    "\n",
+    "bad = 0\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    wavlm.train(); heads.train()\n",
+    "    opt.zero_grad(); run = 0.0; nb = 0\n",
+    "    for step, b in enumerate(tqdm(tr_loader, desc=f\"+epoch {ep}\")):\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM\n",
+    "        scaler.scale(loss).backward()\n",
+    "        if (step + 1) % ACCUM == 0:\n",
+    "            scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "        run += loss.item() * ACCUM; nb += 1\n",
+    "    m = evaluate(); sc = mean_srcc(m)\n",
+    "    msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "    print(f\"+epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})\")\n",
+    "    if sc > best:\n",
+    "        best = sc\n",
+    "        best_state = {\"wavlm\": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},\n",
+    "                      \"heads\": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}\n",
+    "        bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "        if bad >= PATIENCE:\n",
+    "            print(f\"Early stop (resume) ở +epoch {ep}.\"); break\n",
+    "\n",
+    "wavlm.load_state_dict(best_state[\"wavlm\"]); heads.load_state_dict(best_state[\"heads\"])\n",
+    "final = evaluate()\n",
+    "print(\"\\n✅ VAL sau resume:\")\n",
+    "print(f\"   EMOS={final['emos']:.4f} (ckpt {m0['emos']:.3f} · exp08 nộp {EXP08['emos']})\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} \"\n",
+    "          f\"(exp08 nộp {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})\")\n",
+    "print(f\"   mean SRCC: ckpt {mean_srcc(m0):.4f} → sau resume {mean_srcc(final):.4f} \"\n",
+    "      + (\"🚀 cải thiện\" if mean_srcc(final) > mean_srcc(m0) + 1e-4 else \"➖ không cải thiện (giữ ckpt cũ)\"))\n",
+    "\n",
+    "torch.save({\"wavlm\": best_state[\"wavlm\"], \"heads\": best_state[\"heads\"],\n",
+    "            \"emos_mu\": emos_mu, \"emos_sd\": emos_sd, \"vad_mu\": vad_mu, \"vad_sd\": vad_sd,\n",
+    "            \"WAVLM_DIM\": WAVLM_DIM, \"AUD_DIM\": AUD_DIM, \"UNFREEZE_TOP_LAYERS\": UNFREEZE_TOP_LAYERS,\n",
+    "            \"val_emos\": final[\"emos\"]}, os.path.join(OUT_DIR, \"ft_emotion_full.pt\"))\n",
+    "print(\"Đã lưu FULL (có backbone):\", os.path.join(OUT_DIR, \"ft_emotion_full.pt\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd6bb5d8",
+   "metadata": {},
+   "source": [
+    "## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc từ resume; QMOS mượn exp07 hoặc UTMOSv2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b7753e8",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "aud_dev = extract_audeering(dev_stems, \"dev\")\n",
+    "\n",
+    "def load_exp07_qmos():\n",
+    "    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "        import csv\n",
+    "        d = {}\n",
+    "        with open(EXP07_ANSWER) as f:\n",
+    "            for row in csv.DictReader(f):\n",
+    "                d[row[\"wav\"]] = float(row[\"QMOS\"]); d[stem(row[\"wav\"])] = float(row[\"QMOS\"])\n",
+    "        print(f\"✅ Mượn QMOS từ exp07 ({EXP07_ANSWER}): {len(d)//2} wav\")\n",
+    "        return d\n",
+    "    return None\n",
+    "\n",
+    "qmos_map = load_exp07_qmos()\n",
+    "if qmos_map is None:\n",
+    "    print(\"ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024).\")\n",
+    "    pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")\n",
+    "    import utmosv2\n",
+    "    v2 = utmosv2.create_model(pretrained=True)\n",
+    "    qmos_map = {}\n",
+    "    for n in tqdm(dev_names, desc=\"UTMOSv2\"):\n",
+    "        wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else str(n) + \".wav\")\n",
+    "        if not os.path.exists(wav):\n",
+    "            continue\n",
+    "        out = v2.predict(input_path=wav)\n",
+    "        qmos_map[n] = float(out[\"predicted_mos\"]) if isinstance(out, dict) else float(out)\n",
+    "    del v2; torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    wave = load_wav(sid)\n",
+    "    if wave is None or (USE_AUDEERING and sid not in aud_dev):\n",
+    "        return None\n",
+    "    wavlm.eval(); heads.eval()\n",
+    "    iv = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        fw = wavlm_embed(iv, am)\n",
+    "        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    n_real = n_def = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pr = predict_emotion(sid)\n",
+    "            if pr is None:\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1\n",
+    "            else:\n",
+    "                emos, cat5, vad3 = pr; n_real += 1\n",
+    "            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7dac208f",
+   "metadata": {},
+   "source": [
+    "## 8. Validate + zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c123c058",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp08_resume.zip answer.txt && unzip -l submission_track2_exp08_resume.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp08_resume.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e5f82cf0",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Đầu vào bắt buộc:** `RESUME_CKPT` = `ft_emotion_full.pt` (CÓ backbone). Bản `ft_emotion_meta.pt` cũ chỉ\n",
+    "  có heads → cell 2 sẽ assert lỗi nhắc dùng file đủ.\n",
+    "- **Cache:** trỏ `CACHE_INPUT` tới dataset chứa `aud_train.npz`/`aud_dev.npz` → khỏi trích lại audeering.\n",
+    "  Nếu LIMIT khác lần trước, cache thiếu stem nào sẽ tự trích bù (resume theo stem).\n",
+    "- **Chuẩn hóa lấy TỪ checkpoint** (`emos_mu/sd`, `vad_mu/sd`) → khớp thang head đã train (đừng tính lại).\n",
+    "- **best init từ checkpoint** → chỉ lưu nếu train tiếp THỰC SỰ tốt hơn (không sợ tụt).\n",
+    "- Nếu val chững: đặt `RESUME_LR_SCALE=0.5` (giảm LR) hoặc tăng `UNFREEZE_TOP_LAYERS` (lưu ý: mở thêm lớp\n",
+    "  thì lớp mới chưa được train trong checkpoint → cần nhiều epoch hơn).\n",
+    "- QMOS: tốt nhất Add Input `answer.txt` exp07 (0.548). Để trộn cột chuẩn, xem kết quả exp08: 5 cột cảm xúc\n",
+    "  resume + QMOS exp07 → hệ thống mạnh nhất 6 cột.\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md`."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp08b_finetune_resume_pipeline.py ADDED Viewed

	@@ -0,0 +1,642 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp08-RESUME (fine-tune TIẾP từ checkpoint + cache) — Kaggle
+#
+# **Mục đích:** train tiếp model fine-tune cảm xúc (exp08) từ **checkpoint đã lưu** thay vì train lại từ
+# đầu — tiết kiệm giờ GPU. Tận dụng:
+# - `ft_emotion_full.pt` (CÓ cả backbone WavLM + heads + thống kê chuẩn hóa) → nạp lại đúng trạng thái.
+# - **cache audeering** `aud_*.npz` (đặc trưng frozen) → KHÔNG trích lại (~đỡ chục phút).
+#
+# > ⚠️ Bắt buộc dùng checkpoint **đủ backbone** (`ft_emotion_full.pt` từ cell "TRAIN TIẾP", hoặc bản
+# > `ft_emotion_meta.pt` MỚI đã vá để lưu cả `wavlm`). Bản `ft_emotion_meta.pt` CŨ chỉ có `heads` → KHÔNG dùng được.
+#
+# ## Chuẩn bị input trên Kaggle (Add Input)
+# 1. Dataset Track 2 (`vmc2026-track2-full`) — wav + nhãn.
+# 2. **Checkpoint**: upload `ft_emotion_full.pt` thành 1 Dataset → trỏ `RESUME_CKPT`.
+# 3. **Cache** (tùy chọn nhưng nên có): upload thư mục chứa `aud_train.npz`, `aud_dev.npz` → trỏ `CACHE_INPUT`.
+# 4. (tùy chọn) `answer.txt` exp07 để mượn cột QMOS 0.548.
+#
+# **Cách chạy:** GPU T4 + Internet On → sửa các slug ở cell 0 → Run All. Lần đầu để `LIMIT_TRAIN=300`.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os, shutil
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+# ── Checkpoint + cache để RESUME ─────────────────────────────────────────────
+RESUME_CKPT  = "/kaggle/input/ft-emotion-full/ft_emotion_full.pt"   # << CHECKPOINT đủ backbone
+CACHE_INPUT  = "/kaggle/input/ft-emotion-cache"                     # << thư mục chứa aud_*.npz (hoặc "" nếu không có)
+EXP07_ANSWER = "/kaggle/input/exp07-answer/answer.txt"             # << (tùy chọn) mượn QMOS 0.548; không có → UTMOSv2
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/ft_cache"     # /kaggle/input read-only → copy cache sang đây để ghi/append được
+os.makedirs(CACHE_DIR, exist_ok=True)
+# ── Fine-tune / siêu tham số (train TIẾP) ────────────────────────────────────
+DEVICE              = "cuda"
+SR                  = 16000
+MAX_SECONDS         = 8
+UNFREEZE_TOP_LAYERS = 6           # PHẢI khớp checkpoint (mặc định exp08 = 6)
+TRUNK_HIDDEN        = 512          # PHẢI khớp checkpoint
+HEAD_HIDDEN         = 128          # PHẢI khớp checkpoint
+DROPOUT             = 0.3
+LR_BACKBONE         = 1e-5
+LR_HEAD             = 1e-3
+RESUME_LR_SCALE     = 1.0          # <1.0 để giảm LR khi train tiếp (vd 0.5 nếu val đã chững)
+WEIGHT_DECAY        = 1e-5
+EPOCHS              = 10           # số epoch train THÊM (run này)
+PATIENCE            = 5            # dừng khi val không lên; LUÔN giữ best
+BATCH               = 4
+ACCUM               = 8           # effective batch = 32
+VAL_FRAC            = 0.10
+SEED                = 42
+USE_AMP             = True
+USE_GRAD_CKPT       = True
+USE_AUDEERING       = True         # PHẢI khớp checkpoint (exp08 = True)
+USE_UNCERTAINTY     = True
+LIMIT_TRAIN         = 300          # << LẦN ĐẦU 300; chạy thật None
+LIMIT_DEV           = 20           # << LẦN ĐẦU 20; chạy thật None
+# Mốc exp07 + exp08 để so
+EXP07 = {"emos": 0.795, "cat_err": 0.153, "val": 0.581, "aro": 0.752, "dom": 0.705}
+EXP08 = {"emos": 0.811, "cat_err": 0.133, "val": 0.659, "aro": 0.793, "dom": 0.751}  # bản đã nộp
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP, RESUME_CKPT]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# Copy cache (aud_*.npz) từ input read-only sang working để append được
+if CACHE_INPUT and os.path.isdir(CACHE_INPUT):
+    n = 0
+    for fn in os.listdir(CACHE_INPUT):
+        if fn.startswith("aud_") and fn.endswith(".npz"):
+            shutil.copy(os.path.join(CACHE_INPUT, fn), os.path.join(CACHE_DIR, fn)); n += 1
+    print(f"📦 Copy {n} file cache audeering từ {CACHE_INPUT} → {CACHE_DIR}")
+else:
+    print("ℹ️ Không có CACHE_INPUT → sẽ tự trích audeering (chậm hơn lần đầu).")
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER (để dựng đúng kiến trúc WavLM rồi nạp checkpoint đè lên)
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("loralib", "speechbrain", "speechmos", "librosa", "soundfile",
+            "scipy", "scikit-learn", "pandas", "tqdm")
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Dựng WavLM (như exp08) → NẠP trọng số backbone từ checkpoint
+# Dựng đúng kiến trúc (SAILER wrapper → lấy backbone HF; fallback WavLM trắng), rồi `load_state_dict`
+# bằng `ckpt["wavlm"]` → khôi phục đúng trạng thái fine-tune đã lưu.
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (rất chậm!)")
+ckpt = torch.load(RESUME_CKPT, map_location="cpu", weights_only=False)   # ckpt có numpy (vad_mu) → cần False
+assert "wavlm" in ckpt, ("❌ Checkpoint KHÔNG có 'wavlm' (backbone). Đây là bản ft_emotion_meta.pt CŨ "
+                         "chỉ lưu heads → không resume được. Hãy dùng ft_emotion_full.pt.")
+print("✅ Nạp checkpoint:", RESUME_CKPT, "| keys:", list(ckpt.keys()))
+def find_hf_backbone(module):
+    cands = []
+    for name, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((name, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+try:
+    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+    _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+    name, wavlm = find_hf_backbone(_wrapper)
+    if wavlm is not None:
+        print(f"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{name}'")
+except Exception as e:
+    print("⚠️ Lỗi nạp SAILER wrapper:", repr(e), "→ fallback WavLM trắng.")
+if wavlm is None:
+    from transformers import WavLMModel
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("ℹ️ Fallback: microsoft/wavlm-large.")
+wavlm = wavlm.to(device)
+WAVLM_DIM = int(wavlm.config.hidden_size)
+# Nạp trọng số đã fine-tune từ checkpoint (đè lên kiến trúc vừa dựng)
+miss, unexp = wavlm.load_state_dict(ckpt["wavlm"], strict=False)
+print(f"🔁 load wavlm từ checkpoint: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0).")
+if len(miss) > 20 or len(unexp) > 20:
+    print("   ⚠️ Lệch key nhiều → kiến trúc có thể không khớp checkpoint. Kiểm tra UNFREEZE/USE_AUDEERING.")
+# Đóng băng partial: chỉ mở UNFREEZE_TOP_LAYERS lớp trên
+for p in wavlm.parameters():
+    p.requires_grad = False
+enc_layers = wavlm.encoder.layers
+n_layers = len(enc_layers)
+for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:
+    for p in layer.parameters():
+        p.requires_grad = True
+n_train = sum(p.numel() for p in wavlm.parameters() if p.requires_grad)
+print(f"WavLM: {n_layers} lớp · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} → {n_train/1e6:.1f}M param train (dim {WAVLM_DIM})")
+if USE_GRAD_CKPT:
+    wavlm.gradient_checkpointing_enable()
+    if hasattr(wavlm, "enable_input_require_grads"):
+        wavlm.enable_input_require_grads()
+def masked_mean(hidden, attn_mask):
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    try:
+        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+    except Exception:
+        return hidden.mean(dim=1)
+    fm = fm.unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+def wavlm_embed(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state
+    return masked_mean(out, attn_mask)
+# %% [markdown]
+# ## 3. audeering FROZEN (đặc trưng phụ) — dùng cache nếu có
+# %%
+import numpy as np
+import librosa
+from tqdm.auto import tqdm
+AUD_DIM = 0
+aud_backbone = aud_head = aud_proc = None
+if USE_AUDEERING:
+    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+    from huggingface_hub import hf_hub_download
+    AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+    aud_backbone = Wav2Vec2Model(aud_cfg)
+    try:
+        _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+            hf_hub_download(AUD_NAME, "model.safetensors"))
+    except Exception:
+        _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+    bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+    aud_backbone.load_state_dict(bb_sd, strict=False)
+    _hid = _sd["classifier.dense.weight"].shape[0]
+    _out = _sd["classifier.out_proj.weight"].shape[0]
+    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))
+    aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+    aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+    aud_backbone = aud_backbone.to(device).eval()
+    aud_head = aud_head.to(device).eval()
+    AUD_DIM = _hid + 3
+    print(f"✅ audeering frozen ({AUD_DIM}-D)")
+def load_wav(name_or_stem):
+    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(
+        WAV_DIR, name_or_stem if str(name_or_stem).endswith(".wav") else str(name_or_stem) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    return wave[: MAX_SECONDS * SR].astype(np.float32)
+@torch.no_grad()
+def extract_audeering(stems, tag):
+    if not USE_AUDEERING:
+        return {}
+    cache_path = os.path.join(CACHE_DIR, f"aud_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[aud/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    for i, s in enumerate(tqdm(todo, desc=f"audeering {tag}")):
+        wave = load_wav(s)
+        if wave is None:
+            continue
+        x = aud_proc(wave, sampling_rate=SR).input_values[0]
+        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)
+        h = aud_backbone(x)[0].mean(dim=1)
+        out = aud_head(h)[0].cpu().numpy()
+        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)
+        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)
+        if (i + 1) % 500 == 0:
+            np.savez(cache_path, **store)
+    if todo:
+        np.savez(cache_path, **store)
+    return store
+# %% [markdown]
+# ## 4. Đọc & gộp nhãn theo wavID
+# %%
+import pandas as pd
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, df=None, default_idx=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = _col(cols, "wavid", "wav", df=df, default_idx=1)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col = _col(cols, "val", "valence"); aro_col = _col(cols, "aro", "arousal"); dom_col = _col(cols, "dom", "dominance")
+    cat_col = _col(cols, "emocat", "cat", "emotion")
+    assert emos_col, f"Không thấy cột eMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+# %% [markdown]
+# ## 5. Dataset/loader — DÙNG thống kê chuẩn hóa TỪ CHECKPOINT (để khớp head đã train)
+# %%
+from torch.utils.data import Dataset, DataLoader
+from sklearn.model_selection import train_test_split
+train_stems = [s for s in train_df["wavID"] if target_map.get(s) is not None]
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+aud_tr = extract_audeering(train_stems, "train")
+lab = train_df.set_index("wavID")
+# QUAN TRỌNG: lấy mean/std từ checkpoint (head đã train theo thang này) thay vì tính lại.
+emos_mu = float(ckpt["emos_mu"]); emos_sd = float(ckpt["emos_sd"])
+vad_mu = np.asarray(ckpt["vad_mu"], dtype=np.float32); vad_sd = np.asarray(ckpt["vad_sd"], dtype=np.float32)
+print(f"Dùng chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}")
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+class EmoDataset(Dataset):
+    def __init__(self, stems):
+        self.stems = [s for s in stems if (load_wav(s) is not None) and ((not USE_AUDEERING) or s in aud_tr)]
+    def __len__(self):
+        return len(self.stems)
+    def __getitem__(self, i):
+        s = self.stems[i]
+        wave = load_wav(s)
+        emos = (float(lab.loc[s, "emos"]) - emos_mu) / emos_sd
+        if HAS_VAD:
+            vad = (np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32) - vad_mu) / vad_sd
+        else:
+            vad = np.zeros(3, dtype=np.float32)
+        cat = np.array([lab.loc[s, f"cat{j}"] for j in range(len(EMOTIONS5))], dtype=np.float32)
+        aud = aud_tr[s] if USE_AUDEERING else np.zeros(0, dtype=np.float32)
+        return {"wave": wave, "tgt": onehot_target(target_map.get(s)), "aud": aud,
+                "emos": np.float32(emos), "vad": vad, "cat": cat,
+                "emos_raw": np.float32(lab.loc[s, "emos"]),
+                "vad_raw": np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32)}
+def collate(batch):
+    lens = [len(b["wave"]) for b in batch]
+    L = max(lens)
+    waves = np.zeros((len(batch), L), dtype=np.float32)
+    mask = np.zeros((len(batch), L), dtype=np.float32)
+    for i, b in enumerate(batch):
+        waves[i, : len(b["wave"])] = b["wave"]; mask[i, : len(b["wave"])] = 1.0
+    return {
+        "input_values": torch.from_numpy(waves), "attn_mask": torch.from_numpy(mask).long(),
+        "tgt": torch.from_numpy(np.stack([b["tgt"] for b in batch])),
+        "aud": torch.from_numpy(np.stack([b["aud"] for b in batch])) if USE_AUDEERING else None,
+        "emos": torch.from_numpy(np.stack([b["emos"] for b in batch])).unsqueeze(1),
+        "vad": torch.from_numpy(np.stack([b["vad"] for b in batch])),
+        "cat": torch.from_numpy(np.stack([b["cat"] for b in batch])),
+        "emos_raw": np.stack([b["emos_raw"] for b in batch]),
+        "vad_raw": np.stack([b["vad_raw"] for b in batch]),
+    }
+ds = EmoDataset(train_stems)
+print("Dataset hợp lệ:", len(ds), "wav")
+tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)
+tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)
+va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)
+# %% [markdown]
+# ## 6. Heads (NẠP từ checkpoint) + optimizer + train TIẾP
+# %%
+from scipy.stats import spearmanr
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+hmiss, hunexp = heads.load_state_dict(ckpt["heads"], strict=False)
+print(f"🔁 load heads từ checkpoint: thiếu {len(hmiss)} / dư {len(hunexp)} key (kỳ vọng 0).")
+print(f"Trunk input = {TRUNK_IN} (wavlm {WAVLM_DIM} + aud {AUD_DIM if USE_AUDEERING else 0})")
+TASKS = ["emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+bb_params = [p for p in wavlm.parameters() if p.requires_grad]
+head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.AdamW([
+    {"params": bb_params, "lr": LR_BACKBONE * RESUME_LR_SCALE},
+    {"params": head_params, "lr": LR_HEAD * RESUME_LR_SCALE},
+], weight_decay=WEIGHT_DECAY)
+scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == "cuda")
+mse = nn.MSELoss()
+def soft_ce(logits, target_dist):
+    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()
+def forward_batch(b):
+    feat_wavlm = wavlm_embed(b["input_values"].to(device), b["attn_mask"].to(device))
+    feat = torch.cat([feat_wavlm, b["aud"].to(device)], dim=1) if USE_AUDEERING else feat_wavlm
+    return heads(feat, b["tgt"].to(device))
+def compute_loss(emos_p, cat_l, vad_p, b):
+    L = {}
+    L["emos"] = mse(emos_p, b["emos"].to(device))
+    L["cat"] = soft_ce(cat_l, b["cat"].to(device))
+    if HAS_VAD:
+        vt = b["vad"].to(device)
+        L["val"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L["aro"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L["dom"] = mse(vad_p[:, 2:3], vt[:, 2:3])
+    else:
+        z = torch.zeros((), device=device); L["val"] = L["aro"] = L["dom"] = z
+    if USE_UNCERTAINTY:
+        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))
+    return sum(L.values())
+@torch.no_grad()
+def evaluate():
+    wavlm.eval(); heads.eval()
+    P = {"emos": [], "val": [], "aro": [], "dom": []}; Y = {"emos": [], "val": [], "aro": [], "dom": []}
+    catP, catY = [], []
+    for b in va_loader:
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+        P["emos"] += emos_p.float().cpu().numpy().ravel().tolist(); Y["emos"] += b["emos_raw"].tolist()
+        vad_p = vad_p.float().cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            P[t] += vad_p[:, j].tolist(); Y[t] += b["vad_raw"][:, j].tolist()
+        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b["cat"])
+    out = {}
+    for t in ["emos"] + (["val", "aro", "dom"] if HAS_VAD else []):
+        out[t] = spearmanr(P[t], Y[t]).correlation
+    q = np.concatenate(catP); p = np.concatenate(catY)
+    out["cat_err"] = float(np.abs(q - p).sum(1).mean())
+    return out
+def mean_srcc(m):
+    keys = ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+# Init best TỪ checkpoint hiện tại → chỉ lưu nếu train tiếp TỐT HƠN
+m0 = evaluate(); best = mean_srcc(m0)
+best_state = {"wavlm": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},
+              "heads": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}
+print(f"📍 Checkpoint hiện tại: mean SRCC = {best:.4f} | "
+      + " ".join(f"{k}={m0[k]:.3f}" for k in ['emos','val','aro','dom'] if k in m0))
+bad = 0
+for ep in range(1, EPOCHS + 1):
+    wavlm.train(); heads.train()
+    opt.zero_grad(); run = 0.0; nb = 0
+    for step, b in enumerate(tqdm(tr_loader, desc=f"+epoch {ep}")):
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM
+        scaler.scale(loss).backward()
+        if (step + 1) % ACCUM == 0:
+            scaler.step(opt); scaler.update(); opt.zero_grad()
+        run += loss.item() * ACCUM; nb += 1
+    m = evaluate(); sc = mean_srcc(m)
+    msg = " ".join(f"{k}={m[k]:.3f}" for k in ["emos", "val", "aro", "dom"] if k in m)
+    print(f"+epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})")
+    if sc > best:
+        best = sc
+        best_state = {"wavlm": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},
+                      "heads": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}
+        bad = 0
+    else:
+        bad += 1
+        if bad >= PATIENCE:
+            print(f"Early stop (resume) ở +epoch {ep}."); break
+wavlm.load_state_dict(best_state["wavlm"]); heads.load_state_dict(best_state["heads"])
+final = evaluate()
+print("\n✅ VAL sau resume:")
+print(f"   EMOS={final['emos']:.4f} (ckpt {m0['emos']:.3f} · exp08 nộp {EXP08['emos']})")
+if HAS_VAD:
+    print(f"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} "
+          f"(exp08 nộp {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})")
+print(f"   mean SRCC: ckpt {mean_srcc(m0):.4f} → sau resume {mean_srcc(final):.4f} "
+      + ("🚀 cải thiện" if mean_srcc(final) > mean_srcc(m0) + 1e-4 else "➖ không cải thiện (giữ ckpt cũ)"))
+torch.save({"wavlm": best_state["wavlm"], "heads": best_state["heads"],
+            "emos_mu": emos_mu, "emos_sd": emos_sd, "vad_mu": vad_mu, "vad_sd": vad_sd,
+            "WAVLM_DIM": WAVLM_DIM, "AUD_DIM": AUD_DIM, "UNFREEZE_TOP_LAYERS": UNFREEZE_TOP_LAYERS,
+            "val_emos": final["emos"]}, os.path.join(OUT_DIR, "ft_emotion_full.pt"))
+print("Đã lưu FULL (có backbone):", os.path.join(OUT_DIR, "ft_emotion_full.pt"))
+# %% [markdown]
+# ## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc từ resume; QMOS mượn exp07 hoặc UTMOSv2)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+aud_dev = extract_audeering(dev_stems, "dev")
+def load_exp07_qmos():
+    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+        import csv
+        d = {}
+        with open(EXP07_ANSWER) as f:
+            for row in csv.DictReader(f):
+                d[row["wav"]] = float(row["QMOS"]); d[stem(row["wav"])] = float(row["QMOS"])
+        print(f"✅ Mượn QMOS từ exp07 ({EXP07_ANSWER}): {len(d)//2} wav")
+        return d
+    return None
+qmos_map = load_exp07_qmos()
+if qmos_map is None:
+    print("ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024).")
+    pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")
+    import utmosv2
+    v2 = utmosv2.create_model(pretrained=True)
+    qmos_map = {}
+    for n in tqdm(dev_names, desc="UTMOSv2"):
+        wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else str(n) + ".wav")
+        if not os.path.exists(wav):
+            continue
+        out = v2.predict(input_path=wav)
+        qmos_map[n] = float(out["predicted_mos"]) if isinstance(out, dict) else float(out)
+    del v2; torch.cuda.empty_cache() if device == "cuda" else None
+@torch.no_grad()
+def predict_emotion(sid):
+    wave = load_wav(sid)
+    if wave is None or (USE_AUDEERING and sid not in aud_dev):
+        return None
+    wavlm.eval(); heads.eval()
+    iv = torch.from_numpy(wave).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        fw = wavlm_embed(iv, am)
+        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    n_real = n_def = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pr = predict_emotion(sid)
+            if pr is None:
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1
+            else:
+                emos, cat5, vad3 = pr; n_real += 1
+            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 8. Validate + zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp08_resume.zip answer.txt && unzip -l submission_track2_exp08_resume.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp08_resume.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Đầu vào bắt buộc:** `RESUME_CKPT` = `ft_emotion_full.pt` (CÓ backbone). Bản `ft_emotion_meta.pt` cũ chỉ
+#   có heads → cell 2 sẽ assert lỗi nhắc dùng file đủ.
+# - **Cache:** trỏ `CACHE_INPUT` tới dataset chứa `aud_train.npz`/`aud_dev.npz` → khỏi trích lại audeering.
+#   Nếu LIMIT khác lần trước, cache thiếu stem nào sẽ tự trích bù (resume theo stem).
+# - **Chuẩn hóa lấy TỪ checkpoint** (`emos_mu/sd`, `vad_mu/sd`) → khớp thang head đã train (đừng tính lại).
+# - **best init từ checkpoint** → chỉ lưu nếu train tiếp THỰC SỰ tốt hơn (không sợ tụt).
+# - Nếu val chững: đặt `RESUME_LR_SCALE=0.5` (giảm LR) hoặc tăng `UNFREEZE_TOP_LAYERS` (lưu ý: mở thêm lớp
+#   thì lớp mới chưa được train trong checkpoint → cần nhiều epoch hơn).
+# - QMOS: tốt nhất Add Input `answer.txt` exp07 (0.548). Để trộn cột chuẩn, xem kết quả exp08: 5 cột cảm xúc
+#   resume + QMOS exp07 → hệ thống mạnh nhất 6 cột.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md`.

track2/exp09a_qmos_utmosv2_probe.ipynb ADDED Viewed

	@@ -0,0 +1,339 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "0afb9ac3",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp09a (PROBE: UTMOSv2 vs UTMOS cho QMOS) — Kaggle\n",
+    "\n",
+    "**Mục đích (rẻ, KHÔNG tốn lượt nộp):** trước khi fine-tune QMOS, kiểm tra xem\n",
+    "**UTMOSv2** (hệ thống **T05 — vô địch VoiceMOS Challenge 2024 Track 1**, naturalness MOS)\n",
+    "có **mạnh hơn UTMOS 2022** (đang dùng) trên dữ liệu Track 2 hay không.\n",
+    "\n",
+    "## Ý tưởng A/B không tốn lượt nộp\n",
+    "Tập **train** Track 2 CÓ nhãn `qMOS` thật (`sets/train.csv`). Ta:\n",
+    "1. Chấm một mẫu train bằng **UTMOS** (torch.hub `utmos22_strong`) — baseline đang dùng.\n",
+    "2. Chấm cùng mẫu đó bằng **UTMOSv2** (`sarulab-speech/UTMOSv2`, MIT).\n",
+    "3. So **SRCC mỗi model vs nhãn qMOS vàng** → biết model nào \"xếp hạng\" giống người chấm hơn.\n",
+    "\n",
+    "> SRCC chấm **thứ hạng** (scale-invariant) → khỏi lo lệch thang điểm. Mẫu ~2.000 wav là đủ ổn định.\n",
+    "\n",
+    "## Vì sao đáng thử\n",
+    "- UTMOSv2 = #1 ở 7/16 metric VMC2024 Track 1 (bỏ xa hạng 3) → bản kế nhiệm trực tiếp của UTMOS.\n",
+    "- **Lưu ý:** UTMOSv2 cũng train trên giọng *không* cảm xúc → vẫn có thể lệch domain; A/B này để\n",
+    "  biết nó có **đáng** làm \"neo\" mạnh hơn cho head QMOS fine-tune (exp09) hay không.\n",
+    "\n",
+    "**Cách chạy:** GPU T4 + **Internet On** (UTMOSv2 cài từ git + tải checkpoint) → Add Input dataset\n",
+    "Track 2 → sửa `DATA_ROOT` → Run All. Lần đầu để `PROBE_N=300` cho nhanh, OK rồi tăng `2000`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c4f6fdc3",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "df72dc96",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/qmos_probe_cache\"\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "DEVICE  = \"cuda\"\n",
+    "PROBE_N = 2000     # số wav train để A/B (lần đầu để 300 cho nhanh). SRCC ~2000 mẫu đã ổn định.\n",
+    "SEED    = 42\n",
+    "\n",
+    "# (Tùy chọn) Nếu muốn TẠO LUÔN answer.txt đổi cột QMOS←UTMOSv2 để nộp xác nhận trên DEV:\n",
+    "#   trỏ tới answer.txt của exp07 (giữ nguyên 5 cột cảm xúc, chỉ thay QMOS).\n",
+    "#   Để None nếu chỉ muốn chạy A/B nội bộ.\n",
+    "EXP07_ANSWER = None    # ví dụ: \"/kaggle/input/exp07-answer/answer.txt\"\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "for p in [WAV_DIR, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c18d2e88",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt (UTMOS + UTMOSv2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c21a3cb0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"speechmos\", \"librosa\", \"soundfile\", \"pandas\", \"scipy\", \"scikit-learn\", \"tqdm\")\n",
+    "# UTMOSv2 (T05) — cài từ git, cần Internet On. Checkpoint tự tải lần đầu.\n",
+    "pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad56ee86",
+   "metadata": {},
+   "source": [
+    "## 2. Nhãn qMOS vàng (gộp trung bình theo wav)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66c28d52",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "def load_qmos_labels():\n",
+    "    \"\"\"train.csv (sep '|') → dict {stem: qMOS trung bình theo wav}.\"\"\"\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col  = cols.get(\"wavid\") or cols.get(\"wav\") or list(df.columns)[1]\n",
+    "    qmos_col = cols.get(\"qmos\")  or cols.get(\"mos\")\n",
+    "    assert qmos_col, f\"Không thấy cột qMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    g = df.groupby(\"_stem\")[qmos_col].mean()\n",
+    "    return {s: float(v) for s, v in g.items()}\n",
+    "\n",
+    "qmos_gold = load_qmos_labels()\n",
+    "print(f\"Số wav train có nhãn qMOS: {len(qmos_gold)}\")\n",
+    "\n",
+    "# Chọn mẫu probe (chỉ giữ wav thật sự tồn tại trên đĩa)\n",
+    "rng = np.random.default_rng(SEED)\n",
+    "all_stems = [s for s in qmos_gold if os.path.exists(os.path.join(WAV_DIR, s + \".wav\"))]\n",
+    "rng.shuffle(all_stems)\n",
+    "probe_stems = all_stems[:PROBE_N]\n",
+    "print(f\"Mẫu probe: {len(probe_stems)} / {len(all_stems)} wav tồn tại\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13d793d1",
+   "metadata": {},
+   "source": [
+    "## 3. Hàm chấm: UTMOS (cũ) và UTMOSv2 (mới) — đều cache .npz"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "05dd3e90",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from scipy.stats import spearmanr, pearsonr\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU\")\n",
+    "\n",
+    "def score_utmos(stems, tag):\n",
+    "    \"\"\"UTMOS 2022 (torch.hub utmos22_strong). → dict {stem: score}. Cache.\"\"\"\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache = os.path.join(CACHE_DIR, f\"utmos_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache):\n",
+    "        z = np.load(cache, allow_pickle=True)\n",
+    "        store = {k: float(z[k]) for k in z.files}\n",
+    "        print(f\"[utmos/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\",\n",
+    "                                   trust_repo=True).to(device).eval()\n",
+    "        with torch.no_grad():\n",
+    "            for i, s in enumerate(tqdm(todo, desc=f\"utmos {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                store[s] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device),\n",
+    "                                           sr=16000).mean().item())\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        np.savez(cache, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        del predictor\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return store\n",
+    "\n",
+    "def score_utmosv2(stems, tag):\n",
+    "    \"\"\"UTMOSv2 / T05 (sarulab-speech/UTMOSv2). → dict {stem: score}. Cache.\"\"\"\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache = os.path.join(CACHE_DIR, f\"utmosv2_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache):\n",
+    "        z = np.load(cache, allow_pickle=True)\n",
+    "        store = {k: float(z[k]) for k in z.files}\n",
+    "        print(f\"[utmosv2/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        import utmosv2\n",
+    "        model = utmosv2.create_model(pretrained=True)   # ensemble, checkpoint tự tải\n",
+    "        for i, s in enumerate(tqdm(todo, desc=f\"utmosv2 {tag}\")):\n",
+    "            wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "            out = model.predict(input_path=wav)\n",
+    "            # predict trả về float (hoặc dict có 'predicted_mos') tùy phiên bản\n",
+    "            store[s] = float(out[\"predicted_mos\"]) if isinstance(out, dict) else float(out)\n",
+    "            if (i + 1) % 200 == 0:\n",
+    "                np.savez(cache, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        np.savez(cache, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        del model\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d04d250c",
+   "metadata": {},
+   "source": [
+    "## 4. Chạy A/B trên mẫu train → in SRCC mỗi model vs nhãn qMOS vàng"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "eb9fe414",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "utmos_s   = score_utmos(probe_stems, \"probe\")\n",
+    "utmosv2_s = score_utmosv2(probe_stems, \"probe\")\n",
+    "\n",
+    "# Chỉ so trên các stem cả 2 model đều chấm được (để công bằng)\n",
+    "common = [s for s in probe_stems if s in utmos_s and s in utmosv2_s and s in qmos_gold]\n",
+    "y_gold = np.array([qmos_gold[s] for s in common])\n",
+    "p_v1   = np.array([utmos_s[s]   for s in common])\n",
+    "p_v2   = np.array([utmosv2_s[s] for s in common])\n",
+    "print(f\"\\nSố mẫu so sánh chung: {len(common)}\")\n",
+    "\n",
+    "srcc_v1 = spearmanr(p_v1, y_gold).correlation\n",
+    "srcc_v2 = spearmanr(p_v2, y_gold).correlation\n",
+    "lcc_v1  = pearsonr(p_v1, y_gold)[0]\n",
+    "lcc_v2  = pearsonr(p_v2, y_gold)[0]\n",
+    "\n",
+    "print(\"\\n📊 A/B trên TRAIN (nhãn qMOS vàng) — UTT-SRCC là metric chính:\")\n",
+    "print(f\"   UTMOS 2022 (đang dùng) : SRCC = {srcc_v1:.4f} | LCC = {lcc_v1:.4f}\")\n",
+    "print(f\"   UTMOSv2 / T05 (mới)     : SRCC = {srcc_v2:.4f} | LCC = {lcc_v2:.4f}\")\n",
+    "delta = srcc_v2 - srcc_v1\n",
+    "if delta > 0.01:\n",
+    "    print(f\"   ✅ UTMOSv2 THẮNG (+{delta:.4f} SRCC) → đáng dùng làm neo cho exp09 / đổi cột QMOS.\")\n",
+    "elif delta < -0.01:\n",
+    "    print(f\"   ⚠️ UTMOSv2 THUA ({delta:.4f} SRCC) → giữ UTMOS; lệch domain cảm xúc quá mạnh.\")\n",
+    "else:\n",
+    "    print(f\"   ➖ Ngang nhau ({delta:+.4f}) → ưu tiên model nào tiện hơn; chốt bằng fine-tune.\")\n",
+    "\n",
+    "# Mốc tham chiếu leaderboard: UTMOS zero-shot DEV = 0.414; head QMOS exp07 = 0.548.\n",
+    "# (SRCC train ≠ SRCC dev nhưng cùng xu hướng → dùng để quyết hướng, không phải điểm nộp.)\n",
+    "print(\"\\nℹ️ Mốc leaderboard DEV để đối chiếu: UTMOS zero-shot 0.414 · head QMOS exp07 0.548.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d808b9d6",
+   "metadata": {},
+   "source": [
+    "## 5. (Tùy chọn) Tạo answer.txt đổi cột QMOS←UTMOSv2 để nộp xác nhận DEV\n",
+    "Chỉ chạy nếu `EXP07_ANSWER` trỏ tới answer.txt exp07. Giữ nguyên 5 cột cảm xúc, chỉ thay QMOS."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b08a39f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def build_swapped_answer(exp07_answer_path, out_path):\n",
+    "    \"\"\"Đọc answer.txt exp07 (wav,QMOS,EMOS,CAT,VAL,ARO,DOM), thay QMOS = UTMOSv2(dev).\"\"\"\n",
+    "    import csv\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        dev_names = [ln.strip() for ln in f if ln.strip()]\n",
+    "    dev_stems = [stem(n) for n in dev_names]\n",
+    "    utmosv2_dev = score_utmosv2(dev_stems, \"dev\")    # chấm DEV bằng UTMOSv2 (cache riêng)\n",
+    "\n",
+    "    with open(exp07_answer_path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header, body = rows[0], rows[1:]\n",
+    "    qi = header.index(\"QMOS\")\n",
+    "    n_swap = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\",\".join(header) + \"\\n\")\n",
+    "        for r in body:\n",
+    "            sid = stem(r[0])\n",
+    "            if sid in utmosv2_dev:\n",
+    "                r[qi] = f\"{utmosv2_dev[sid]:.6g}\"\n",
+    "                n_swap += 1\n",
+    "            f.write(\",\".join(r) + \"\\n\")\n",
+    "    print(f\"Ghi {len(body)} dòng → {out_path} | đổi QMOS được {n_swap} dòng\")\n",
+    "    return out_path\n",
+    "\n",
+    "if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "    out = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "    build_swapped_answer(EXP07_ANSWER, out)\n",
+    "    os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp09a_utmosv2.zip answer.txt \"\n",
+    "              f\"&& unzip -l submission_track2_exp09a_utmosv2.zip\")\n",
+    "    print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp09a_utmosv2.zip\"))\n",
+    "else:\n",
+    "    print(\"Bỏ qua mục 5 (EXP07_ANSWER=None hoặc không tồn tại). Chỉ chạy A/B nội bộ.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "adc8ba21",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Đọc kết quả mục 4:** UTMOSv2 SRCC có > UTMOS không?\n",
+    "  - **Thắng rõ** → dùng UTMOSv2 làm **neo** cho `exp09` (fine-tune WavLM trên nhãn qMOS) thay UTMOS;\n",
+    "    và/hoặc nộp answer.txt đổi cột (mục 5) để xác nhận trên leaderboard DEV.\n",
+    "  - **Thua/ngang** → giữ UTMOS làm neo; kết luận \"UTMOSv2 vẫn lệch domain cảm xúc\" (phát hiện cho paper).\n",
+    "- **Gotcha Kaggle:** UTMOSv2 cài từ git + tải checkpoint → **Internet On**. Bản nộp Internet-off cần\n",
+    "  pre-download weights thành Kaggle Dataset.\n",
+    "- UTMOSv2 là **ensemble nhiều fold** → chậm hơn UTMOS. Nếu lâu, giảm `PROBE_N` hoặc chấm dần (có cache).\n",
+    "- License: UTMOSv2 **MIT** · UTMOS BSD-3. Ghi vào `docs/12_system_description.md`.\n",
+    "- Ghi config → kết qu�� → nhận xét vào `docs/04_experiments_log.md` (mục exp09a)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp09a_qmos_utmosv2_probe_pipeline.py ADDED Viewed

	@@ -0,0 +1,239 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp09a (PROBE: UTMOSv2 vs UTMOS cho QMOS) — Kaggle
+#
+# **Mục đích (rẻ, KHÔNG tốn lượt nộp):** trước khi fine-tune QMOS, kiểm tra xem
+# **UTMOSv2** (hệ thống **T05 — vô địch VoiceMOS Challenge 2024 Track 1**, naturalness MOS)
+# có **mạnh hơn UTMOS 2022** (đang dùng) trên dữ liệu Track 2 hay không.
+#
+# ## Ý tưởng A/B không tốn lượt nộp
+# Tập **train** Track 2 CÓ nhãn `qMOS` thật (`sets/train.csv`). Ta:
+# 1. Chấm một mẫu train bằng **UTMOS** (torch.hub `utmos22_strong`) — baseline đang dùng.
+# 2. Chấm cùng mẫu đó bằng **UTMOSv2** (`sarulab-speech/UTMOSv2`, MIT).
+# 3. So **SRCC mỗi model vs nhãn qMOS vàng** → biết model nào "xếp hạng" giống người chấm hơn.
+#
+# > SRCC chấm **thứ hạng** (scale-invariant) → khỏi lo lệch thang điểm. Mẫu ~2.000 wav là đủ ổn định.
+#
+# ## Vì sao đáng thử
+# - UTMOSv2 = #1 ở 7/16 metric VMC2024 Track 1 (bỏ xa hạng 3) → bản kế nhiệm trực tiếp của UTMOS.
+# - **Lưu ý:** UTMOSv2 cũng train trên giọng *không* cảm xúc → vẫn có thể lệch domain; A/B này để
+#   biết nó có **đáng** làm "neo" mạnh hơn cho head QMOS fine-tune (exp09) hay không.
+#
+# **Cách chạy:** GPU T4 + **Internet On** (UTMOSv2 cài từ git + tải checkpoint) → Add Input dataset
+# Track 2 → sửa `DATA_ROOT` → Run All. Lần đầu để `PROBE_N=300` cho nhanh, OK rồi tăng `2000`.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug
+WAV_DIR      = f"{DATA_ROOT}/wav"
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/qmos_probe_cache"
+os.makedirs(CACHE_DIR, exist_ok=True)
+DEVICE  = "cuda"
+PROBE_N = 2000     # số wav train để A/B (lần đầu để 300 cho nhanh). SRCC ~2000 mẫu đã ổn định.
+SEED    = 42
+# (Tùy chọn) Nếu muốn TẠO LUÔN answer.txt đổi cột QMOS←UTMOSv2 để nộp xác nhận trên DEV:
+#   trỏ tới answer.txt của exp07 (giữ nguyên 5 cột cảm xúc, chỉ thay QMOS).
+#   Để None nếu chỉ muốn chạy A/B nội bộ.
+EXP07_ANSWER = None    # ví dụ: "/kaggle/input/exp07-answer/answer.txt"
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+for p in [WAV_DIR, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt (UTMOS + UTMOSv2)
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("speechmos", "librosa", "soundfile", "pandas", "scipy", "scikit-learn", "tqdm")
+# UTMOSv2 (T05) — cài từ git, cần Internet On. Checkpoint tự tải lần đầu.
+pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")
+# %% [markdown]
+# ## 2. Nhãn qMOS vàng (gộp trung bình theo wav)
+# %%
+import numpy as np
+import pandas as pd
+def load_qmos_labels():
+    """train.csv (sep '|') → dict {stem: qMOS trung bình theo wav}."""
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col  = cols.get("wavid") or cols.get("wav") or list(df.columns)[1]
+    qmos_col = cols.get("qmos")  or cols.get("mos")
+    assert qmos_col, f"Không thấy cột qMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    g = df.groupby("_stem")[qmos_col].mean()
+    return {s: float(v) for s, v in g.items()}
+qmos_gold = load_qmos_labels()
+print(f"Số wav train có nhãn qMOS: {len(qmos_gold)}")
+# Chọn mẫu probe (chỉ giữ wav thật sự tồn tại trên đĩa)
+rng = np.random.default_rng(SEED)
+all_stems = [s for s in qmos_gold if os.path.exists(os.path.join(WAV_DIR, s + ".wav"))]
+rng.shuffle(all_stems)
+probe_stems = all_stems[:PROBE_N]
+print(f"Mẫu probe: {len(probe_stems)} / {len(all_stems)} wav tồn tại")
+# %% [markdown]
+# ## 3. Hàm chấm: UTMOS (cũ) và UTMOSv2 (mới) — đều cache .npz
+# %%
+import torch
+from scipy.stats import spearmanr, pearsonr
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU")
+def score_utmos(stems, tag):
+    """UTMOS 2022 (torch.hub utmos22_strong). → dict {stem: score}. Cache."""
+    import librosa
+    from tqdm.auto import tqdm
+    cache = os.path.join(CACHE_DIR, f"utmos_{tag}.npz")
+    store = {}
+    if os.path.exists(cache):
+        z = np.load(cache, allow_pickle=True)
+        store = {k: float(z[k]) for k in z.files}
+        print(f"[utmos/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong",
+                                   trust_repo=True).to(device).eval()
+        with torch.no_grad():
+            for i, s in enumerate(tqdm(todo, desc=f"utmos {tag}")):
+                wav = os.path.join(WAV_DIR, s + ".wav")
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                store[s] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device),
+                                           sr=16000).mean().item())
+                if (i + 1) % 500 == 0:
+                    np.savez(cache, **{k: np.float32(v) for k, v in store.items()})
+        np.savez(cache, **{k: np.float32(v) for k, v in store.items()})
+        del predictor
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return store
+def score_utmosv2(stems, tag):
+    """UTMOSv2 / T05 (sarulab-speech/UTMOSv2). → dict {stem: score}. Cache."""
+    from tqdm.auto import tqdm
+    cache = os.path.join(CACHE_DIR, f"utmosv2_{tag}.npz")
+    store = {}
+    if os.path.exists(cache):
+        z = np.load(cache, allow_pickle=True)
+        store = {k: float(z[k]) for k in z.files}
+        print(f"[utmosv2/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        import utmosv2
+        model = utmosv2.create_model(pretrained=True)   # ensemble, checkpoint tự tải
+        for i, s in enumerate(tqdm(todo, desc=f"utmosv2 {tag}")):
+            wav = os.path.join(WAV_DIR, s + ".wav")
+            out = model.predict(input_path=wav)
+            # predict trả về float (hoặc dict có 'predicted_mos') tùy phiên bản
+            store[s] = float(out["predicted_mos"]) if isinstance(out, dict) else float(out)
+            if (i + 1) % 200 == 0:
+                np.savez(cache, **{k: np.float32(v) for k, v in store.items()})
+        np.savez(cache, **{k: np.float32(v) for k, v in store.items()})
+        del model
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return store
+# %% [markdown]
+# ## 4. Chạy A/B trên mẫu train → in SRCC mỗi model vs nhãn qMOS vàng
+# %%
+utmos_s   = score_utmos(probe_stems, "probe")
+utmosv2_s = score_utmosv2(probe_stems, "probe")
+# Chỉ so trên các stem cả 2 model đều chấm được (để công bằng)
+common = [s for s in probe_stems if s in utmos_s and s in utmosv2_s and s in qmos_gold]
+y_gold = np.array([qmos_gold[s] for s in common])
+p_v1   = np.array([utmos_s[s]   for s in common])
+p_v2   = np.array([utmosv2_s[s] for s in common])
+print(f"\nSố mẫu so sánh chung: {len(common)}")
+srcc_v1 = spearmanr(p_v1, y_gold).correlation
+srcc_v2 = spearmanr(p_v2, y_gold).correlation
+lcc_v1  = pearsonr(p_v1, y_gold)[0]
+lcc_v2  = pearsonr(p_v2, y_gold)[0]
+print("\n📊 A/B trên TRAIN (nhãn qMOS vàng) — UTT-SRCC là metric chính:")
+print(f"   UTMOS 2022 (đang dùng) : SRCC = {srcc_v1:.4f} | LCC = {lcc_v1:.4f}")
+print(f"   UTMOSv2 / T05 (mới)     : SRCC = {srcc_v2:.4f} | LCC = {lcc_v2:.4f}")
+delta = srcc_v2 - srcc_v1
+if delta > 0.01:
+    print(f"   ✅ UTMOSv2 THẮNG (+{delta:.4f} SRCC) → đáng dùng làm neo cho exp09 / đổi cột QMOS.")
+elif delta < -0.01:
+    print(f"   ⚠️ UTMOSv2 THUA ({delta:.4f} SRCC) → giữ UTMOS; lệch domain cảm xúc quá mạnh.")
+else:
+    print(f"   ➖ Ngang nhau ({delta:+.4f}) → ưu tiên model nào tiện hơn; chốt bằng fine-tune.")
+# Mốc tham chiếu leaderboard: UTMOS zero-shot DEV = 0.414; head QMOS exp07 = 0.548.
+# (SRCC train ≠ SRCC dev nhưng cùng xu hướng → dùng để quyết hướng, không phải điểm nộp.)
+print("\nℹ️ Mốc leaderboard DEV để đối chiếu: UTMOS zero-shot 0.414 · head QMOS exp07 0.548.")
+# %% [markdown]
+# ## 5. (Tùy chọn) Tạo answer.txt đổi cột QMOS←UTMOSv2 để nộp xác nhận DEV
+# Chỉ chạy nếu `EXP07_ANSWER` trỏ tới answer.txt exp07. Giữ nguyên 5 cột cảm xúc, chỉ thay QMOS.
+# %%
+def build_swapped_answer(exp07_answer_path, out_path):
+    """Đọc answer.txt exp07 (wav,QMOS,EMOS,CAT,VAL,ARO,DOM), thay QMOS = UTMOSv2(dev)."""
+    import csv
+    with open(DEV_SCP) as f:
+        dev_names = [ln.strip() for ln in f if ln.strip()]
+    dev_stems = [stem(n) for n in dev_names]
+    utmosv2_dev = score_utmosv2(dev_stems, "dev")    # chấm DEV bằng UTMOSv2 (cache riêng)
+    with open(exp07_answer_path) as f:
+        rows = list(csv.reader(f))
+    header, body = rows[0], rows[1:]
+    qi = header.index("QMOS")
+    n_swap = 0
+    with open(out_path, "w") as f:
+        f.write(",".join(header) + "\n")
+        for r in body:
+            sid = stem(r[0])
+            if sid in utmosv2_dev:
+                r[qi] = f"{utmosv2_dev[sid]:.6g}"
+                n_swap += 1
+            f.write(",".join(r) + "\n")
+    print(f"Ghi {len(body)} dòng → {out_path} | đổi QMOS được {n_swap} dòng")
+    return out_path
+if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+    out = os.path.join(OUT_DIR, "answer.txt")
+    build_swapped_answer(EXP07_ANSWER, out)
+    os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp09a_utmosv2.zip answer.txt "
+              f"&& unzip -l submission_track2_exp09a_utmosv2.zip")
+    print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp09a_utmosv2.zip"))
+else:
+    print("Bỏ qua mục 5 (EXP07_ANSWER=None hoặc không tồn tại). Chỉ chạy A/B nội bộ.")
+# %% [markdown]
+# ## Ghi chú
+# - **Đọc kết quả mục 4:** UTMOSv2 SRCC có > UTMOS không?
+#   - **Thắng rõ** → dùng UTMOSv2 làm **neo** cho `exp09` (fine-tune WavLM trên nhãn qMOS) thay UTMOS;
+#     và/hoặc nộp answer.txt đổi cột (mục 5) để xác nhận trên leaderboard DEV.
+#   - **Thua/ngang** → giữ UTMOS làm neo; kết luận "UTMOSv2 vẫn lệch domain cảm xúc" (phát hiện cho paper).
+# - **Gotcha Kaggle:** UTMOSv2 cài từ git + tải checkpoint → **Internet On**. Bản nộp Internet-off cần
+#   pre-download weights thành Kaggle Dataset.
+# - UTMOSv2 là **ensemble nhiều fold** → chậm hơn UTMOS. Nếu lâu, giảm `PROBE_N` hoặc chấm dần (có cache).
+# - License: UTMOSv2 **MIT** · UTMOS BSD-3. Ghi vào `docs/12_system_description.md`.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp09a).

track2/exp10_finetune_audeering.ipynb ADDED Viewed

	@@ -0,0 +1,691 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "678096c6",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp10 (fine-tune AUDEERING riêng + ensemble VAD với exp08) — Kaggle T4\n",
+    "\n",
+    "**Ý tưởng (Hướng A — an toàn cho T4):** thay vì nhồi 2 backbone large vào 1 model (dễ OOM),\n",
+    "ta fine-tune **audeering wav2vec2-large** RIÊNG (1 backbone → vừa T4), rồi **ensemble cột VAD**\n",
+    "với exp08 (WavLM fine-tune). Mỗi lần chỉ 1 backbone trong VRAM → không OOM.\n",
+    "\n",
+    "```\n",
+    " [exp08]  WavLM fine-tune  ─► VAD_wavlm  ┐\n",
+    "                                         ├─ trung bình ─► VAD cuối (mạnh hơn cả 2)\n",
+    " [exp10]  audeering fine-tune ─► VAD_aud ┘\n",
+    "```\n",
+    "audeering vốn là model **dimensional (chuyên VAD)** → fine-tune nó để bổ trợ VAD cho exp08.\n",
+    "\n",
+    "**Cách chạy:** GPU T4 + Internet On → sửa slug cell 0 → Run All. Lần đầu `LIMIT_TRAIN=300`.\n",
+    "Để ensemble: Add Input answer.txt exp08 → trỏ `EXP08_ANSWER`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "291f23e3",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "41d619c3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/datasets/minhtoan2/vmc2026-track2-full\"   # << SỬA slug\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "\n",
+    "# QMOS mượn exp07 (0.548); ensemble VAD với answer.txt exp08.\n",
+    "EXP07_ANSWER = \"/kaggle/input/exp07-answer/answer.txt\"     # << (tùy chọn) mượn QMOS; không có → UTMOSv2\n",
+    "EXP08_ANSWER = \"/kaggle/input/exp08-answer/answer.txt\"     # << (tùy chọn) để ENSEMBLE VAD; không có → chỉ ra answer audeering\n",
+    "\n",
+    "# ── Fine-tune audeering (1 backbone) ─────────────────────────────────────────\n",
+    "DEVICE              = \"cuda\"\n",
+    "SR                  = 16000\n",
+    "MAX_SECONDS         = 8\n",
+    "UNFREEZE_TOP_LAYERS = 6           # số lớp encoder audeering mở băng (T4 thừa sức 1 backbone)\n",
+    "TRUNK_HIDDEN        = 512\n",
+    "HEAD_HIDDEN         = 128\n",
+    "DROPOUT             = 0.3\n",
+    "LR_BACKBONE         = 1e-5\n",
+    "LR_HEAD             = 1e-3\n",
+    "WEIGHT_DECAY        = 1e-5\n",
+    "EPOCHS              = 12\n",
+    "PATIENCE            = 4\n",
+    "BATCH               = 4\n",
+    "ACCUM               = 8\n",
+    "VAL_FRAC            = 0.10\n",
+    "SEED                = 42\n",
+    "USE_AMP             = True\n",
+    "USE_GRAD_CKPT       = True\n",
+    "USE_UNCERTAINTY     = True\n",
+    "\n",
+    "# Ensemble: cột nào lấy TRUNG BÌNH giữa exp08 và exp10; cột khác giữ từ exp08.\n",
+    "ENSEMBLE_COLS       = [\"VAL\", \"ARO\", \"DOM\"]   # audeering mạnh VAD → ensemble VAD. Thêm \"EMOS\" nếu muốn.\n",
+    "\n",
+    "LIMIT_TRAIN         = 300         # << LẦN ĐẦU 300; chạy thật None\n",
+    "LIMIT_DEV           = 20          # << LẦN ĐẦU 20; chạy thật None\n",
+    "\n",
+    "EXP08 = {\"emos\": 0.811, \"cat_err\": 0.133, \"val\": 0.659, \"aro\": 0.793, \"dom\": 0.751}\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0adb988b",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1713d69b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"transformers\", \"huggingface_hub\", \"safetensors\", \"speechmos\",\n",
+    "            \"librosa\", \"soundfile\", \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4fcb8b30",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp audeering wav2vec2-large làm backbone FINE-TUNE\n",
+    "Nạp backbone tay (tránh lỗi subclass `Wav2Vec2PreTrainedModel` ở transformers mới) rồi mở băng lớp trên."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f0a39dab",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "import numpy as np\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (rất chậm!)\")\n",
+    "\n",
+    "from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "aud = Wav2Vec2Model(aud_cfg)\n",
+    "try:\n",
+    "    _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "        hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "except Exception:\n",
+    "    _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "miss, unexp = aud.load_state_dict(bb_sd, strict=False)\n",
+    "print(f\"audeering backbone: thiếu {len(miss)} / dư {len(unexp)} key (strict=False)\")\n",
+    "aud = aud.to(device)\n",
+    "AUD_DIM = int(aud.config.hidden_size)\n",
+    "\n",
+    "# Đóng băng tất cả, mở băng UNFREEZE_TOP_LAYERS lớp encoder trên cùng\n",
+    "for p in aud.parameters():\n",
+    "    p.requires_grad = False\n",
+    "enc_layers = aud.encoder.layers\n",
+    "n_layers = len(enc_layers)\n",
+    "for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:\n",
+    "    for p in layer.parameters():\n",
+    "        p.requires_grad = True\n",
+    "n_train = sum(p.numel() for p in aud.parameters() if p.requires_grad)\n",
+    "print(f\"audeering: {n_layers} lớp · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} → {n_train/1e6:.1f}M param train (dim {AUD_DIM})\")\n",
+    "\n",
+    "if USE_GRAD_CKPT:\n",
+    "    aud.gradient_checkpointing_enable()\n",
+    "    if hasattr(aud, \"enable_input_require_grads\"):\n",
+    "        aud.enable_input_require_grads()\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    try:\n",
+    "        fm = aud._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "    except Exception:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "def aud_embed(input_values, attn_mask):\n",
+    "    out = aud(input_values, attention_mask=attn_mask).last_hidden_state\n",
+    "    return masked_mean(out, attn_mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "54e993b9",
+   "metadata": {},
+   "source": [
+    "## 3. Nhãn (gộp theo wavID) — như exp08"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "46cc0e42",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import librosa\n",
+    "import pandas as pd\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, df=None, default_idx=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = _col(cols, \"wavid\", \"wav\", df=df, default_idx=1)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col = _col(cols, \"val\", \"valence\"); aro_col = _col(cols, \"aro\", \"arousal\"); dom_col = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0e0bf0bb",
+   "metadata": {},
+   "source": [
+    "## 4. Dataset/loader (input_values qua audeering processor) + chuẩn hóa nhãn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65768ef9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "train_stems = [s for s in train_df[\"wavID\"] if target_map.get(s) is not None]\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "\n",
+    "def _zfit(arr):\n",
+    "    a = np.asarray(arr, dtype=np.float32)\n",
+    "    return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)\n",
+    "\n",
+    "emos_mu, emos_sd = _zfit([lab.loc[s, \"emos\"] for s in train_stems])\n",
+    "if HAS_VAD:\n",
+    "    vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in [\"val\", \"aro\", \"dom\"]], dtype=np.float32)\n",
+    "    vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in [\"val\", \"aro\", \"dom\"]], dtype=np.float32)\n",
+    "else:\n",
+    "    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_iv(sid):\n",
+    "    \"\"\"Đọc wav → chuẩn hóa bằng audeering processor → input_values (1D float32).\"\"\"\n",
+    "    p = os.path.join(WAV_DIR, sid if str(sid).endswith(\".wav\") else str(sid) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    wave = wave[: MAX_SECONDS * SR]\n",
+    "    iv = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "    return np.asarray(iv, dtype=np.float32)\n",
+    "\n",
+    "class AudDataset(Dataset):\n",
+    "    def __init__(self, stems):\n",
+    "        self.stems = [s for s in stems if load_iv(s) is not None]\n",
+    "    def __len__(self):\n",
+    "        return len(self.stems)\n",
+    "    def __getitem__(self, i):\n",
+    "        s = self.stems[i]\n",
+    "        iv = load_iv(s)\n",
+    "        emos = (float(lab.loc[s, \"emos\"]) - emos_mu) / emos_sd\n",
+    "        if HAS_VAD:\n",
+    "            vad = (np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32) - vad_mu) / vad_sd\n",
+    "        else:\n",
+    "            vad = np.zeros(3, dtype=np.float32)\n",
+    "        cat = np.array([lab.loc[s, f\"cat{j}\"] for j in range(len(EMOTIONS5))], dtype=np.float32)\n",
+    "        return {\"iv\": iv, \"tgt\": onehot_target(target_map.get(s)),\n",
+    "                \"emos\": np.float32(emos), \"vad\": vad, \"cat\": cat,\n",
+    "                \"emos_raw\": np.float32(lab.loc[s, \"emos\"]),\n",
+    "                \"vad_raw\": np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32)}\n",
+    "\n",
+    "def collate(batch):\n",
+    "    L = max(len(b[\"iv\"]) for b in batch)\n",
+    "    ivs = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    mask = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    for i, b in enumerate(batch):\n",
+    "        ivs[i, : len(b[\"iv\"])] = b[\"iv\"]; mask[i, : len(b[\"iv\"])] = 1.0\n",
+    "    return {\n",
+    "        \"input_values\": torch.from_numpy(ivs), \"attn_mask\": torch.from_numpy(mask).long(),\n",
+    "        \"tgt\": torch.from_numpy(np.stack([b[\"tgt\"] for b in batch])),\n",
+    "        \"emos\": torch.from_numpy(np.stack([b[\"emos\"] for b in batch])).unsqueeze(1),\n",
+    "        \"vad\": torch.from_numpy(np.stack([b[\"vad\"] for b in batch])),\n",
+    "        \"cat\": torch.from_numpy(np.stack([b[\"cat\"] for b in batch])),\n",
+    "        \"emos_raw\": np.stack([b[\"emos_raw\"] for b in batch]),\n",
+    "        \"vad_raw\": np.stack([b[\"vad_raw\"] for b in batch]),\n",
+    "    }\n",
+    "\n",
+    "ds = AudDataset(train_stems)\n",
+    "print(\"Dataset hợp lệ:\", len(ds), \"wav\")\n",
+    "tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)\n",
+    "tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)\n",
+    "va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "697d9ca3",
+   "metadata": {},
+   "source": [
+    "## 5. Heads + train loop (lưu ft_audeering_full.pt mỗi best)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8fe7ec40",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import spearmanr\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(AUD_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "\n",
+    "TASKS = [\"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "bb_params = [p for p in aud.parameters() if p.requires_grad]\n",
+    "head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.AdamW([{\"params\": bb_params, \"lr\": LR_BACKBONE},\n",
+    "                         {\"params\": head_params, \"lr\": LR_HEAD}], weight_decay=WEIGHT_DECAY)\n",
+    "scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == \"cuda\")\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()\n",
+    "\n",
+    "def forward_batch(b):\n",
+    "    feat = aud_embed(b[\"input_values\"].to(device), b[\"attn_mask\"].to(device))\n",
+    "    return heads(feat, b[\"tgt\"].to(device))\n",
+    "\n",
+    "def compute_loss(emos_p, cat_l, vad_p, b):\n",
+    "    L = {}\n",
+    "    L[\"emos\"] = mse(emos_p, b[\"emos\"].to(device))\n",
+    "    L[\"cat\"] = soft_ce(cat_l, b[\"cat\"].to(device))\n",
+    "    if HAS_VAD:\n",
+    "        vt = b[\"vad\"].to(device)\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L[\"aro\"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L[\"dom\"] = mse(vad_p[:, 2:3], vt[:, 2:3])\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device); L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))\n",
+    "    return sum(L.values())\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def evaluate():\n",
+    "    aud.eval(); heads.eval()\n",
+    "    P = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}; Y = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}\n",
+    "    catP, catY = [], []\n",
+    "    for b in va_loader:\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "        P[\"emos\"] += emos_p.float().cpu().numpy().ravel().tolist(); Y[\"emos\"] += b[\"emos_raw\"].tolist()\n",
+    "        vad_p = vad_p.float().cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            P[t] += vad_p[:, j].tolist(); Y[t] += b[\"vad_raw\"][:, j].tolist()\n",
+    "        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b[\"cat\"])\n",
+    "    out = {}\n",
+    "    for t in [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else []):\n",
+    "        out[t] = spearmanr(P[t], Y[t]).correlation\n",
+    "    q = np.concatenate(catP); p = np.concatenate(catY)\n",
+    "    out[\"cat_err\"] = float(np.abs(q - p).sum(1).mean())\n",
+    "    return out\n",
+    "\n",
+    "def mean_srcc(m):\n",
+    "    keys = [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "CKPT_PATH = os.path.join(OUT_DIR, \"ft_audeering_full.pt\")\n",
+    "def save_full_ckpt(state, val_emos=float(\"nan\")):\n",
+    "    torch.save({\"aud\": state[\"aud\"], \"heads\": state[\"heads\"],\n",
+    "                \"emos_mu\": emos_mu, \"emos_sd\": emos_sd, \"vad_mu\": vad_mu, \"vad_sd\": vad_sd,\n",
+    "                \"AUD_DIM\": AUD_DIM, \"UNFREEZE_TOP_LAYERS\": UNFREEZE_TOP_LAYERS,\n",
+    "                \"val_emos\": float(val_emos)}, CKPT_PATH)\n",
+    "\n",
+    "best, best_state, bad = -1e9, None, 0\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    aud.train(); heads.train()\n",
+    "    opt.zero_grad(); run = 0.0; nb = 0\n",
+    "    for step, b in enumerate(tqdm(tr_loader, desc=f\"epoch {ep}\")):\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM\n",
+    "        scaler.scale(loss).backward()\n",
+    "        if (step + 1) % ACCUM == 0:\n",
+    "            scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "        run += loss.item() * ACCUM; nb += 1\n",
+    "    m = evaluate(); sc = mean_srcc(m)\n",
+    "    msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "    print(f\"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})\")\n",
+    "    if sc > best:\n",
+    "        best = sc\n",
+    "        best_state = {\"aud\": {k: v.cpu().clone() for k, v in aud.state_dict().items()},\n",
+    "                      \"heads\": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}\n",
+    "        save_full_ckpt(best_state, m[\"emos\"])\n",
+    "        print(f\"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})\")\n",
+    "        bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "        if bad >= PATIENCE:\n",
+    "            print(f\"Early stop ở epoch {ep}.\"); break\n",
+    "\n",
+    "if best_state:\n",
+    "    aud.load_state_dict(best_state[\"aud\"]); heads.load_state_dict(best_state[\"heads\"])\n",
+    "final = evaluate()\n",
+    "print(\"\\n✅ VAL (nội bộ) — exp10 (fine-tune audeering):\")\n",
+    "print(f\"   EMOS={final['emos']:.4f}\", end=\"\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\" | VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} (exp08 {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})\")\n",
+    "else:\n",
+    "    print()\n",
+    "print(f\"   → so exp08: audeering {'mạnh' if HAS_VAD and final['val'] > EXP08['val'] else 'yếu/ngang'} ở VAL. \"\n",
+    "      f\"Ensemble sẽ lấy trung bình 2 model.\")\n",
+    "save_full_ckpt(best_state if best_state else {\"aud\": aud.state_dict(), \"heads\": heads.state_dict()}, final[\"emos\"])\n",
+    "print(f\"✅ Đã lưu {CKPT_PATH}. NHỚ Save Version!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4911ff48",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán DEV → predictions + answer_audeering.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4ed9a022",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "def load_exp07_qmos():\n",
+    "    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "        import csv\n",
+    "        d = {}\n",
+    "        with open(EXP07_ANSWER) as f:\n",
+    "            for row in csv.DictReader(f):\n",
+    "                d[row[\"wav\"]] = float(row[\"QMOS\"]); d[stem(row[\"wav\"])] = float(row[\"QMOS\"])\n",
+    "        print(f\"✅ Mượn QMOS từ exp07: {len(d)//2} wav\")\n",
+    "        return d\n",
+    "    return None\n",
+    "\n",
+    "qmos_map = load_exp07_qmos()\n",
+    "if qmos_map is None:\n",
+    "    print(\"ℹ️ Không có exp07 → QMOS bằng UTMOSv2.\")\n",
+    "    pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")\n",
+    "    import utmosv2\n",
+    "    v2 = utmosv2.create_model(pretrained=True)\n",
+    "    qmos_map = {}\n",
+    "    for n in tqdm(dev_names, desc=\"UTMOSv2\"):\n",
+    "        wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else str(n) + \".wav\")\n",
+    "        if os.path.exists(wav):\n",
+    "            o = v2.predict(input_path=wav)\n",
+    "            qmos_map[n] = float(o[\"predicted_mos\"]) if isinstance(o, dict) else float(o)\n",
+    "    del v2; torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    iv = load_iv(sid)\n",
+    "    if iv is None:\n",
+    "        return None\n",
+    "    aud.eval(); heads.eval()\n",
+    "    ivt = torch.from_numpy(iv).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(iv)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        feat = aud_embed(ivt, am)\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "dev_pred = {}   # name -> (emos, cat5, vad3)\n",
+    "with open(os.path.join(OUT_DIR, \"answer_audeering.txt\"), \"w\") as f:\n",
+    "    f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "    for name in tqdm(dev_names, desc=\"answer_aud\"):\n",
+    "        sid = stem(name)\n",
+    "        pr = predict_emotion(sid)\n",
+    "        if pr is None:\n",
+    "            emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])\n",
+    "        else:\n",
+    "            emos, cat5, vad3 = pr\n",
+    "        dev_pred[name] = (emos, cat5, vad3)\n",
+    "        qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))\n",
+    "        f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "print(\"Đã ghi answer_audeering.txt\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "72f8f313",
+   "metadata": {},
+   "source": [
+    "## 7. ENSEMBLE với exp08 → answer.txt cuối (trung bình cột VAD)\n",
+    "Lấy answer.txt exp08 làm nền; cột trong `ENSEMBLE_COLS` = trung bình (exp08 + exp10). Còn lại giữ exp08."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a33720f9",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import csv\n",
+    "COL_IDX = {\"QMOS\": 1, \"EMOS\": 2, \"VAL\": 4, \"ARO\": 5, \"DOM\": 6}   # vị trí cột trong answer.txt\n",
+    "AUD_VAL = {\"EMOS\": lambda p: p[0], \"VAL\": lambda p: p[2][0], \"ARO\": lambda p: p[2][1], \"DOM\": lambda p: p[2][2]}\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "if EXP08_ANSWER and os.path.exists(EXP08_ANSWER):\n",
+    "    with open(EXP08_ANSWER) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header, body = rows[0], rows[1:]\n",
+    "    n_ens = 0\n",
+    "    with open(answer_path, \"w\") as f:\n",
+    "        f.write(\",\".join(header) + \"\\n\")\n",
+    "        for r in body:\n",
+    "            name = r[0]; sid = stem(name)\n",
+    "            pr = dev_pred.get(name) or dev_pred.get(sid)\n",
+    "            if pr is not None:\n",
+    "                for col in ENSEMBLE_COLS:\n",
+    "                    if col in COL_IDX and col in AUD_VAL:\n",
+    "                        v08 = float(r[COL_IDX[col]]); vaud = float(AUD_VAL[col](pr))\n",
+    "                        r[COL_IDX[col]] = f\"{0.5*(v08+vaud):.6g}\"\n",
+    "                n_ens += 1\n",
+    "            f.write(\",\".join(r) + \"\\n\")\n",
+    "    print(f\"✅ Ensemble {ENSEMBLE_COLS}: {n_ens} dòng → {answer_path} (nền exp08 + trung bình audeering)\")\n",
+    "else:\n",
+    "    print(\"ℹ️ Không có EXP08_ANSWER → answer.txt = answer_audeering.txt (chỉ audeering, chưa ensemble).\")\n",
+    "    import shutil\n",
+    "    shutil.copy(os.path.join(OUT_DIR, \"answer_audeering.txt\"), answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5089b7fa",
+   "metadata": {},
+   "source": [
+    "## 8. Validate + zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "66272528",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp10_ensemble.zip answer.txt && unzip -l submission_track2_exp10_ensemble.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp10_ensemble.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "994723d2",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Hướng A (T4-an toàn):** fine-tune audeering RIÊNG (1 backbone) → ensemble VAD với exp08 → KHÔNG OOM.\n",
+    "- **Đọc mục 5:** audeering VAL/ARO/DOM có ≥ exp08 không? Nếu ngang/hơn → ensemble đáng giá.\n",
+    "- **Ensemble (mục 7):** mặc định trung bình VAL/ARO/DOM. Thêm \"EMOS\" vào `ENSEMBLE_COLS` nếu audeering EMOS tốt.\n",
+    "- **Checkpoint:** lưu `ft_audeering_full.pt` mỗi best (kernel chết vẫn còn). Save Version sau khi xong.\n",
+    "- QMOS vẫn mượn exp07 (0.548). So sánh: nộp answer.txt ensemble vs exp08 thuần để xem ensemble có nhích VAD.\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (exp10)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp10_finetune_audeering_pipeline.py ADDED Viewed

	@@ -0,0 +1,553 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp10 (fine-tune AUDEERING riêng + ensemble VAD với exp08) — Kaggle T4
+#
+# **Ý tưởng (Hướng A — an toàn cho T4):** thay vì nhồi 2 backbone large vào 1 model (dễ OOM),
+# ta fine-tune **audeering wav2vec2-large** RIÊNG (1 backbone → vừa T4), rồi **ensemble cột VAD**
+# với exp08 (WavLM fine-tune). Mỗi lần chỉ 1 backbone trong VRAM → không OOM.
+#
+# ```
+#  [exp08]  WavLM fine-tune  ─► VAD_wavlm  ┐
+#                                          ├─ trung bình ─► VAD cuối (mạnh hơn cả 2)
+#  [exp10]  audeering fine-tune ─► VAD_aud ┘
+# ```
+# audeering vốn là model **dimensional (chuyên VAD)** → fine-tune nó để bổ trợ VAD cho exp08.
+#
+# **Cách chạy:** GPU T4 + Internet On → sửa slug cell 0 → Run All. Lần đầu `LIMIT_TRAIN=300`.
+# Để ensemble: Add Input answer.txt exp08 → trỏ `EXP08_ANSWER`.
+# %% [markdown]
+# ## 0. Cấu hình
+# %%
+import os
+DATA_ROOT    = "/kaggle/input/datasets/minhtoan2/vmc2026-track2-full"   # << SỬA slug
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+OUT_DIR   = "/kaggle/working"
+# QMOS mượn exp07 (0.548); ensemble VAD với answer.txt exp08.
+EXP07_ANSWER = "/kaggle/input/exp07-answer/answer.txt"     # << (tùy chọn) mượn QMOS; không có → UTMOSv2
+EXP08_ANSWER = "/kaggle/input/exp08-answer/answer.txt"     # << (tùy chọn) để ENSEMBLE VAD; không có → chỉ ra answer audeering
+# ── Fine-tune audeering (1 backbone) ─────────────────────────────────────────
+DEVICE              = "cuda"
+SR                  = 16000
+MAX_SECONDS         = 8
+UNFREEZE_TOP_LAYERS = 6           # số lớp encoder audeering mở băng (T4 thừa sức 1 backbone)
+TRUNK_HIDDEN        = 512
+HEAD_HIDDEN         = 128
+DROPOUT             = 0.3
+LR_BACKBONE         = 1e-5
+LR_HEAD             = 1e-3
+WEIGHT_DECAY        = 1e-5
+EPOCHS              = 12
+PATIENCE            = 4
+BATCH               = 4
+ACCUM               = 8
+VAL_FRAC            = 0.10
+SEED                = 42
+USE_AMP             = True
+USE_GRAD_CKPT       = True
+USE_UNCERTAINTY     = True
+# Ensemble: cột nào lấy TRUNG BÌNH giữa exp08 và exp10; cột khác giữ từ exp08.
+ENSEMBLE_COLS       = ["VAL", "ARO", "DOM"]   # audeering mạnh VAD → ensemble VAD. Thêm "EMOS" nếu muốn.
+LIMIT_TRAIN         = 300         # << LẦN ĐẦU 300; chạy thật None
+LIMIT_DEV           = 20          # << LẦN ĐẦU 20; chạy thật None
+EXP08 = {"emos": 0.811, "cat_err": 0.133, "val": 0.659, "aro": 0.793, "dom": 0.751}
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("transformers", "huggingface_hub", "safetensors", "speechmos",
+            "librosa", "soundfile", "scipy", "scikit-learn", "pandas", "tqdm")
+# %% [markdown]
+# ## 2. Nạp audeering wav2vec2-large làm backbone FINE-TUNE
+# Nạp backbone tay (tránh lỗi subclass `Wav2Vec2PreTrainedModel` ở transformers mới) rồi mở băng lớp trên.
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (rất chậm!)")
+from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+from huggingface_hub import hf_hub_download
+AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+aud = Wav2Vec2Model(aud_cfg)
+try:
+    _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+        hf_hub_download(AUD_NAME, "model.safetensors"))
+except Exception:
+    _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+miss, unexp = aud.load_state_dict(bb_sd, strict=False)
+print(f"audeering backbone: thiếu {len(miss)} / dư {len(unexp)} key (strict=False)")
+aud = aud.to(device)
+AUD_DIM = int(aud.config.hidden_size)
+# Đóng băng tất cả, mở băng UNFREEZE_TOP_LAYERS lớp encoder trên cùng
+for p in aud.parameters():
+    p.requires_grad = False
+enc_layers = aud.encoder.layers
+n_layers = len(enc_layers)
+for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:
+    for p in layer.parameters():
+        p.requires_grad = True
+n_train = sum(p.numel() for p in aud.parameters() if p.requires_grad)
+print(f"audeering: {n_layers} lớp · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} → {n_train/1e6:.1f}M param train (dim {AUD_DIM})")
+if USE_GRAD_CKPT:
+    aud.gradient_checkpointing_enable()
+    if hasattr(aud, "enable_input_require_grads"):
+        aud.enable_input_require_grads()
+def masked_mean(hidden, attn_mask):
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    try:
+        fm = aud._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+    except Exception:
+        return hidden.mean(dim=1)
+    fm = fm.unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+def aud_embed(input_values, attn_mask):
+    out = aud(input_values, attention_mask=attn_mask).last_hidden_state
+    return masked_mean(out, attn_mask)
+# %% [markdown]
+# ## 3. Nhãn (gộp theo wavID) — như exp08
+# %%
+import librosa
+import pandas as pd
+from tqdm.auto import tqdm
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, df=None, default_idx=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = _col(cols, "wavid", "wav", df=df, default_idx=1)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col = _col(cols, "val", "valence"); aro_col = _col(cols, "aro", "arousal"); dom_col = _col(cols, "dom", "dominance")
+    cat_col = _col(cols, "emocat", "cat", "emotion")
+    assert emos_col, f"Không thấy cột eMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+# %% [markdown]
+# ## 4. Dataset/loader (input_values qua audeering processor) + chuẩn hóa nhãn
+# %%
+from torch.utils.data import Dataset, DataLoader
+from sklearn.model_selection import train_test_split
+train_stems = [s for s in train_df["wavID"] if target_map.get(s) is not None]
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+lab = train_df.set_index("wavID")
+def _zfit(arr):
+    a = np.asarray(arr, dtype=np.float32)
+    return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)
+emos_mu, emos_sd = _zfit([lab.loc[s, "emos"] for s in train_stems])
+if HAS_VAD:
+    vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in ["val", "aro", "dom"]], dtype=np.float32)
+    vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in ["val", "aro", "dom"]], dtype=np.float32)
+else:
+    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+def load_iv(sid):
+    """Đọc wav → chuẩn hóa bằng audeering processor → input_values (1D float32)."""
+    p = os.path.join(WAV_DIR, sid if str(sid).endswith(".wav") else str(sid) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    wave = wave[: MAX_SECONDS * SR]
+    iv = aud_proc(wave, sampling_rate=SR).input_values[0]
+    return np.asarray(iv, dtype=np.float32)
+class AudDataset(Dataset):
+    def __init__(self, stems):
+        self.stems = [s for s in stems if load_iv(s) is not None]
+    def __len__(self):
+        return len(self.stems)
+    def __getitem__(self, i):
+        s = self.stems[i]
+        iv = load_iv(s)
+        emos = (float(lab.loc[s, "emos"]) - emos_mu) / emos_sd
+        if HAS_VAD:
+            vad = (np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32) - vad_mu) / vad_sd
+        else:
+            vad = np.zeros(3, dtype=np.float32)
+        cat = np.array([lab.loc[s, f"cat{j}"] for j in range(len(EMOTIONS5))], dtype=np.float32)
+        return {"iv": iv, "tgt": onehot_target(target_map.get(s)),
+                "emos": np.float32(emos), "vad": vad, "cat": cat,
+                "emos_raw": np.float32(lab.loc[s, "emos"]),
+                "vad_raw": np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32)}
+def collate(batch):
+    L = max(len(b["iv"]) for b in batch)
+    ivs = np.zeros((len(batch), L), dtype=np.float32)
+    mask = np.zeros((len(batch), L), dtype=np.float32)
+    for i, b in enumerate(batch):
+        ivs[i, : len(b["iv"])] = b["iv"]; mask[i, : len(b["iv"])] = 1.0
+    return {
+        "input_values": torch.from_numpy(ivs), "attn_mask": torch.from_numpy(mask).long(),
+        "tgt": torch.from_numpy(np.stack([b["tgt"] for b in batch])),
+        "emos": torch.from_numpy(np.stack([b["emos"] for b in batch])).unsqueeze(1),
+        "vad": torch.from_numpy(np.stack([b["vad"] for b in batch])),
+        "cat": torch.from_numpy(np.stack([b["cat"] for b in batch])),
+        "emos_raw": np.stack([b["emos_raw"] for b in batch]),
+        "vad_raw": np.stack([b["vad_raw"] for b in batch]),
+    }
+ds = AudDataset(train_stems)
+print("Dataset hợp lệ:", len(ds), "wav")
+tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)
+tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)
+va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)
+# %% [markdown]
+# ## 5. Heads + train loop (lưu ft_audeering_full.pt mỗi best)
+# %%
+from scipy.stats import spearmanr
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(AUD_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+TASKS = ["emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+bb_params = [p for p in aud.parameters() if p.requires_grad]
+head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.AdamW([{"params": bb_params, "lr": LR_BACKBONE},
+                         {"params": head_params, "lr": LR_HEAD}], weight_decay=WEIGHT_DECAY)
+scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == "cuda")
+mse = nn.MSELoss()
+def soft_ce(logits, target_dist):
+    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()
+def forward_batch(b):
+    feat = aud_embed(b["input_values"].to(device), b["attn_mask"].to(device))
+    return heads(feat, b["tgt"].to(device))
+def compute_loss(emos_p, cat_l, vad_p, b):
+    L = {}
+    L["emos"] = mse(emos_p, b["emos"].to(device))
+    L["cat"] = soft_ce(cat_l, b["cat"].to(device))
+    if HAS_VAD:
+        vt = b["vad"].to(device)
+        L["val"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L["aro"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L["dom"] = mse(vad_p[:, 2:3], vt[:, 2:3])
+    else:
+        z = torch.zeros((), device=device); L["val"] = L["aro"] = L["dom"] = z
+    if USE_UNCERTAINTY:
+        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))
+    return sum(L.values())
+@torch.no_grad()
+def evaluate():
+    aud.eval(); heads.eval()
+    P = {"emos": [], "val": [], "aro": [], "dom": []}; Y = {"emos": [], "val": [], "aro": [], "dom": []}
+    catP, catY = [], []
+    for b in va_loader:
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+        P["emos"] += emos_p.float().cpu().numpy().ravel().tolist(); Y["emos"] += b["emos_raw"].tolist()
+        vad_p = vad_p.float().cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            P[t] += vad_p[:, j].tolist(); Y[t] += b["vad_raw"][:, j].tolist()
+        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b["cat"])
+    out = {}
+    for t in ["emos"] + (["val", "aro", "dom"] if HAS_VAD else []):
+        out[t] = spearmanr(P[t], Y[t]).correlation
+    q = np.concatenate(catP); p = np.concatenate(catY)
+    out["cat_err"] = float(np.abs(q - p).sum(1).mean())
+    return out
+def mean_srcc(m):
+    keys = ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+CKPT_PATH = os.path.join(OUT_DIR, "ft_audeering_full.pt")
+def save_full_ckpt(state, val_emos=float("nan")):
+    torch.save({"aud": state["aud"], "heads": state["heads"],
+                "emos_mu": emos_mu, "emos_sd": emos_sd, "vad_mu": vad_mu, "vad_sd": vad_sd,
+                "AUD_DIM": AUD_DIM, "UNFREEZE_TOP_LAYERS": UNFREEZE_TOP_LAYERS,
+                "val_emos": float(val_emos)}, CKPT_PATH)
+best, best_state, bad = -1e9, None, 0
+for ep in range(1, EPOCHS + 1):
+    aud.train(); heads.train()
+    opt.zero_grad(); run = 0.0; nb = 0
+    for step, b in enumerate(tqdm(tr_loader, desc=f"epoch {ep}")):
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM
+        scaler.scale(loss).backward()
+        if (step + 1) % ACCUM == 0:
+            scaler.step(opt); scaler.update(); opt.zero_grad()
+        run += loss.item() * ACCUM; nb += 1
+    m = evaluate(); sc = mean_srcc(m)
+    msg = " ".join(f"{k}={m[k]:.3f}" for k in ["emos", "val", "aro", "dom"] if k in m)
+    print(f"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})")
+    if sc > best:
+        best = sc
+        best_state = {"aud": {k: v.cpu().clone() for k, v in aud.state_dict().items()},
+                      "heads": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}
+        save_full_ckpt(best_state, m["emos"])
+        print(f"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})")
+        bad = 0
+    else:
+        bad += 1
+        if bad >= PATIENCE:
+            print(f"Early stop ở epoch {ep}."); break
+if best_state:
+    aud.load_state_dict(best_state["aud"]); heads.load_state_dict(best_state["heads"])
+final = evaluate()
+print("\n✅ VAL (nội bộ) — exp10 (fine-tune audeering):")
+print(f"   EMOS={final['emos']:.4f}", end="")
+if HAS_VAD:
+    print(f" | VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} (exp08 {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})")
+else:
+    print()
+print(f"   → so exp08: audeering {'mạnh' if HAS_VAD and final['val'] > EXP08['val'] else 'yếu/ngang'} ở VAL. "
+      f"Ensemble sẽ lấy trung bình 2 model.")
+save_full_ckpt(best_state if best_state else {"aud": aud.state_dict(), "heads": heads.state_dict()}, final["emos"])
+print(f"✅ Đã lưu {CKPT_PATH}. NHỚ Save Version!")
+# %% [markdown]
+# ## 6. Dự đoán DEV → predictions + answer_audeering.txt
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+print("DEV:", len(dev_names), "mẫu")
+def load_exp07_qmos():
+    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+        import csv
+        d = {}
+        with open(EXP07_ANSWER) as f:
+            for row in csv.DictReader(f):
+                d[row["wav"]] = float(row["QMOS"]); d[stem(row["wav"])] = float(row["QMOS"])
+        print(f"✅ Mượn QMOS từ exp07: {len(d)//2} wav")
+        return d
+    return None
+qmos_map = load_exp07_qmos()
+if qmos_map is None:
+    print("ℹ️ Không có exp07 → QMOS bằng UTMOSv2.")
+    pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")
+    import utmosv2
+    v2 = utmosv2.create_model(pretrained=True)
+    qmos_map = {}
+    for n in tqdm(dev_names, desc="UTMOSv2"):
+        wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else str(n) + ".wav")
+        if os.path.exists(wav):
+            o = v2.predict(input_path=wav)
+            qmos_map[n] = float(o["predicted_mos"]) if isinstance(o, dict) else float(o)
+    del v2; torch.cuda.empty_cache() if device == "cuda" else None
+@torch.no_grad()
+def predict_emotion(sid):
+    iv = load_iv(sid)
+    if iv is None:
+        return None
+    aud.eval(); heads.eval()
+    ivt = torch.from_numpy(iv).unsqueeze(0).to(device)
+    am = torch.ones((1, len(iv)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        feat = aud_embed(ivt, am)
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+dev_pred = {}   # name -> (emos, cat5, vad3)
+with open(os.path.join(OUT_DIR, "answer_audeering.txt"), "w") as f:
+    f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+    for name in tqdm(dev_names, desc="answer_aud"):
+        sid = stem(name)
+        pr = predict_emotion(sid)
+        if pr is None:
+            emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])
+        else:
+            emos, cat5, vad3 = pr
+        dev_pred[name] = (emos, cat5, vad3)
+        qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))
+        f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+print("Đã ghi answer_audeering.txt")
+# %% [markdown]
+# ## 7. ENSEMBLE với exp08 → answer.txt cuối (trung bình cột VAD)
+# Lấy answer.txt exp08 làm nền; cột trong `ENSEMBLE_COLS` = trung bình (exp08 + exp10). Còn lại giữ exp08.
+# %%
+import csv
+COL_IDX = {"QMOS": 1, "EMOS": 2, "VAL": 4, "ARO": 5, "DOM": 6}   # vị trí cột trong answer.txt
+AUD_VAL = {"EMOS": lambda p: p[0], "VAL": lambda p: p[2][0], "ARO": lambda p: p[2][1], "DOM": lambda p: p[2][2]}
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+if EXP08_ANSWER and os.path.exists(EXP08_ANSWER):
+    with open(EXP08_ANSWER) as f:
+        rows = list(csv.reader(f))
+    header, body = rows[0], rows[1:]
+    n_ens = 0
+    with open(answer_path, "w") as f:
+        f.write(",".join(header) + "\n")
+        for r in body:
+            name = r[0]; sid = stem(name)
+            pr = dev_pred.get(name) or dev_pred.get(sid)
+            if pr is not None:
+                for col in ENSEMBLE_COLS:
+                    if col in COL_IDX and col in AUD_VAL:
+                        v08 = float(r[COL_IDX[col]]); vaud = float(AUD_VAL[col](pr))
+                        r[COL_IDX[col]] = f"{0.5*(v08+vaud):.6g}"
+                n_ens += 1
+            f.write(",".join(r) + "\n")
+    print(f"✅ Ensemble {ENSEMBLE_COLS}: {n_ens} dòng → {answer_path} (nền exp08 + trung bình audeering)")
+else:
+    print("ℹ️ Không có EXP08_ANSWER → answer.txt = answer_audeering.txt (chỉ audeering, chưa ensemble).")
+    import shutil
+    shutil.copy(os.path.join(OUT_DIR, "answer_audeering.txt"), answer_path)
+# %% [markdown]
+# ## 8. Validate + zip
+# %%
+def validate(path):
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp10_ensemble.zip answer.txt && unzip -l submission_track2_exp10_ensemble.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp10_ensemble.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Hướng A (T4-an toàn):** fine-tune audeering RIÊNG (1 backbone) → ensemble VAD với exp08 → KHÔNG OOM.
+# - **Đọc mục 5:** audeering VAL/ARO/DOM có ≥ exp08 không? Nếu ngang/hơn → ensemble đáng giá.
+# - **Ensemble (mục 7):** mặc định trung bình VAL/ARO/DOM. Thêm "EMOS" vào `ENSEMBLE_COLS` nếu audeering EMOS tốt.
+# - **Checkpoint:** lưu `ft_audeering_full.pt` mỗi best (kernel chết vẫn còn). Save Version sau khi xong.
+# - QMOS vẫn mượn exp07 (0.548). So sánh: nộp answer.txt ensemble vs exp08 thuần để xem ensemble có nhích VAD.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (exp10).

track2/exp11_finetune_joint.ipynb ADDED Viewed

	@@ -0,0 +1,805 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "a2dce1b4",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp11 (FINE-TUNE ĐỒNG THỜI WavLM + audeering, FUSION 1 model) — Kaggle T4\n",
+    "\n",
+    "**Khác exp08:** exp08 chỉ fine-tune WavLM, audeering **đóng băng** (frozen, cache). exp11 **MỞ BĂNG CẢ HAI**\n",
+    "backbone và fuse đặc trưng **trong cùng 1 model** → cả hai cùng học cho bài MOS cảm xúc 2026.\n",
+    "\n",
+    "```\n",
+    " wav ─┬─► WavLM-large   (warm-start exp08, TRAINABLE: mở băng N lớp trên) ─► pool ─► emb_wavlm ┐\n",
+    "      └─► audeering MSP (TRAINABLE: mở băng N lớp trên) ─► pool ─► [emb_aud(1024) | vad3] ──────┼─► TRUNK ─┬─► EMOS (+target)\n",
+    "                                                                                                ┘          ├─► CAT (5)\n",
+    "                                                                                                           └─► VAD (3)\n",
+    " QMOS: KHÔNG train ở đây → mượn cột QMOS exp07 (0.548) hoặc UTMOSv2.\n",
+    "```\n",
+    "\n",
+    "## Vì sao \"feature fusion + fine-tune cả 2\" (khác ensemble exp10)\n",
+    "- **exp10 = ensemble:** 2 model RIÊNG → trung bình cột VAD ở mức answer. An toàn nhưng 2 model không \"nói chuyện\".\n",
+    "- **exp11 = fusion:** 1 model, 2 backbone fuse Ở TRONG → trunk học phối hợp cả hai góc nhìn (WavLM categorical +\n",
+    "  audeering dimensional) → kỳ vọng mạnh hơn nếu không OOM/overfit.\n",
+    "\n",
+    "## ⚠️ ĐÁNH ĐỔI PHẢI BIẾT — đây là cấu hình NẶNG nhất (2 backbone large cùng có gradient)\n",
+    "- **Rủi ro OOM cao trên T4 (16GB).** Đã bật sẵn mọi cách giảm bộ nhớ: `BATCH=1` + grad-accum,\n",
+    "  gradient-checkpointing CẢ 2 backbone, AMP fp16, `MAX_SECONDS=6`, mở băng ÍT lớp (mặc định 4 mỗi backbone).\n",
+    "- Nếu vẫn OOM: giảm `UNFREEZE_WAVLM`/`UNFREEZE_AUD` → 2, giảm `MAX_SECONDS` → 5, tăng `ACCUM`.\n",
+    "- **Chậm + đốt giờ GPU** (2 backbone forward+backward, không cache được). **LẦN ĐẦU BẮT BUỘC `LIMIT_TRAIN=300`,\n",
+    "  `LIMIT_DEV=20`** để chỉnh trơn rồi mới `None`.\n",
+    "- **Lưới an toàn:** đừng đốt lượt nộp — chỉ nộp khi exp11 thắng exp08 (0.811) TRÊN VAL NỘI BỘ.\n",
+    "\n",
+    "**Cách chạy:** GPU **T4** + Internet **On** → Add Input (data + checkpoint exp08 + [tùy chọn] answer exp07) →\n",
+    "sửa slug cell 0 → Run All. Ghi config→kết quả→nhận xét vào `docs/04_experiments_log.md` (exp11)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6d884a7",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6eca25f7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/datasets/minhtoan2/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header)\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "# ── Warm-start / RESUME: trỏ tới 1 trong 2 loại checkpoint ───────────────────\n",
+    "#   • ft_emotion_full_20epoch.pt (exp08): có 'wavlm'+'heads' → WARM-START (audeering từ pretrained gốc).\n",
+    "#   • ft_joint_full.pt (exp11): có thêm 'aud'+'aud_head'    → RESUME ĐỦ (khôi phục cả 2 backbone đã fine-tune).\n",
+    "# Notebook TỰ nhận biết theo key trong checkpoint. Để \"\" nếu train WavLM từ SAILER trắng.\n",
+    "WARMSTART_CKPT = \"/kaggle/input/ft-joint-full/ft_joint_full.pt\"   # << exp08 ckpt (warm-start) HOẶC exp11 ckpt (resume)\n",
+    "\n",
+    "# Mượn cột QMOS exp07 (0.548). Không có → UTMOSv2.\n",
+    "EXP07_ANSWER = \"/kaggle/input/exp07-answer/answer.txt\"   # << (tùy chọn) answer.txt exp07; không có → UTMOSv2\n",
+    "\n",
+    "OUT_DIR = \"/kaggle/working\"\n",
+    "\n",
+    "# ── Fine-tune / siêu tham số (CẤU HÌNH NẶNG — đã tối ưu cho T4) ───────────────\n",
+    "DEVICE          = \"cuda\"\n",
+    "SR              = 16000\n",
+    "MAX_SECONDS     = 6            # ↓ so exp08 (8) để tiết kiệm VRAM (2 backbone)\n",
+    "UNFREEZE_WAVLM  = 4            # số lớp encoder WavLM mở băng (OOM → 2)\n",
+    "UNFREEZE_AUD    = 4            # số lớp encoder audeering mở băng (OOM → 2)\n",
+    "TRUNK_HIDDEN    = 512          # PHẢI khớp checkpoint exp08 nếu warm-start heads\n",
+    "HEAD_HIDDEN     = 128          # PHẢI khớp checkpoint exp08\n",
+    "DROPOUT         = 0.3\n",
+    "LR_BACKBONE     = 1e-5         # LR chung cho 2 backbone\n",
+    "LR_HEAD         = 1e-3\n",
+    "RESUME_LR_SCALE = 1.0          # <1.0 để GIẢM LR khi resume (vd 0.5 nếu val đã chững) — nhân vào cả 2 nhóm LR\n",
+    "WEIGHT_DECAY    = 1e-5\n",
+    "EPOCHS          = 12\n",
+    "PATIENCE        = 4            # dừng khi val không lên; LUÔN giữ best\n",
+    "BATCH           = 1            # ⚠️ 2 backbone large → batch nhỏ\n",
+    "ACCUM           = 16           # effective batch = 16\n",
+    "VAL_FRAC        = 0.10\n",
+    "SEED            = 42\n",
+    "USE_AMP         = True\n",
+    "USE_GRAD_CKPT   = True\n",
+    "USE_UNCERTAINTY = True\n",
+    "\n",
+    "LIMIT_TRAIN     = 300          # << LẦN ĐẦU 300; chạy thật None\n",
+    "LIMIT_DEV       = 20           # << LẦN ĐẦU 20; chạy thật None\n",
+    "\n",
+    "EXP08 = {\"emos\": 0.811, \"cat_err\": 0.133, \"val\": 0.659, \"aro\": 0.793, \"dom\": 0.751}  # mốc để so\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)\n",
+    "print((\"  ✅ \" if (WARMSTART_CKPT and os.path.exists(WARMSTART_CKPT)) else \"  ⚠️ KHÔNG có \") + str(WARMSTART_CKPT)\n",
+    "      + (\"  → warm-start\" if (WARMSTART_CKPT and os.path.exists(WARMSTART_CKPT)) else \"  → train từ SAILER trắng\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b15a7e01",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER (dựng đúng kiến trúc WavLM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2aacc36b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"transformers\", \"huggingface_hub\", \"safetensors\", \"loralib\", \"speechbrain\",\n",
+    "            \"speechmos\", \"librosa\", \"soundfile\", \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3021edeb",
+   "metadata": {},
+   "source": [
+    "## 2A. WavLM TRAINABLE (warm-start SAILER / checkpoint exp08)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d2459502",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "import numpy as np\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (rất chậm!)\")\n",
+    "\n",
+    "# Nạp checkpoint exp08 (nếu có) — lấy cả 'wavlm', 'heads', thống kê chuẩn hóa\n",
+    "ckpt = None\n",
+    "if WARMSTART_CKPT and os.path.exists(WARMSTART_CKPT):\n",
+    "    ckpt = torch.load(WARMSTART_CKPT, map_location=\"cpu\", weights_only=False)\n",
+    "    print(\"✅ Nạp checkpoint warm-start:\", WARMSTART_CKPT, \"| keys:\", list(ckpt.keys()))\n",
+    "    if \"wavlm\" not in ckpt:\n",
+    "        print(\"   ⚠️ Checkpoint KHÔNG có 'wavlm' (chỉ heads?) → vẫn dựng WavLM từ SAILER, chỉ warm-start heads nếu khớp.\")\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    cands = []\n",
+    "    for name, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((name, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "try:\n",
+    "    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "    _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "    name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    if wavlm is not None:\n",
+    "        print(f\"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{name}'\")\n",
+    "except Exception as e:\n",
+    "    print(\"⚠️ Lỗi nạp SAILER wrapper:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "\n",
+    "if wavlm is None:\n",
+    "    from transformers import WavLMModel\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"ℹ️ Fallback: microsoft/wavlm-large.\")\n",
+    "\n",
+    "wavlm = wavlm.to(device)\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "wavlm.config.layerdrop = 0.0   # ⚠️ tắt layerdrop khi dùng gradient-checkpointing (tránh CheckpointError)\n",
+    "\n",
+    "# Đè trọng số đã fine-tune từ checkpoint exp08 (nếu có)\n",
+    "if ckpt is not None and \"wavlm\" in ckpt:\n",
+    "    miss, unexp = wavlm.load_state_dict(ckpt[\"wavlm\"], strict=False)\n",
+    "    print(f\"🔁 load wavlm từ checkpoint exp08: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0).\")\n",
+    "\n",
+    "# Đóng băng partial: chỉ mở UNFREEZE_WAVLM lớp trên\n",
+    "for p in wavlm.parameters():\n",
+    "    p.requires_grad = False\n",
+    "_wl = wavlm.encoder.layers\n",
+    "for layer in _wl[max(0, len(_wl) - UNFREEZE_WAVLM):]:\n",
+    "    for p in layer.parameters():\n",
+    "        p.requires_grad = True\n",
+    "print(f\"WavLM: {len(_wl)} lớp · mở băng {min(UNFREEZE_WAVLM, len(_wl))} → \"\n",
+    "      f\"{sum(p.numel() for p in wavlm.parameters() if p.requires_grad)/1e6:.1f}M param train (dim {WAVLM_DIM})\")\n",
+    "\n",
+    "if USE_GRAD_CKPT:\n",
+    "    wavlm.gradient_checkpointing_enable()\n",
+    "    if hasattr(wavlm, \"enable_input_require_grads\"):\n",
+    "        wavlm.enable_input_require_grads()\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask, model):\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    try:\n",
+    "        fm = model._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "    except Exception:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "def wavlm_embed(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state\n",
+    "    return masked_mean(out, attn_mask, wavlm)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20a0c88d",
+   "metadata": {},
+   "source": [
+    "## 2B. audeering TRAINABLE (mở băng — khác exp08 là frozen)\n",
+    "Nạp backbone tay + head dimensional gốc; mở băng `UNFREEZE_AUD` lớp trên. Đặc trưng fuse = [hidden(1024) | vad3]."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9360d566",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "from huggingface_hub import hf_hub_download\n",
+    "\n",
+    "AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "aud = Wav2Vec2Model(aud_cfg)\n",
+    "try:\n",
+    "    _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "        hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "except Exception:\n",
+    "    _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "aud.load_state_dict(bb_sd, strict=False)\n",
+    "_hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _sd[\"classifier.out_proj.weight\"].shape[0]))\n",
+    "aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "aud = aud.to(device); aud_head = aud_head.to(device)\n",
+    "aud.config.layerdrop = 0.0   # ⚠️ tắt layerdrop khi dùng gradient-checkpointing (tránh CheckpointError)\n",
+    "AUD_DIM = _hid + 3   # = 1027 (khớp exp08 để warm-start heads)\n",
+    "\n",
+    "# RESUME: nếu checkpoint là ft_joint_full.pt (có 'aud') → khôi phục audeering ĐÃ fine-tune (đè pretrained)\n",
+    "if ckpt is not None and \"aud\" in ckpt:\n",
+    "    amiss, aunexp = aud.load_state_dict(ckpt[\"aud\"], strict=False)\n",
+    "    print(f\"🔁 RESUME audeering từ checkpoint: thiếu {len(amiss)} / dư {len(aunexp)} key (kỳ vọng ~0).\")\n",
+    "    if \"aud_head\" in ckpt:\n",
+    "        aud_head.load_state_dict(ckpt[\"aud_head\"]); print(\"🔁 RESUME aud_head từ checkpoint.\")\n",
+    "else:\n",
+    "    print(\"ℹ️ Checkpoint không có 'aud' → audeering khởi từ pretrained gốc (chế độ warm-start exp08).\")\n",
+    "\n",
+    "# Đóng băng partial audeering: mở UNFREEZE_AUD lớp trên + head dimensional luôn trainable\n",
+    "for p in aud.parameters():\n",
+    "    p.requires_grad = False\n",
+    "_al = aud.encoder.layers\n",
+    "for layer in _al[max(0, len(_al) - UNFREEZE_AUD):]:\n",
+    "    for p in layer.parameters():\n",
+    "        p.requires_grad = True\n",
+    "for p in aud_head.parameters():\n",
+    "    p.requires_grad = True\n",
+    "print(f\"audeering: {len(_al)} lớp · mở băng {min(UNFREEZE_AUD, len(_al))} → \"\n",
+    "      f\"{sum(p.numel() for p in aud.parameters() if p.requires_grad)/1e6:.1f}M param train (hidden {_hid}, fuse dim {AUD_DIM})\")\n",
+    "\n",
+    "if USE_GRAD_CKPT:\n",
+    "    aud.gradient_checkpointing_enable()\n",
+    "    if hasattr(aud, \"enable_input_require_grads\"):\n",
+    "        aud.enable_input_require_grads()\n",
+    "\n",
+    "def aud_embed(input_values, attn_mask):\n",
+    "    \"\"\"Trả về [hidden(1024) | vad3] — vad3 từ head dimensional gốc, theo thứ tự VAL,ARO,DOM.\"\"\"\n",
+    "    h = masked_mean(aud(input_values, attention_mask=attn_mask).last_hidden_state, attn_mask, aud)\n",
+    "    out = aud_head(h)   # [B,3] thứ tự gốc audeering: (arousal, dominance, valence)\n",
+    "    vad = torch.stack([1 + 4 * out[:, 2], 1 + 4 * out[:, 0], 1 + 4 * out[:, 1]], dim=1)  # → VAL,ARO,DOM\n",
+    "    return torch.cat([h, vad], dim=1)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7550bb8e",
+   "metadata": {},
+   "source": [
+    "## 3. Đọc & gộp nhãn theo wavID (như exp08)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "86f83d8a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import librosa\n",
+    "import pandas as pd\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, df=None, default_idx=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = _col(cols, \"wavid\", \"wav\", df=df, default_idx=1)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col = _col(cols, \"val\", \"valence\"); aro_col = _col(cols, \"aro\", \"arousal\"); dom_col = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "02e003af",
+   "metadata": {},
+   "source": [
+    "## 4. Dataset/loader — trả về CẢ raw wave (cho WavLM) + input_values audeering\n",
+    "Hai backbone cần đầu vào khác nhau: WavLM nhận wave thô; audeering nhận wave đã chuẩn hóa bởi processor.\n",
+    "Cùng độ dài → dùng chung attention mask theo mức sample."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f91d8e80",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "train_stems = [s for s in train_df[\"wavID\"] if target_map.get(s) is not None]\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "\n",
+    "# Chuẩn hóa: lấy TỪ checkpoint nếu warm-start (để khớp head đã train); không thì fit từ data.\n",
+    "if ckpt is not None and \"emos_mu\" in ckpt:\n",
+    "    emos_mu = float(ckpt[\"emos_mu\"]); emos_sd = float(ckpt[\"emos_sd\"])\n",
+    "    vad_mu = np.asarray(ckpt[\"vad_mu\"], dtype=np.float32); vad_sd = np.asarray(ckpt[\"vad_sd\"], dtype=np.float32)\n",
+    "    print(f\"Chuẩn hóa TỪ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}\")\n",
+    "else:\n",
+    "    def _zfit(a):\n",
+    "        a = np.asarray(a, dtype=np.float32); return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)\n",
+    "    emos_mu, emos_sd = _zfit([lab.loc[s, \"emos\"] for s in train_stems])\n",
+    "    if HAS_VAD:\n",
+    "        vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in [\"val\", \"aro\", \"dom\"]], np.float32)\n",
+    "        vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in [\"val\", \"aro\", \"dom\"]], np.float32)\n",
+    "    else:\n",
+    "        vad_mu = np.zeros(3, np.float32); vad_sd = np.ones(3, np.float32)\n",
+    "    print(f\"Chuẩn hóa fit từ data: emos μ={emos_mu:.3f} σ={emos_sd:.3f}\")\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_pair(sid):\n",
+    "    \"\"\"Trả về (wave_thô, iv_audeering) cùng độ dài; None nếu thiếu file.\"\"\"\n",
+    "    p = os.path.join(WAV_DIR, sid if str(sid).endswith(\".wav\") else str(sid) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    wave = wave[: MAX_SECONDS * SR].astype(np.float32)\n",
+    "    iv = np.asarray(aud_proc(wave, sampling_rate=SR).input_values[0], dtype=np.float32)\n",
+    "    return wave, iv\n",
+    "\n",
+    "class JointDataset(Dataset):\n",
+    "    def __init__(self, stems):\n",
+    "        self.stems = [s for s in stems if load_pair(s) is not None]\n",
+    "    def __len__(self):\n",
+    "        return len(self.stems)\n",
+    "    def __getitem__(self, i):\n",
+    "        s = self.stems[i]\n",
+    "        wave, iv = load_pair(s)\n",
+    "        emos = (float(lab.loc[s, \"emos\"]) - emos_mu) / emos_sd\n",
+    "        if HAS_VAD:\n",
+    "            vad = (np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32) - vad_mu) / vad_sd\n",
+    "        else:\n",
+    "            vad = np.zeros(3, dtype=np.float32)\n",
+    "        cat = np.array([lab.loc[s, f\"cat{j}\"] for j in range(len(EMOTIONS5))], dtype=np.float32)\n",
+    "        return {\"wave\": wave, \"iv\": iv, \"tgt\": onehot_target(target_map.get(s)),\n",
+    "                \"emos\": np.float32(emos), \"vad\": vad, \"cat\": cat,\n",
+    "                \"emos_raw\": np.float32(lab.loc[s, \"emos\"]),\n",
+    "                \"vad_raw\": np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32)}\n",
+    "\n",
+    "def collate(batch):\n",
+    "    L = max(len(b[\"wave\"]) for b in batch)\n",
+    "    waves = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    ivs = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    mask = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    for i, b in enumerate(batch):\n",
+    "        n = len(b[\"wave\"])\n",
+    "        waves[i, :n] = b[\"wave\"]; ivs[i, :len(b[\"iv\"])] = b[\"iv\"]; mask[i, :n] = 1.0\n",
+    "    return {\n",
+    "        \"wave\": torch.from_numpy(waves), \"iv\": torch.from_numpy(ivs), \"attn_mask\": torch.from_numpy(mask).long(),\n",
+    "        \"tgt\": torch.from_numpy(np.stack([b[\"tgt\"] for b in batch])),\n",
+    "        \"emos\": torch.from_numpy(np.stack([b[\"emos\"] for b in batch])).unsqueeze(1),\n",
+    "        \"vad\": torch.from_numpy(np.stack([b[\"vad\"] for b in batch])),\n",
+    "        \"cat\": torch.from_numpy(np.stack([b[\"cat\"] for b in batch])),\n",
+    "        \"emos_raw\": np.stack([b[\"emos_raw\"] for b in batch]),\n",
+    "        \"vad_raw\": np.stack([b[\"vad_raw\"] for b in batch]),\n",
+    "    }\n",
+    "\n",
+    "ds = JointDataset(train_stems)\n",
+    "print(\"Dataset hợp lệ:\", len(ds), \"wav\")\n",
+    "tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)\n",
+    "tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)\n",
+    "va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0f85c871",
+   "metadata": {},
+   "source": [
+    "## 5. Heads (warm-start exp08 nếu khớp) + optimizer 2 backbone + train loop"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b0a71176",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import spearmanr\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "TRUNK_IN = WAVLM_DIM + AUD_DIM\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "if ckpt is not None and \"heads\" in ckpt:\n",
+    "    hmiss, hunexp = heads.load_state_dict(ckpt[\"heads\"], strict=False)\n",
+    "    if len(hmiss) == 0 and len(hunexp) == 0:\n",
+    "        print(\"🔁 warm-start heads từ exp08: KHỚP hoàn toàn.\")\n",
+    "    else:\n",
+    "        print(f\"⚠️ heads exp08 lệch (thiếu {len(hmiss)}/dư {len(hunexp)}) → có thể TRUNK_IN khác. Heads init mới phần lệch.\")\n",
+    "print(f\"Trunk input = {TRUNK_IN} (wavlm {WAVLM_DIM} + aud {AUD_DIM})\")\n",
+    "\n",
+    "TASKS = [\"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "bb_params = [p for p in wavlm.parameters() if p.requires_grad] + \\\n",
+    "            [p for p in aud.parameters() if p.requires_grad] + list(aud_head.parameters())\n",
+    "head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.AdamW([{\"params\": bb_params, \"lr\": LR_BACKBONE * RESUME_LR_SCALE},\n",
+    "                         {\"params\": head_params, \"lr\": LR_HEAD * RESUME_LR_SCALE}], weight_decay=WEIGHT_DECAY)\n",
+    "scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == \"cuda\")\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()\n",
+    "\n",
+    "def forward_batch(b):\n",
+    "    am = b[\"attn_mask\"].to(device)\n",
+    "    fw = wavlm_embed(b[\"wave\"].to(device), am)\n",
+    "    fa = aud_embed(b[\"iv\"].to(device), am)\n",
+    "    return heads(torch.cat([fw, fa], dim=1), b[\"tgt\"].to(device))\n",
+    "\n",
+    "def compute_loss(emos_p, cat_l, vad_p, b):\n",
+    "    L = {}\n",
+    "    L[\"emos\"] = mse(emos_p, b[\"emos\"].to(device))\n",
+    "    L[\"cat\"] = soft_ce(cat_l, b[\"cat\"].to(device))\n",
+    "    if HAS_VAD:\n",
+    "        vt = b[\"vad\"].to(device)\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L[\"aro\"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L[\"dom\"] = mse(vad_p[:, 2:3], vt[:, 2:3])\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device); L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))\n",
+    "    return sum(L.values())\n",
+    "\n",
+    "def set_train(flag):\n",
+    "    wavlm.train(flag); aud.train(flag); aud_head.train(flag); heads.train(flag)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def evaluate():\n",
+    "    set_train(False)\n",
+    "    P = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}; Y = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}\n",
+    "    catP, catY = [], []\n",
+    "    for b in va_loader:\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "        P[\"emos\"] += emos_p.float().cpu().numpy().ravel().tolist(); Y[\"emos\"] += b[\"emos_raw\"].tolist()\n",
+    "        vad_p = vad_p.float().cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            P[t] += vad_p[:, j].tolist(); Y[t] += b[\"vad_raw\"][:, j].tolist()\n",
+    "        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b[\"cat\"])\n",
+    "    out = {}\n",
+    "    for t in [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else []):\n",
+    "        out[t] = spearmanr(P[t], Y[t]).correlation\n",
+    "    q = np.concatenate(catP); p = np.concatenate(catY)\n",
+    "    out[\"cat_err\"] = float(np.abs(q - p).sum(1).mean())\n",
+    "    return out\n",
+    "\n",
+    "def mean_srcc(m):\n",
+    "    keys = [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "def snapshot():\n",
+    "    return {\"wavlm\": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},\n",
+    "            \"aud\": {k: v.cpu().clone() for k, v in aud.state_dict().items()},\n",
+    "            \"aud_head\": {k: v.cpu().clone() for k, v in aud_head.state_dict().items()},\n",
+    "            \"heads\": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}\n",
+    "\n",
+    "CKPT_PATH = os.path.join(OUT_DIR, \"ft_joint_full.pt\")\n",
+    "def save_full(state, val_emos=float(\"nan\")):\n",
+    "    torch.save({**state, \"emos_mu\": emos_mu, \"emos_sd\": emos_sd, \"vad_mu\": vad_mu, \"vad_sd\": vad_sd,\n",
+    "                \"WAVLM_DIM\": WAVLM_DIM, \"AUD_DIM\": AUD_DIM,\n",
+    "                \"UNFREEZE_WAVLM\": UNFREEZE_WAVLM, \"UNFREEZE_AUD\": UNFREEZE_AUD,\n",
+    "                \"val_emos\": float(val_emos)}, CKPT_PATH)\n",
+    "\n",
+    "# Init best từ trạng thái warm-start hiện tại → chỉ lưu nếu train tốt hơn\n",
+    "m0 = evaluate(); best = mean_srcc(m0); best_state = snapshot(); save_full(best_state, m0.get(\"emos\", float(\"nan\")))\n",
+    "print(f\"📍 Khởi điểm (warm-start): mean SRCC = {best:.4f} | \"\n",
+    "      + \" \".join(f\"{k}={m0[k]:.3f}\" for k in ['emos','val','aro','dom'] if k in m0))\n",
+    "\n",
+    "bad = 0\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    set_train(True)\n",
+    "    opt.zero_grad(); run = 0.0; nb = 0\n",
+    "    for step, b in enumerate(tqdm(tr_loader, desc=f\"epoch {ep}\")):\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM\n",
+    "        scaler.scale(loss).backward()\n",
+    "        if (step + 1) % ACCUM == 0:\n",
+    "            scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "        run += loss.item() * ACCUM; nb += 1\n",
+    "    m = evaluate(); sc = mean_srcc(m)\n",
+    "    msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "    print(f\"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})\")\n",
+    "    if sc > best:\n",
+    "        best = sc; best_state = snapshot(); save_full(best_state, m[\"emos\"])\n",
+    "        print(f\"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})\"); bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "        if bad >= PATIENCE:\n",
+    "            print(f\"Early stop ở epoch {ep}.\"); break\n",
+    "\n",
+    "# Nạp lại best\n",
+    "wavlm.load_state_dict(best_state[\"wavlm\"]); aud.load_state_dict(best_state[\"aud\"])\n",
+    "aud_head.load_state_dict(best_state[\"aud_head\"]); heads.load_state_dict(best_state[\"heads\"])\n",
+    "final = evaluate()\n",
+    "print(\"\\n✅ VAL (nội bộ) — exp11 (fine-tune CẢ 2 + fusion):\")\n",
+    "print(f\"   EMOS={final['emos']:.4f} (exp08 {EXP08['emos']})\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} \"\n",
+    "          f\"(exp08 {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})\")\n",
+    "print(f\"   mean SRCC: warm-start {mean_srcc(m0):.4f} → exp11 {mean_srcc(final):.4f} \"\n",
+    "      + (\"🚀 cải thiện\" if mean_srcc(final) > mean_srcc(m0) + 1e-4 else \"➖ không cải thiện\"))\n",
+    "save_full(best_state, final.get(\"emos\", float(\"nan\")))\n",
+    "print(\"Đã lưu FULL:\", CKPT_PATH, \"→ NHỚ Save Version!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dcb57395",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán DEV → answer.txt (5 cột cảm xúc từ exp11; QMOS mượn exp07 / UTMOSv2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fcae1d4c",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "def load_exp07_qmos():\n",
+    "    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "        import csv\n",
+    "        d = {}\n",
+    "        with open(EXP07_ANSWER) as f:\n",
+    "            for row in csv.DictReader(f):\n",
+    "                d[row[\"wav\"]] = float(row[\"QMOS\"]); d[stem(row[\"wav\"])] = float(row[\"QMOS\"])\n",
+    "        print(f\"✅ Mượn QMOS exp07: {len(d)//2} wav\")\n",
+    "        return d\n",
+    "    return None\n",
+    "\n",
+    "qmos_map = load_exp07_qmos()\n",
+    "if qmos_map is None:\n",
+    "    print(\"ℹ️ Không có exp07 → QMOS bằng UTMOSv2.\")\n",
+    "    pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")\n",
+    "    import utmosv2\n",
+    "    v2 = utmosv2.create_model(pretrained=True)\n",
+    "    qmos_map = {}\n",
+    "    for n in tqdm(dev_names, desc=\"UTMOSv2\"):\n",
+    "        wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else str(n) + \".wav\")\n",
+    "        if os.path.exists(wav):\n",
+    "            o = v2.predict(input_path=wav)\n",
+    "            qmos_map[n] = float(o[\"predicted_mos\"]) if isinstance(o, dict) else float(o)\n",
+    "    del v2; torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    pair = load_pair(sid)\n",
+    "    if pair is None:\n",
+    "        return None\n",
+    "    wave, iv = pair\n",
+    "    set_train(False)\n",
+    "    w = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    ivt = torch.from_numpy(iv).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        feat = torch.cat([wavlm_embed(w, am), aud_embed(ivt, am)], dim=1)\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "n_real = n_def = 0\n",
+    "with open(answer_path, \"w\") as f:\n",
+    "    f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "    for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "        sid = stem(name)\n",
+    "        pr = predict_emotion(sid)\n",
+    "        if pr is None:\n",
+    "            emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1\n",
+    "        else:\n",
+    "            emos, cat5, vad3 = pr; n_real += 1\n",
+    "        qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))\n",
+    "        f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "print(f\"Ghi {len(dev_names)} dòng → {answer_path} | cảm xúc thật {n_real}, mặc định {n_def}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7b04afd9",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0dfcf6ef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp11_joint.zip answer.txt && unzip -l submission_track2_exp11_joint.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp11_joint.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8b7adc9b",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **exp11 = fine-tune CẢ WavLM + audeering, FUSION 1 model** (khác exp08 audeering frozen, khác exp10 ensemble).\n",
+    "- **Warm-start:** WavLM + heads từ `ft_emotion_full_20epoch.pt` (exp08) → bắt đầu từ điểm tốt; audeering từ\n",
+    "  pretrained gốc, mở băng để học thêm. Khởi điểm = đúng exp08 → train chỉ có thể tốt lên (giữ best).\n",
+    "- **OOM:** đây là cấu hình nặng nhất. Nếu CUDA OOM → giảm `UNFREEZE_WAVLM`/`UNFREEZE_AUD` (4→2),\n",
+    "  `MAX_SECONDS` (6→5), giữ `BATCH=1` + tăng `ACCUM`.\n",
+    "- **Checkpoint:** lưu `ft_joint_full.pt` mỗi best (đủ cả 2 backbone + heads) → kernel chết vẫn còn. Save Version!\n",
+    "- **QMOS** vẫn mượn exp07 (0.548). So sánh nộp: exp11 vs exp08(0.811) vs exp10(ensemble) → chọn bản tốt nhất.\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (exp11)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp11_finetune_joint_pipeline.py ADDED Viewed

	@@ -0,0 +1,665 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp11 (FINE-TUNE ĐỒNG THỜI WavLM + audeering, FUSION 1 model) — Kaggle T4
+#
+# **Khác exp08:** exp08 chỉ fine-tune WavLM, audeering **đóng băng** (frozen, cache). exp11 **MỞ BĂNG CẢ HAI**
+# backbone và fuse đặc trưng **trong cùng 1 model** → cả hai cùng học cho bài MOS cảm xúc 2026.
+#
+# ```
+#  wav ─┬─► WavLM-large   (warm-start exp08, TRAINABLE: mở băng N lớp trên) ─► pool ─► emb_wavlm ┐
+#       └─► audeering MSP (TRAINABLE: mở băng N lớp trên) ─► pool ─► [emb_aud(1024) | vad3] ──────┼─► TRUNK ─┬─► EMOS (+target)
+#                                                                                                 ┘          ├─► CAT (5)
+#                                                                                                            └─► VAD (3)
+#  QMOS: KHÔNG train ở đây → mượn cột QMOS exp07 (0.548) hoặc UTMOSv2.
+# ```
+#
+# ## Vì sao "feature fusion + fine-tune cả 2" (khác ensemble exp10)
+# - **exp10 = ensemble:** 2 model RIÊNG → trung bình cột VAD ở mức answer. An toàn nhưng 2 model không "nói chuyện".
+# - **exp11 = fusion:** 1 model, 2 backbone fuse Ở TRONG → trunk học phối hợp cả hai góc nhìn (WavLM categorical +
+#   audeering dimensional) → kỳ vọng mạnh hơn nếu không OOM/overfit.
+#
+# ## ⚠️ ĐÁNH ĐỔI PHẢI BIẾT — đây là cấu hình NẶNG nhất (2 backbone large cùng có gradient)
+# - **Rủi ro OOM cao trên T4 (16GB).** Đã bật sẵn mọi cách giảm bộ nhớ: `BATCH=1` + grad-accum,
+#   gradient-checkpointing CẢ 2 backbone, AMP fp16, `MAX_SECONDS=6`, mở băng ÍT lớp (mặc định 4 mỗi backbone).
+# - Nếu vẫn OOM: giảm `UNFREEZE_WAVLM`/`UNFREEZE_AUD` → 2, giảm `MAX_SECONDS` → 5, tăng `ACCUM`.
+# - **Chậm + đốt giờ GPU** (2 backbone forward+backward, không cache được). **LẦN ĐẦU BẮT BUỘC `LIMIT_TRAIN=300`,
+#   `LIMIT_DEV=20`** để chỉnh trơn rồi mới `None`.
+# - **Lưới an toàn:** đừng đốt lượt nộp — chỉ nộp khi exp11 thắng exp08 (0.811) TRÊN VAL NỘI BỘ.
+#
+# **Cách chạy:** GPU **T4** + Internet **On** → Add Input (data + checkpoint exp08 + [tùy chọn] answer exp07) →
+# sửa slug cell 0 → Run All. Ghi config→kết quả→nhận xét vào `docs/04_experiments_log.md` (exp11).
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+DATA_ROOT    = "/kaggle/input/datasets/minhtoan2/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header)
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+# ── Warm-start / RESUME: trỏ tới 1 trong 2 loại checkpoint ───────────────────
+#   • ft_emotion_full_20epoch.pt (exp08): có 'wavlm'+'heads' → WARM-START (audeering từ pretrained gốc).
+#   • ft_joint_full.pt (exp11): có thêm 'aud'+'aud_head'    → RESUME ĐỦ (khôi phục cả 2 backbone đã fine-tune).
+# Notebook TỰ nhận biết theo key trong checkpoint. Để "" nếu train WavLM từ SAILER trắng.
+WARMSTART_CKPT = "/kaggle/input/ft-joint-full/ft_joint_full.pt"   # << exp08 ckpt (warm-start) HOẶC exp11 ckpt (resume)
+# Mượn cột QMOS exp07 (0.548). Không có → UTMOSv2.
+EXP07_ANSWER = "/kaggle/input/exp07-answer/answer.txt"   # << (tùy chọn) answer.txt exp07; không có → UTMOSv2
+OUT_DIR = "/kaggle/working"
+# ── Fine-tune / siêu tham số (CẤU HÌNH NẶNG — đã tối ưu cho T4) ───────────────
+DEVICE          = "cuda"
+SR              = 16000
+MAX_SECONDS     = 6            # ↓ so exp08 (8) để tiết kiệm VRAM (2 backbone)
+UNFREEZE_WAVLM  = 4            # số lớp encoder WavLM mở băng (OOM → 2)
+UNFREEZE_AUD    = 4            # số lớp encoder audeering mở băng (OOM → 2)
+TRUNK_HIDDEN    = 512          # PHẢI khớp checkpoint exp08 nếu warm-start heads
+HEAD_HIDDEN     = 128          # PHẢI khớp checkpoint exp08
+DROPOUT         = 0.3
+LR_BACKBONE     = 1e-5         # LR chung cho 2 backbone
+LR_HEAD         = 1e-3
+RESUME_LR_SCALE = 1.0          # <1.0 để GIẢM LR khi resume (vd 0.5 nếu val đã chững) — nhân vào cả 2 nhóm LR
+WEIGHT_DECAY    = 1e-5
+EPOCHS          = 12
+PATIENCE        = 4            # dừng khi val không lên; LUÔN giữ best
+BATCH           = 1            # ⚠️ 2 backbone large → batch nhỏ
+ACCUM           = 16           # effective batch = 16
+VAL_FRAC        = 0.10
+SEED            = 42
+USE_AMP         = True
+USE_GRAD_CKPT   = True
+USE_UNCERTAINTY = True
+LIMIT_TRAIN     = 300          # << LẦN ĐẦU 300; chạy thật None
+LIMIT_DEV       = 20           # << LẦN ĐẦU 20; ch���y thật None
+EXP08 = {"emos": 0.811, "cat_err": 0.133, "val": 0.659, "aro": 0.793, "dom": 0.751}  # mốc để so
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+print(("  ✅ " if (WARMSTART_CKPT and os.path.exists(WARMSTART_CKPT)) else "  ⚠️ KHÔNG có ") + str(WARMSTART_CKPT)
+      + ("  → warm-start" if (WARMSTART_CKPT and os.path.exists(WARMSTART_CKPT)) else "  → train từ SAILER trắng"))
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER (dựng đúng kiến trúc WavLM)
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("transformers", "huggingface_hub", "safetensors", "loralib", "speechbrain",
+            "speechmos", "librosa", "soundfile", "scipy", "scikit-learn", "pandas", "tqdm")
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2A. WavLM TRAINABLE (warm-start SAILER / checkpoint exp08)
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (rất chậm!)")
+# Nạp checkpoint exp08 (nếu có) — lấy cả 'wavlm', 'heads', thống kê chuẩn hóa
+ckpt = None
+if WARMSTART_CKPT and os.path.exists(WARMSTART_CKPT):
+    ckpt = torch.load(WARMSTART_CKPT, map_location="cpu", weights_only=False)
+    print("✅ Nạp checkpoint warm-start:", WARMSTART_CKPT, "| keys:", list(ckpt.keys()))
+    if "wavlm" not in ckpt:
+        print("   ⚠️ Checkpoint KHÔNG có 'wavlm' (chỉ heads?) → vẫn dựng WavLM từ SAILER, chỉ warm-start heads nếu khớp.")
+def find_hf_backbone(module):
+    cands = []
+    for name, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((name, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+try:
+    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+    _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+    name, wavlm = find_hf_backbone(_wrapper)
+    if wavlm is not None:
+        print(f"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{name}'")
+except Exception as e:
+    print("⚠️ Lỗi nạp SAILER wrapper:", repr(e), "→ fallback WavLM trắng.")
+if wavlm is None:
+    from transformers import WavLMModel
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("ℹ️ Fallback: microsoft/wavlm-large.")
+wavlm = wavlm.to(device)
+WAVLM_DIM = int(wavlm.config.hidden_size)
+wavlm.config.layerdrop = 0.0   # ⚠️ tắt layerdrop khi dùng gradient-checkpointing (tránh CheckpointError)
+# Đè trọng số đã fine-tune từ checkpoint exp08 (nếu có)
+if ckpt is not None and "wavlm" in ckpt:
+    miss, unexp = wavlm.load_state_dict(ckpt["wavlm"], strict=False)
+    print(f"🔁 load wavlm từ checkpoint exp08: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0).")
+# Đóng băng partial: chỉ mở UNFREEZE_WAVLM lớp trên
+for p in wavlm.parameters():
+    p.requires_grad = False
+_wl = wavlm.encoder.layers
+for layer in _wl[max(0, len(_wl) - UNFREEZE_WAVLM):]:
+    for p in layer.parameters():
+        p.requires_grad = True
+print(f"WavLM: {len(_wl)} lớp · mở băng {min(UNFREEZE_WAVLM, len(_wl))} → "
+      f"{sum(p.numel() for p in wavlm.parameters() if p.requires_grad)/1e6:.1f}M param train (dim {WAVLM_DIM})")
+if USE_GRAD_CKPT:
+    wavlm.gradient_checkpointing_enable()
+    if hasattr(wavlm, "enable_input_require_grads"):
+        wavlm.enable_input_require_grads()
+def masked_mean(hidden, attn_mask, model):
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    try:
+        fm = model._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+    except Exception:
+        return hidden.mean(dim=1)
+    fm = fm.unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+def wavlm_embed(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state
+    return masked_mean(out, attn_mask, wavlm)
+# %% [markdown]
+# ## 2B. audeering TRAINABLE (mở băng — khác exp08 là frozen)
+# Nạp backbone tay + head dimensional gốc; mở băng `UNFREEZE_AUD` lớp trên. Đặc trưng fuse = [hidden(1024) | vad3].
+# %%
+from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+from huggingface_hub import hf_hub_download
+AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+aud = Wav2Vec2Model(aud_cfg)
+try:
+    _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+        hf_hub_download(AUD_NAME, "model.safetensors"))
+except Exception:
+    _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+aud.load_state_dict(bb_sd, strict=False)
+_hid = _sd["classifier.dense.weight"].shape[0]
+aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _sd["classifier.out_proj.weight"].shape[0]))
+aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+aud = aud.to(device); aud_head = aud_head.to(device)
+aud.config.layerdrop = 0.0   # ⚠️ tắt layerdrop khi dùng gradient-checkpointing (tránh CheckpointError)
+AUD_DIM = _hid + 3   # = 1027 (khớp exp08 để warm-start heads)
+# RESUME: nếu checkpoint là ft_joint_full.pt (có 'aud') → khôi phục audeering ĐÃ fine-tune (đè pretrained)
+if ckpt is not None and "aud" in ckpt:
+    amiss, aunexp = aud.load_state_dict(ckpt["aud"], strict=False)
+    print(f"🔁 RESUME audeering từ checkpoint: thiếu {len(amiss)} / dư {len(aunexp)} key (kỳ vọng ~0).")
+    if "aud_head" in ckpt:
+        aud_head.load_state_dict(ckpt["aud_head"]); print("🔁 RESUME aud_head từ checkpoint.")
+else:
+    print("ℹ️ Checkpoint không có 'aud' → audeering khởi từ pretrained gốc (chế độ warm-start exp08).")
+# Đóng băng partial audeering: mở UNFREEZE_AUD lớp trên + head dimensional luôn trainable
+for p in aud.parameters():
+    p.requires_grad = False
+_al = aud.encoder.layers
+for layer in _al[max(0, len(_al) - UNFREEZE_AUD):]:
+    for p in layer.parameters():
+        p.requires_grad = True
+for p in aud_head.parameters():
+    p.requires_grad = True
+print(f"audeering: {len(_al)} lớp · mở băng {min(UNFREEZE_AUD, len(_al))} → "
+      f"{sum(p.numel() for p in aud.parameters() if p.requires_grad)/1e6:.1f}M param train (hidden {_hid}, fuse dim {AUD_DIM})")
+if USE_GRAD_CKPT:
+    aud.gradient_checkpointing_enable()
+    if hasattr(aud, "enable_input_require_grads"):
+        aud.enable_input_require_grads()
+def aud_embed(input_values, attn_mask):
+    """Trả về [hidden(1024) | vad3] — vad3 từ head dimensional gốc, theo thứ tự VAL,ARO,DOM."""
+    h = masked_mean(aud(input_values, attention_mask=attn_mask).last_hidden_state, attn_mask, aud)
+    out = aud_head(h)   # [B,3] thứ tự gốc audeering: (arousal, dominance, valence)
+    vad = torch.stack([1 + 4 * out[:, 2], 1 + 4 * out[:, 0], 1 + 4 * out[:, 1]], dim=1)  # → VAL,ARO,DOM
+    return torch.cat([h, vad], dim=1)
+# %% [markdown]
+# ## 3. Đọc & gộp nhãn theo wavID (như exp08)
+# %%
+import librosa
+import pandas as pd
+from tqdm.auto import tqdm
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, df=None, default_idx=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = _col(cols, "wavid", "wav", df=df, default_idx=1)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col = _col(cols, "val", "valence"); aro_col = _col(cols, "aro", "arousal"); dom_col = _col(cols, "dom", "dominance")
+    cat_col = _col(cols, "emocat", "cat", "emotion")
+    assert emos_col, f"Không thấy cột eMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+# %% [markdown]
+# ## 4. Dataset/loader — trả về CẢ raw wave (cho WavLM) + input_values audeering
+# Hai backbone cần đầu vào khác nhau: WavLM nhận wave thô; audeering nhận wave đã chuẩn hóa bởi processor.
+# Cùng độ dài → dùng chung attention mask theo mức sample.
+# %%
+from torch.utils.data import Dataset, DataLoader
+from sklearn.model_selection import train_test_split
+train_stems = [s for s in train_df["wavID"] if target_map.get(s) is not None]
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+lab = train_df.set_index("wavID")
+# Chuẩn hóa: lấy TỪ checkpoint nếu warm-start (để khớp head đã train); không thì fit từ data.
+if ckpt is not None and "emos_mu" in ckpt:
+    emos_mu = float(ckpt["emos_mu"]); emos_sd = float(ckpt["emos_sd"])
+    vad_mu = np.asarray(ckpt["vad_mu"], dtype=np.float32); vad_sd = np.asarray(ckpt["vad_sd"], dtype=np.float32)
+    print(f"Chuẩn hóa TỪ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}")
+else:
+    def _zfit(a):
+        a = np.asarray(a, dtype=np.float32); return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)
+    emos_mu, emos_sd = _zfit([lab.loc[s, "emos"] for s in train_stems])
+    if HAS_VAD:
+        vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in ["val", "aro", "dom"]], np.float32)
+        vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in ["val", "aro", "dom"]], np.float32)
+    else:
+        vad_mu = np.zeros(3, np.float32); vad_sd = np.ones(3, np.float32)
+    print(f"Chuẩn hóa fit từ data: emos μ={emos_mu:.3f} σ={emos_sd:.3f}")
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+def load_pair(sid):
+    """Trả về (wave_thô, iv_audeering) cùng độ dài; None nếu thiếu file."""
+    p = os.path.join(WAV_DIR, sid if str(sid).endswith(".wav") else str(sid) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    wave = wave[: MAX_SECONDS * SR].astype(np.float32)
+    iv = np.asarray(aud_proc(wave, sampling_rate=SR).input_values[0], dtype=np.float32)
+    return wave, iv
+class JointDataset(Dataset):
+    def __init__(self, stems):
+        self.stems = [s for s in stems if load_pair(s) is not None]
+    def __len__(self):
+        return len(self.stems)
+    def __getitem__(self, i):
+        s = self.stems[i]
+        wave, iv = load_pair(s)
+        emos = (float(lab.loc[s, "emos"]) - emos_mu) / emos_sd
+        if HAS_VAD:
+            vad = (np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32) - vad_mu) / vad_sd
+        else:
+            vad = np.zeros(3, dtype=np.float32)
+        cat = np.array([lab.loc[s, f"cat{j}"] for j in range(len(EMOTIONS5))], dtype=np.float32)
+        return {"wave": wave, "iv": iv, "tgt": onehot_target(target_map.get(s)),
+                "emos": np.float32(emos), "vad": vad, "cat": cat,
+                "emos_raw": np.float32(lab.loc[s, "emos"]),
+                "vad_raw": np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32)}
+def collate(batch):
+    L = max(len(b["wave"]) for b in batch)
+    waves = np.zeros((len(batch), L), dtype=np.float32)
+    ivs = np.zeros((len(batch), L), dtype=np.float32)
+    mask = np.zeros((len(batch), L), dtype=np.float32)
+    for i, b in enumerate(batch):
+        n = len(b["wave"])
+        waves[i, :n] = b["wave"]; ivs[i, :len(b["iv"])] = b["iv"]; mask[i, :n] = 1.0
+    return {
+        "wave": torch.from_numpy(waves), "iv": torch.from_numpy(ivs), "attn_mask": torch.from_numpy(mask).long(),
+        "tgt": torch.from_numpy(np.stack([b["tgt"] for b in batch])),
+        "emos": torch.from_numpy(np.stack([b["emos"] for b in batch])).unsqueeze(1),
+        "vad": torch.from_numpy(np.stack([b["vad"] for b in batch])),
+        "cat": torch.from_numpy(np.stack([b["cat"] for b in batch])),
+        "emos_raw": np.stack([b["emos_raw"] for b in batch]),
+        "vad_raw": np.stack([b["vad_raw"] for b in batch]),
+    }
+ds = JointDataset(train_stems)
+print("Dataset hợp lệ:", len(ds), "wav")
+tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)
+tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)
+va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)
+# %% [markdown]
+# ## 5. Heads (warm-start exp08 nếu khớp) + optimizer 2 backbone + train loop
+# %%
+from scipy.stats import spearmanr
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+TRUNK_IN = WAVLM_DIM + AUD_DIM
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+if ckpt is not None and "heads" in ckpt:
+    hmiss, hunexp = heads.load_state_dict(ckpt["heads"], strict=False)
+    if len(hmiss) == 0 and len(hunexp) == 0:
+        print("🔁 warm-start heads từ exp08: KHỚP hoàn toàn.")
+    else:
+        print(f"⚠️ heads exp08 lệch (thiếu {len(hmiss)}/dư {len(hunexp)}) → có thể TRUNK_IN khác. Heads init mới phần lệch.")
+print(f"Trunk input = {TRUNK_IN} (wavlm {WAVLM_DIM} + aud {AUD_DIM})")
+TASKS = ["emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+bb_params = [p for p in wavlm.parameters() if p.requires_grad] + \
+            [p for p in aud.parameters() if p.requires_grad] + list(aud_head.parameters())
+head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.AdamW([{"params": bb_params, "lr": LR_BACKBONE * RESUME_LR_SCALE},
+                         {"params": head_params, "lr": LR_HEAD * RESUME_LR_SCALE}], weight_decay=WEIGHT_DECAY)
+scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == "cuda")
+mse = nn.MSELoss()
+def soft_ce(logits, target_dist):
+    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()
+def forward_batch(b):
+    am = b["attn_mask"].to(device)
+    fw = wavlm_embed(b["wave"].to(device), am)
+    fa = aud_embed(b["iv"].to(device), am)
+    return heads(torch.cat([fw, fa], dim=1), b["tgt"].to(device))
+def compute_loss(emos_p, cat_l, vad_p, b):
+    L = {}
+    L["emos"] = mse(emos_p, b["emos"].to(device))
+    L["cat"] = soft_ce(cat_l, b["cat"].to(device))
+    if HAS_VAD:
+        vt = b["vad"].to(device)
+        L["val"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L["aro"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L["dom"] = mse(vad_p[:, 2:3], vt[:, 2:3])
+    else:
+        z = torch.zeros((), device=device); L["val"] = L["aro"] = L["dom"] = z
+    if USE_UNCERTAINTY:
+        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))
+    return sum(L.values())
+def set_train(flag):
+    wavlm.train(flag); aud.train(flag); aud_head.train(flag); heads.train(flag)
+@torch.no_grad()
+def evaluate():
+    set_train(False)
+    P = {"emos": [], "val": [], "aro": [], "dom": []}; Y = {"emos": [], "val": [], "aro": [], "dom": []}
+    catP, catY = [], []
+    for b in va_loader:
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+        P["emos"] += emos_p.float().cpu().numpy().ravel().tolist(); Y["emos"] += b["emos_raw"].tolist()
+        vad_p = vad_p.float().cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            P[t] += vad_p[:, j].tolist(); Y[t] += b["vad_raw"][:, j].tolist()
+        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b["cat"])
+    out = {}
+    for t in ["emos"] + (["val", "aro", "dom"] if HAS_VAD else []):
+        out[t] = spearmanr(P[t], Y[t]).correlation
+    q = np.concatenate(catP); p = np.concatenate(catY)
+    out["cat_err"] = float(np.abs(q - p).sum(1).mean())
+    return out
+def mean_srcc(m):
+    keys = ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+def snapshot():
+    return {"wavlm": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},
+            "aud": {k: v.cpu().clone() for k, v in aud.state_dict().items()},
+            "aud_head": {k: v.cpu().clone() for k, v in aud_head.state_dict().items()},
+            "heads": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}
+CKPT_PATH = os.path.join(OUT_DIR, "ft_joint_full.pt")
+def save_full(state, val_emos=float("nan")):
+    torch.save({**state, "emos_mu": emos_mu, "emos_sd": emos_sd, "vad_mu": vad_mu, "vad_sd": vad_sd,
+                "WAVLM_DIM": WAVLM_DIM, "AUD_DIM": AUD_DIM,
+                "UNFREEZE_WAVLM": UNFREEZE_WAVLM, "UNFREEZE_AUD": UNFREEZE_AUD,
+                "val_emos": float(val_emos)}, CKPT_PATH)
+# Init best từ trạng thái warm-start hiện tại → chỉ lưu nếu train tốt hơn
+m0 = evaluate(); best = mean_srcc(m0); best_state = snapshot(); save_full(best_state, m0.get("emos", float("nan")))
+print(f"📍 Khởi điểm (warm-start): mean SRCC = {best:.4f} | "
+      + " ".join(f"{k}={m0[k]:.3f}" for k in ['emos','val','aro','dom'] if k in m0))
+bad = 0
+for ep in range(1, EPOCHS + 1):
+    set_train(True)
+    opt.zero_grad(); run = 0.0; nb = 0
+    for step, b in enumerate(tqdm(tr_loader, desc=f"epoch {ep}")):
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM
+        scaler.scale(loss).backward()
+        if (step + 1) % ACCUM == 0:
+            scaler.step(opt); scaler.update(); opt.zero_grad()
+        run += loss.item() * ACCUM; nb += 1
+    m = evaluate(); sc = mean_srcc(m)
+    msg = " ".join(f"{k}={m[k]:.3f}" for k in ["emos", "val", "aro", "dom"] if k in m)
+    print(f"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})")
+    if sc > best:
+        best = sc; best_state = snapshot(); save_full(best_state, m["emos"])
+        print(f"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})"); bad = 0
+    else:
+        bad += 1
+        if bad >= PATIENCE:
+            print(f"Early stop ở epoch {ep}."); break
+# Nạp lại best
+wavlm.load_state_dict(best_state["wavlm"]); aud.load_state_dict(best_state["aud"])
+aud_head.load_state_dict(best_state["aud_head"]); heads.load_state_dict(best_state["heads"])
+final = evaluate()
+print("\n✅ VAL (nội bộ) — exp11 (fine-tune CẢ 2 + fusion):")
+print(f"   EMOS={final['emos']:.4f} (exp08 {EXP08['emos']})")
+if HAS_VAD:
+    print(f"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} "
+          f"(exp08 {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})")
+print(f"   mean SRCC: warm-start {mean_srcc(m0):.4f} → exp11 {mean_srcc(final):.4f} "
+      + ("🚀 cải thiện" if mean_srcc(final) > mean_srcc(m0) + 1e-4 else "➖ không cải thiện"))
+save_full(best_state, final.get("emos", float("nan")))
+print("Đã lưu FULL:", CKPT_PATH, "→ NHỚ Save Version!")
+# %% [markdown]
+# ## 6. Dự đoán DEV → answer.txt (5 cột cảm xúc từ exp11; QMOS mượn exp07 / UTMOSv2)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+print("DEV:", len(dev_names), "mẫu")
+def load_exp07_qmos():
+    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+        import csv
+        d = {}
+        with open(EXP07_ANSWER) as f:
+            for row in csv.DictReader(f):
+                d[row["wav"]] = float(row["QMOS"]); d[stem(row["wav"])] = float(row["QMOS"])
+        print(f"✅ Mượn QMOS exp07: {len(d)//2} wav")
+        return d
+    return None
+qmos_map = load_exp07_qmos()
+if qmos_map is None:
+    print("ℹ️ Không có exp07 → QMOS bằng UTMOSv2.")
+    pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")
+    import utmosv2
+    v2 = utmosv2.create_model(pretrained=True)
+    qmos_map = {}
+    for n in tqdm(dev_names, desc="UTMOSv2"):
+        wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else str(n) + ".wav")
+        if os.path.exists(wav):
+            o = v2.predict(input_path=wav)
+            qmos_map[n] = float(o["predicted_mos"]) if isinstance(o, dict) else float(o)
+    del v2; torch.cuda.empty_cache() if device == "cuda" else None
+@torch.no_grad()
+def predict_emotion(sid):
+    pair = load_pair(sid)
+    if pair is None:
+        return None
+    wave, iv = pair
+    set_train(False)
+    w = torch.from_numpy(wave).unsqueeze(0).to(device)
+    ivt = torch.from_numpy(iv).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        feat = torch.cat([wavlm_embed(w, am), aud_embed(ivt, am)], dim=1)
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+n_real = n_def = 0
+with open(answer_path, "w") as f:
+    f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+    for name in tqdm(dev_names, desc="answer"):
+        sid = stem(name)
+        pr = predict_emotion(sid)
+        if pr is None:
+            emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1
+        else:
+            emos, cat5, vad3 = pr; n_real += 1
+        qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))
+        f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+print(f"Ghi {len(dev_names)} dòng → {answer_path} | cảm xúc thật {n_real}, mặc định {n_def}")
+# %% [markdown]
+# ## 7. Validate + zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp11_joint.zip answer.txt && unzip -l submission_track2_exp11_joint.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp11_joint.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **exp11 = fine-tune CẢ WavLM + audeering, FUSION 1 model** (khác exp08 audeering frozen, khác exp10 ensemble).
+# - **Warm-start:** WavLM + heads từ `ft_emotion_full_20epoch.pt` (exp08) → bắt đầu từ điểm tốt; audeering từ
+#   pretrained gốc, mở băng để học thêm. Khởi điểm = đúng exp08 → train chỉ có thể tốt lên (giữ best).
+# - **OOM:** đây là cấu hình nặng nhất. Nếu CUDA OOM → giảm `UNFREEZE_WAVLM`/`UNFREEZE_AUD` (4→2),
+#   `MAX_SECONDS` (6→5), giữ `BATCH=1` + tăng `ACCUM`.
+# - **Checkpoint:** lưu `ft_joint_full.pt` mỗi best (đủ cả 2 backbone + heads) → kernel chết vẫn còn. Save Version!
+# - **QMOS** vẫn mượn exp07 (0.548). So sánh nộp: exp11 vs exp08(0.811) vs exp10(ensemble) → chọn bản tốt nhất.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (exp11).

track2/exp12_wavlm_scratch.ipynb ADDED Viewed

	@@ -0,0 +1,690 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "73aea642",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp12 (WavLM: SCRATCH vs BASE vs SAILER — ablation khởi tạo) — Kaggle T4\n",
+    "\n",
+    "**Mục đích:** kiểm chứng giả thuyết của mentor — *\"với 12k data, train from scratch có tốt hơn fine-tune không?\"*\n",
+    "Một notebook, đổi cờ `INIT_MODE` để chạy 3 cách khởi tạo backbone WavLM, so trên CÙNG kiến trúc/data:\n",
+    "\n",
+    "| INIT_MODE | Khởi tạo WavLM | Train gì | Ý nghĩa |\n",
+    "|---|---|---|---|\n",
+    "| `scratch` | **ngẫu nhiên** (không pretrain) | **toàn bộ** backbone | \"from scratch\" đúng nghĩa mentor nói |\n",
+    "| `base`    | microsoft/wavlm-large (pretrain SSL, KHÔNG cảm xúc) | mở băng N lớp trên | đo lợi ích của SAILER warm-start |\n",
+    "| `sailer`  | warm-start cảm xúc (như exp08) | mở băng N lớp trên | bản mạnh hiện tại |\n",
+    "\n",
+    "**Chỉ WavLM** (bỏ audeering) để cô lập đúng biến \"khởi tạo\". QMOS mượn exp07 / UTMOSv2.\n",
+    "\n",
+    "## ⚠️ Kỳ vọng trung thực (để đọc kết quả đúng)\n",
+    "- `scratch` gần như CHẮC CHẮN yếu hơn `base`/`sailer`: 12k mẫu là quá ít để dạy WavLM \"nghe\" từ đầu\n",
+    "  (SSL pretrain dùng ~94.000 GIỜ audio). Đây là ablation để **chứng minh bằng số**, không phải để vượt.\n",
+    "- `scratch` phải mở băng TOÀN BỘ (mới có gì để học) → **nặng + chậm + dễ OOM** trên T4. Dùng LIMIT nhỏ trước.\n",
+    "- So sánh bằng **VAL nội bộ** giữa 3 mode đã đủ kết luận; muốn chắc thì nộp mode tốt nhất lên DEV.\n",
+    "\n",
+    "**Cách chạy:** GPU T4 + Internet On → sửa cell 0 (`INIT_MODE` + slug) → Run All. Chạy 3 lần đổi INIT_MODE."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c0242f5c",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a167d9d3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "INIT_MODE = \"sailer\"   # << \"scratch\" | \"base\" | \"sailer\"  (đổi rồi chạy lại để so) — \"sailer\" = WavLM warm-start cảm xúc\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/datasets/minhtoan2/vmc2026-track2-full\"   # << SỬA slug cho khớp Add Input\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "EXP07_ANSWER = \"/kaggle/input/exp07-answer/answer.txt\"   # << (tùy chọn) mượn QMOS 0.548; không có → UTMOSv2\n",
+    "OUT_DIR      = \"/kaggle/working\"\n",
+    "\n",
+    "# ── Siêu tham số ─────────────────────────────────────────────────────────────\n",
+    "DEVICE          = \"cuda\"\n",
+    "SR              = 16000\n",
+    "MAX_SECONDS     = 6\n",
+    "TRUNK_HIDDEN    = 512\n",
+    "HEAD_HIDDEN     = 128\n",
+    "DROPOUT         = 0.3\n",
+    "WEIGHT_DECAY    = 1e-5\n",
+    "EPOCHS          = 15\n",
+    "PATIENCE        = 5\n",
+    "BATCH           = 4\n",
+    "ACCUM           = 8\n",
+    "VAL_FRAC        = 0.10\n",
+    "SEED            = 42\n",
+    "USE_AMP         = True\n",
+    "USE_GRAD_CKPT   = True\n",
+    "USE_UNCERTAINTY = True\n",
+    "\n",
+    "# Khởi tạo & LR & mở băng — TỰ đặt theo INIT_MODE (scratch cần LR lớn + mở băng toàn bộ)\n",
+    "if INIT_MODE == \"scratch\":\n",
+    "    UNFREEZE_TOP_LAYERS = \"all\"     # random init → phải train tất cả mới học được\n",
+    "    LR_BACKBONE = 1e-4              # random init cần bước lớn hơn fine-tune\n",
+    "elif INIT_MODE in (\"base\", \"sailer\"):\n",
+    "    UNFREEZE_TOP_LAYERS = 6         # fine-tune: chỉ mở băng N lớp trên (tiết kiệm VRAM, chống overfit)\n",
+    "    LR_BACKBONE = 1e-5\n",
+    "else:\n",
+    "    raise ValueError(f\"INIT_MODE lạ: {INIT_MODE}\")\n",
+    "LR_HEAD = 1e-3\n",
+    "\n",
+    "LIMIT_TRAIN = 300    # << LẦN ĐẦU 300; chạy thật None\n",
+    "LIMIT_DEV   = 20     # << LẦN ĐẦU 20; chạy thật None\n",
+    "\n",
+    "EXP08 = {\"emos\": 0.811, \"cat_err\": 0.133, \"val\": 0.659, \"aro\": 0.793, \"dom\": 0.751}  # mốc DEV để tham khảo\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(f\"INIT_MODE = {INIT_MODE} | UNFREEZE = {UNFREEZE_TOP_LAYERS} | LR_BACKBONE = {LR_BACKBONE}\")\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "46cd2554",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt (clone SAILER chỉ khi INIT_MODE='sailer')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "51808707",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "import numpy as _np\n",
+    "\n",
+    "# ⚠️ KHÓA numpy = bản Kaggle đang có → pip KHÔNG được nâng/hạ numpy → tránh \"SystemError: bad call flags\"\n",
+    "# (lỗi import torch do numpy lệch phiên bản với torch đã biên dịch sẵn).\n",
+    "_NPIN = f\"numpy=={_np.__version__}\"\n",
+    "print(\"Khóa numpy ở:\", _NPIN)\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs, _NPIN], check=True)\n",
+    "\n",
+    "# Kaggle đã có sẵn torch/transformers/librosa/scipy/sklearn/pandas/tqdm/huggingface_hub/safetensors.\n",
+    "# Chỉ cài thêm vài gói speech còn thiếu (kèm khóa numpy ở trên).\n",
+    "pip_install(\"loralib\", \"speechmos\", \"soundfile\")\n",
+    "if INIT_MODE == \"sailer\":\n",
+    "    pip_install(\"speechbrain\")\n",
+    "\n",
+    "if INIT_MODE == \"sailer\":\n",
+    "    REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "    if not os.path.exists(REPO_DIR):\n",
+    "        subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                        \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "    if REPO_DIR not in sys.path:\n",
+    "        sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5288727c",
+   "metadata": {},
+   "source": [
+    "## 2. Dựng WavLM theo INIT_MODE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c828dcd3",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "import numpy as np\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (rất chậm!)\")\n",
+    "\n",
+    "from transformers import WavLMModel, WavLMConfig\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    cands = []\n",
+    "    for name, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((name, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "if INIT_MODE == \"scratch\":\n",
+    "    # Random init NHƯNG giữ ĐÚNG kiến trúc large (để công bằng với base/sailer)\n",
+    "    cfg = WavLMConfig.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    wavlm = WavLMModel(cfg)   # KHÔNG load trọng số → ngẫu nhiên\n",
+    "    print(\"🎲 WavLM-large khởi tạo NGẪU NHIÊN (from scratch, không pretrain).\")\n",
+    "elif INIT_MODE == \"base\":\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"📦 WavLM-large pretrain SSL (chưa học cảm xúc).\")\n",
+    "elif INIT_MODE == \"sailer\":\n",
+    "    try:\n",
+    "        from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "        _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "        name, wavlm = find_hf_backbone(_wrapper)\n",
+    "        print(f\"🔥 WavLM warm-start SAILER (cảm xúc) tại '.{name}'\")\n",
+    "    except Exception as e:\n",
+    "        print(\"⚠️ Lỗi nạp SAILER:\", repr(e), \"→ fallback base pretrained.\")\n",
+    "        wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "\n",
+    "wavlm = wavlm.to(device)\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "wavlm.config.layerdrop = 0.0   # ⚠️ BẮT BUỘC khi dùng gradient-checkpointing (tránh CheckpointError do bỏ lớp ngẫu nhiên)\n",
+    "\n",
+    "# Mở băng theo cấu hình\n",
+    "if UNFREEZE_TOP_LAYERS == \"all\":\n",
+    "    for p in wavlm.parameters():\n",
+    "        p.requires_grad = True\n",
+    "    n_open = \"ALL\"\n",
+    "else:\n",
+    "    for p in wavlm.parameters():\n",
+    "        p.requires_grad = False\n",
+    "    _wl = wavlm.encoder.layers\n",
+    "    for layer in _wl[max(0, len(_wl) - UNFREEZE_TOP_LAYERS):]:\n",
+    "        for p in layer.parameters():\n",
+    "            p.requires_grad = True\n",
+    "    n_open = f\"top {min(UNFREEZE_TOP_LAYERS, len(_wl))}/{len(_wl)}\"\n",
+    "print(f\"WavLM mở băng: {n_open} → {sum(p.numel() for p in wavlm.parameters() if p.requires_grad)/1e6:.1f}M param train (dim {WAVLM_DIM})\")\n",
+    "\n",
+    "if USE_GRAD_CKPT:\n",
+    "    wavlm.gradient_checkpointing_enable()\n",
+    "    if hasattr(wavlm, \"enable_input_require_grads\"):\n",
+    "        wavlm.enable_input_require_grads()\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    try:\n",
+    "        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "    except Exception:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "def wavlm_embed(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state\n",
+    "    return masked_mean(out, attn_mask)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b963eb2",
+   "metadata": {},
+   "source": [
+    "## 3. Đọc & gộp nhãn theo wavID"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4ba4667b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import librosa\n",
+    "import pandas as pd\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, df=None, default_idx=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = _col(cols, \"wavid\", \"wav\", df=df, default_idx=1)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col = _col(cols, \"val\", \"valence\"); aro_col = _col(cols, \"aro\", \"arousal\"); dom_col = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7efc0957",
+   "metadata": {},
+   "source": [
+    "## 4. Dataset/loader (chỉ raw wave cho WavLM)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0ae0f55",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "train_stems = [s for s in train_df[\"wavID\"] if target_map.get(s) is not None]\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "\n",
+    "def _zfit(a):\n",
+    "    a = np.asarray(a, dtype=np.float32); return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)\n",
+    "emos_mu, emos_sd = _zfit([lab.loc[s, \"emos\"] for s in train_stems])\n",
+    "if HAS_VAD:\n",
+    "    vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in [\"val\", \"aro\", \"dom\"]], np.float32)\n",
+    "    vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in [\"val\", \"aro\", \"dom\"]], np.float32)\n",
+    "else:\n",
+    "    vad_mu = np.zeros(3, np.float32); vad_sd = np.ones(3, np.float32)\n",
+    "print(f\"Chuẩn hóa: emos μ={emos_mu:.3f} σ={emos_sd:.3f}\")\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_wav(sid):\n",
+    "    p = os.path.join(WAV_DIR, sid if str(sid).endswith(\".wav\") else str(sid) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    return wave[: MAX_SECONDS * SR].astype(np.float32)\n",
+    "\n",
+    "class EmoDataset(Dataset):\n",
+    "    def __init__(self, stems):\n",
+    "        self.stems = [s for s in stems if load_wav(s) is not None]\n",
+    "    def __len__(self):\n",
+    "        return len(self.stems)\n",
+    "    def __getitem__(self, i):\n",
+    "        s = self.stems[i]\n",
+    "        wave = load_wav(s)\n",
+    "        emos = (float(lab.loc[s, \"emos\"]) - emos_mu) / emos_sd\n",
+    "        if HAS_VAD:\n",
+    "            vad = (np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32) - vad_mu) / vad_sd\n",
+    "        else:\n",
+    "            vad = np.zeros(3, dtype=np.float32)\n",
+    "        cat = np.array([lab.loc[s, f\"cat{j}\"] for j in range(len(EMOTIONS5))], dtype=np.float32)\n",
+    "        return {\"wave\": wave, \"tgt\": onehot_target(target_map.get(s)),\n",
+    "                \"emos\": np.float32(emos), \"vad\": vad, \"cat\": cat,\n",
+    "                \"emos_raw\": np.float32(lab.loc[s, \"emos\"]),\n",
+    "                \"vad_raw\": np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32)}\n",
+    "\n",
+    "def collate(batch):\n",
+    "    L = max(len(b[\"wave\"]) for b in batch)\n",
+    "    waves = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    mask = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    for i, b in enumerate(batch):\n",
+    "        waves[i, : len(b[\"wave\"])] = b[\"wave\"]; mask[i, : len(b[\"wave\"])] = 1.0\n",
+    "    return {\n",
+    "        \"input_values\": torch.from_numpy(waves), \"attn_mask\": torch.from_numpy(mask).long(),\n",
+    "        \"tgt\": torch.from_numpy(np.stack([b[\"tgt\"] for b in batch])),\n",
+    "        \"emos\": torch.from_numpy(np.stack([b[\"emos\"] for b in batch])).unsqueeze(1),\n",
+    "        \"vad\": torch.from_numpy(np.stack([b[\"vad\"] for b in batch])),\n",
+    "        \"cat\": torch.from_numpy(np.stack([b[\"cat\"] for b in batch])),\n",
+    "        \"emos_raw\": np.stack([b[\"emos_raw\"] for b in batch]),\n",
+    "        \"vad_raw\": np.stack([b[\"vad_raw\"] for b in batch]),\n",
+    "    }\n",
+    "\n",
+    "ds = EmoDataset(train_stems)\n",
+    "print(\"Dataset hợp lệ:\", len(ds), \"wav\")\n",
+    "tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)\n",
+    "tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)\n",
+    "va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d249653b",
+   "metadata": {},
+   "source": [
+    "## 5. Heads + train loop"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fac929f9",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import spearmanr\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(WAVLM_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "\n",
+    "TASKS = [\"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "bb_params = [p for p in wavlm.parameters() if p.requires_grad]\n",
+    "head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.AdamW([{\"params\": bb_params, \"lr\": LR_BACKBONE},\n",
+    "                         {\"params\": head_params, \"lr\": LR_HEAD}], weight_decay=WEIGHT_DECAY)\n",
+    "scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == \"cuda\")\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()\n",
+    "\n",
+    "def forward_batch(b):\n",
+    "    feat = wavlm_embed(b[\"input_values\"].to(device), b[\"attn_mask\"].to(device))\n",
+    "    return heads(feat, b[\"tgt\"].to(device))\n",
+    "\n",
+    "def compute_loss(emos_p, cat_l, vad_p, b):\n",
+    "    L = {}\n",
+    "    L[\"emos\"] = mse(emos_p, b[\"emos\"].to(device))\n",
+    "    L[\"cat\"] = soft_ce(cat_l, b[\"cat\"].to(device))\n",
+    "    if HAS_VAD:\n",
+    "        vt = b[\"vad\"].to(device)\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L[\"aro\"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L[\"dom\"] = mse(vad_p[:, 2:3], vt[:, 2:3])\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device); L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))\n",
+    "    return sum(L.values())\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def evaluate():\n",
+    "    wavlm.eval(); heads.eval()\n",
+    "    P = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}; Y = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}\n",
+    "    catP, catY = [], []\n",
+    "    for b in va_loader:\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "        P[\"emos\"] += emos_p.float().cpu().numpy().ravel().tolist(); Y[\"emos\"] += b[\"emos_raw\"].tolist()\n",
+    "        vad_p = vad_p.float().cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            P[t] += vad_p[:, j].tolist(); Y[t] += b[\"vad_raw\"][:, j].tolist()\n",
+    "        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b[\"cat\"])\n",
+    "    out = {}\n",
+    "    for t in [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else []):\n",
+    "        out[t] = spearmanr(P[t], Y[t]).correlation\n",
+    "    q = np.concatenate(catP); p = np.concatenate(catY)\n",
+    "    out[\"cat_err\"] = float(np.abs(q - p).sum(1).mean())\n",
+    "    return out\n",
+    "\n",
+    "def mean_srcc(m):\n",
+    "    keys = [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "CKPT_PATH = os.path.join(OUT_DIR, f\"ft_wavlm_{INIT_MODE}.pt\")\n",
+    "def save_full(state, val_emos=float(\"nan\")):\n",
+    "    torch.save({\"wavlm\": state[\"wavlm\"], \"heads\": state[\"heads\"], \"INIT_MODE\": INIT_MODE,\n",
+    "                \"emos_mu\": emos_mu, \"emos_sd\": emos_sd, \"vad_mu\": vad_mu, \"vad_sd\": vad_sd,\n",
+    "                \"WAVLM_DIM\": WAVLM_DIM, \"val_emos\": float(val_emos)}, CKPT_PATH)\n",
+    "\n",
+    "best, best_state, bad = -1e9, None, 0\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    wavlm.train(); heads.train()\n",
+    "    opt.zero_grad(); run = 0.0; nb = 0\n",
+    "    for step, b in enumerate(tqdm(tr_loader, desc=f\"[{INIT_MODE}] epoch {ep}\")):\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM\n",
+    "        scaler.scale(loss).backward()\n",
+    "        if (step + 1) % ACCUM == 0:\n",
+    "            scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "        run += loss.item() * ACCUM; nb += 1\n",
+    "    m = evaluate(); sc = mean_srcc(m)\n",
+    "    msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "    print(f\"[{INIT_MODE}] epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})\")\n",
+    "    if sc > best:\n",
+    "        best = sc\n",
+    "        best_state = {\"wavlm\": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},\n",
+    "                      \"heads\": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}\n",
+    "        save_full(best_state, m[\"emos\"]); bad = 0\n",
+    "        print(f\"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})\")\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "        if bad >= PATIENCE:\n",
+    "            print(f\"Early stop ở epoch {ep}.\"); break\n",
+    "\n",
+    "if best_state:\n",
+    "    wavlm.load_state_dict(best_state[\"wavlm\"]); heads.load_state_dict(best_state[\"heads\"])\n",
+    "final = evaluate()\n",
+    "print(f\"\\n✅ VAL (nội bộ) — exp12 INIT_MODE={INIT_MODE}:\")\n",
+    "print(f\"   EMOS={final['emos']:.4f}\", end=\"\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\" | VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f}\")\n",
+    "else:\n",
+    "    print()\n",
+    "print(f\"   cat_err={final['cat_err']:.4f} | mean SRCC={mean_srcc(final):.4f}\")\n",
+    "print(f\"   (Mốc DEV exp08 để tham khảo: EMOS {EXP08['emos']}, VAD {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})\")\n",
+    "print(\"   ➜ GHI con số này vào bảng ablation 04_ rồi đổi INIT_MODE chạy lại để so 3 mode.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0874af79",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán DEV → answer.txt (QMOS mượn exp07 / UTMOSv2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b21df9af",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "def load_exp07_qmos():\n",
+    "    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "        import csv\n",
+    "        d = {}\n",
+    "        with open(EXP07_ANSWER) as f:\n",
+    "            for row in csv.DictReader(f):\n",
+    "                d[row[\"wav\"]] = float(row[\"QMOS\"]); d[stem(row[\"wav\"])] = float(row[\"QMOS\"])\n",
+    "        print(f\"✅ Mượn QMOS exp07: {len(d)//2} wav\")\n",
+    "        return d\n",
+    "    return None\n",
+    "\n",
+    "qmos_map = load_exp07_qmos()\n",
+    "if qmos_map is None:\n",
+    "    print(\"ℹ️ Không có exp07 → QMOS bằng UTMOSv2.\")\n",
+    "    pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")\n",
+    "    import utmosv2\n",
+    "    v2 = utmosv2.create_model(pretrained=True)\n",
+    "    qmos_map = {}\n",
+    "    for n in tqdm(dev_names, desc=\"UTMOSv2\"):\n",
+    "        wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else str(n) + \".wav\")\n",
+    "        if os.path.exists(wav):\n",
+    "            o = v2.predict(input_path=wav)\n",
+    "            qmos_map[n] = float(o[\"predicted_mos\"]) if isinstance(o, dict) else float(o)\n",
+    "    del v2; torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    wave = load_wav(sid)\n",
+    "    if wave is None:\n",
+    "        return None\n",
+    "    wavlm.eval(); heads.eval()\n",
+    "    iv = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        feat = wavlm_embed(iv, am)\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, f\"answer_{INIT_MODE}.txt\")\n",
+    "n_real = n_def = 0\n",
+    "with open(answer_path, \"w\") as f:\n",
+    "    f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "    for name in tqdm(dev_names, desc=f\"answer[{INIT_MODE}]\"):\n",
+    "        sid = stem(name)\n",
+    "        pr = predict_emotion(sid)\n",
+    "        if pr is None:\n",
+    "            emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1\n",
+    "        else:\n",
+    "            emos, cat5, vad3 = pr; n_real += 1\n",
+    "        qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))\n",
+    "        f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "print(f\"Ghi {len(dev_names)} dòng → {answer_path} | thật {n_real}, mặc định {n_def}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eaebc2e5",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "43e0440b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && cp answer_{INIT_MODE}.txt answer.txt && zip -j submission_track2_exp12_{INIT_MODE}.zip answer.txt && unzip -l submission_track2_exp12_{INIT_MODE}.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, f\"submission_track2_exp12_{INIT_MODE}.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ade69063",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Chạy 3 lần** đổi `INIT_MODE` (\"scratch\"→\"base\"→\"sailer\"), ghi `mean SRCC` mỗi lần vào BẢNG ABLATION\n",
+    "  trong `docs/04_experiments_log.md` → trả lời mentor bằng số: from-scratch tốt hơn fine-tune không?\n",
+    "- **scratch nặng:** mở băng toàn bộ WavLM-large. Nếu OOM → giảm `BATCH` (4→2), `MAX_SECONDS` (6→5),\n",
+    "  hoặc đổi sang `microsoft/wavlm-base-plus` (sửa cell 2) cho khả thi (lưu ý: khác kiến trúc → so kém công bằng hơn).\n",
+    "- **scratch chậm + cần nhiều epoch hơn** (random init): để `EPOCHS=15`, `PATIENCE=5`. Vẫn nhiều khả năng < base/sailer.\n",
+    "- **Đừng nhầm VAL nội bộ với DEV.** So 3 mode bằng VAL nội bộ đã đủ kết luận; muốn chắc thì nộp mode tốt nhất.\n",
+    "- Checkpoint lưu `ft_wavlm_<mode>.pt`. Save Version sau mỗi lần chạy."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp12_wavlm_scratch_pipeline.py ADDED Viewed

	@@ -0,0 +1,564 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp12 (WavLM: SCRATCH vs BASE vs SAILER — ablation khởi tạo) — Kaggle T4
+#
+# **Mục đích:** kiểm chứng giả thuyết của mentor — *"với 12k data, train from scratch có tốt hơn fine-tune không?"*
+# Một notebook, đổi cờ `INIT_MODE` để chạy 3 cách khởi tạo backbone WavLM, so trên CÙNG kiến trúc/data:
+#
+# | INIT_MODE | Khởi tạo WavLM | Train gì | Ý nghĩa |
+# |---|---|---|---|
+# | `scratch` | **ngẫu nhiên** (không pretrain) | **toàn bộ** backbone | "from scratch" đúng nghĩa mentor nói |
+# | `base`    | microsoft/wavlm-large (pretrain SSL, KHÔNG cảm xúc) | mở băng N lớp trên | đo lợi ích của SAILER warm-start |
+# | `sailer`  | warm-start cảm xúc (như exp08) | mở băng N lớp trên | bản mạnh hiện tại |
+#
+# **Chỉ WavLM** (bỏ audeering) để cô lập đúng biến "khởi tạo". QMOS mượn exp07 / UTMOSv2.
+#
+# ## ⚠️ Kỳ vọng trung thực (để đọc kết quả đúng)
+# - `scratch` gần như CHẮC CHẮN yếu hơn `base`/`sailer`: 12k mẫu là quá ít để dạy WavLM "nghe" từ đầu
+#   (SSL pretrain dùng ~94.000 GIỜ audio). Đây là ablation để **chứng minh bằng số**, không phải để vượt.
+# - `scratch` phải mở băng TOÀN BỘ (mới có gì để học) → **nặng + chậm + dễ OOM** trên T4. Dùng LIMIT nhỏ trước.
+# - So sánh bằng **VAL nội bộ** giữa 3 mode đã đủ kết luận; muốn chắc thì nộp mode tốt nhất lên DEV.
+#
+# **Cách chạy:** GPU T4 + Internet On → sửa cell 0 (`INIT_MODE` + slug) → Run All. Chạy 3 lần đổi INIT_MODE.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+INIT_MODE = "sailer"   # << "scratch" | "base" | "sailer"  (đổi rồi chạy lại để so) — "sailer" = WavLM warm-start cảm xúc
+DATA_ROOT    = "/kaggle/input/datasets/minhtoan2/vmc2026-track2-full"   # << SỬA slug cho khớp Add Input
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+EXP07_ANSWER = "/kaggle/input/exp07-answer/answer.txt"   # << (tùy chọn) mượn QMOS 0.548; không có → UTMOSv2
+OUT_DIR      = "/kaggle/working"
+# ── Siêu tham số ─────────────────────────────────────────────────────────────
+DEVICE          = "cuda"
+SR              = 16000
+MAX_SECONDS     = 6
+TRUNK_HIDDEN    = 512
+HEAD_HIDDEN     = 128
+DROPOUT         = 0.3
+WEIGHT_DECAY    = 1e-5
+EPOCHS          = 15
+PATIENCE        = 5
+BATCH           = 4
+ACCUM           = 8
+VAL_FRAC        = 0.10
+SEED            = 42
+USE_AMP         = True
+USE_GRAD_CKPT   = True
+USE_UNCERTAINTY = True
+# Khởi tạo & LR & mở băng — TỰ đặt theo INIT_MODE (scratch cần LR lớn + mở băng toàn bộ)
+if INIT_MODE == "scratch":
+    UNFREEZE_TOP_LAYERS = "all"     # random init → phải train tất cả mới học được
+    LR_BACKBONE = 1e-4              # random init cần bước lớn hơn fine-tune
+elif INIT_MODE in ("base", "sailer"):
+    UNFREEZE_TOP_LAYERS = 6         # fine-tune: chỉ mở băng N lớp trên (tiết kiệm VRAM, chống overfit)
+    LR_BACKBONE = 1e-5
+else:
+    raise ValueError(f"INIT_MODE lạ: {INIT_MODE}")
+LR_HEAD = 1e-3
+LIMIT_TRAIN = 300    # << LẦN ĐẦU 300; chạy thật None
+LIMIT_DEV   = 20     # << LẦN ĐẦU 20; chạy thật None
+EXP08 = {"emos": 0.811, "cat_err": 0.133, "val": 0.659, "aro": 0.793, "dom": 0.751}  # mốc DEV để tham khảo
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print(f"INIT_MODE = {INIT_MODE} | UNFREEZE = {UNFREEZE_TOP_LAYERS} | LR_BACKBONE = {LR_BACKBONE}")
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt (clone SAILER chỉ khi INIT_MODE='sailer')
+# %%
+import sys, subprocess
+import numpy as _np
+# ⚠️ KHÓA numpy = bản Kaggle đang có → pip KHÔNG được nâng/hạ numpy → tránh "SystemError: bad call flags"
+# (lỗi import torch do numpy lệch phiên bản với torch đã biên dịch sẵn).
+_NPIN = f"numpy=={_np.__version__}"
+print("Khóa numpy ở:", _NPIN)
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs, _NPIN], check=True)
+# Kaggle đã có sẵn torch/transformers/librosa/scipy/sklearn/pandas/tqdm/huggingface_hub/safetensors.
+# Chỉ cài thêm vài gói speech còn thiếu (kèm khóa numpy ở trên).
+pip_install("loralib", "speechmos", "soundfile")
+if INIT_MODE == "sailer":
+    pip_install("speechbrain")
+if INIT_MODE == "sailer":
+    REPO_DIR = "/kaggle/working/vox-profile-release"
+    if not os.path.exists(REPO_DIR):
+        subprocess.run(["git", "clone", "--depth", "1",
+                        "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+    if REPO_DIR not in sys.path:
+        sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Dựng WavLM theo INIT_MODE
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (rất chậm!)")
+from transformers import WavLMModel, WavLMConfig
+def find_hf_backbone(module):
+    cands = []
+    for name, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((name, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+if INIT_MODE == "scratch":
+    # Random init NHƯNG giữ ĐÚNG kiến trúc large (để công bằng với base/sailer)
+    cfg = WavLMConfig.from_pretrained("microsoft/wavlm-large")
+    wavlm = WavLMModel(cfg)   # KHÔNG load trọng số → ngẫu nhiên
+    print("🎲 WavLM-large khởi tạo NGẪU NHIÊN (from scratch, không pretrain).")
+elif INIT_MODE == "base":
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("📦 WavLM-large pretrain SSL (chưa học cảm xúc).")
+elif INIT_MODE == "sailer":
+    try:
+        from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+        _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+        name, wavlm = find_hf_backbone(_wrapper)
+        print(f"🔥 WavLM warm-start SAILER (cảm xúc) tại '.{name}'")
+    except Exception as e:
+        print("⚠️ Lỗi nạp SAILER:", repr(e), "→ fallback base pretrained.")
+        wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+wavlm = wavlm.to(device)
+WAVLM_DIM = int(wavlm.config.hidden_size)
+wavlm.config.layerdrop = 0.0   # ⚠️ BẮT BUỘC khi dùng gradient-checkpointing (tránh CheckpointError do bỏ lớp ngẫu nhiên)
+# Mở băng theo cấu hình
+if UNFREEZE_TOP_LAYERS == "all":
+    for p in wavlm.parameters():
+        p.requires_grad = True
+    n_open = "ALL"
+else:
+    for p in wavlm.parameters():
+        p.requires_grad = False
+    _wl = wavlm.encoder.layers
+    for layer in _wl[max(0, len(_wl) - UNFREEZE_TOP_LAYERS):]:
+        for p in layer.parameters():
+            p.requires_grad = True
+    n_open = f"top {min(UNFREEZE_TOP_LAYERS, len(_wl))}/{len(_wl)}"
+print(f"WavLM mở băng: {n_open} → {sum(p.numel() for p in wavlm.parameters() if p.requires_grad)/1e6:.1f}M param train (dim {WAVLM_DIM})")
+if USE_GRAD_CKPT:
+    wavlm.gradient_checkpointing_enable()
+    if hasattr(wavlm, "enable_input_require_grads"):
+        wavlm.enable_input_require_grads()
+def masked_mean(hidden, attn_mask):
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    try:
+        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+    except Exception:
+        return hidden.mean(dim=1)
+    fm = fm.unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+def wavlm_embed(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state
+    return masked_mean(out, attn_mask)
+# %% [markdown]
+# ## 3. Đọc & gộp nhãn theo wavID
+# %%
+import librosa
+import pandas as pd
+from tqdm.auto import tqdm
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, df=None, default_idx=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = _col(cols, "wavid", "wav", df=df, default_idx=1)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col = _col(cols, "val", "valence"); aro_col = _col(cols, "aro", "arousal"); dom_col = _col(cols, "dom", "dominance")
+    cat_col = _col(cols, "emocat", "cat", "emotion")
+    assert emos_col, f"Không thấy cột eMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+# %% [markdown]
+# ## 4. Dataset/loader (chỉ raw wave cho WavLM)
+# %%
+from torch.utils.data import Dataset, DataLoader
+from sklearn.model_selection import train_test_split
+train_stems = [s for s in train_df["wavID"] if target_map.get(s) is not None]
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+lab = train_df.set_index("wavID")
+def _zfit(a):
+    a = np.asarray(a, dtype=np.float32); return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)
+emos_mu, emos_sd = _zfit([lab.loc[s, "emos"] for s in train_stems])
+if HAS_VAD:
+    vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in ["val", "aro", "dom"]], np.float32)
+    vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in ["val", "aro", "dom"]], np.float32)
+else:
+    vad_mu = np.zeros(3, np.float32); vad_sd = np.ones(3, np.float32)
+print(f"Chuẩn hóa: emos μ={emos_mu:.3f} σ={emos_sd:.3f}")
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+def load_wav(sid):
+    p = os.path.join(WAV_DIR, sid if str(sid).endswith(".wav") else str(sid) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    return wave[: MAX_SECONDS * SR].astype(np.float32)
+class EmoDataset(Dataset):
+    def __init__(self, stems):
+        self.stems = [s for s in stems if load_wav(s) is not None]
+    def __len__(self):
+        return len(self.stems)
+    def __getitem__(self, i):
+        s = self.stems[i]
+        wave = load_wav(s)
+        emos = (float(lab.loc[s, "emos"]) - emos_mu) / emos_sd
+        if HAS_VAD:
+            vad = (np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32) - vad_mu) / vad_sd
+        else:
+            vad = np.zeros(3, dtype=np.float32)
+        cat = np.array([lab.loc[s, f"cat{j}"] for j in range(len(EMOTIONS5))], dtype=np.float32)
+        return {"wave": wave, "tgt": onehot_target(target_map.get(s)),
+                "emos": np.float32(emos), "vad": vad, "cat": cat,
+                "emos_raw": np.float32(lab.loc[s, "emos"]),
+                "vad_raw": np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32)}
+def collate(batch):
+    L = max(len(b["wave"]) for b in batch)
+    waves = np.zeros((len(batch), L), dtype=np.float32)
+    mask = np.zeros((len(batch), L), dtype=np.float32)
+    for i, b in enumerate(batch):
+        waves[i, : len(b["wave"])] = b["wave"]; mask[i, : len(b["wave"])] = 1.0
+    return {
+        "input_values": torch.from_numpy(waves), "attn_mask": torch.from_numpy(mask).long(),
+        "tgt": torch.from_numpy(np.stack([b["tgt"] for b in batch])),
+        "emos": torch.from_numpy(np.stack([b["emos"] for b in batch])).unsqueeze(1),
+        "vad": torch.from_numpy(np.stack([b["vad"] for b in batch])),
+        "cat": torch.from_numpy(np.stack([b["cat"] for b in batch])),
+        "emos_raw": np.stack([b["emos_raw"] for b in batch]),
+        "vad_raw": np.stack([b["vad_raw"] for b in batch]),
+    }
+ds = EmoDataset(train_stems)
+print("Dataset hợp lệ:", len(ds), "wav")
+tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)
+tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)
+va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)
+# %% [markdown]
+# ## 5. Heads + train loop
+# %%
+from scipy.stats import spearmanr
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(WAVLM_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+TASKS = ["emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+bb_params = [p for p in wavlm.parameters() if p.requires_grad]
+head_params = list(heads.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.AdamW([{"params": bb_params, "lr": LR_BACKBONE},
+                         {"params": head_params, "lr": LR_HEAD}], weight_decay=WEIGHT_DECAY)
+scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == "cuda")
+mse = nn.MSELoss()
+def soft_ce(logits, target_dist):
+    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()
+def forward_batch(b):
+    feat = wavlm_embed(b["input_values"].to(device), b["attn_mask"].to(device))
+    return heads(feat, b["tgt"].to(device))
+def compute_loss(emos_p, cat_l, vad_p, b):
+    L = {}
+    L["emos"] = mse(emos_p, b["emos"].to(device))
+    L["cat"] = soft_ce(cat_l, b["cat"].to(device))
+    if HAS_VAD:
+        vt = b["vad"].to(device)
+        L["val"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L["aro"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L["dom"] = mse(vad_p[:, 2:3], vt[:, 2:3])
+    else:
+        z = torch.zeros((), device=device); L["val"] = L["aro"] = L["dom"] = z
+    if USE_UNCERTAINTY:
+        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))
+    return sum(L.values())
+@torch.no_grad()
+def evaluate():
+    wavlm.eval(); heads.eval()
+    P = {"emos": [], "val": [], "aro": [], "dom": []}; Y = {"emos": [], "val": [], "aro": [], "dom": []}
+    catP, catY = [], []
+    for b in va_loader:
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+        P["emos"] += emos_p.float().cpu().numpy().ravel().tolist(); Y["emos"] += b["emos_raw"].tolist()
+        vad_p = vad_p.float().cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            P[t] += vad_p[:, j].tolist(); Y[t] += b["vad_raw"][:, j].tolist()
+        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b["cat"])
+    out = {}
+    for t in ["emos"] + (["val", "aro", "dom"] if HAS_VAD else []):
+        out[t] = spearmanr(P[t], Y[t]).correlation
+    q = np.concatenate(catP); p = np.concatenate(catY)
+    out["cat_err"] = float(np.abs(q - p).sum(1).mean())
+    return out
+def mean_srcc(m):
+    keys = ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+CKPT_PATH = os.path.join(OUT_DIR, f"ft_wavlm_{INIT_MODE}.pt")
+def save_full(state, val_emos=float("nan")):
+    torch.save({"wavlm": state["wavlm"], "heads": state["heads"], "INIT_MODE": INIT_MODE,
+                "emos_mu": emos_mu, "emos_sd": emos_sd, "vad_mu": vad_mu, "vad_sd": vad_sd,
+                "WAVLM_DIM": WAVLM_DIM, "val_emos": float(val_emos)}, CKPT_PATH)
+best, best_state, bad = -1e9, None, 0
+for ep in range(1, EPOCHS + 1):
+    wavlm.train(); heads.train()
+    opt.zero_grad(); run = 0.0; nb = 0
+    for step, b in enumerate(tqdm(tr_loader, desc=f"[{INIT_MODE}] epoch {ep}")):
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM
+        scaler.scale(loss).backward()
+        if (step + 1) % ACCUM == 0:
+            scaler.step(opt); scaler.update(); opt.zero_grad()
+        run += loss.item() * ACCUM; nb += 1
+    m = evaluate(); sc = mean_srcc(m)
+    msg = " ".join(f"{k}={m[k]:.3f}" for k in ["emos", "val", "aro", "dom"] if k in m)
+    print(f"[{INIT_MODE}] epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})")
+    if sc > best:
+        best = sc
+        best_state = {"wavlm": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},
+                      "heads": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}
+        save_full(best_state, m["emos"]); bad = 0
+        print(f"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})")
+    else:
+        bad += 1
+        if bad >= PATIENCE:
+            print(f"Early stop ở epoch {ep}."); break
+if best_state:
+    wavlm.load_state_dict(best_state["wavlm"]); heads.load_state_dict(best_state["heads"])
+final = evaluate()
+print(f"\n✅ VAL (nội bộ) — exp12 INIT_MODE={INIT_MODE}:")
+print(f"   EMOS={final['emos']:.4f}", end="")
+if HAS_VAD:
+    print(f" | VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f}")
+else:
+    print()
+print(f"   cat_err={final['cat_err']:.4f} | mean SRCC={mean_srcc(final):.4f}")
+print(f"   (Mốc DEV exp08 để tham khảo: EMOS {EXP08['emos']}, VAD {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})")
+print("   ➜ GHI con số này vào bảng ablation 04_ rồi đổi INIT_MODE chạy lại để so 3 mode.")
+# %% [markdown]
+# ## 6. Dự đoán DEV → answer.txt (QMOS mượn exp07 / UTMOSv2)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+print("DEV:", len(dev_names), "mẫu")
+def load_exp07_qmos():
+    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+        import csv
+        d = {}
+        with open(EXP07_ANSWER) as f:
+            for row in csv.DictReader(f):
+                d[row["wav"]] = float(row["QMOS"]); d[stem(row["wav"])] = float(row["QMOS"])
+        print(f"✅ Mượn QMOS exp07: {len(d)//2} wav")
+        return d
+    return None
+qmos_map = load_exp07_qmos()
+if qmos_map is None:
+    print("ℹ️ Không có exp07 → QMOS bằng UTMOSv2.")
+    pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")
+    import utmosv2
+    v2 = utmosv2.create_model(pretrained=True)
+    qmos_map = {}
+    for n in tqdm(dev_names, desc="UTMOSv2"):
+        wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else str(n) + ".wav")
+        if os.path.exists(wav):
+            o = v2.predict(input_path=wav)
+            qmos_map[n] = float(o["predicted_mos"]) if isinstance(o, dict) else float(o)
+    del v2; torch.cuda.empty_cache() if device == "cuda" else None
+@torch.no_grad()
+def predict_emotion(sid):
+    wave = load_wav(sid)
+    if wave is None:
+        return None
+    wavlm.eval(); heads.eval()
+    iv = torch.from_numpy(wave).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        feat = wavlm_embed(iv, am)
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+answer_path = os.path.join(OUT_DIR, f"answer_{INIT_MODE}.txt")
+n_real = n_def = 0
+with open(answer_path, "w") as f:
+    f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+    for name in tqdm(dev_names, desc=f"answer[{INIT_MODE}]"):
+        sid = stem(name)
+        pr = predict_emotion(sid)
+        if pr is None:
+            emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1
+        else:
+            emos, cat5, vad3 = pr; n_real += 1
+        qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))
+        f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+print(f"Ghi {len(dev_names)} dòng → {answer_path} | thật {n_real}, mặc định {n_def}")
+# %% [markdown]
+# ## 7. Validate + zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && cp answer_{INIT_MODE}.txt answer.txt && zip -j submission_track2_exp12_{INIT_MODE}.zip answer.txt && unzip -l submission_track2_exp12_{INIT_MODE}.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, f"submission_track2_exp12_{INIT_MODE}.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Chạy 3 lần** đổi `INIT_MODE` ("scratch"→"base"→"sailer"), ghi `mean SRCC` mỗi lần vào BẢNG ABLATION
+#   trong `docs/04_experiments_log.md` → trả lời mentor bằng số: from-scratch tốt hơn fine-tune không?
+# - **scratch nặng:** mở băng toàn bộ WavLM-large. Nếu OOM → giảm `BATCH` (4→2), `MAX_SECONDS` (6→5),
+#   hoặc đổi sang `microsoft/wavlm-base-plus` (sửa cell 2) cho khả thi (lưu ý: khác kiến trúc → so kém công bằng hơn).
+# - **scratch chậm + cần nhiều epoch hơn** (random init): để `EPOCHS=15`, `PATIENCE=5`. Vẫn nhiều khả năng < base/sailer.
+# - **Đừng nhầm VAL nội bộ với DEV.** So 3 mode bằng VAL nội bộ đã đủ kết luận; muốn chắc thì nộp mode tốt nhất.
+# - Checkpoint lưu `ft_wavlm_<mode>.pt`. Save Version sau mỗi lần chạy.

track2/exp13_finetune_qmos.ipynb ADDED Viewed

	@@ -0,0 +1,733 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d3c827cb",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp13 (FINE-TUNE UTMOS cho QMOS) + answer 6 cột — Kaggle\n",
+    "\n",
+    "**Mục tiêu:** QMOS hiện tốt nhất = 0.548 (exp07, head ĐÓNG BĂNG + neo UTMOS). exp13 thử **MỞ BĂNG\n",
+    "(fine-tune) thẳng UTMOS** trên nhãn `qMOS` thật của Track 2 → kéo model chất lượng về đúng domain giọng\n",
+    "cảm xúc. Sau đó **mượn 5 cột cảm xúc từ checkpoint exp08** (`ft_emotion_full_20epoch.pt` — bản TỐT NHẤT)\n",
+    "→ ghép `answer.txt` 6 cột.\n",
+    "\n",
+    "## Vì sao fine-tune UTMOS (không phải UTMOSv2)\n",
+    "- UTMOS (`utmos22_strong`, tarepan/SpeechMOS) = **1 model đơn**, tải qua `torch.hub`, **bản thân đã dự đoán\n",
+    "  QMOS** → warm-start hoàn hảo cho cột chất lượng (khác UTMOSv2 = ensemble nhiều fold + 2 luồng → khó train).\n",
+    "- forward: `model(wave[B,T], sr) -> MOS[B]`, là `nn.Module` chuẩn → backprop được toàn model.\n",
+    "- **Không dùng neo UTMOS riêng** (đã chốt): khi fine-tune chính UTMOS thì \"neo\" nằm sẵn trong trọng số\n",
+    "  warm-start → head/neo ngoài là thừa.\n",
+    "\n",
+    "## Thiết kế\n",
+    "```\n",
+    " [PHẦN A] wav ─► UTMOS (utmos22_strong, TRAINABLE, warm-start pretrained) ─► QMOS    (train trên qMOS gold)\n",
+    " [PHẦN B] wav ─► WavLM(exp08 ft) + audeering(frozen) ─► EMOS/CAT/VAD               (NẠP ckpt, chỉ inference)\n",
+    " [PHẦN C] ghép QMOS(A) + 5 cột cảm xúc(B) ─► answer.txt 6 cột ─► validate ─► zip\n",
+    "```\n",
+    "\n",
+    "## ⚠️ Phải biết trước\n",
+    "- Fine-tune = **không cache** (mỗi epoch chạy lại UTMOS forward+backward) → tốn giờ GPU. **Lần đầu BẮT BUỘC\n",
+    "  `LIMIT_TRAIN=300`, `LIMIT_DEV=20`** để chỉnh trơn rồi mới `None`.\n",
+    "- Lưới an toàn: chỉ nộp QMOS fine-tune nếu **SRCC val nội bộ > zero-shot UTMOS** (mục A in cả 2 số).\n",
+    "- **Lưu checkpoint `ft_qmos_utmos.pt` mỗi best + Save Version NGAY** (bài học exp08: kernel chết là mất).\n",
+    "\n",
+    "**Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input (1) dataset Track 2, (2) dataset chứa\n",
+    "`ft_emotion_full.pt` (exp08), (3) tùy chọn cache `aud_dev.npz` → sửa slug cell 0 → Run All."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6a78806d",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1374fa7d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, shutil, glob\n",
+    "\n",
+    "# ── TỰ DÒ DATA_ROOT (quét /kaggle/input tìm thư mục có sets/train.csv + wav/ + metadata.csv) ──\n",
+    "def find_data_root(search_root=\"/kaggle/input\"):\n",
+    "    cands = []\n",
+    "    for train_csv in glob.glob(os.path.join(search_root, \"**\", \"sets\", \"train.csv\"), recursive=True):\n",
+    "        root = os.path.dirname(os.path.dirname(train_csv))          # .../<root>/sets/train.csv → <root>\n",
+    "        score = os.path.isdir(os.path.join(root, \"wav\")) + os.path.exists(os.path.join(root, \"metadata.csv\"))\n",
+    "        cands.append((score, root))\n",
+    "    cands.sort(reverse=True)                                        # ưu tiên thư mục đủ wav + metadata\n",
+    "    return cands\n",
+    "\n",
+    "_cands = find_data_root(\"/kaggle/input\")\n",
+    "if _cands:\n",
+    "    print(\"🔎 Ứng viên DATA_ROOT (điểm cao = đủ wav+metadata):\")\n",
+    "    for sc, r in _cands:\n",
+    "        print(f\"   [{sc}/2] {r}\")\n",
+    "    DATA_ROOT = _cands[0][1]\n",
+    "    print(f\"👉 Tự chọn DATA_ROOT = {DATA_ROOT}\")\n",
+    "else:\n",
+    "    DATA_ROOT = \"/kaggle/input/datasets/minhtoan2\"   # dự phòng — sửa tay nếu auto-dò không thấy\n",
+    "    print(f\"❌ Không thấy sets/train.csv trong /kaggle/input → dùng dự phòng {DATA_ROOT} (đã Add Input chưa?)\")\n",
+    "\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (cho cột cảm xúc)\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "# ── Checkpoint cảm xúc exp08 (để sinh 5 cột EMOS/CAT/VAD) ─────────────────────\n",
+    "# ⭐ TỐT NHẤT = ft_emotion_full_20epoch.pt (bản 20 epoch) — dùng bản này, KHÔNG dùng ft_emotion_full.pt.\n",
+    "EMO_CKPT     = \"/kaggle/input/ft-emotion-full/ft_emotion_full_20epoch.pt\"   # << ckpt exp08 20ep (CÓ backbone WavLM)\n",
+    "CACHE_INPUT  = \"/kaggle/input/ft-emotion-cache\"                     # << (tùy chọn) thư mục chứa aud_dev.npz; \"\" nếu không có\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/ft_cache\"     # /kaggle/input read-only → copy cache audeering sang đây\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── PHẦN A: fine-tune UTMOS (QMOS) ───────────────────────────────────────────\n",
+    "DEVICE          = \"cuda\"\n",
+    "SR              = 16000\n",
+    "QMOS_MAX_SEC    = 12          # cắt audio chặn bộ nhớ backprop (UTMOS); OOM thì giảm 10/8\n",
+    "LR              = 1e-5        # LR nhỏ cho fine-tune (warm-start sẵn tốt)\n",
+    "WEIGHT_DECAY    = 1e-5\n",
+    "EPOCHS          = 10          # TRẦN; early-stop quyết số epoch thật\n",
+    "PATIENCE        = 3\n",
+    "BATCH           = 1          # UTMOS forward KHÔNG có attention-mask → BATCH=1 an toàn (pad zero sẽ lệch pooling)\n",
+    "ACCUM           = 16         # effective batch = BATCH*ACCUM = 16\n",
+    "VAL_FRAC        = 0.10\n",
+    "SEED            = 42\n",
+    "USE_AMP         = True\n",
+    "RANK_LAMBDA     = 0.0         # 0 = chỉ MSE. >0 (vd 0.3) = cộng pairwise ranking loss (tối ưu thẳng thứ hạng=SRCC)\n",
+    "FREEZE_FEAT_EXT = True        # đóng băng feature-extractor (CNN conv) của UTMOS → đỡ VRAM + chống overfit\n",
+    "\n",
+    "# ── PHẦN B: inference cảm xúc (PHẢI khớp kiến trúc exp08) ─────────────────────\n",
+    "EMO_MAX_SEC         = 8\n",
+    "UNFREEZE_TOP_LAYERS = 6       # khớp ckpt exp08\n",
+    "TRUNK_HIDDEN        = 512\n",
+    "HEAD_HIDDEN         = 128\n",
+    "DROPOUT             = 0.3\n",
+    "USE_AUDEERING       = True    # khớp ckpt exp08\n",
+    "\n",
+    "LIMIT_TRAIN = 300            # << LẦN ĐẦU 300; chạy thật None\n",
+    "LIMIT_DEV   = 20             # << LẦN ĐẦU 20; chạy thật None\n",
+    "\n",
+    "# Mốc QMOS để so (leaderboard DEV)\n",
+    "QMOS_BASELINE = {\"utmos_zeroshot\": 0.414, \"exp07_head\": 0.548}\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP, EMO_CKPT]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)\n",
+    "print(f\"Fine-tune UTMOS: LR {LR} · BATCH {BATCH}×ACCUM {ACCUM} · MAX {QMOS_MAX_SEC}s · rank λ {RANK_LAMBDA}\")\n",
+    "\n",
+    "# Copy cache audeering (aud_dev.npz) từ input read-only sang working (để cột cảm xúc khỏi trích lại)\n",
+    "if CACHE_INPUT and os.path.isdir(CACHE_INPUT):\n",
+    "    n = 0\n",
+    "    for fn in os.listdir(CACHE_INPUT):\n",
+    "        if fn.startswith(\"aud_\") and fn.endswith(\".npz\"):\n",
+    "            shutil.copy(os.path.join(CACHE_INPUT, fn), os.path.join(CACHE_DIR, fn)); n += 1\n",
+    "    print(f\"📦 Copy {n} file cache audeering từ {CACHE_INPUT} → {CACHE_DIR}\")\n",
+    "else:\n",
+    "    print(\"ℹ️ Không có CACHE_INPUT → sẽ tự trích audeering cho DEV (chậm hơn lần đầu).\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5e568431",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "731d1056",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"speechmos\", \"loralib\", \"speechbrain\", \"librosa\", \"soundfile\",\n",
+    "            \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "# Code SAILER (để dựng đúng kiến trúc WavLM rồi nạp ckpt exp08 đè lên) — chỉ cần cho PHẦN B\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "732b81f8",
+   "metadata": {},
+   "source": [
+    "## 2. Nhãn vàng qMOS (gộp trung bình theo wav) — như exp06/exp09a"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8cfb94af",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "def load_qmos_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col  = cols.get(\"wavid\") or cols.get(\"wav\") or list(df.columns)[1]\n",
+    "    qmos_col = cols.get(\"qmos\")  or cols.get(\"mos\")\n",
+    "    assert qmos_col, f\"Không thấy cột qMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    g = df.groupby(\"_stem\")[qmos_col].mean()\n",
+    "    return {s: float(v) for s, v in g.items()}\n",
+    "\n",
+    "qmos_gold = load_qmos_labels()\n",
+    "print(f\"Số wav train có nhãn qMOS: {len(qmos_gold)}\")\n",
+    "_vals = np.array(list(qmos_gold.values()))\n",
+    "print(f\"qMOS gold: mean {_vals.mean():.3f} · std {_vals.std():.3f} · min {_vals.min():.2f} · max {_vals.max():.2f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "374534a0",
+   "metadata": {},
+   "source": [
+    "## 3. PHẦN A — Fine-tune UTMOS trên qMOS\n",
+    "UTMOS xuất MOS thang ~1–5 (đã warm-start) → train MSE trên thang GỐC (không z-score, để giữ ý nghĩa warm-start).\n",
+    "`BATCH=1` + grad-accum: tránh phải pad (UTMOS forward không nhận attention-mask)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4636d35c",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import librosa\n",
+    "from tqdm.auto import tqdm\n",
+    "from scipy.stats import spearmanr\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (rất chậm!)\")\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "\n",
+    "# Nạp UTMOS (torch.hub) — model nn.Module, forward(wave[B,T], sr) -> MOS[B]\n",
+    "utmos = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\", trust_repo=True).to(device)\n",
+    "n_all = sum(p.numel() for p in utmos.parameters())\n",
+    "\n",
+    "# (tùy chọn) đóng băng feature-extractor (các lớp conv trích đặc trưng) → đỡ VRAM + chống overfit\n",
+    "if FREEZE_FEAT_EXT:\n",
+    "    n_frozen = 0\n",
+    "    for name, p in utmos.named_parameters():\n",
+    "        if \"feature_extractor\" in name or \"feature_projection\" in name or \"conv\" in name.lower():\n",
+    "            p.requires_grad = False; n_frozen += p.numel()\n",
+    "    print(f\"❄️ Đóng băng feature-extractor: {n_frozen/1e6:.1f}M / {n_all/1e6:.1f}M param\")\n",
+    "n_train = sum(p.numel() for p in utmos.parameters() if p.requires_grad)\n",
+    "print(f\"UTMOS: {n_all/1e6:.1f}M param tổng · {n_train/1e6:.1f}M param sẽ train\")\n",
+    "\n",
+    "def load_wav_qmos(sid):\n",
+    "    p = os.path.join(WAV_DIR, sid + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    return wave[: QMOS_MAX_SEC * SR].astype(np.float32)\n",
+    "\n",
+    "# Tập train QMOS: chỉ wav tồn tại trên đĩa\n",
+    "train_stems_q = [s for s in qmos_gold if os.path.exists(os.path.join(WAV_DIR, s + \".wav\"))]\n",
+    "np.random.shuffle(train_stems_q)\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems_q = train_stems_q[:LIMIT_TRAIN]\n",
+    "tr_q, va_q = train_test_split(train_stems_q, test_size=VAL_FRAC, random_state=SEED)\n",
+    "print(f\"QMOS train: {len(tr_q)} · val nội bộ: {len(va_q)}\")\n",
+    "\n",
+    "opt = torch.optim.AdamW([p for p in utmos.parameters() if p.requires_grad],\n",
+    "                        lr=LR, weight_decay=WEIGHT_DECAY)\n",
+    "scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == \"cuda\")\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def utmos_forward(wave_np):\n",
+    "    \"\"\"1 wav numpy -> MOS scalar tensor (giữ grad).\"\"\"\n",
+    "    x = torch.from_numpy(wave_np).unsqueeze(0).to(device)   # [1, T]\n",
+    "    out = utmos(x, SR)                                       # [1] (hoặc [1,?])\n",
+    "    return out.reshape(-1).mean()                            # scalar an toàn mọi shape\n",
+    "\n",
+    "def pairwise_rank_loss(preds, targets):\n",
+    "    \"\"\"Hinge ranking trên các cặp trong 1 nhóm (khuyến khích đúng thứ hạng = tối ưu SRCC).\"\"\"\n",
+    "    p = torch.stack(preds); t = torch.tensor(targets, device=device, dtype=torch.float32)\n",
+    "    if len(p) < 2:\n",
+    "        return torch.zeros((), device=device)\n",
+    "    sign = torch.sign(t.unsqueeze(0) - t.unsqueeze(1))\n",
+    "    diff = p.unsqueeze(0) - p.unsqueeze(1)\n",
+    "    return torch.relu(-sign * diff).mean()\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def eval_qmos_val():\n",
+    "    utmos.eval()\n",
+    "    preds, gts = [], []\n",
+    "    for s in va_q:\n",
+    "        wave = load_wav_qmos(s)\n",
+    "        if wave is None:\n",
+    "            continue\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            preds.append(float(utmos_forward(wave).item()))\n",
+    "        gts.append(qmos_gold[s])\n",
+    "    return float(spearmanr(preds, gts).correlation)\n",
+    "\n",
+    "# Baseline ZERO-SHOT (trước khi train) trên CÙNG val → mốc phải vượt\n",
+    "srcc_zeroshot = eval_qmos_val()\n",
+    "print(f\"\\n📍 UTMOS zero-shot (val nội bộ): SRCC = {srcc_zeroshot:.4f}  \"\n",
+    "      f\"(leaderboard DEV ~{QMOS_BASELINE['utmos_zeroshot']}; exp07 head {QMOS_BASELINE['exp07_head']})\")\n",
+    "\n",
+    "CKPT_QMOS = os.path.join(OUT_DIR, \"ft_qmos_utmos.pt\")\n",
+    "def save_qmos_ckpt(srcc):\n",
+    "    torch.save({\"utmos_state\": {k: v.cpu() for k, v in utmos.state_dict().items()},\n",
+    "                \"val_srcc\": float(srcc), \"raw_scale\": True,\n",
+    "                \"QMOS_MAX_SEC\": QMOS_MAX_SEC, \"FREEZE_FEAT_EXT\": FREEZE_FEAT_EXT}, CKPT_QMOS)\n",
+    "\n",
+    "best, best_state, bad = srcc_zeroshot, {k: v.cpu().clone() for k, v in utmos.state_dict().items()}, 0\n",
+    "save_qmos_ckpt(best)   # lưu sẵn bản zero-shot (worst case vẫn = baseline)\n",
+    "\n",
+    "# Gom theo CỬA SỔ = ACCUM mẫu HỢP LỆ (micro). Hai chế độ backward:\n",
+    "#   • RANK off (mặc định) → backward NGAY từng mẫu → đồ thị giải phóng liền → VRAM thấp.\n",
+    "#   • RANK on  → ranking cần SO các pred TRONG cửa sổ → PHẢI giữ đồ thị cả cửa sổ →\n",
+    "#                gom MSE (win_loss) + pred (buf_p) rồi backward MỘT lần (MSE_mean + λ·rank).\n",
+    "#   ⚠️ Lỗi cũ: backward MSE từng bước đã giải phóng đồ thị → rank_loss.backward() sau đó\n",
+    "#      sẽ lỗi \"backward through the graph a second time\". Bản này gom rồi backward 1 lần → hết lỗi.\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    utmos.train()\n",
+    "    opt.zero_grad()\n",
+    "    np.random.shuffle(tr_q)\n",
+    "    run = 0.0; nb = 0\n",
+    "    micro = 0; win_loss = None; buf_p, buf_t = [], []\n",
+    "    for s in tqdm(tr_q, desc=f\"epoch {ep}\"):\n",
+    "        wave = load_wav_qmos(s)\n",
+    "        if wave is None:\n",
+    "            continue\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            pred = utmos_forward(wave)\n",
+    "            loss = mse(pred, torch.tensor(qmos_gold[s], device=device, dtype=pred.dtype))\n",
+    "        run += float(loss.item()); nb += 1\n",
+    "        if RANK_LAMBDA > 0:\n",
+    "            win_loss = loss if win_loss is None else win_loss + loss   # GIỮ đồ thị (không backward ngay)\n",
+    "            buf_p.append(pred); buf_t.append(qmos_gold[s]); micro += 1\n",
+    "        else:\n",
+    "            scaler.scale(loss / ACCUM).backward(); micro += 1            # backward ngay → VRAM thấp\n",
+    "        if micro == ACCUM:\n",
+    "            if RANK_LAMBDA > 0:\n",
+    "                total = win_loss / micro\n",
+    "                if len(buf_p) >= 2:\n",
+    "                    total = total + RANK_LAMBDA * pairwise_rank_loss(buf_p, buf_t)\n",
+    "                scaler.scale(total).backward()\n",
+    "            scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "            micro = 0; win_loss = None; buf_p, buf_t = [], []\n",
+    "    # flush cửa sổ dư cuối epoch (số mẫu không chia hết cho ACCUM)\n",
+    "    if micro > 0:\n",
+    "        if RANK_LAMBDA > 0:\n",
+    "            total = win_loss / micro\n",
+    "            if len(buf_p) >= 2:\n",
+    "                total = total + RANK_LAMBDA * pairwise_rank_loss(buf_p, buf_t)\n",
+    "            scaler.scale(total).backward()\n",
+    "        scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "    sc = eval_qmos_val()\n",
+    "    print(f\"epoch {ep:2d} | loss {run/max(nb,1):.4f} | val SRCC {sc:.4f} \"\n",
+    "          f\"(zero-shot {srcc_zeroshot:.4f} · best {max(best,sc):.4f})\")\n",
+    "    if sc > best:\n",
+    "        best = sc\n",
+    "        best_state = {k: v.cpu().clone() for k, v in utmos.state_dict().items()}\n",
+    "        save_qmos_ckpt(best)\n",
+    "        print(f\"   💾 lưu best → {CKPT_QMOS} (epoch {ep}, SRCC {sc:.4f})\")\n",
+    "        bad = 0\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "        if bad >= PATIENCE:\n",
+    "            print(f\"Early stop ở epoch {ep}.\"); break\n",
+    "\n",
+    "utmos.load_state_dict(best_state)\n",
+    "print(f\"\\n✅ PHẦN A xong — QMOS val nội bộ: zero-shot {srcc_zeroshot:.4f} → fine-tune {best:.4f} \"\n",
+    "      + (\"🚀 cải thiện\" if best > srcc_zeroshot + 1e-4 else \"➖ KHÔNG vượt zero-shot\"))\n",
+    "if best <= srcc_zeroshot + 1e-4:\n",
+    "    print(\"   ⚠️ Fine-tune chưa vượt zero-shot → cân nhắc tăng EPOCHS / bật RANK_LAMBDA=0.3 / \"\n",
+    "          \"mở băng feature-extractor (FREEZE_FEAT_EXT=False); hoặc giữ QMOS exp07 (0.548).\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d2448fa6",
+   "metadata": {},
+   "source": [
+    "## 4. PHẦN A (tiếp) — Dự đoán QMOS cho DEV bằng UTMOS đã fine-tune"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f8463ca5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_qmos(name):\n",
+    "    p = os.path.join(WAV_DIR, name if str(name).endswith(\".wav\") else str(name) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    wave = wave[: QMOS_MAX_SEC * SR].astype(np.float32)\n",
+    "    utmos.eval()\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        v = float(utmos_forward(wave).item())\n",
+    "    return float(np.clip(v, 1.0, 5.0))\n",
+    "\n",
+    "qmos_pred = {}\n",
+    "n_real = n_def = 0\n",
+    "for name in tqdm(dev_names, desc=\"QMOS dev\"):\n",
+    "    v = predict_qmos(name)\n",
+    "    if v is None:\n",
+    "        v = 3.0; n_def += 1\n",
+    "    else:\n",
+    "        n_real += 1\n",
+    "    qmos_pred[name] = v\n",
+    "print(f\"QMOS dự đoán: thật {n_real}, mặc định {n_def}\")\n",
+    "\n",
+    "# Giải phóng UTMOS trước khi nạp backbone cảm xúc (đỡ VRAM T4)\n",
+    "del utmos, opt, scaler\n",
+    "torch.cuda.empty_cache() if device == \"cuda\" else None"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b67b5d6b",
+   "metadata": {},
+   "source": [
+    "## 5. PHẦN B — Nạp ckpt exp08 (WavLM ft + audeering) → 5 cột cảm xúc cho DEV\n",
+    "Tái dùng nguyên cơ chế load của exp08b: dựng kiến trúc → `load_state_dict` từ `ft_emotion_full_20epoch.pt`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f98eca99",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch.nn.functional as F\n",
+    "\n",
+    "ckpt = torch.load(EMO_CKPT, map_location=\"cpu\", weights_only=False)   # ckpt có numpy → weights_only=False\n",
+    "assert \"wavlm\" in ckpt, (\"❌ EMO_CKPT không có 'wavlm' (backbone). Cần ft_emotion_full_20epoch.pt (bản đủ backbone), \"\n",
+    "                         \"KHÔNG phải ft_emotion_meta.pt cũ.\")\n",
+    "print(\"✅ Nạp ckpt cảm xúc:\", EMO_CKPT, \"| keys:\", list(ckpt.keys()))\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    cands = []\n",
+    "    for nm, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((nm, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda x: sum(p.numel() for p in x[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "try:\n",
+    "    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "    _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "    _name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    if wavlm is not None:\n",
+    "        print(f\"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{_name}'\")\n",
+    "except Exception as e:\n",
+    "    print(\"⚠️ Lỗi nạp SAILER wrapper:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "if wavlm is None:\n",
+    "    from transformers import WavLMModel\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"ℹ️ Fallback: microsoft/wavlm-large.\")\n",
+    "\n",
+    "wavlm = wavlm.to(device).eval()\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "miss, unexp = wavlm.load_state_dict(ckpt[\"wavlm\"], strict=False)\n",
+    "print(f\"🔁 load wavlm từ ckpt: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0).\")\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    try:\n",
+    "        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)\n",
+    "    except Exception:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = fm.unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def wavlm_embed(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state\n",
+    "    return masked_mean(out, attn_mask)\n",
+    "\n",
+    "# ── audeering FROZEN (đặc trưng phụ) — như exp08 ──\n",
+    "AUD_DIM = 0\n",
+    "aud_backbone = aud_head = aud_proc = None\n",
+    "if USE_AUDEERING:\n",
+    "    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "    from huggingface_hub import hf_hub_download\n",
+    "    AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "    aud_backbone = Wav2Vec2Model(aud_cfg)\n",
+    "    try:\n",
+    "        _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "            hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "    except Exception:\n",
+    "        _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "    bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "    aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "    _hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "    _out = _sd[\"classifier.out_proj.weight\"].shape[0]\n",
+    "    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))\n",
+    "    aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "    aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "    aud_backbone = aud_backbone.to(device).eval()\n",
+    "    aud_head = aud_head.to(device).eval()\n",
+    "    AUD_DIM = _hid + 3\n",
+    "    print(f\"✅ audeering frozen ({AUD_DIM}-D)\")\n",
+    "\n",
+    "def load_wav_emo(sid):\n",
+    "    p = os.path.join(WAV_DIR, sid + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    return wave[: EMO_MAX_SEC * SR].astype(np.float32)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def extract_audeering(stems, tag):\n",
+    "    if not USE_AUDEERING:\n",
+    "        return {}\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"aud_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[aud/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    for i, s in enumerate(tqdm(todo, desc=f\"audeering {tag}\")):\n",
+    "        wave = load_wav_emo(s)\n",
+    "        if wave is None:\n",
+    "            continue\n",
+    "        x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)\n",
+    "        h = aud_backbone(x)[0].mean(dim=1)\n",
+    "        out = aud_head(h)[0].cpu().numpy()\n",
+    "        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]\n",
+    "        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)\n",
+    "        if (i + 1) % 500 == 0:\n",
+    "            np.savez(cache_path, **store)\n",
+    "    if todo:\n",
+    "        np.savez(cache_path, **store)\n",
+    "    return store\n",
+    "\n",
+    "# ── EmoHeads (khớp exp08) + nạp trọng số head + thống kê chuẩn hóa từ ckpt ──\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device).eval()\n",
+    "hmiss, hunexp = heads.load_state_dict(ckpt[\"heads\"], strict=False)\n",
+    "print(f\"🔁 load heads từ ckpt: thiếu {len(hmiss)} / dư {len(hunexp)} key (kỳ vọng 0).\")\n",
+    "\n",
+    "emos_mu = float(ckpt[\"emos_mu\"]); emos_sd = float(ckpt[\"emos_sd\"])\n",
+    "vad_mu = np.asarray(ckpt[\"vad_mu\"], dtype=np.float32); vad_sd = np.asarray(ckpt[\"vad_sd\"], dtype=np.float32)\n",
+    "print(f\"Chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}\")\n",
+    "\n",
+    "# Target cảm xúc (cho EMOS head) từ metadata\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(N_EMO, dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "aud_dev = extract_audeering(dev_stems, \"dev\")\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    wave = load_wav_emo(sid)\n",
+    "    if wave is None or (USE_AUDEERING and sid not in aud_dev):\n",
+    "        return None\n",
+    "    iv = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        fw = wavlm_embed(iv, am)\n",
+    "        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f813bfaf",
+   "metadata": {},
+   "source": [
+    "## 6. PHẦN C — Ghép QMOS (fine-tune) + 5 cột cảm xúc (exp08) → answer.txt 6 cột"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f9fd3208",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    n_real = n_def = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pr = predict_emotion(sid)\n",
+    "            if pr is None:\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1\n",
+    "            else:\n",
+    "                emos, cat5, vad3 = pr; n_real += 1\n",
+    "            qmos = qmos_pred.get(name, qmos_pred.get(sid, 3.0))\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3f78d7fe",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d873783b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0] and \"EMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp13_ft-qmos.zip answer.txt \"\n",
+    "          f\"&& unzip -l submission_track2_exp13_ft-qmos.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp13_ft-qmos.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "33290a62",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Lần đầu** `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để chạy trơn (không OOM, 1 epoch xong); rồi đặt `None`.\n",
+    "- **OOM trên T4?** giảm `QMOS_MAX_SEC` (12→10→8); giữ `FREEZE_FEAT_EXT=True`; `BATCH=1` đã là min.\n",
+    "  ⚠️ **Bật `RANK_LAMBDA>0` tốn VRAM hơn** vì phải GIỮ đồ thị cả cửa sổ ACCUM (=16) để so thứ hạng →\n",
+    "  nếu OOM khi bật ranking: giảm `ACCUM` (vd 8, cũng là kích thước nhóm ranking) hoặc `QMOS_MAX_SEC`.\n",
+    "- **Đọc mục A:** so `val SRCC fine-tune` với `zero-shot`. Chỉ nộp QMOS fine-tune nếu **vượt zero-shot**\n",
+    "  (lý tưởng vượt cả exp07 0.548); nếu không → giữ QMOS exp07 (Add Input answer.txt exp07, đổi cột QMOS).\n",
+    "- Nếu chưa vượt: tăng `EPOCHS`, bật `RANK_LAMBDA=0.3` (tối ưu thẳng thứ hạng), hoặc `FREEZE_FEAT_EXT=False`\n",
+    "  (mở băng feature-extractor — mạnh hơn nhưng dễ overfit + nặng VRAM).\n",
+    "- **Lưu checkpoint:** `ft_qmos_utmos.pt` lưu mỗi best → **Save Version NGAY** sau khi chạy (bài học exp08).\n",
+    "- **License QMOS:** UTMOS/SpeechMOS (kiểm tra license tarepan/SpeechMOS) — khai báo `docs/12_system_description.md`.\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp13)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp13_finetune_qmos_pipeline.py ADDED Viewed

	@@ -0,0 +1,607 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp13 (FINE-TUNE UTMOS cho QMOS) + answer 6 cột — Kaggle
+#
+# **Mục tiêu:** QMOS hiện tốt nhất = 0.548 (exp07, head ĐÓNG BĂNG + neo UTMOS). exp13 thử **MỞ BĂNG
+# (fine-tune) thẳng UTMOS** trên nhãn `qMOS` thật của Track 2 → kéo model chất lượng về đúng domain giọng
+# cảm xúc. Sau đó **mượn 5 cột cảm xúc từ checkpoint exp08** (`ft_emotion_full_20epoch.pt` — bản TỐT NHẤT)
+# → ghép `answer.txt` 6 cột.
+#
+# ## Vì sao fine-tune UTMOS (không phải UTMOSv2)
+# - UTMOS (`utmos22_strong`, tarepan/SpeechMOS) = **1 model đơn**, tải qua `torch.hub`, **bản thân đã dự đoán
+#   QMOS** → warm-start hoàn hảo cho cột chất lượng (khác UTMOSv2 = ensemble nhiều fold + 2 luồng → khó train).
+# - forward: `model(wave[B,T], sr) -> MOS[B]`, là `nn.Module` chuẩn → backprop được toàn model.
+# - **Không dùng neo UTMOS riêng** (đã chốt): khi fine-tune chính UTMOS thì "neo" nằm sẵn trong trọng số
+#   warm-start → head/neo ngoài là thừa.
+#
+# ## Thiết kế
+# ```
+#  [PHẦN A] wav ─► UTMOS (utmos22_strong, TRAINABLE, warm-start pretrained) ─► QMOS    (train trên qMOS gold)
+#  [PHẦN B] wav ─► WavLM(exp08 ft) + audeering(frozen) ─► EMOS/CAT/VAD               (NẠP ckpt, chỉ inference)
+#  [PHẦN C] ghép QMOS(A) + 5 cột cảm xúc(B) ─► answer.txt 6 cột ─► validate ─► zip
+# ```
+#
+# ## ⚠️ Phải biết trước
+# - Fine-tune = **không cache** (mỗi epoch chạy lại UTMOS forward+backward) → tốn giờ GPU. **Lần đầu BẮT BUỘC
+#   `LIMIT_TRAIN=300`, `LIMIT_DEV=20`** để chỉnh trơn rồi mới `None`.
+# - Lưới an toàn: chỉ nộp QMOS fine-tune nếu **SRCC val nội bộ > zero-shot UTMOS** (mục A in cả 2 số).
+# - **Lưu checkpoint `ft_qmos_utmos.pt` mỗi best + Save Version NGAY** (bài học exp08: kernel chết là mất).
+#
+# **Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input (1) dataset Track 2, (2) dataset chứa
+# `ft_emotion_full.pt` (exp08), (3) tùy chọn cache `aud_dev.npz` → sửa slug cell 0 → Run All.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os, shutil, glob
+# ── TỰ DÒ DATA_ROOT (quét /kaggle/input tìm thư mục có sets/train.csv + wav/ + metadata.csv) ──
+def find_data_root(search_root="/kaggle/input"):
+    cands = []
+    for train_csv in glob.glob(os.path.join(search_root, "**", "sets", "train.csv"), recursive=True):
+        root = os.path.dirname(os.path.dirname(train_csv))          # .../<root>/sets/train.csv → <root>
+        score = os.path.isdir(os.path.join(root, "wav")) + os.path.exists(os.path.join(root, "metadata.csv"))
+        cands.append((score, root))
+    cands.sort(reverse=True)                                        # ưu tiên thư mục đủ wav + metadata
+    return cands
+_cands = find_data_root("/kaggle/input")
+if _cands:
+    print("🔎 Ứng viên DATA_ROOT (điểm cao = đủ wav+metadata):")
+    for sc, r in _cands:
+        print(f"   [{sc}/2] {r}")
+    DATA_ROOT = _cands[0][1]
+    print(f"👉 Tự chọn DATA_ROOT = {DATA_ROOT}")
+else:
+    DATA_ROOT = "/kaggle/input/datasets/minhtoan2"   # dự phòng — sửa tay nếu auto-dò không thấy
+    print(f"❌ Không thấy sets/train.csv trong /kaggle/input → dùng dự phòng {DATA_ROOT} (đã Add Input chưa?)")
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (cho cột cảm xúc)
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+# ── Checkpoint cảm xúc exp08 (để sinh 5 cột EMOS/CAT/VAD) ─────────────────────
+# ⭐ TỐT NHẤT = ft_emotion_full_20epoch.pt (bản 20 epoch) — dùng bản này, KHÔNG dùng ft_emotion_full.pt.
+EMO_CKPT     = "/kaggle/input/ft-emotion-full/ft_emotion_full_20epoch.pt"   # << ckpt exp08 20ep (CÓ backbone WavLM)
+CACHE_INPUT  = "/kaggle/input/ft-emotion-cache"                     # << (tùy chọn) thư mục chứa aud_dev.npz; "" nếu không có
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/ft_cache"     # /kaggle/input read-only → copy cache audeering sang đây
+os.makedirs(CACHE_DIR, exist_ok=True)
+# ── PHẦN A: fine-tune UTMOS (QMOS) ───────────────────────────────────────────
+DEVICE          = "cuda"
+SR              = 16000
+QMOS_MAX_SEC    = 12          # cắt audio chặn bộ nhớ backprop (UTMOS); OOM thì giảm 10/8
+LR              = 1e-5        # LR nhỏ cho fine-tune (warm-start sẵn tốt)
+WEIGHT_DECAY    = 1e-5
+EPOCHS          = 10          # TRẦN; early-stop quyết số epoch thật
+PATIENCE        = 3
+BATCH           = 1          # UTMOS forward KHÔNG có attention-mask → BATCH=1 an toàn (pad zero sẽ lệch pooling)
+ACCUM           = 16         # effective batch = BATCH*ACCUM = 16
+VAL_FRAC        = 0.10
+SEED            = 42
+USE_AMP         = True
+RANK_LAMBDA     = 0.0         # 0 = chỉ MSE. >0 (vd 0.3) = cộng pairwise ranking loss (tối ưu thẳng thứ hạng=SRCC)
+FREEZE_FEAT_EXT = True        # đóng băng feature-extractor (CNN conv) của UTMOS → đỡ VRAM + chống overfit
+# ── PHẦN B: inference cảm xúc (PHẢI khớp kiến trúc exp08) ─────────────────────
+EMO_MAX_SEC         = 8
+UNFREEZE_TOP_LAYERS = 6       # khớp ckpt exp08
+TRUNK_HIDDEN        = 512
+HEAD_HIDDEN         = 128
+DROPOUT             = 0.3
+USE_AUDEERING       = True    # khớp ckpt exp08
+LIMIT_TRAIN = 300            # << LẦN ĐẦU 300; chạy thật None
+LIMIT_DEV   = 20             # << LẦN ĐẦU 20; chạy thật None
+# Mốc QMOS để so (leaderboard DEV)
+QMOS_BASELINE = {"utmos_zeroshot": 0.414, "exp07_head": 0.548}
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP, EMO_CKPT]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+print(f"Fine-tune UTMOS: LR {LR} · BATCH {BATCH}×ACCUM {ACCUM} · MAX {QMOS_MAX_SEC}s · rank λ {RANK_LAMBDA}")
+# Copy cache audeering (aud_dev.npz) từ input read-only sang working (để cột cảm xúc khỏi trích lại)
+if CACHE_INPUT and os.path.isdir(CACHE_INPUT):
+    n = 0
+    for fn in os.listdir(CACHE_INPUT):
+        if fn.startswith("aud_") and fn.endswith(".npz"):
+            shutil.copy(os.path.join(CACHE_INPUT, fn), os.path.join(CACHE_DIR, fn)); n += 1
+    print(f"📦 Copy {n} file cache audeering từ {CACHE_INPUT} → {CACHE_DIR}")
+else:
+    print("ℹ️ Không có CACHE_INPUT → sẽ tự trích audeering cho DEV (chậm hơn lần đầu).")
+# %% [markdown]
+# ## 1. Cài đặt
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("speechmos", "loralib", "speechbrain", "librosa", "soundfile",
+            "scipy", "scikit-learn", "pandas", "tqdm")
+# Code SAILER (để dựng đúng kiến trúc WavLM rồi nạp ckpt exp08 đè lên) — chỉ cần cho PHẦN B
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Nhãn vàng qMOS (gộp trung bình theo wav) — như exp06/exp09a
+# %%
+import numpy as np
+import pandas as pd
+def load_qmos_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col  = cols.get("wavid") or cols.get("wav") or list(df.columns)[1]
+    qmos_col = cols.get("qmos")  or cols.get("mos")
+    assert qmos_col, f"Không thấy cột qMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    g = df.groupby("_stem")[qmos_col].mean()
+    return {s: float(v) for s, v in g.items()}
+qmos_gold = load_qmos_labels()
+print(f"Số wav train có nhãn qMOS: {len(qmos_gold)}")
+_vals = np.array(list(qmos_gold.values()))
+print(f"qMOS gold: mean {_vals.mean():.3f} · std {_vals.std():.3f} · min {_vals.min():.2f} · max {_vals.max():.2f}")
+# %% [markdown]
+# ## 3. PHẦN A — Fine-tune UTMOS trên qMOS
+# UTMOS xuất MOS thang ~1–5 (đã warm-start) → train MSE trên thang GỐC (không z-score, để giữ ý nghĩa warm-start).
+# `BATCH=1` + grad-accum: tránh phải pad (UTMOS forward không nhận attention-mask).
+# %%
+import torch
+import torch.nn as nn
+import librosa
+from tqdm.auto import tqdm
+from scipy.stats import spearmanr
+from sklearn.model_selection import train_test_split
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (rất chậm!)")
+torch.manual_seed(SEED); np.random.seed(SEED)
+# Nạp UTMOS (torch.hub) — model nn.Module, forward(wave[B,T], sr) -> MOS[B]
+utmos = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True).to(device)
+n_all = sum(p.numel() for p in utmos.parameters())
+# (tùy chọn) đóng băng feature-extractor (các lớp conv trích đặc trưng) → đỡ VRAM + chống overfit
+if FREEZE_FEAT_EXT:
+    n_frozen = 0
+    for name, p in utmos.named_parameters():
+        if "feature_extractor" in name or "feature_projection" in name or "conv" in name.lower():
+            p.requires_grad = False; n_frozen += p.numel()
+    print(f"❄️ Đóng băng feature-extractor: {n_frozen/1e6:.1f}M / {n_all/1e6:.1f}M param")
+n_train = sum(p.numel() for p in utmos.parameters() if p.requires_grad)
+print(f"UTMOS: {n_all/1e6:.1f}M param tổng · {n_train/1e6:.1f}M param sẽ train")
+def load_wav_qmos(sid):
+    p = os.path.join(WAV_DIR, sid + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    return wave[: QMOS_MAX_SEC * SR].astype(np.float32)
+# Tập train QMOS: chỉ wav tồn tại trên đĩa
+train_stems_q = [s for s in qmos_gold if os.path.exists(os.path.join(WAV_DIR, s + ".wav"))]
+np.random.shuffle(train_stems_q)
+if LIMIT_TRAIN:
+    train_stems_q = train_stems_q[:LIMIT_TRAIN]
+tr_q, va_q = train_test_split(train_stems_q, test_size=VAL_FRAC, random_state=SEED)
+print(f"QMOS train: {len(tr_q)} · val nội bộ: {len(va_q)}")
+opt = torch.optim.AdamW([p for p in utmos.parameters() if p.requires_grad],
+                        lr=LR, weight_decay=WEIGHT_DECAY)
+scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == "cuda")
+mse = nn.MSELoss()
+def utmos_forward(wave_np):
+    """1 wav numpy -> MOS scalar tensor (giữ grad)."""
+    x = torch.from_numpy(wave_np).unsqueeze(0).to(device)   # [1, T]
+    out = utmos(x, SR)                                       # [1] (hoặc [1,?])
+    return out.reshape(-1).mean()                            # scalar an toàn mọi shape
+def pairwise_rank_loss(preds, targets):
+    """Hinge ranking trên các cặp trong 1 nhóm (khuyến khích đúng thứ hạng = tối ưu SRCC)."""
+    p = torch.stack(preds); t = torch.tensor(targets, device=device, dtype=torch.float32)
+    if len(p) < 2:
+        return torch.zeros((), device=device)
+    sign = torch.sign(t.unsqueeze(0) - t.unsqueeze(1))
+    diff = p.unsqueeze(0) - p.unsqueeze(1)
+    return torch.relu(-sign * diff).mean()
+@torch.no_grad()
+def eval_qmos_val():
+    utmos.eval()
+    preds, gts = [], []
+    for s in va_q:
+        wave = load_wav_qmos(s)
+        if wave is None:
+            continue
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            preds.append(float(utmos_forward(wave).item()))
+        gts.append(qmos_gold[s])
+    return float(spearmanr(preds, gts).correlation)
+# Baseline ZERO-SHOT (trước khi train) trên CÙNG val → mốc phải vượt
+srcc_zeroshot = eval_qmos_val()
+print(f"\n📍 UTMOS zero-shot (val nội bộ): SRCC = {srcc_zeroshot:.4f}  "
+      f"(leaderboard DEV ~{QMOS_BASELINE['utmos_zeroshot']}; exp07 head {QMOS_BASELINE['exp07_head']})")
+CKPT_QMOS = os.path.join(OUT_DIR, "ft_qmos_utmos.pt")
+def save_qmos_ckpt(srcc):
+    torch.save({"utmos_state": {k: v.cpu() for k, v in utmos.state_dict().items()},
+                "val_srcc": float(srcc), "raw_scale": True,
+                "QMOS_MAX_SEC": QMOS_MAX_SEC, "FREEZE_FEAT_EXT": FREEZE_FEAT_EXT}, CKPT_QMOS)
+best, best_state, bad = srcc_zeroshot, {k: v.cpu().clone() for k, v in utmos.state_dict().items()}, 0
+save_qmos_ckpt(best)   # lưu sẵn bản zero-shot (worst case vẫn = baseline)
+# Gom theo CỬA SỔ = ACCUM mẫu HỢP LỆ (micro). Hai chế độ backward:
+#   • RANK off (mặc định) → backward NGAY từng mẫu → đồ thị giải phóng liền → VRAM thấp.
+#   • RANK on  → ranking cần SO các pred TRONG cửa sổ → PHẢI giữ đồ thị cả cửa sổ →
+#                gom MSE (win_loss) + pred (buf_p) rồi backward MỘT lần (MSE_mean + λ·rank).
+#   ⚠️ Lỗi cũ: backward MSE từng bước đã giải phóng đồ thị → rank_loss.backward() sau đó
+#      sẽ lỗi "backward through the graph a second time". Bản này gom rồi backward 1 lần → hết lỗi.
+for ep in range(1, EPOCHS + 1):
+    utmos.train()
+    opt.zero_grad()
+    np.random.shuffle(tr_q)
+    run = 0.0; nb = 0
+    micro = 0; win_loss = None; buf_p, buf_t = [], []
+    for s in tqdm(tr_q, desc=f"epoch {ep}"):
+        wave = load_wav_qmos(s)
+        if wave is None:
+            continue
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            pred = utmos_forward(wave)
+            loss = mse(pred, torch.tensor(qmos_gold[s], device=device, dtype=pred.dtype))
+        run += float(loss.item()); nb += 1
+        if RANK_LAMBDA > 0:
+            win_loss = loss if win_loss is None else win_loss + loss   # GIỮ đồ thị (không backward ngay)
+            buf_p.append(pred); buf_t.append(qmos_gold[s]); micro += 1
+        else:
+            scaler.scale(loss / ACCUM).backward(); micro += 1            # backward ngay → VRAM thấp
+        if micro == ACCUM:
+            if RANK_LAMBDA > 0:
+                total = win_loss / micro
+                if len(buf_p) >= 2:
+                    total = total + RANK_LAMBDA * pairwise_rank_loss(buf_p, buf_t)
+                scaler.scale(total).backward()
+            scaler.step(opt); scaler.update(); opt.zero_grad()
+            micro = 0; win_loss = None; buf_p, buf_t = [], []
+    # flush cửa sổ dư cuối epoch (số mẫu không chia hết cho ACCUM)
+    if micro > 0:
+        if RANK_LAMBDA > 0:
+            total = win_loss / micro
+            if len(buf_p) >= 2:
+                total = total + RANK_LAMBDA * pairwise_rank_loss(buf_p, buf_t)
+            scaler.scale(total).backward()
+        scaler.step(opt); scaler.update(); opt.zero_grad()
+    sc = eval_qmos_val()
+    print(f"epoch {ep:2d} | loss {run/max(nb,1):.4f} | val SRCC {sc:.4f} "
+          f"(zero-shot {srcc_zeroshot:.4f} · best {max(best,sc):.4f})")
+    if sc > best:
+        best = sc
+        best_state = {k: v.cpu().clone() for k, v in utmos.state_dict().items()}
+        save_qmos_ckpt(best)
+        print(f"   💾 lưu best → {CKPT_QMOS} (epoch {ep}, SRCC {sc:.4f})")
+        bad = 0
+    else:
+        bad += 1
+        if bad >= PATIENCE:
+            print(f"Early stop ở epoch {ep}."); break
+utmos.load_state_dict(best_state)
+print(f"\n✅ PHẦN A xong — QMOS val nội bộ: zero-shot {srcc_zeroshot:.4f} → fine-tune {best:.4f} "
+      + ("🚀 cải thiện" if best > srcc_zeroshot + 1e-4 else "➖ KHÔNG vượt zero-shot"))
+if best <= srcc_zeroshot + 1e-4:
+    print("   ⚠️ Fine-tune chưa vượt zero-shot → cân nhắc tăng EPOCHS / bật RANK_LAMBDA=0.3 / "
+          "mở băng feature-extractor (FREEZE_FEAT_EXT=False); hoặc giữ QMOS exp07 (0.548).")
+# %% [markdown]
+# ## 4. PHẦN A (tiếp) — Dự đoán QMOS cho DEV bằng UTMOS đã fine-tune
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+print("DEV:", len(dev_names), "mẫu")
+@torch.no_grad()
+def predict_qmos(name):
+    p = os.path.join(WAV_DIR, name if str(name).endswith(".wav") else str(name) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    wave = wave[: QMOS_MAX_SEC * SR].astype(np.float32)
+    utmos.eval()
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        v = float(utmos_forward(wave).item())
+    return float(np.clip(v, 1.0, 5.0))
+qmos_pred = {}
+n_real = n_def = 0
+for name in tqdm(dev_names, desc="QMOS dev"):
+    v = predict_qmos(name)
+    if v is None:
+        v = 3.0; n_def += 1
+    else:
+        n_real += 1
+    qmos_pred[name] = v
+print(f"QMOS dự đoán: thật {n_real}, mặc định {n_def}")
+# Giải phóng UTMOS trước khi nạp backbone cảm xúc (đỡ VRAM T4)
+del utmos, opt, scaler
+torch.cuda.empty_cache() if device == "cuda" else None
+# %% [markdown]
+# ## 5. PHẦN B — Nạp ckpt exp08 (WavLM ft + audeering) → 5 cột cảm xúc cho DEV
+# Tái dùng nguyên cơ chế load của exp08b: dựng kiến trúc → `load_state_dict` từ `ft_emotion_full_20epoch.pt`.
+# %%
+import torch.nn.functional as F
+ckpt = torch.load(EMO_CKPT, map_location="cpu", weights_only=False)   # ckpt có numpy → weights_only=False
+assert "wavlm" in ckpt, ("❌ EMO_CKPT không có 'wavlm' (backbone). Cần ft_emotion_full_20epoch.pt (bản đủ backbone), "
+                         "KHÔNG phải ft_emotion_meta.pt cũ.")
+print("✅ Nạp ckpt cảm xúc:", EMO_CKPT, "| keys:", list(ckpt.keys()))
+def find_hf_backbone(module):
+    cands = []
+    for nm, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((nm, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda x: sum(p.numel() for p in x[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+try:
+    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+    _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+    _name, wavlm = find_hf_backbone(_wrapper)
+    if wavlm is not None:
+        print(f"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{_name}'")
+except Exception as e:
+    print("⚠️ Lỗi nạp SAILER wrapper:", repr(e), "→ fallback WavLM trắng.")
+if wavlm is None:
+    from transformers import WavLMModel
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("ℹ️ Fallback: microsoft/wavlm-large.")
+wavlm = wavlm.to(device).eval()
+WAVLM_DIM = int(wavlm.config.hidden_size)
+miss, unexp = wavlm.load_state_dict(ckpt["wavlm"], strict=False)
+print(f"🔁 load wavlm từ ckpt: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0).")
+def masked_mean(hidden, attn_mask):
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    try:
+        fm = wavlm._get_feature_vector_attention_mask(hidden.shape[1], attn_mask)
+    except Exception:
+        return hidden.mean(dim=1)
+    fm = fm.unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+@torch.no_grad()
+def wavlm_embed(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state
+    return masked_mean(out, attn_mask)
+# ── audeering FROZEN (đặc trưng phụ) — như exp08 ──
+AUD_DIM = 0
+aud_backbone = aud_head = aud_proc = None
+if USE_AUDEERING:
+    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+    from huggingface_hub import hf_hub_download
+    AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+    aud_backbone = Wav2Vec2Model(aud_cfg)
+    try:
+        _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+            hf_hub_download(AUD_NAME, "model.safetensors"))
+    except Exception:
+        _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+    bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+    aud_backbone.load_state_dict(bb_sd, strict=False)
+    _hid = _sd["classifier.dense.weight"].shape[0]
+    _out = _sd["classifier.out_proj.weight"].shape[0]
+    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))
+    aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+    aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+    aud_backbone = aud_backbone.to(device).eval()
+    aud_head = aud_head.to(device).eval()
+    AUD_DIM = _hid + 3
+    print(f"✅ audeering frozen ({AUD_DIM}-D)")
+def load_wav_emo(sid):
+    p = os.path.join(WAV_DIR, sid + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    return wave[: EMO_MAX_SEC * SR].astype(np.float32)
+@torch.no_grad()
+def extract_audeering(stems, tag):
+    if not USE_AUDEERING:
+        return {}
+    cache_path = os.path.join(CACHE_DIR, f"aud_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[aud/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    for i, s in enumerate(tqdm(todo, desc=f"audeering {tag}")):
+        wave = load_wav_emo(s)
+        if wave is None:
+            continue
+        x = aud_proc(wave, sampling_rate=SR).input_values[0]
+        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)
+        h = aud_backbone(x)[0].mean(dim=1)
+        out = aud_head(h)[0].cpu().numpy()
+        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]
+        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)
+        if (i + 1) % 500 == 0:
+            np.savez(cache_path, **store)
+    if todo:
+        np.savez(cache_path, **store)
+    return store
+# ── EmoHeads (khớp exp08) + nạp trọng số head + thống kê chuẩn hóa từ ckpt ──
+N_EMO = len(EMOTIONS5)
+TRUNK_IN = WAVLM_DIM + (AUD_DIM if USE_AUDEERING else 0)
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device).eval()
+hmiss, hunexp = heads.load_state_dict(ckpt["heads"], strict=False)
+print(f"🔁 load heads từ ckpt: thiếu {len(hmiss)} / dư {len(hunexp)} key (kỳ vọng 0).")
+emos_mu = float(ckpt["emos_mu"]); emos_sd = float(ckpt["emos_sd"])
+vad_mu = np.asarray(ckpt["vad_mu"], dtype=np.float32); vad_sd = np.asarray(ckpt["vad_sd"], dtype=np.float32)
+print(f"Chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}")
+# Target cảm xúc (cho EMOS head) từ metadata
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+target_map = load_target_emotions()
+def onehot_target(tgt):
+    v = np.zeros(N_EMO, dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+dev_stems = [stem(n) for n in dev_names]
+aud_dev = extract_audeering(dev_stems, "dev")
+@torch.no_grad()
+def predict_emotion(sid):
+    wave = load_wav_emo(sid)
+    if wave is None or (USE_AUDEERING and sid not in aud_dev):
+        return None
+    iv = torch.from_numpy(wave).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        fw = wavlm_embed(iv, am)
+        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+# %% [markdown]
+# ## 6. PHẦN C — Ghép QMOS (fine-tune) + 5 cột cảm xúc (exp08) → answer.txt 6 cột
+# %%
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    n_real = n_def = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pr = predict_emotion(sid)
+            if pr is None:
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1
+            else:
+                emos, cat5, vad3 = pr; n_real += 1
+            qmos = qmos_pred.get(name, qmos_pred.get(sid, 3.0))
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 7. Validate + zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0] and "EMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp13_ft-qmos.zip answer.txt "
+          f"&& unzip -l submission_track2_exp13_ft-qmos.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp13_ft-qmos.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Lần đầu** `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để chạy trơn (không OOM, 1 epoch xong); rồi đặt `None`.
+# - **OOM trên T4?** giảm `QMOS_MAX_SEC` (12→10→8); giữ `FREEZE_FEAT_EXT=True`; `BATCH=1` đã là min.
+#   ⚠️ **Bật `RANK_LAMBDA>0` tốn VRAM hơn** vì phải GIỮ đồ thị cả cửa sổ ACCUM (=16) để so thứ hạng →
+#   nếu OOM khi bật ranking: giảm `ACCUM` (vd 8, cũng là kích thước nhóm ranking) hoặc `QMOS_MAX_SEC`.
+# - **Đọc mục A:** so `val SRCC fine-tune` với `zero-shot`. Chỉ nộp QMOS fine-tune nếu **vượt zero-shot**
+#   (lý tưởng vượt cả exp07 0.548); nếu không → giữ QMOS exp07 (Add Input answer.txt exp07, đổi cột QMOS).
+# - Nếu chưa vượt: tăng `EPOCHS`, bật `RANK_LAMBDA=0.3` (tối ưu thẳng thứ hạng), hoặc `FREEZE_FEAT_EXT=False`
+#   (mở băng feature-extractor — mạnh hơn nhưng dễ overfit + nặng VRAM).
+# - **Lưu checkpoint:** `ft_qmos_utmos.pt` lưu mỗi best → **Save Version NGAY** sau khi chạy (bài học exp08).
+# - **License QMOS:** UTMOS/SpeechMOS (kiểm tra license tarepan/SpeechMOS) — khai báo `docs/12_system_description.md`.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp13).

track2/exp14_mamba_head.ipynb ADDED Viewed

	@@ -0,0 +1,952 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "63b4bfa4",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp14 (MAMBA temporal head, CỘNG vào FUSION 6 cột) — Kaggle\n",
+    "\n",
+    "**Ý tưởng (theo gợi ý mentor \"thử Mamba\"):** exp04/exp07 đều **mean-pool** đặc trưng SSL →\n",
+    "mỗi wav thành 1 vector → mất hết **động lực theo thời gian** (lên/xuống giọng, ngắt quãng, rung).\n",
+    "**Mamba** là State Space Model (SSM) xử lý **chuỗi** với độ phức tạp tuyến tính → cho nó **dãy frame**\n",
+    "(chưa pool) để học temporal dynamics, rồi mới pool. Tham khảo: MambaRate (AudioMOS 2025), arXiv:2507.12090.\n",
+    "\n",
+    "## exp14 = exp07 + 1 nhánh Mamba (CỘNG thêm, không thay thế)\n",
+    "```\n",
+    "           ┌─ đặc trưng POOLED [e2v_emb|e2v_p5|sailer_emb|sailer_p9|sailer_vad3]  (y hệt exp07 → DÙNG LẠI cache)\n",
+    " mỗi wav ──┤\n",
+    "           └─ WavLM frame-level (chuỗi T×1024) ─► Mamba (2 lớp, 2 chiều) ─► attn-pool ─► z_seq (Z chiều)\n",
+    "                   │\n",
+    "        concat ──► TRUNK chung ──► 6 head: QMOS · EMOS · CAT · VAL · ARO · DOM\n",
+    "```\n",
+    "- **Cờ `USE_MAMBA`:** `False` → chạy ra **đúng exp07** (kiểm chứng tái lập ~0.548/0.795). `True` → bật nhánh Mamba.\n",
+    "  Đây CHÍNH là **ablation \"có/không Mamba\"** cho paper.\n",
+    "- WavLM **đóng băng** (chỉ trích đặc trưng) → Mamba head nhỏ → train nhanh, vừa T4.\n",
+    "\n",
+    "## 2 gotcha Kaggle đã xử trong file\n",
+    "1. `mamba-ssm` hay lỗi build CUDA → **nhúng sẵn Mamba thuần PyTorch** (không cần pip); tự dùng `mamba-ssm` nếu import được.\n",
+    "2. Cache frame-level RẤT nặng → **cap `MAX_FRAMES`** + lưu **fp16**. Ước lượng: MAX_FRAMES=256, 1024 chiều, fp16\n",
+    "   ≈ 0.5 MB/wav → train ~12k ≈ 6 GB, dev ~2.7k ≈ 1.4 GB (vừa /kaggle/working). **Save Version** để giữ cache.\n",
+    "\n",
+    "**Cách chạy:** GPU T4 + Internet On → Add Input dataset Track 2 → sửa `DATA_ROOT` → Run All.\n",
+    "Lần đầu đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để soi nhanh; OK rồi đặt `None`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3fe243f8",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd2e582a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header)\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/fusion_cache\"     # DÙNG CHUNG với exp04/exp07 (e2v_*, sailer_*, utmos_*)\n",
+    "SEQ_DIR   = \"/kaggle/working/wavlm_seq_cache\"  # MỚI: cache frame-level WavLM (fp16)\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "os.makedirs(SEQ_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── Bật/tắt nhánh Mamba (ablation chính) ─────────────────────────────────────\n",
+    "USE_MAMBA       = True         # False → ra ĐÚNG exp07 (sanity check). True → bật nhánh Mamba.\n",
+    "\n",
+    "# ── Siêu tham số nhánh Mamba ─────────────────────────────────────────────────\n",
+    "WAVLM_NAME      = \"microsoft/wavlm-large\"   # backbone frame-level (đóng băng). Trả chuỗi (T, 1024).\n",
+    "MAX_FRAMES      = 256          # cap độ dài chuỗi (256 frame ≈ 5.1s @ 50Hz). Giảm nếu hết đĩa.\n",
+    "MAMBA_DMODEL    = 256          # chiều ẩn của khối Mamba (proj 1024→256 trước khi vào Mamba)\n",
+    "MAMBA_LAYERS    = 2            # số khối Mamba xếp chồng\n",
+    "MAMBA_DSTATE    = 16           # chiều state SSM\n",
+    "BIDIRECTIONAL   = True         # chạy Mamba cả 2 chiều (xuôi + ngược) rồi cộng\n",
+    "Z_DIM           = 128          # chiều vector z_seq sau attentive-pool, đem concat vào fusion\n",
+    "\n",
+    "# ── Siêu tham số fusion (giống exp07) ────────────────────────────────────────\n",
+    "DEVICE          = \"cuda\"\n",
+    "TRUNK_HIDDEN    = 512\n",
+    "HEAD_HIDDEN     = 128\n",
+    "DROPOUT         = 0.3\n",
+    "LR              = 1e-3\n",
+    "EPOCHS          = 80\n",
+    "BATCH           = 32           # nhỏ hơn exp07 (64) vì có nhánh Mamba tốn RAM hơn\n",
+    "VAL_FRAC        = 0.10\n",
+    "PATIENCE        = 15\n",
+    "SEED            = 42\n",
+    "\n",
+    "USE_UNCERTAINTY = True\n",
+    "LOSS_W          = {\"qmos\": 1.0, \"emos\": 1.0, \"cat\": 1.0, \"val\": 1.0, \"aro\": 1.0, \"dom\": 1.0}\n",
+    "USE_E2V         = True\n",
+    "USE_SAILER      = True\n",
+    "USE_CLASSPROB   = True\n",
+    "USE_UTMOS_FEAT  = True\n",
+    "\n",
+    "LIMIT_TRAIN     = None\n",
+    "LIMIT_DEV       = None\n",
+    "\n",
+    "# Mốc exp07 để so (đây là hệ thống đang tốt nhất)\n",
+    "EXP07 = {\"qmos\": 0.548, \"emos\": 0.795, \"cat_err\": 0.153, \"val\": 0.581, \"aro\": 0.752, \"dom\": 0.705}\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "SAILER9 = [\"Anger\", \"Contempt\", \"Disgust\", \"Fear\", \"Happiness\", \"Neutral\", \"Sadness\", \"Surprise\", \"Other\"]\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "assert USE_E2V or USE_SAILER, \"Phải bật ít nhất 1 backbone pooled.\"\n",
+    "print(\"USE_MAMBA =\", USE_MAMBA, \"| nếu False → ra đúng exp07\")\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ad58750",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER\n",
+    "Chỉ cài gói còn thiếu (Kaggle có sẵn torch/transformers). KHÔNG đụng numpy (tránh lệch ABI torch — bài học exp12)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3260eb06",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"speechmos\", \"funasr\", \"librosa\", \"soundfile\", \"pandas\", \"scipy\", \"scikit-learn\", \"tqdm\")\n",
+    "\n",
+    "if USE_SAILER:\n",
+    "    pip_install(\"loralib\", \"speechbrain\")\n",
+    "    REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "    if not os.path.exists(REPO_DIR):\n",
+    "        subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                        \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "    if REPO_DIR not in sys.path:\n",
+    "        sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f92c0e17",
+   "metadata": {},
+   "source": [
+    "## 2. Đọc & gộp nhãn theo wavID (giống exp07)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bab3f8d5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) < 2:\n",
+    "                continue\n",
+    "            tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, default_idx=None, df=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col  = _col(cols, \"wavid\", \"wav\", default_idx=1, df=df)\n",
+    "    qmos_col = _col(cols, \"qmos\", \"mos\")\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col  = _col(cols, \"val\", \"valence\")\n",
+    "    aro_col  = _col(cols, \"aro\", \"arousal\")\n",
+    "    dom_col  = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col  = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert qmos_col and emos_col, f\"Thiếu cột qMOS/eMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"qmos\": float(g[qmos_col].mean()), \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 1.0 / len(EMOTIONS5), dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a5cd1ff1",
+   "metadata": {},
+   "source": [
+    "## 3. Đặc trưng POOLED (e2v + sailer + UTMOS) — TÁI DÙNG cache exp04/exp07\n",
+    "(Y hệt exp07; nếu đã chạy exp07 thì cache `fusion_cache/` còn nguyên → không tính lại.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8c31b6a4",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU\")\n",
+    "\n",
+    "def extract_e2v(stems, tag):\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"e2v_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[e2v/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        from funasr import AutoModel\n",
+    "        m = AutoModel(model=\"iic/emotion2vec_plus_large\", hub=\"hf\", device=device)\n",
+    "        for i, s in enumerate(tqdm(todo, desc=f\"e2v {tag}\")):\n",
+    "            wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "            if not os.path.exists(wav):\n",
+    "                continue\n",
+    "            r = m.generate(wav, granularity=\"utterance\", extract_embedding=True)[0]\n",
+    "            emb = np.asarray(r[\"feats\"], dtype=np.float32).reshape(-1)\n",
+    "            probs = {e: 0.0 for e in EMOTIONS5}\n",
+    "            for lab, sc in zip(r[\"labels\"], r[\"scores\"]):\n",
+    "                name = lab.split(\"/\")[-1]\n",
+    "                if name in probs:\n",
+    "                    probs[name] = float(sc)\n",
+    "            tot = sum(probs.values())\n",
+    "            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)\n",
+    "            store[s] = np.concatenate([emb, p5]).astype(np.float32)\n",
+    "            if (i + 1) % 500 == 0:\n",
+    "                np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del m\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return {s: (v[:-5], v[-5:]) for s, v in store.items()}\n",
+    "\n",
+    "def _pool_feat(features):\n",
+    "    f = features.detach().cpu().numpy()\n",
+    "    if f.ndim <= 1:\n",
+    "        return f.reshape(-1).astype(np.float32)\n",
+    "    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)\n",
+    "\n",
+    "def extract_sailer(stems, tag):\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"sailer_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[sailer/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    if todo:\n",
+    "        from src.model.emotion.wavlm_emotion import WavLMWrapper\n",
+    "        sailer = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\").to(device).eval()\n",
+    "        with torch.no_grad():\n",
+    "            for i, s in enumerate(tqdm(todo, desc=f\"sailer {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                wave = wave[: 15 * 16000]\n",
+    "                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)\n",
+    "                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)\n",
+    "                emb = _pool_feat(feat)\n",
+    "                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)\n",
+    "                vad3 = np.array([1 + 4 * float(valence.item()),\n",
+    "                                 1 + 4 * float(arousal.item()),\n",
+    "                                 1 + 4 * float(dominance.item())], dtype=np.float32)\n",
+    "                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache_path, **store)\n",
+    "        np.savez(cache_path, **store)\n",
+    "        del sailer\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return {s: (v[:-12], v[-12:-3], v[-3:]) for s, v in store.items()}\n",
+    "\n",
+    "def extract_utmos(names, tag):\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"utmos_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: float(z[k]) for k in z.files}\n",
+    "        print(f\"[utmos/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [n for n in names if stem(n) not in store]\n",
+    "    if todo:\n",
+    "        predictor = torch.hub.load(\"tarepan/SpeechMOS:v1.2.0\", \"utmos22_strong\",\n",
+    "                                   trust_repo=True).to(device).eval()\n",
+    "        with torch.no_grad():\n",
+    "            for i, n in enumerate(tqdm(todo, desc=f\"utmos {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else n + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                store[stem(n)] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device),\n",
+    "                                                 sr=16000).mean().item())\n",
+    "                if (i + 1) % 500 == 0:\n",
+    "                    np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})\n",
+    "        del predictor\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a6a9dfc9",
+   "metadata": {},
+   "source": [
+    "## 3b. Đặc trưng FRAME-LEVEL WavLM (chuỗi T×1024) cho nhánh Mamba — cache fp16\n",
+    "Mỗi wav lưu 1 file `.npy` riêng trong `SEQ_DIR` (mảng fp16 [T, 1024], T ≤ MAX_FRAMES).\n",
+    "WavLM **đóng băng** (eval, no_grad) → layerdrop tự tắt ở eval, không đụng gotcha checkpoint."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "60c9e86e",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "_wavlm = None\n",
+    "def _get_wavlm():\n",
+    "    \"\"\"Lazy-load microsoft/wavlm-large (đóng băng). Trả model + feature_extractor.\"\"\"\n",
+    "    global _wavlm\n",
+    "    if _wavlm is None:\n",
+    "        from transformers import WavLMModel, AutoFeatureExtractor\n",
+    "        fe = AutoFeatureExtractor.from_pretrained(WAVLM_NAME)\n",
+    "        mdl = WavLMModel.from_pretrained(WAVLM_NAME).to(device).eval()\n",
+    "        for p in mdl.parameters():\n",
+    "            p.requires_grad = False\n",
+    "        _wavlm = (mdl, fe)\n",
+    "    return _wavlm\n",
+    "\n",
+    "def seq_path(sid):\n",
+    "    return os.path.join(SEQ_DIR, sid + \".npy\")\n",
+    "\n",
+    "def extract_wavlm_seq(stems, tag):\n",
+    "    \"\"\"Trích frame-level WavLM cho từng wav, cache fp16 ra .npy. Trả set stem đã có.\"\"\"\n",
+    "    if not USE_MAMBA:\n",
+    "        return set()\n",
+    "    import librosa\n",
+    "    from tqdm.auto import tqdm\n",
+    "    todo = [s for s in stems if not os.path.exists(seq_path(s))]\n",
+    "    if todo:\n",
+    "        mdl, fe = _get_wavlm()\n",
+    "        with torch.no_grad():\n",
+    "            for i, s in enumerate(tqdm(todo, desc=f\"wavlm-seq {tag}\")):\n",
+    "                wav = os.path.join(WAV_DIR, s + \".wav\")\n",
+    "                if not os.path.exists(wav):\n",
+    "                    continue\n",
+    "                wave, _ = librosa.load(wav, sr=16000, mono=True)\n",
+    "                wave = wave[: 15 * 16000]\n",
+    "                inp = fe(wave, sampling_rate=16000, return_tensors=\"pt\").input_values.to(device)\n",
+    "                hs = mdl(inp).last_hidden_state[0]          # (T, 1024)\n",
+    "                if hs.shape[0] > MAX_FRAMES:                 # cap độ dài (đều theo thời gian)\n",
+    "                    idx = torch.linspace(0, hs.shape[0] - 1, MAX_FRAMES).long()\n",
+    "                    hs = hs[idx]\n",
+    "                np.save(seq_path(s), hs.cpu().numpy().astype(np.float16))\n",
+    "        torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "    return {s for s in stems if os.path.exists(seq_path(s))}\n",
+    "\n",
+    "def load_seq(sid):\n",
+    "    \"\"\"Đọc chuỗi fp16 → tensor float32 (T, 1024). Thiếu file → None.\"\"\"\n",
+    "    p = seq_path(sid)\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    return torch.from_numpy(np.load(p).astype(np.float32))\n",
+    "\n",
+    "def collate_seqs(sids):\n",
+    "    \"\"\"Gộp list chuỗi độ dài khác nhau → (B, Lmax, 1024) + mask (B, Lmax) bool (True=thật).\"\"\"\n",
+    "    seqs = [load_seq(s) for s in sids]\n",
+    "    lens = [t.shape[0] for t in seqs]\n",
+    "    Lmax = max(lens)\n",
+    "    B = len(seqs)\n",
+    "    x = torch.zeros(B, Lmax, seqs[0].shape[1], dtype=torch.float32)\n",
+    "    mask = torch.zeros(B, Lmax, dtype=torch.bool)\n",
+    "    for i, t in enumerate(seqs):\n",
+    "        x[i, : t.shape[0]] = t\n",
+    "        mask[i, : t.shape[0]] = True\n",
+    "    return x, mask"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "328a5f30",
+   "metadata": {},
+   "source": [
+    "## 4. Dựng feature pooled + nhãn cho train (lọc các wav đủ mọi nguồn)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4449a153",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_stems = list(train_df[\"wavID\"])\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "\n",
+    "e2v_tr    = extract_e2v(train_stems, \"train\")    if USE_E2V    else {}\n",
+    "sailer_tr = extract_sailer(train_stems, \"train\") if USE_SAILER else {}\n",
+    "utmos_tr  = extract_utmos(train_stems, \"train\")  if USE_UTMOS_FEAT else {}\n",
+    "seq_tr    = extract_wavlm_seq(train_stems, \"train\")\n",
+    "\n",
+    "def audio_feature(sid, e2v_map, sailer_map):\n",
+    "    parts = []\n",
+    "    if USE_E2V:\n",
+    "        pk = e2v_map.get(sid)\n",
+    "        if pk is None:\n",
+    "            return None\n",
+    "        emb, p5 = pk\n",
+    "        parts.append(emb)\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(p5)\n",
+    "    if USE_SAILER:\n",
+    "        pk = sailer_map.get(sid)\n",
+    "        if pk is None:\n",
+    "            return None\n",
+    "        emb, p9, vad3 = pk\n",
+    "        parts.append(emb)\n",
+    "        if USE_CLASSPROB:\n",
+    "            parts.append(p9); parts.append(vad3)\n",
+    "    return np.concatenate(parts).astype(np.float32)\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "keep_sids, X, T, U = [], [], [], []\n",
+    "y_qmos, y_emos, y_vad, y_cat = [], [], [], []\n",
+    "for s in train_stems:\n",
+    "    f = audio_feature(s, e2v_tr, sailer_tr)\n",
+    "    tgt = target_map.get(s)\n",
+    "    if f is None or tgt is None or s not in lab.index:\n",
+    "        continue\n",
+    "    if USE_UTMOS_FEAT and s not in utmos_tr:\n",
+    "        continue\n",
+    "    if USE_MAMBA and s not in seq_tr:        # cần có chuỗi WavLM nếu bật Mamba\n",
+    "        continue\n",
+    "    keep_sids.append(s)\n",
+    "    X.append(f)\n",
+    "    T.append(onehot_target(tgt))\n",
+    "    U.append(utmos_tr.get(s, 3.0) if USE_UTMOS_FEAT else 0.0)\n",
+    "    y_qmos.append(lab.loc[s, \"qmos\"]); y_emos.append(lab.loc[s, \"emos\"])\n",
+    "    y_vad.append([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]])\n",
+    "    y_cat.append([lab.loc[s, f\"cat{i}\"] for i in range(len(EMOTIONS5))])\n",
+    "\n",
+    "X = np.stack(X).astype(np.float32)\n",
+    "T = np.stack(T).astype(np.float32)\n",
+    "U = np.array(U, dtype=np.float32).reshape(-1, 1)\n",
+    "y_qmos = np.array(y_qmos, dtype=np.float32); y_emos = np.array(y_emos, dtype=np.float32)\n",
+    "y_vad  = np.array(y_vad,  dtype=np.float32); y_cat  = np.array(y_cat,  dtype=np.float32)\n",
+    "FEAT_DIM = X.shape[1]\n",
+    "print(f\"Train giữ lại: {len(keep_sids)} wav | X={X.shape} | Mamba={'ON' if USE_MAMBA else 'OFF'}\")\n",
+    "\n",
+    "# Chuẩn hóa feature pooled + UTMOS + nhãn liên tục (z-score)\n",
+    "feat_mean = X.mean(0, keepdims=True); feat_std = X.std(0, keepdims=True) + 1e-6\n",
+    "Xn = (X - feat_mean) / feat_std\n",
+    "u_mu, u_sd = float(U.mean()), float(U.std() + 1e-6); Un = (U - u_mu) / u_sd\n",
+    "qmos_mu, qmos_sd = float(y_qmos.mean()), float(y_qmos.std() + 1e-6); y_qmos_z = (y_qmos - qmos_mu) / qmos_sd\n",
+    "emos_mu, emos_sd = float(y_emos.mean()), float(y_emos.std() + 1e-6); y_emos_z = (y_emos - emos_mu) / emos_sd\n",
+    "if HAS_VAD:\n",
+    "    vad_mu = np.nanmean(y_vad, axis=0); vad_sd = np.nanstd(y_vad, axis=0) + 1e-6\n",
+    "    y_vad_z = (y_vad - vad_mu) / vad_sd\n",
+    "else:\n",
+    "    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32); y_vad_z = np.zeros_like(y_vad)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f0a94ff",
+   "metadata": {},
+   "source": [
+    "## 5a. Khối MAMBA (thuần PyTorch, không cần `mamba-ssm`)\n",
+    "Tự dùng `mamba-ssm` nếu import được (nhanh hơn); nếu không → bản thuần PyTorch (selective scan vòng lặp thời gian).\n",
+    "Bản này theo \"mamba-minimal\" (johnma2006) — đúng công thức, chỉ chậm hơn kernel CUDA, nhưng head nhỏ nên OK trên T4."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "535fcd63",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import math\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "try:\n",
+    "    from mamba_ssm import Mamba as _OfficialMamba   # nếu cài được thì dùng (tùy chọn)\n",
+    "    _HAS_MAMBA_SSM = True\n",
+    "    print(\"✅ Dùng mamba-ssm (CUDA kernel)\")\n",
+    "except Exception:\n",
+    "    _HAS_MAMBA_SSM = False\n",
+    "    print(\"ℹ️ Không có mamba-ssm → dùng Mamba thuần PyTorch (nhúng sẵn)\")\n",
+    "\n",
+    "class MambaBlockTorch(nn.Module):\n",
+    "    \"\"\"Một khối Mamba (selective SSM) thuần PyTorch. d_model = chiều ẩn.\"\"\"\n",
+    "    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):\n",
+    "        super().__init__()\n",
+    "        self.d_inner = expand * d_model\n",
+    "        self.dt_rank = math.ceil(d_model / 16)\n",
+    "        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)\n",
+    "        self.conv1d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=d_conv,\n",
+    "                                groups=self.d_inner, padding=d_conv - 1, bias=True)\n",
+    "        self.x_proj = nn.Linear(self.d_inner, self.dt_rank + d_state * 2, bias=False)\n",
+    "        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True)\n",
+    "        A = torch.arange(1, d_state + 1, dtype=torch.float32).repeat(self.d_inner, 1)\n",
+    "        self.A_log = nn.Parameter(torch.log(A))           # (d_inner, d_state)\n",
+    "        self.D = nn.Parameter(torch.ones(self.d_inner))\n",
+    "        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)\n",
+    "        self.d_state = d_state\n",
+    "\n",
+    "    def forward(self, x):                                 # x: (B, L, d_model)\n",
+    "        B, L, _ = x.shape\n",
+    "        xz = self.in_proj(x)                              # (B, L, 2*d_inner)\n",
+    "        xin, z = xz.chunk(2, dim=-1)\n",
+    "        xin = xin.transpose(1, 2)                         # (B, d_inner, L)\n",
+    "        xin = self.conv1d(xin)[..., :L].transpose(1, 2)   # (B, L, d_inner) causal conv\n",
+    "        xin = F.silu(xin)\n",
+    "        y = self._ssm(xin)                                # (B, L, d_inner)\n",
+    "        y = y * F.silu(z)\n",
+    "        return self.out_proj(y)\n",
+    "\n",
+    "    def _ssm(self, x):                                    # x: (B, L, d_inner)\n",
+    "        A = -torch.exp(self.A_log)                        # (d_inner, d_state)\n",
+    "        x_dbl = self.x_proj(x)                            # (B, L, dt_rank + 2*d_state)\n",
+    "        delta, Bm, Cm = torch.split(x_dbl, [self.dt_rank, self.d_state, self.d_state], dim=-1)\n",
+    "        delta = F.softplus(self.dt_proj(delta))           # (B, L, d_inner)\n",
+    "        dA = torch.exp(delta.unsqueeze(-1) * A)           # (B, L, d_inner, d_state)\n",
+    "        dB_x = delta.unsqueeze(-1) * Bm.unsqueeze(2) * x.unsqueeze(-1)  # (B, L, d_inner, d_state)\n",
+    "        h = torch.zeros(x.shape[0], self.d_inner, self.d_state, device=x.device, dtype=x.dtype)\n",
+    "        ys = []\n",
+    "        for t in range(x.shape[1]):                       # selective scan theo thời gian\n",
+    "            h = dA[:, t] * h + dB_x[:, t]\n",
+    "            ys.append((h * Cm[:, t].unsqueeze(1)).sum(-1))   # (B, d_inner)\n",
+    "        y = torch.stack(ys, dim=1)                        # (B, L, d_inner)\n",
+    "        return y + x * self.D\n",
+    "\n",
+    "class MambaLayer(nn.Module):\n",
+    "    \"\"\"Pre-norm residual quanh 1 khối Mamba (chọn official nếu có).\"\"\"\n",
+    "    def __init__(self, d_model, d_state):\n",
+    "        super().__init__()\n",
+    "        self.norm = nn.LayerNorm(d_model)\n",
+    "        if _HAS_MAMBA_SSM:\n",
+    "            self.mix = _OfficialMamba(d_model=d_model, d_state=d_state, d_conv=4, expand=2)\n",
+    "        else:\n",
+    "            self.mix = MambaBlockTorch(d_model, d_state=d_state)\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        return x + self.mix(self.norm(x))\n",
+    "\n",
+    "class MambaEncoder(nn.Module):\n",
+    "    \"\"\"1024 → d_model → [Mamba ×L] (2 chiều nếu BIDIRECTIONAL) → attentive-pool → Z_DIM.\"\"\"\n",
+    "    def __init__(self, d_in, d_model, n_layers, d_state, z_dim, bidir):\n",
+    "        super().__init__()\n",
+    "        self.bidir = bidir\n",
+    "        self.proj = nn.Linear(d_in, d_model)\n",
+    "        self.fwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])\n",
+    "        if bidir:\n",
+    "            self.bwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])\n",
+    "        self.attn = nn.Linear(d_model, 1)                 # attentive pooling\n",
+    "        self.out = nn.Linear(d_model, z_dim)\n",
+    "\n",
+    "    def _run(self, layers, h):\n",
+    "        for L in layers:\n",
+    "            h = L(h)\n",
+    "        return h\n",
+    "\n",
+    "    def forward(self, x, mask):                           # x: (B, L, 1024), mask: (B, L) bool\n",
+    "        h = self.proj(x)\n",
+    "        out = self._run(self.fwd, h)\n",
+    "        if self.bidir:\n",
+    "            rev = torch.flip(h, dims=[1])\n",
+    "            out = out + torch.flip(self._run(self.bwd, rev), dims=[1])\n",
+    "        a = self.attn(out).squeeze(-1)                    # (B, L)\n",
+    "        a = a.masked_fill(~mask, float(\"-inf\"))\n",
+    "        w = torch.softmax(a, dim=1).unsqueeze(-1)         # (B, L, 1)\n",
+    "        pooled = (out * w).sum(1)                          # (B, d_model)\n",
+    "        return self.out(pooled)                            # (B, z_dim)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a1a3026b",
+   "metadata": {},
+   "source": [
+    "## 5b. Model fusion 6 head + nhánh Mamba + train loop"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e5f743ef",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import spearmanr\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "idx_all = np.arange(X.shape[0])\n",
+    "tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)\n",
+    "\n",
+    "def to_t(a):\n",
+    "    return torch.tensor(a, dtype=torch.float32, device=device)\n",
+    "\n",
+    "Xn_t, T_t, Un_t = to_t(Xn), to_t(T), to_t(Un)\n",
+    "qmos_t = to_t(y_qmos_z).unsqueeze(1); emos_t = to_t(y_emos_z).unsqueeze(1)\n",
+    "vad_t  = to_t(y_vad_z); cat_t = to_t(y_cat)\n",
+    "\n",
+    "class FusionMamba6(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo, use_utmos, use_mamba):\n",
+    "        super().__init__()\n",
+    "        self.use_utmos = use_utmos\n",
+    "        self.use_mamba = use_mamba\n",
+    "        z_extra = Z_DIM if use_mamba else 0\n",
+    "        if use_mamba:\n",
+    "            self.enc = MambaEncoder(1024, MAMBA_DMODEL, MAMBA_LAYERS, MAMBA_DSTATE, Z_DIM, BIDIRECTIONAL)\n",
+    "        self.trunk = nn.Sequential(\n",
+    "            nn.Linear(d_in + z_extra, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "            nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.qmos = nn.Sequential(\n",
+    "            nn.Linear(trunk_h + (1 if use_utmos else 0), head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.emos = nn.Sequential(\n",
+    "            nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(\n",
+    "            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(\n",
+    "            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "\n",
+    "    def forward(self, x, tgt, utmos, seq=None, mask=None):\n",
+    "        if self.use_mamba:\n",
+    "            z = self.enc(seq, mask)\n",
+    "            x = torch.cat([x, z], dim=1)\n",
+    "        h = self.trunk(x)\n",
+    "        qmos_in = torch.cat([h, utmos], dim=1) if self.use_utmos else h\n",
+    "        return self.qmos(qmos_in), self.emos(torch.cat([h, tgt], dim=1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "model = FusionMamba6(FEAT_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO, USE_UTMOS_FEAT, USE_MAMBA).to(device)\n",
+    "n_par = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "print(f\"Tham số train được: {n_par/1e6:.2f} M\")\n",
+    "\n",
+    "TASKS = [\"qmos\", \"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "params = list(model.parameters()) + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "opt = torch.optim.Adam(params, lr=LR, weight_decay=1e-5)\n",
+    "mse = nn.MSELoss(reduction=\"none\")\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    return -(target_dist * F.log_softmax(logits, dim=1)).sum(dim=1)\n",
+    "\n",
+    "def task_losses(qmos_p, emos_p, cat_logits, vad_p, b):\n",
+    "    L = {\"qmos\": mse(qmos_p, qmos_t[b]).mean(),\n",
+    "         \"emos\": mse(emos_p, emos_t[b]).mean(),\n",
+    "         \"cat\":  soft_ce(cat_logits, cat_t[b]).mean()}\n",
+    "    if HAS_VAD:\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vad_t[b, 0:1]).mean()\n",
+    "        L[\"aro\"] = mse(vad_p[:, 1:2], vad_t[b, 1:2]).mean()\n",
+    "        L[\"dom\"] = mse(vad_p[:, 2:3], vad_t[b, 2:3]).mean()\n",
+    "    else:\n",
+    "        z = torch.zeros((), device=device); L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    return L\n",
+    "\n",
+    "def combine(L):\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))\n",
+    "    return sum(LOSS_W[t] * L[t] for t in TASKS)\n",
+    "\n",
+    "# batch theo INDEX (vì nhánh Mamba cần đọc chuỗi theo sid → collate động)\n",
+    "sids_arr = np.array(keep_sids)\n",
+    "\n",
+    "def forward_batch(bidx):\n",
+    "    \"\"\"bidx: numpy index. Trả output model cho batch (tự collate chuỗi nếu bật Mamba).\"\"\"\n",
+    "    bt = torch.tensor(bidx, device=device)\n",
+    "    if USE_MAMBA:\n",
+    "        seq, mask = collate_seqs(list(sids_arr[bidx]))\n",
+    "        seq, mask = seq.to(device), mask.to(device)\n",
+    "        return model(Xn_t[bt], T_t[bt], Un_t[bt], seq, mask)\n",
+    "    return model(Xn_t[bt], T_t[bt], Un_t[bt])\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def eval_val():\n",
+    "    model.eval()\n",
+    "    qp, ep, vp = [], [], []\n",
+    "    for i in range(0, len(va_idx), BATCH):\n",
+    "        b = va_idx[i:i + BATCH]\n",
+    "        q, e, _cl, v = forward_batch(b)\n",
+    "        qp.append(q.cpu().numpy().ravel()); ep.append(e.cpu().numpy().ravel()); vp.append(v.cpu().numpy())\n",
+    "    qp = np.concatenate(qp); ep = np.concatenate(ep); vp = np.concatenate(vp)\n",
+    "    out = {\"qmos\": spearmanr(qp, y_qmos[va_idx]).correlation,\n",
+    "           \"emos\": spearmanr(ep, y_emos[va_idx]).correlation}\n",
+    "    if USE_UTMOS_FEAT:\n",
+    "        out[\"qmos_utmos\"] = spearmanr(U[va_idx, 0], y_qmos[va_idx]).correlation\n",
+    "    if HAS_VAD:\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            out[t] = spearmanr(vp[:, j], y_vad[va_idx, j]).correlation\n",
+    "    return out\n",
+    "\n",
+    "def val_score(m):\n",
+    "    keys = [\"qmos\", \"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "best_score, best_state, bad = -1e9, None, 0\n",
+    "for ep_i in range(1, EPOCHS + 1):\n",
+    "    model.train()\n",
+    "    perm = np.random.permutation(tr_idx)\n",
+    "    run = 0.0\n",
+    "    for i in range(0, len(perm), BATCH):\n",
+    "        b = perm[i:i + BATCH]\n",
+    "        opt.zero_grad()\n",
+    "        q, e, cl, v = forward_batch(b)\n",
+    "        loss = combine(task_losses(q, e, cl, v, torch.tensor(b, device=device)))\n",
+    "        loss.backward(); opt.step()\n",
+    "        run += loss.item() * len(b)\n",
+    "    m = eval_val(); sc = val_score(m)\n",
+    "    if sc > best_score:\n",
+    "        best_score = sc; bad = 0\n",
+    "        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "    if ep_i % 2 == 0 or ep_i == 1:\n",
+    "        msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"qmos\", \"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "        print(f\"epoch {ep_i:3d} | loss {run/len(perm):.4f} | {msg} | best {best_score:.4f}\")\n",
+    "    if bad >= PATIENCE:\n",
+    "        print(f\"Early stop ở epoch {ep_i}.\"); break\n",
+    "\n",
+    "model.load_state_dict(best_state)\n",
+    "final = eval_val()\n",
+    "print(f\"\\n✅ VAL (nội bộ) — exp14 (Mamba={'ON' if USE_MAMBA else 'OFF'}):\")\n",
+    "print(f\"   QMOS={final['qmos']:.4f} (exp07 {EXP07['qmos']}) | EMOS={final['emos']:.4f} (exp07 {EXP07['emos']})\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f}\"\n",
+    "          f\" (exp07 {EXP07['val']}/{EXP07['aro']}/{EXP07['dom']})\")\n",
+    "print(\"   → So sánh USE_MAMBA True vs False = ablation Mamba cho paper.\")\n",
+    "\n",
+    "torch.save({\"state\": best_state, \"feat_mean\": feat_mean, \"feat_std\": feat_std,\n",
+    "            \"u_mu\": u_mu, \"u_sd\": u_sd, \"qmos_mu\": qmos_mu, \"qmos_sd\": qmos_sd,\n",
+    "            \"emos_mu\": emos_mu, \"emos_sd\": emos_sd, \"vad_mu\": vad_mu, \"vad_sd\": vad_sd,\n",
+    "            \"FEAT_DIM\": FEAT_DIM, \"USE_MAMBA\": USE_MAMBA, \"val_score\": best_score},\n",
+    "           os.path.join(OUT_DIR, \"fusion_mamba_mtl.pt\"))\n",
+    "print(\"Đã lưu\", os.path.join(OUT_DIR, \"fusion_mamba_mtl.pt\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea38383a",
+   "metadata": {},
+   "source": [
+    "## 6. Dự đoán DEV → `answer.txt` đủ 6 cột"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6e774431",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "\n",
+    "e2v_dev    = extract_e2v(dev_stems, \"dev\")    if USE_E2V    else {}\n",
+    "sailer_dev = extract_sailer(dev_stems, \"dev\") if USE_SAILER else {}\n",
+    "utmos_dev  = extract_utmos(dev_names, \"dev\")  if USE_UTMOS_FEAT else {}\n",
+    "seq_dev    = extract_wavlm_seq(dev_stems, \"dev\")\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_all(sid):\n",
+    "    f = audio_feature(sid, e2v_dev, sailer_dev)\n",
+    "    if f is None:\n",
+    "        return None\n",
+    "    if USE_MAMBA and not os.path.exists(seq_path(sid)):\n",
+    "        return None\n",
+    "    fn = (f[None, :] - feat_mean) / feat_std\n",
+    "    tgt = onehot_target(target_map.get(sid))[None, :]\n",
+    "    u = np.array([[utmos_dev.get(sid, 3.0)]], dtype=np.float32); un = (u - u_mu) / u_sd\n",
+    "    model.eval()\n",
+    "    if USE_MAMBA:\n",
+    "        seq, mask = collate_seqs([sid]); seq, mask = seq.to(device), mask.to(device)\n",
+    "        q, e, cl, v = model(to_t(fn), to_t(tgt), to_t(un), seq, mask)\n",
+    "    else:\n",
+    "        q, e, cl, v = model(to_t(fn), to_t(tgt), to_t(un))\n",
+    "    qmos = float(q.item()) * qmos_sd + qmos_mu\n",
+    "    emos = float(e.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cl, dim=1)[0].cpu().numpy()\n",
+    "    vad3 = v[0].cpu().numpy() * vad_sd + vad_mu\n",
+    "    return qmos, emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(probs5):\n",
+    "    return \"|\".join(f\"{e}:{probs5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    from tqdm.auto import tqdm\n",
+    "    n_real = n_default = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pred = predict_all(sid)\n",
+    "            if pred is None:\n",
+    "                qmos = utmos_dev.get(sid, 3.0)\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])\n",
+    "                n_default += 1\n",
+    "            else:\n",
+    "                qmos, emos, cat5, vad3 = pred; n_real += 1\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},\"\n",
+    "                    f\"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | head thật {n_real}, mặc định {n_default}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bcab20d3",
+   "metadata": {},
+   "source": [
+    "## 7. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e9b2e0ab",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    assert header[0] == \"wav\" and \"QMOS\" in header and \"EMOS\" in header, \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(header), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {header}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp14_mamba.zip answer.txt \"\n",
+    "          f\"&& unzip -l submission_track2_exp14_mamba.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp14_mamba.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7604df81",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **Ablation chính cho paper:** chạy 2 lần — `USE_MAMBA=False` (= exp07, mốc) và `USE_MAMBA=True`.\n",
+    "  So QMOS/EMOS/VAD nội bộ → trả lời \"bộ mã hóa thời gian Mamba có hơn mean-pooling không?\".\n",
+    "- **Nếu hết đĩa khi cache chuỗi:** giảm `MAX_FRAMES` (256→160) hoặc xóa `wavlm_seq_cache/` sau khi chạy xong.\n",
+    "- **Nếu Mamba chậm:** thử `pip install mamba-ssm causal-conv1d` (file tự dùng nếu import được); hoặc giảm\n",
+    "  `MAMBA_LAYERS`/`MAX_FRAMES`. Bản thuần PyTorch dùng vòng lặp thời gian nên chậm hơn kernel CUDA.\n",
+    "- **Save Version** để giữ cache `fusion_cache/` + `wavlm_seq_cache/` cho lần sau.\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp14)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp14_mamba_head_pipeline.py ADDED Viewed

	@@ -0,0 +1,798 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp14 (MAMBA temporal head, CỘNG vào FUSION 6 cột) — Kaggle
+#
+# **Ý tưởng (theo gợi ý mentor "thử Mamba"):** exp04/exp07 đều **mean-pool** đặc trưng SSL →
+# mỗi wav thành 1 vector → mất hết **động lực theo thời gian** (lên/xuống giọng, ngắt quãng, rung).
+# **Mamba** là State Space Model (SSM) xử lý **chuỗi** với độ phức tạp tuyến tính → cho nó **dãy frame**
+# (chưa pool) để học temporal dynamics, rồi mới pool. Tham khảo: MambaRate (AudioMOS 2025), arXiv:2507.12090.
+#
+# ## exp14 = exp07 + 1 nhánh Mamba (CỘNG thêm, không thay thế)
+# ```
+#            ┌─ đặc trưng POOLED [e2v_emb|e2v_p5|sailer_emb|sailer_p9|sailer_vad3]  (y hệt exp07 → DÙNG LẠI cache)
+#  mỗi wav ──┤
+#            └─ WavLM frame-level (chuỗi T×1024) ─► Mamba (2 lớp, 2 chiều) ─► attn-pool ─► z_seq (Z chiều)
+#                    │
+#         concat ──► TRUNK chung ──► 6 head: QMOS · EMOS · CAT · VAL · ARO · DOM
+# ```
+# - **Cờ `USE_MAMBA`:** `False` → chạy ra **đúng exp07** (kiểm chứng tái lập ~0.548/0.795). `True` → bật nhánh Mamba.
+#   Đây CHÍNH là **ablation "có/không Mamba"** cho paper.
+# - WavLM **đóng băng** (chỉ trích đặc trưng) → Mamba head nhỏ → train nhanh, vừa T4.
+#
+# ## 2 gotcha Kaggle đã xử trong file
+# 1. `mamba-ssm` hay lỗi build CUDA → **nhúng sẵn Mamba thuần PyTorch** (không cần pip); tự dùng `mamba-ssm` nếu import được.
+# 2. Cache frame-level RẤT nặng → **cap `MAX_FRAMES`** + lưu **fp16**. Ước lượng: MAX_FRAMES=256, 1024 chiều, fp16
+#    ≈ 0.5 MB/wav → train ~12k ≈ 6 GB, dev ~2.7k ≈ 1.4 GB (vừa /kaggle/working). **Save Version** để giữ cache.
+#
+# **Cách chạy:** GPU T4 + Internet On → Add Input dataset Track 2 → sửa `DATA_ROOT` → Run All.
+# Lần đầu đặt `LIMIT_TRAIN=300`, `LIMIT_DEV=20` để soi nhanh; OK rồi đặt `None`.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header)
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/fusion_cache"     # DÙNG CHUNG với exp04/exp07 (e2v_*, sailer_*, utmos_*)
+SEQ_DIR   = "/kaggle/working/wavlm_seq_cache"  # MỚI: cache frame-level WavLM (fp16)
+os.makedirs(CACHE_DIR, exist_ok=True)
+os.makedirs(SEQ_DIR, exist_ok=True)
+# ── Bật/tắt nhánh Mamba (ablation chính) ─────────────────────────────────────
+USE_MAMBA       = True         # False → ra ĐÚNG exp07 (sanity check). True → bật nhánh Mamba.
+# ── Siêu tham số nhánh Mamba ─────────────────────────────────────────────────
+WAVLM_NAME      = "microsoft/wavlm-large"   # backbone frame-level (đóng băng). Trả chuỗi (T, 1024).
+MAX_FRAMES      = 256          # cap độ dài chuỗi (256 frame ≈ 5.1s @ 50Hz). Giảm nếu hết đĩa.
+MAMBA_DMODEL    = 256          # chiều ẩn của khối Mamba (proj 1024→256 trước khi vào Mamba)
+MAMBA_LAYERS    = 2            # số khối Mamba xếp chồng
+MAMBA_DSTATE    = 16           # chiều state SSM
+BIDIRECTIONAL   = True         # chạy Mamba cả 2 chiều (xuôi + ngược) rồi cộng
+Z_DIM           = 128          # chiều vector z_seq sau attentive-pool, đem concat vào fusion
+# ── Siêu tham số fusion (giống exp07) ────────────────────────────────────────
+DEVICE          = "cuda"
+TRUNK_HIDDEN    = 512
+HEAD_HIDDEN     = 128
+DROPOUT         = 0.3
+LR              = 1e-3
+EPOCHS          = 80
+BATCH           = 32           # nhỏ hơn exp07 (64) vì có nhánh Mamba tốn RAM hơn
+VAL_FRAC        = 0.10
+PATIENCE        = 15
+SEED            = 42
+USE_UNCERTAINTY = True
+LOSS_W          = {"qmos": 1.0, "emos": 1.0, "cat": 1.0, "val": 1.0, "aro": 1.0, "dom": 1.0}
+USE_E2V         = True
+USE_SAILER      = True
+USE_CLASSPROB   = True
+USE_UTMOS_FEAT  = True
+LIMIT_TRAIN     = None
+LIMIT_DEV       = None
+# Mốc exp07 để so (đây là hệ thống đang tốt nhất)
+EXP07 = {"qmos": 0.548, "emos": 0.795, "cat_err": 0.153, "val": 0.581, "aro": 0.752, "dom": 0.705}
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+SAILER9 = ["Anger", "Contempt", "Disgust", "Fear", "Happiness", "Neutral", "Sadness", "Surprise", "Other"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+assert USE_E2V or USE_SAILER, "Phải bật ít nhất 1 backbone pooled."
+print("USE_MAMBA =", USE_MAMBA, "| nếu False → ra đúng exp07")
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER
+# Chỉ cài gói còn thiếu (Kaggle có sẵn torch/transformers). KHÔNG đụng numpy (tránh lệch ABI torch — bài học exp12).
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("speechmos", "funasr", "librosa", "soundfile", "pandas", "scipy", "scikit-learn", "tqdm")
+if USE_SAILER:
+    pip_install("loralib", "speechbrain")
+    REPO_DIR = "/kaggle/working/vox-profile-release"
+    if not os.path.exists(REPO_DIR):
+        subprocess.run(["git", "clone", "--depth", "1",
+                        "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+    if REPO_DIR not in sys.path:
+        sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Đọc & gộp nhãn theo wavID (giống exp07)
+# %%
+import numpy as np
+import pandas as pd
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, default_idx=None, df=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col  = _col(cols, "wavid", "wav", default_idx=1, df=df)
+    qmos_col = _col(cols, "qmos", "mos")
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col  = _col(cols, "val", "valence")
+    aro_col  = _col(cols, "aro", "arousal")
+    dom_col  = _col(cols, "dom", "dominance")
+    cat_col  = _col(cols, "emocat", "cat", "emotion")
+    assert qmos_col and emos_col, f"Thiếu cột qMOS/eMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "qmos": float(g[qmos_col].mean()), "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 1.0 / len(EMOTIONS5), dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+# %% [markdown]
+# ## 3. Đặc trưng POOLED (e2v + sailer + UTMOS) — TÁI DÙNG cache exp04/exp07
+# (Y hệt exp07; nếu đã chạy exp07 thì cache `fusion_cache/` còn nguyên → không tính lại.)
+# %%
+import torch
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU")
+def extract_e2v(stems, tag):
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"e2v_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[e2v/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from funasr import AutoModel
+        m = AutoModel(model="iic/emotion2vec_plus_large", hub="hf", device=device)
+        for i, s in enumerate(tqdm(todo, desc=f"e2v {tag}")):
+            wav = os.path.join(WAV_DIR, s + ".wav")
+            if not os.path.exists(wav):
+                continue
+            r = m.generate(wav, granularity="utterance", extract_embedding=True)[0]
+            emb = np.asarray(r["feats"], dtype=np.float32).reshape(-1)
+            probs = {e: 0.0 for e in EMOTIONS5}
+            for lab, sc in zip(r["labels"], r["scores"]):
+                name = lab.split("/")[-1]
+                if name in probs:
+                    probs[name] = float(sc)
+            tot = sum(probs.values())
+            p5 = np.array([probs[e] / tot if tot > 0 else 0.2 for e in EMOTIONS5], dtype=np.float32)
+            store[s] = np.concatenate([emb, p5]).astype(np.float32)
+            if (i + 1) % 500 == 0:
+                np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del m
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return {s: (v[:-5], v[-5:]) for s, v in store.items()}
+def _pool_feat(features):
+    f = features.detach().cpu().numpy()
+    if f.ndim <= 1:
+        return f.reshape(-1).astype(np.float32)
+    return f.mean(axis=tuple(range(f.ndim - 1))).reshape(-1).astype(np.float32)
+def extract_sailer(stems, tag):
+    import librosa
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"sailer_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[sailer/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    if todo:
+        from src.model.emotion.wavlm_emotion import WavLMWrapper
+        sailer = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion").to(device).eval()
+        with torch.no_grad():
+            for i, s in enumerate(tqdm(todo, desc=f"sailer {tag}")):
+                wav = os.path.join(WAV_DIR, s + ".wav")
+                if not os.path.exists(wav):
+                    continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                wave = wave[: 15 * 16000]
+                data = torch.from_numpy(wave).float().unsqueeze(0).to(device)
+                logits, feat, _det, arousal, valence, dominance = sailer(data, return_feature=True)
+                emb = _pool_feat(feat)
+                p9 = F.softmax(logits, dim=1)[0].detach().cpu().numpy().astype(np.float32)
+                vad3 = np.array([1 + 4 * float(valence.item()),
+                                 1 + 4 * float(arousal.item()),
+                                 1 + 4 * float(dominance.item())], dtype=np.float32)
+                store[s] = np.concatenate([emb, p9, vad3]).astype(np.float32)
+                if (i + 1) % 500 == 0:
+                    np.savez(cache_path, **store)
+        np.savez(cache_path, **store)
+        del sailer
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return {s: (v[:-12], v[-12:-3], v[-3:]) for s, v in store.items()}
+def extract_utmos(names, tag):
+    import librosa
+    from tqdm.auto import tqdm
+    cache_path = os.path.join(CACHE_DIR, f"utmos_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: float(z[k]) for k in z.files}
+        print(f"[utmos/{tag}] nạp cache: {len(store)}")
+    todo = [n for n in names if stem(n) not in store]
+    if todo:
+        predictor = torch.hub.load("tarepan/SpeechMOS:v1.2.0", "utmos22_strong",
+                                   trust_repo=True).to(device).eval()
+        with torch.no_grad():
+            for i, n in enumerate(tqdm(todo, desc=f"utmos {tag}")):
+                wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else n + ".wav")
+                if not os.path.exists(wav):
+                    continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                store[stem(n)] = float(predictor(torch.from_numpy(wave).unsqueeze(0).to(device),
+                                                 sr=16000).mean().item())
+                if (i + 1) % 500 == 0:
+                    np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})
+        np.savez(cache_path, **{k: np.float32(v) for k, v in store.items()})
+        del predictor
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return store
+# %% [markdown]
+# ## 3b. Đặc trưng FRAME-LEVEL WavLM (chuỗi T×1024) cho nhánh Mamba — cache fp16
+# Mỗi wav lưu 1 file `.npy` riêng trong `SEQ_DIR` (mảng fp16 [T, 1024], T ≤ MAX_FRAMES).
+# WavLM **đóng băng** (eval, no_grad) → layerdrop tự tắt ở eval, không đụng gotcha checkpoint.
+# %%
+_wavlm = None
+def _get_wavlm():
+    """Lazy-load microsoft/wavlm-large (đóng băng). Trả model + feature_extractor."""
+    global _wavlm
+    if _wavlm is None:
+        from transformers import WavLMModel, AutoFeatureExtractor
+        fe = AutoFeatureExtractor.from_pretrained(WAVLM_NAME)
+        mdl = WavLMModel.from_pretrained(WAVLM_NAME).to(device).eval()
+        for p in mdl.parameters():
+            p.requires_grad = False
+        _wavlm = (mdl, fe)
+    return _wavlm
+def seq_path(sid):
+    return os.path.join(SEQ_DIR, sid + ".npy")
+def extract_wavlm_seq(stems, tag):
+    """Trích frame-level WavLM cho từng wav, cache fp16 ra .npy. Trả set stem đã có."""
+    if not USE_MAMBA:
+        return set()
+    import librosa
+    from tqdm.auto import tqdm
+    todo = [s for s in stems if not os.path.exists(seq_path(s))]
+    if todo:
+        mdl, fe = _get_wavlm()
+        with torch.no_grad():
+            for i, s in enumerate(tqdm(todo, desc=f"wavlm-seq {tag}")):
+                wav = os.path.join(WAV_DIR, s + ".wav")
+                if not os.path.exists(wav):
+                    continue
+                wave, _ = librosa.load(wav, sr=16000, mono=True)
+                wave = wave[: 15 * 16000]
+                inp = fe(wave, sampling_rate=16000, return_tensors="pt").input_values.to(device)
+                hs = mdl(inp).last_hidden_state[0]          # (T, 1024)
+                if hs.shape[0] > MAX_FRAMES:                 # cap độ dài (đều theo thời gian)
+                    idx = torch.linspace(0, hs.shape[0] - 1, MAX_FRAMES).long()
+                    hs = hs[idx]
+                np.save(seq_path(s), hs.cpu().numpy().astype(np.float16))
+        torch.cuda.empty_cache() if device == "cuda" else None
+    return {s for s in stems if os.path.exists(seq_path(s))}
+def load_seq(sid):
+    """Đọc chuỗi fp16 → tensor float32 (T, 1024). Thiếu file → None."""
+    p = seq_path(sid)
+    if not os.path.exists(p):
+        return None
+    return torch.from_numpy(np.load(p).astype(np.float32))
+def collate_seqs(sids):
+    """Gộp list chuỗi độ dài khác nhau → (B, Lmax, 1024) + mask (B, Lmax) bool (True=thật)."""
+    seqs = [load_seq(s) for s in sids]
+    lens = [t.shape[0] for t in seqs]
+    Lmax = max(lens)
+    B = len(seqs)
+    x = torch.zeros(B, Lmax, seqs[0].shape[1], dtype=torch.float32)
+    mask = torch.zeros(B, Lmax, dtype=torch.bool)
+    for i, t in enumerate(seqs):
+        x[i, : t.shape[0]] = t
+        mask[i, : t.shape[0]] = True
+    return x, mask
+# %% [markdown]
+# ## 4. Dựng feature pooled + nhãn cho train (lọc các wav đủ mọi nguồn)
+# %%
+train_stems = list(train_df["wavID"])
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+e2v_tr    = extract_e2v(train_stems, "train")    if USE_E2V    else {}
+sailer_tr = extract_sailer(train_stems, "train") if USE_SAILER else {}
+utmos_tr  = extract_utmos(train_stems, "train")  if USE_UTMOS_FEAT else {}
+seq_tr    = extract_wavlm_seq(train_stems, "train")
+def audio_feature(sid, e2v_map, sailer_map):
+    parts = []
+    if USE_E2V:
+        pk = e2v_map.get(sid)
+        if pk is None:
+            return None
+        emb, p5 = pk
+        parts.append(emb)
+        if USE_CLASSPROB:
+            parts.append(p5)
+    if USE_SAILER:
+        pk = sailer_map.get(sid)
+        if pk is None:
+            return None
+        emb, p9, vad3 = pk
+        parts.append(emb)
+        if USE_CLASSPROB:
+            parts.append(p9); parts.append(vad3)
+    return np.concatenate(parts).astype(np.float32)
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+lab = train_df.set_index("wavID")
+keep_sids, X, T, U = [], [], [], []
+y_qmos, y_emos, y_vad, y_cat = [], [], [], []
+for s in train_stems:
+    f = audio_feature(s, e2v_tr, sailer_tr)
+    tgt = target_map.get(s)
+    if f is None or tgt is None or s not in lab.index:
+        continue
+    if USE_UTMOS_FEAT and s not in utmos_tr:
+        continue
+    if USE_MAMBA and s not in seq_tr:        # cần có chuỗi WavLM nếu bật Mamba
+        continue
+    keep_sids.append(s)
+    X.append(f)
+    T.append(onehot_target(tgt))
+    U.append(utmos_tr.get(s, 3.0) if USE_UTMOS_FEAT else 0.0)
+    y_qmos.append(lab.loc[s, "qmos"]); y_emos.append(lab.loc[s, "emos"])
+    y_vad.append([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]])
+    y_cat.append([lab.loc[s, f"cat{i}"] for i in range(len(EMOTIONS5))])
+X = np.stack(X).astype(np.float32)
+T = np.stack(T).astype(np.float32)
+U = np.array(U, dtype=np.float32).reshape(-1, 1)
+y_qmos = np.array(y_qmos, dtype=np.float32); y_emos = np.array(y_emos, dtype=np.float32)
+y_vad  = np.array(y_vad,  dtype=np.float32); y_cat  = np.array(y_cat,  dtype=np.float32)
+FEAT_DIM = X.shape[1]
+print(f"Train giữ lại: {len(keep_sids)} wav | X={X.shape} | Mamba={'ON' if USE_MAMBA else 'OFF'}")
+# Chuẩn hóa feature pooled + UTMOS + nhãn liên tục (z-score)
+feat_mean = X.mean(0, keepdims=True); feat_std = X.std(0, keepdims=True) + 1e-6
+Xn = (X - feat_mean) / feat_std
+u_mu, u_sd = float(U.mean()), float(U.std() + 1e-6); Un = (U - u_mu) / u_sd
+qmos_mu, qmos_sd = float(y_qmos.mean()), float(y_qmos.std() + 1e-6); y_qmos_z = (y_qmos - qmos_mu) / qmos_sd
+emos_mu, emos_sd = float(y_emos.mean()), float(y_emos.std() + 1e-6); y_emos_z = (y_emos - emos_mu) / emos_sd
+if HAS_VAD:
+    vad_mu = np.nanmean(y_vad, axis=0); vad_sd = np.nanstd(y_vad, axis=0) + 1e-6
+    y_vad_z = (y_vad - vad_mu) / vad_sd
+else:
+    vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32); y_vad_z = np.zeros_like(y_vad)
+# %% [markdown]
+# ## 5a. Khối MAMBA (thuần PyTorch, không cần `mamba-ssm`)
+# Tự dùng `mamba-ssm` nếu import được (nhanh hơn); nếu không → bản thuần PyTorch (selective scan vòng lặp thời gian).
+# Bản này theo "mamba-minimal" (johnma2006) — đúng công thức, chỉ chậm hơn kernel CUDA, nhưng head nhỏ nên OK trên T4.
+# %%
+import math
+import torch.nn as nn
+try:
+    from mamba_ssm import Mamba as _OfficialMamba   # nếu cài được thì dùng (tùy chọn)
+    _HAS_MAMBA_SSM = True
+    print("✅ Dùng mamba-ssm (CUDA kernel)")
+except Exception:
+    _HAS_MAMBA_SSM = False
+    print("ℹ️ Không có mamba-ssm → dùng Mamba thuần PyTorch (nhúng sẵn)")
+class MambaBlockTorch(nn.Module):
+    """Một khối Mamba (selective SSM) thuần PyTorch. d_model = chiều ẩn."""
+    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
+        super().__init__()
+        self.d_inner = expand * d_model
+        self.dt_rank = math.ceil(d_model / 16)
+        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)
+        self.conv1d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=d_conv,
+                                groups=self.d_inner, padding=d_conv - 1, bias=True)
+        self.x_proj = nn.Linear(self.d_inner, self.dt_rank + d_state * 2, bias=False)
+        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True)
+        A = torch.arange(1, d_state + 1, dtype=torch.float32).repeat(self.d_inner, 1)
+        self.A_log = nn.Parameter(torch.log(A))           # (d_inner, d_state)
+        self.D = nn.Parameter(torch.ones(self.d_inner))
+        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
+        self.d_state = d_state
+    def forward(self, x):                                 # x: (B, L, d_model)
+        B, L, _ = x.shape
+        xz = self.in_proj(x)                              # (B, L, 2*d_inner)
+        xin, z = xz.chunk(2, dim=-1)
+        xin = xin.transpose(1, 2)                         # (B, d_inner, L)
+        xin = self.conv1d(xin)[..., :L].transpose(1, 2)   # (B, L, d_inner) causal conv
+        xin = F.silu(xin)
+        y = self._ssm(xin)                                # (B, L, d_inner)
+        y = y * F.silu(z)
+        return self.out_proj(y)
+    def _ssm(self, x):                                    # x: (B, L, d_inner)
+        A = -torch.exp(self.A_log)                        # (d_inner, d_state)
+        x_dbl = self.x_proj(x)                            # (B, L, dt_rank + 2*d_state)
+        delta, Bm, Cm = torch.split(x_dbl, [self.dt_rank, self.d_state, self.d_state], dim=-1)
+        delta = F.softplus(self.dt_proj(delta))           # (B, L, d_inner)
+        dA = torch.exp(delta.unsqueeze(-1) * A)           # (B, L, d_inner, d_state)
+        dB_x = delta.unsqueeze(-1) * Bm.unsqueeze(2) * x.unsqueeze(-1)  # (B, L, d_inner, d_state)
+        h = torch.zeros(x.shape[0], self.d_inner, self.d_state, device=x.device, dtype=x.dtype)
+        ys = []
+        for t in range(x.shape[1]):                       # selective scan theo thời gian
+            h = dA[:, t] * h + dB_x[:, t]
+            ys.append((h * Cm[:, t].unsqueeze(1)).sum(-1))   # (B, d_inner)
+        y = torch.stack(ys, dim=1)                        # (B, L, d_inner)
+        return y + x * self.D
+class MambaLayer(nn.Module):
+    """Pre-norm residual quanh 1 khối Mamba (chọn official nếu có)."""
+    def __init__(self, d_model, d_state):
+        super().__init__()
+        self.norm = nn.LayerNorm(d_model)
+        if _HAS_MAMBA_SSM:
+            self.mix = _OfficialMamba(d_model=d_model, d_state=d_state, d_conv=4, expand=2)
+        else:
+            self.mix = MambaBlockTorch(d_model, d_state=d_state)
+    def forward(self, x):
+        return x + self.mix(self.norm(x))
+class MambaEncoder(nn.Module):
+    """1024 → d_model → [Mamba ×L] (2 chiều nếu BIDIRECTIONAL) → attentive-pool → Z_DIM."""
+    def __init__(self, d_in, d_model, n_layers, d_state, z_dim, bidir):
+        super().__init__()
+        self.bidir = bidir
+        self.proj = nn.Linear(d_in, d_model)
+        self.fwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])
+        if bidir:
+            self.bwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])
+        self.attn = nn.Linear(d_model, 1)                 # attentive pooling
+        self.out = nn.Linear(d_model, z_dim)
+    def _run(self, layers, h):
+        for L in layers:
+            h = L(h)
+        return h
+    def forward(self, x, mask):                           # x: (B, L, 1024), mask: (B, L) bool
+        h = self.proj(x)
+        out = self._run(self.fwd, h)
+        if self.bidir:
+            rev = torch.flip(h, dims=[1])
+            out = out + torch.flip(self._run(self.bwd, rev), dims=[1])
+        a = self.attn(out).squeeze(-1)                    # (B, L)
+        a = a.masked_fill(~mask, float("-inf"))
+        w = torch.softmax(a, dim=1).unsqueeze(-1)         # (B, L, 1)
+        pooled = (out * w).sum(1)                          # (B, d_model)
+        return self.out(pooled)                            # (B, z_dim)
+# %% [markdown]
+# ## 5b. Model fusion 6 head + nhánh Mamba + train loop
+# %%
+from scipy.stats import spearmanr
+from sklearn.model_selection import train_test_split
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+idx_all = np.arange(X.shape[0])
+tr_idx, va_idx = train_test_split(idx_all, test_size=VAL_FRAC, random_state=SEED)
+def to_t(a):
+    return torch.tensor(a, dtype=torch.float32, device=device)
+Xn_t, T_t, Un_t = to_t(Xn), to_t(T), to_t(Un)
+qmos_t = to_t(y_qmos_z).unsqueeze(1); emos_t = to_t(y_emos_z).unsqueeze(1)
+vad_t  = to_t(y_vad_z); cat_t = to_t(y_cat)
+class FusionMamba6(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo, use_utmos, use_mamba):
+        super().__init__()
+        self.use_utmos = use_utmos
+        self.use_mamba = use_mamba
+        z_extra = Z_DIM if use_mamba else 0
+        if use_mamba:
+            self.enc = MambaEncoder(1024, MAMBA_DMODEL, MAMBA_LAYERS, MAMBA_DSTATE, Z_DIM, BIDIRECTIONAL)
+        self.trunk = nn.Sequential(
+            nn.Linear(d_in + z_extra, trunk_h), nn.ReLU(), nn.Dropout(p),
+            nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.qmos = nn.Sequential(
+            nn.Linear(trunk_h + (1 if use_utmos else 0), head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.emos = nn.Sequential(
+            nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(
+            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(
+            nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, x, tgt, utmos, seq=None, mask=None):
+        if self.use_mamba:
+            z = self.enc(seq, mask)
+            x = torch.cat([x, z], dim=1)
+        h = self.trunk(x)
+        qmos_in = torch.cat([h, utmos], dim=1) if self.use_utmos else h
+        return self.qmos(qmos_in), self.emos(torch.cat([h, tgt], dim=1)), self.cat(h), self.vad(h)
+model = FusionMamba6(FEAT_DIM, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO, USE_UTMOS_FEAT, USE_MAMBA).to(device)
+n_par = sum(p.numel() for p in model.parameters() if p.requires_grad)
+print(f"Tham số train được: {n_par/1e6:.2f} M")
+TASKS = ["qmos", "emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+params = list(model.parameters()) + ([log_var] if USE_UNCERTAINTY else [])
+opt = torch.optim.Adam(params, lr=LR, weight_decay=1e-5)
+mse = nn.MSELoss(reduction="none")
+def soft_ce(logits, target_dist):
+    return -(target_dist * F.log_softmax(logits, dim=1)).sum(dim=1)
+def task_losses(qmos_p, emos_p, cat_logits, vad_p, b):
+    L = {"qmos": mse(qmos_p, qmos_t[b]).mean(),
+         "emos": mse(emos_p, emos_t[b]).mean(),
+         "cat":  soft_ce(cat_logits, cat_t[b]).mean()}
+    if HAS_VAD:
+        L["val"] = mse(vad_p[:, 0:1], vad_t[b, 0:1]).mean()
+        L["aro"] = mse(vad_p[:, 1:2], vad_t[b, 1:2]).mean()
+        L["dom"] = mse(vad_p[:, 2:3], vad_t[b, 2:3]).mean()
+    else:
+        z = torch.zeros((), device=device); L["val"] = L["aro"] = L["dom"] = z
+    return L
+def combine(L):
+    if USE_UNCERTAINTY:
+        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))
+    return sum(LOSS_W[t] * L[t] for t in TASKS)
+# batch theo INDEX (vì nhánh Mamba cần đọc chuỗi theo sid → collate động)
+sids_arr = np.array(keep_sids)
+def forward_batch(bidx):
+    """bidx: numpy index. Trả output model cho batch (tự collate chuỗi nếu bật Mamba)."""
+    bt = torch.tensor(bidx, device=device)
+    if USE_MAMBA:
+        seq, mask = collate_seqs(list(sids_arr[bidx]))
+        seq, mask = seq.to(device), mask.to(device)
+        return model(Xn_t[bt], T_t[bt], Un_t[bt], seq, mask)
+    return model(Xn_t[bt], T_t[bt], Un_t[bt])
+@torch.no_grad()
+def eval_val():
+    model.eval()
+    qp, ep, vp = [], [], []
+    for i in range(0, len(va_idx), BATCH):
+        b = va_idx[i:i + BATCH]
+        q, e, _cl, v = forward_batch(b)
+        qp.append(q.cpu().numpy().ravel()); ep.append(e.cpu().numpy().ravel()); vp.append(v.cpu().numpy())
+    qp = np.concatenate(qp); ep = np.concatenate(ep); vp = np.concatenate(vp)
+    out = {"qmos": spearmanr(qp, y_qmos[va_idx]).correlation,
+           "emos": spearmanr(ep, y_emos[va_idx]).correlation}
+    if USE_UTMOS_FEAT:
+        out["qmos_utmos"] = spearmanr(U[va_idx, 0], y_qmos[va_idx]).correlation
+    if HAS_VAD:
+        for j, t in enumerate(["val", "aro", "dom"]):
+            out[t] = spearmanr(vp[:, j], y_vad[va_idx, j]).correlation
+    return out
+def val_score(m):
+    keys = ["qmos", "emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+best_score, best_state, bad = -1e9, None, 0
+for ep_i in range(1, EPOCHS + 1):
+    model.train()
+    perm = np.random.permutation(tr_idx)
+    run = 0.0
+    for i in range(0, len(perm), BATCH):
+        b = perm[i:i + BATCH]
+        opt.zero_grad()
+        q, e, cl, v = forward_batch(b)
+        loss = combine(task_losses(q, e, cl, v, torch.tensor(b, device=device)))
+        loss.backward(); opt.step()
+        run += loss.item() * len(b)
+    m = eval_val(); sc = val_score(m)
+    if sc > best_score:
+        best_score = sc; bad = 0
+        best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
+    else:
+        bad += 1
+    if ep_i % 2 == 0 or ep_i == 1:
+        msg = " ".join(f"{k}={m[k]:.3f}" for k in ["qmos", "emos", "val", "aro", "dom"] if k in m)
+        print(f"epoch {ep_i:3d} | loss {run/len(perm):.4f} | {msg} | best {best_score:.4f}")
+    if bad >= PATIENCE:
+        print(f"Early stop ở epoch {ep_i}."); break
+model.load_state_dict(best_state)
+final = eval_val()
+print(f"\n✅ VAL (nội bộ) — exp14 (Mamba={'ON' if USE_MAMBA else 'OFF'}):")
+print(f"   QMOS={final['qmos']:.4f} (exp07 {EXP07['qmos']}) | EMOS={final['emos']:.4f} (exp07 {EXP07['emos']})")
+if HAS_VAD:
+    print(f"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f}"
+          f" (exp07 {EXP07['val']}/{EXP07['aro']}/{EXP07['dom']})")
+print("   → So sánh USE_MAMBA True vs False = ablation Mamba cho paper.")
+torch.save({"state": best_state, "feat_mean": feat_mean, "feat_std": feat_std,
+            "u_mu": u_mu, "u_sd": u_sd, "qmos_mu": qmos_mu, "qmos_sd": qmos_sd,
+            "emos_mu": emos_mu, "emos_sd": emos_sd, "vad_mu": vad_mu, "vad_sd": vad_sd,
+            "FEAT_DIM": FEAT_DIM, "USE_MAMBA": USE_MAMBA, "val_score": best_score},
+           os.path.join(OUT_DIR, "fusion_mamba_mtl.pt"))
+print("Đã lưu", os.path.join(OUT_DIR, "fusion_mamba_mtl.pt"))
+# %% [markdown]
+# ## 6. Dự đoán DEV → `answer.txt` đủ 6 cột
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+e2v_dev    = extract_e2v(dev_stems, "dev")    if USE_E2V    else {}
+sailer_dev = extract_sailer(dev_stems, "dev") if USE_SAILER else {}
+utmos_dev  = extract_utmos(dev_names, "dev")  if USE_UTMOS_FEAT else {}
+seq_dev    = extract_wavlm_seq(dev_stems, "dev")
+@torch.no_grad()
+def predict_all(sid):
+    f = audio_feature(sid, e2v_dev, sailer_dev)
+    if f is None:
+        return None
+    if USE_MAMBA and not os.path.exists(seq_path(sid)):
+        return None
+    fn = (f[None, :] - feat_mean) / feat_std
+    tgt = onehot_target(target_map.get(sid))[None, :]
+    u = np.array([[utmos_dev.get(sid, 3.0)]], dtype=np.float32); un = (u - u_mu) / u_sd
+    model.eval()
+    if USE_MAMBA:
+        seq, mask = collate_seqs([sid]); seq, mask = seq.to(device), mask.to(device)
+        q, e, cl, v = model(to_t(fn), to_t(tgt), to_t(un), seq, mask)
+    else:
+        q, e, cl, v = model(to_t(fn), to_t(tgt), to_t(un))
+    qmos = float(q.item()) * qmos_sd + qmos_mu
+    emos = float(e.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cl, dim=1)[0].cpu().numpy()
+    vad3 = v[0].cpu().numpy() * vad_sd + vad_mu
+    return qmos, emos, cat5, vad3
+def fmt_cat(probs5):
+    return "|".join(f"{e}:{probs5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    from tqdm.auto import tqdm
+    n_real = n_default = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pred = predict_all(sid)
+            if pred is None:
+                qmos = utmos_dev.get(sid, 3.0)
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0])
+                n_default += 1
+            else:
+                qmos, emos, cat5, vad3 = pred; n_real += 1
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},"
+                    f"{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | head thật {n_real}, mặc định {n_default}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 7. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp14_mamba.zip answer.txt "
+          f"&& unzip -l submission_track2_exp14_mamba.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp14_mamba.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **Ablation chính cho paper:** chạy 2 lần — `USE_MAMBA=False` (= exp07, mốc) và `USE_MAMBA=True`.
+#   So QMOS/EMOS/VAD nội bộ → trả lời "bộ mã hóa thời gian Mamba có hơn mean-pooling không?".
+# - **Nếu hết đĩa khi cache chuỗi:** giảm `MAX_FRAMES` (256→160) hoặc xóa `wavlm_seq_cache/` sau khi chạy xong.
+# - **Nếu Mamba chậm:** thử `pip install mamba-ssm causal-conv1d` (file tự dùng nếu import được); hoặc giảm
+#   `MAMBA_LAYERS`/`MAX_FRAMES`. Bản thuần PyTorch dùng vòng lặp thời gian nên chậm hơn kernel CUDA.
+# - **Save Version** để giữ cache `fusion_cache/` + `wavlm_seq_cache/` cho lần sau.
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp14).

track2/exp15_predict.ipynb ADDED Viewed

	@@ -0,0 +1,698 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "9f2c52f2",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp15 PREDICT-ONLY (nạp checkpoint → chấm DEV, KHÔNG train) — Kaggle\n",
+    "\n",
+    "**Mục đích:** bạn ĐÃ có checkpoint exp15 (`ft_mamba_emotion_full*.pt`, lưu cả backbone WavLM + Mamba enc + heads).\n",
+    "File này **chỉ inference**: dựng lại đúng kiến trúc → nạp trọng số + thống kê chuẩn hóa TỪ ckpt →\n",
+    "dự đoán 5 cột cảm xúc trên tập DEV → ghép QMOS (exp07/UTMOSv2) → `answer.txt` → zip nộp.\n",
+    "**KHÔNG** train, **KHÔNG** cần train.csv (chỉ cần wav DEV + metadata.csv để lấy cảm xúc target cho EMOS).\n",
+    "\n",
+    "## Vì sao nhanh\n",
+    "- Không có vòng train → chỉ 1 lượt forward qua DEV (~2730 mẫu). Việc lâu nhất là trích audeering DEV\n",
+    "  (~vài phút; có cache thì gần như tức thì).\n",
+    "\n",
+    "## Chuẩn bị input trên Kaggle (Add Input)\n",
+    "1. Dataset Track 2 (wav + `metadata.csv` + `sets/dev.scp`).\n",
+    "2. **Checkpoint** exp15: dataset chứa `ft_mamba_emotion_full*.pt` (vd `cache_exp8`). Auto-dò; hoặc trỏ `CKPT_PATH`.\n",
+    "3. (tùy chọn) cache audeering `aud_dev.npz` để khỏi trích lại.\n",
+    "4. (tùy chọn) `answer.txt` exp07 để mượn cột QMOS 0.548.\n",
+    "\n",
+    "**Cách chạy:** GPU **T4** + Internet **On** → Add Input → Run All."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "adbc7c65",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7eb066d5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, glob\n",
+    "\n",
+    "# ── TỰ DÒ DATA_ROOT (quét /kaggle/input tìm thư mục có sets + wav/ + metadata.csv) ──\n",
+    "def find_data_root(search_root=\"/kaggle/input\"):\n",
+    "    cands = []\n",
+    "    for dev_scp in glob.glob(os.path.join(search_root, \"**\", \"sets\", \"dev.scp\"), recursive=True):\n",
+    "        root = os.path.dirname(os.path.dirname(dev_scp))\n",
+    "        score = os.path.isdir(os.path.join(root, \"wav\")) + os.path.exists(os.path.join(root, \"metadata.csv\"))\n",
+    "        cands.append((score, root))\n",
+    "    cands.sort(reverse=True)\n",
+    "    return cands\n",
+    "\n",
+    "_cands = find_data_root(\"/kaggle/input\")\n",
+    "if _cands:\n",
+    "    print(\"🔎 Ứng viên DATA_ROOT:\")\n",
+    "    for sc, r in _cands:\n",
+    "        print(f\"   [{sc}/2] {r}\")\n",
+    "    DATA_ROOT = _cands[0][1]\n",
+    "    print(f\"👉 Tự chọn DATA_ROOT = {DATA_ROOT}\")\n",
+    "else:\n",
+    "    DATA_ROOT = \"/kaggle/input/datasets/minhtoan2\"   # dự phòng — sửa tay\n",
+    "    print(f\"❌ Không thấy sets/dev.scp → dùng dự phòng {DATA_ROOT} (đã Add Input chưa?)\")\n",
+    "\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header) — lấy cảm xúc target cho EMOS\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/ft_cache\"\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── CHECKPOINT exp15 (đủ backbone + Mamba + heads) ───────────────────────────\n",
+    "CKPT_PATH = \"\"    # << \"\" = auto-dò ft_mamba_emotion_full*.pt; hoặc \"/kaggle/input/<slug>/ft_mamba_emotion_full (2).pt\"\n",
+    "\n",
+    "def find_ckpt(explicit):\n",
+    "    \"\"\"Tìm checkpoint exp15. Khớp cả tên bị thêm hậu tố trùng, vd 'ft_mamba_emotion_full (2).pt'.\"\"\"\n",
+    "    if explicit and os.path.exists(explicit):\n",
+    "        return explicit\n",
+    "    for base in [\"/kaggle/input\", \"/kaggle/working\"]:\n",
+    "        hits = sorted(glob.glob(os.path.join(base, \"**\", \"ft_mamba_emotion_full*.pt\"), recursive=True))\n",
+    "        if hits:\n",
+    "            return hits[0]\n",
+    "    return \"\"\n",
+    "\n",
+    "CKPT_PATH = find_ckpt(CKPT_PATH)\n",
+    "assert CKPT_PATH, \"❌ Không thấy checkpoint ft_mamba_emotion_full*.pt. Đã Add Input dataset chứa ckpt chưa?\"\n",
+    "print(\"✅ Dùng checkpoint:\", CKPT_PATH)\n",
+    "\n",
+    "# (Tùy chọn) tái dùng cache audeering DEV — quét đệ quy (file có thể nằm trong archive/)\n",
+    "CACHE_INPUT = \"/kaggle/input/cache-exp8\"   # << SỬA slug (hoặc \"\")\n",
+    "if CACHE_INPUT and os.path.isdir(CACHE_INPUT):\n",
+    "    import shutil\n",
+    "    _n = 0\n",
+    "    for _fp in glob.glob(os.path.join(CACHE_INPUT, \"**\", \"aud_*.npz\"), recursive=True):\n",
+    "        shutil.copy(_fp, os.path.join(CACHE_DIR, os.path.basename(_fp))); _n += 1\n",
+    "    print(f\"📦 Copy {_n} file aud_*.npz từ {CACHE_INPUT}\")\n",
+    "\n",
+    "# Mượn cột QMOS exp07 (0.548). Trỏ answer.txt exp07 nếu có; không thì UTMOSv2.\n",
+    "EXP07_ANSWER = \"/kaggle/input/exp07-answer/answer.txt\"   # << (tùy chọn)\n",
+    "\n",
+    "# ── Siêu tham số PHẢI KHỚP lúc train exp15 (ckpt không lưu các số này của Mamba) ──\n",
+    "MAMBA_DMODEL  = 256\n",
+    "MAMBA_LAYERS  = 2\n",
+    "MAMBA_DSTATE  = 16\n",
+    "BIDIRECTIONAL = True\n",
+    "TRUNK_HIDDEN  = 512\n",
+    "HEAD_HIDDEN   = 128\n",
+    "DROPOUT       = 0.3       # không ảnh hưởng eval (model.eval() tắt dropout) — chỉ để dựng đúng shape\n",
+    "\n",
+    "DEVICE       = \"cuda\"\n",
+    "SR           = 16000\n",
+    "MAX_SECONDS  = 6          # khớp lúc train (exp15 = 6)\n",
+    "USE_AMP      = True\n",
+    "LIMIT_DEV    = None       # << để None chấm ĐỦ 2730; đặt 20 để smoke-test nhanh\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, DEV_SCP, CKPT_PATH]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "febe8bdc",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER (để dựng đúng kiến trúc WavLM rồi nạp ckpt đè lên)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7732e245",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"loralib\", \"speechbrain\", \"speechmos\", \"librosa\", \"soundfile\",\n",
+    "            \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "# Mamba kernel CUDA (tùy chọn — không có thì dùng Mamba thuần PyTorch, inference vẫn ổn vì chỉ 1 lượt forward)\n",
+    "INSTALL_MAMBA_SSM = True\n",
+    "if INSTALL_MAMBA_SSM:\n",
+    "    try:\n",
+    "        subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"ninja\"], check=True)\n",
+    "        subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"--no-build-isolation\", \"causal-conv1d>=1.2.0\"], check=True)\n",
+    "        subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"--no-build-isolation\", \"mamba-ssm\"], check=True)\n",
+    "        print(\"✅ Cài mamba-ssm xong (dùng kernel CUDA nếu import được).\")\n",
+    "    except Exception as e:\n",
+    "        print(\"⚠️ Cài mamba-ssm thất bại:\", repr(e), \"→ Mamba thuần PyTorch (inference vẫn chạy).\")\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fba12581",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp checkpoint → dựng WavLM → load trọng số backbone đã fine-tune"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "61199736",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (chậm)\")\n",
+    "\n",
+    "ckpt = torch.load(CKPT_PATH, map_location=\"cpu\", weights_only=False)  # ckpt có numpy → cần False\n",
+    "assert \"wavlm\" in ckpt, \"❌ Checkpoint KHÔNG có 'wavlm' (backbone) → không inference được. Cần ft_mamba_emotion_full*.pt đủ.\"\n",
+    "print(\"✅ Nạp ckpt | keys:\", list(ckpt.keys()))\n",
+    "\n",
+    "# Lấy cấu hình KIẾN TRÚC từ ckpt (để dựng đúng shape head)\n",
+    "USE_MAMBA  = bool(ckpt.get(\"USE_MAMBA\", True))\n",
+    "Z_DIM      = int(ckpt.get(\"Z_DIM\", 256))\n",
+    "AUD_DIM    = int(ckpt.get(\"AUD_DIM\", 0))\n",
+    "USE_AUDEERING = AUD_DIM > 0\n",
+    "UNFREEZE_TOP_LAYERS = int(ckpt.get(\"UNFREEZE_TOP_LAYERS\", 6))\n",
+    "print(f\"Từ ckpt: USE_MAMBA={USE_MAMBA} · Z_DIM={Z_DIM} · AUD_DIM={AUD_DIM} (audeering={'ON' if USE_AUDEERING else 'OFF'})\")\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    cands = []\n",
+    "    for name, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((name, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "try:\n",
+    "    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "    _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "    name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    if wavlm is not None:\n",
+    "        print(f\"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{name}'\")\n",
+    "except Exception as e:\n",
+    "    print(\"⚠️ Lỗi nạp SAILER wrapper:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "\n",
+    "if wavlm is None:\n",
+    "    from transformers import WavLMModel\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"ℹ️ Fallback: microsoft/wavlm-large.\")\n",
+    "\n",
+    "wavlm = wavlm.to(device)\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "wavlm.config.layerdrop = 0.0\n",
+    "\n",
+    "miss, unexp = wavlm.load_state_dict(ckpt[\"wavlm\"], strict=False)\n",
+    "print(f\"🔁 load wavlm từ ckpt: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0)\")\n",
+    "if len(miss) > 20 or len(unexp) > 20:\n",
+    "    print(\"   ⚠️ Lệch key nhiều → kiểm tra backbone có khớp ckpt không.\")\n",
+    "wavlm.eval()\n",
+    "\n",
+    "def frame_mask(T, attn_mask):\n",
+    "    if attn_mask is None:\n",
+    "        return torch.ones((1, T), dtype=torch.bool, device=device)\n",
+    "    try:\n",
+    "        return wavlm._get_feature_vector_attention_mask(T, attn_mask).bool()\n",
+    "    except Exception:\n",
+    "        return torch.ones((attn_mask.shape[0], T), dtype=torch.bool, device=attn_mask.device)\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = frame_mask(hidden.shape[1], attn_mask).unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "421e0b6a",
+   "metadata": {},
+   "source": [
+    "## 3. audeering MSP-dim (FROZEN) — chỉ dựng nếu ckpt có dùng (AUD_DIM>0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d37d3d53",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import librosa\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "aud_backbone = aud_head = aud_proc = None\n",
+    "if USE_AUDEERING:\n",
+    "    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "    from huggingface_hub import hf_hub_download\n",
+    "    AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "    aud_backbone = Wav2Vec2Model(aud_cfg)\n",
+    "    try:\n",
+    "        _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "            hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "    except Exception:\n",
+    "        _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "    bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "    aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "    _hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _sd[\"classifier.out_proj.weight\"].shape[0]))\n",
+    "    aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "    aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "    aud_backbone = aud_backbone.to(device).eval()\n",
+    "    aud_head = aud_head.to(device).eval()\n",
+    "    assert _hid + 3 == AUD_DIM, f\"⚠️ AUD_DIM dựng ({_hid+3}) ≠ ckpt ({AUD_DIM}) → audeering không khớp!\"\n",
+    "    print(f\"✅ audeering frozen ({AUD_DIM}-D)\")\n",
+    "\n",
+    "def load_wav(name_or_stem):\n",
+    "    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(\n",
+    "        WAV_DIR, name_or_stem if str(name_or_stem).endswith(\".wav\") else str(name_or_stem) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    return wave[: MAX_SECONDS * SR].astype(np.float32)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def extract_audeering(stems, tag):\n",
+    "    if not USE_AUDEERING:\n",
+    "        return {}\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"aud_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[aud/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    for i, s in enumerate(tqdm(todo, desc=f\"audeering {tag}\")):\n",
+    "        wave = load_wav(s)\n",
+    "        if wave is None:\n",
+    "            continue\n",
+    "        x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)\n",
+    "        h = aud_backbone(x)[0].mean(dim=1)\n",
+    "        out = aud_head(h)[0].cpu().numpy()\n",
+    "        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]\n",
+    "        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)\n",
+    "        if (i + 1) % 500 == 0:\n",
+    "            np.savez(cache_path, **store)\n",
+    "    if todo:\n",
+    "        np.savez(cache_path, **store)\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0a04ef30",
+   "metadata": {},
+   "source": [
+    "## 4. Cảm xúc target theo wavID (cho one-hot điều kiện của head EMOS)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c092318",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "print(\"Target cảm xúc:\", len(target_map), \"wav\")\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a0d7021a",
+   "metadata": {},
+   "source": [
+    "## 5. Khối Mamba (giống exp15) + MambaEncoder"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8c31f88",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import math\n",
+    "\n",
+    "try:\n",
+    "    from mamba_ssm import Mamba as _OfficialMamba\n",
+    "    _HAS_MAMBA_SSM = True\n",
+    "    print(\"✅ Dùng mamba-ssm (CUDA kernel)\")\n",
+    "except Exception:\n",
+    "    _HAS_MAMBA_SSM = False\n",
+    "    print(\"ℹ️ Không có mamba-ssm → Mamba thuần PyTorch\")\n",
+    "\n",
+    "class MambaBlockTorch(nn.Module):\n",
+    "    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):\n",
+    "        super().__init__()\n",
+    "        self.d_inner = expand * d_model\n",
+    "        self.dt_rank = math.ceil(d_model / 16)\n",
+    "        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)\n",
+    "        self.conv1d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=d_conv,\n",
+    "                                groups=self.d_inner, padding=d_conv - 1, bias=True)\n",
+    "        self.x_proj = nn.Linear(self.d_inner, self.dt_rank + d_state * 2, bias=False)\n",
+    "        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True)\n",
+    "        A = torch.arange(1, d_state + 1, dtype=torch.float32).repeat(self.d_inner, 1)\n",
+    "        self.A_log = nn.Parameter(torch.log(A))\n",
+    "        self.D = nn.Parameter(torch.ones(self.d_inner))\n",
+    "        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)\n",
+    "        self.d_state = d_state\n",
+    "\n",
+    "    def forward(self, x):\n",
+    "        B, L, _ = x.shape\n",
+    "        xin, z = self.in_proj(x).chunk(2, dim=-1)\n",
+    "        xin = xin.transpose(1, 2)\n",
+    "        xin = self.conv1d(xin)[..., :L].transpose(1, 2)\n",
+    "        xin = F.silu(xin)\n",
+    "        y = self._ssm(xin) * F.silu(z)\n",
+    "        return self.out_proj(y)\n",
+    "\n",
+    "    def _ssm(self, x):\n",
+    "        A = -torch.exp(self.A_log)\n",
+    "        delta, Bm, Cm = torch.split(self.x_proj(x), [self.dt_rank, self.d_state, self.d_state], dim=-1)\n",
+    "        delta = F.softplus(self.dt_proj(delta))\n",
+    "        dA = torch.exp(delta.unsqueeze(-1) * A)\n",
+    "        dB_x = delta.unsqueeze(-1) * Bm.unsqueeze(2) * x.unsqueeze(-1)\n",
+    "        h = torch.zeros(x.shape[0], self.d_inner, self.d_state, device=x.device, dtype=x.dtype)\n",
+    "        ys = []\n",
+    "        for t in range(x.shape[1]):\n",
+    "            h = dA[:, t] * h + dB_x[:, t]\n",
+    "            ys.append((h * Cm[:, t].unsqueeze(1)).sum(-1))\n",
+    "        return torch.stack(ys, dim=1) + x * self.D\n",
+    "\n",
+    "class MambaLayer(nn.Module):\n",
+    "    def __init__(self, d_model, d_state):\n",
+    "        super().__init__()\n",
+    "        self.norm = nn.LayerNorm(d_model)\n",
+    "        self.mix = _OfficialMamba(d_model=d_model, d_state=d_state, d_conv=4, expand=2) \\\n",
+    "            if _HAS_MAMBA_SSM else MambaBlockTorch(d_model, d_state=d_state)\n",
+    "    def forward(self, x):\n",
+    "        return x + self.mix(self.norm(x))\n",
+    "\n",
+    "class MambaEncoder(nn.Module):\n",
+    "    def __init__(self, d_in, d_model, n_layers, d_state, z_dim, bidir):\n",
+    "        super().__init__()\n",
+    "        self.bidir = bidir\n",
+    "        self.proj = nn.Linear(d_in, d_model)\n",
+    "        self.fwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])\n",
+    "        if bidir:\n",
+    "            self.bwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])\n",
+    "        self.attn = nn.Linear(d_model, 1)\n",
+    "        self.out = nn.Linear(d_model, z_dim)\n",
+    "\n",
+    "    @staticmethod\n",
+    "    def _run(layers, h):\n",
+    "        for L in layers:\n",
+    "            h = L(h)\n",
+    "        return h\n",
+    "\n",
+    "    def forward(self, x, mask):\n",
+    "        with torch.cuda.amp.autocast(enabled=False):\n",
+    "            x = x.float()\n",
+    "            h = self.proj(x)\n",
+    "            out = self._run(self.fwd, h)\n",
+    "            if self.bidir:\n",
+    "                out = out + torch.flip(self._run(self.bwd, torch.flip(h, dims=[1])), dims=[1])\n",
+    "            a = self.attn(out).squeeze(-1).masked_fill(~mask, float(\"-inf\"))\n",
+    "            w = torch.softmax(a, dim=1).unsqueeze(-1)\n",
+    "            return self.out((out * w).sum(1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8369a6b",
+   "metadata": {},
+   "source": [
+    "## 6. Dựng enc + heads → nạp trọng số từ ckpt + lấy chuẩn hóa từ ckpt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1c5e8556",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "N_EMO = len(EMOTIONS5)\n",
+    "WAVLM_BRANCH = Z_DIM if USE_MAMBA else WAVLM_DIM\n",
+    "TRUNK_IN = WAVLM_BRANCH + (AUD_DIM if USE_AUDEERING else 0)\n",
+    "\n",
+    "enc = MambaEncoder(WAVLM_DIM, MAMBA_DMODEL, MAMBA_LAYERS, MAMBA_DSTATE, Z_DIM, BIDIRECTIONAL).to(device) \\\n",
+    "    if USE_MAMBA else None\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "hm, hu = heads.load_state_dict(ckpt[\"heads\"], strict=False)\n",
+    "print(f\"🔁 load heads từ ckpt: thiếu {len(hm)} / dư {len(hu)} key (kỳ vọng 0)\")\n",
+    "if USE_MAMBA:\n",
+    "    assert ckpt.get(\"enc\") is not None, \"❌ ckpt USE_MAMBA=True nhưng KHÔNG có 'enc' → không inference đúng được.\"\n",
+    "    em, eu = enc.load_state_dict(ckpt[\"enc\"], strict=False)\n",
+    "    print(f\"🔁 load Mamba enc từ ckpt: thiếu {len(em)} / dư {len(eu)} key (kỳ vọng 0)\")\n",
+    "heads.eval()\n",
+    "if USE_MAMBA:\n",
+    "    enc.eval()\n",
+    "\n",
+    "# Chuẩn hóa LẤY TỪ ckpt (head dự đoán ở thang z-score này → phải giải chuẩn đúng thang)\n",
+    "emos_mu = float(ckpt[\"emos_mu\"]); emos_sd = float(ckpt[\"emos_sd\"])\n",
+    "vad_mu = np.asarray(ckpt[\"vad_mu\"], dtype=np.float32); vad_sd = np.asarray(ckpt[\"vad_sd\"], dtype=np.float32)\n",
+    "print(f\"Chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}\")\n",
+    "\n",
+    "def wavlm_branch(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state\n",
+    "    if USE_MAMBA:\n",
+    "        return enc(out, frame_mask(out.shape[1], attn_mask))\n",
+    "    return masked_mean(out, attn_mask)\n",
+    "\n",
+    "print(f\"Trunk input = {TRUNK_IN} (wavlm-branch {WAVLM_BRANCH} [{'Mamba' if USE_MAMBA else 'mean-pool'}] + aud {AUD_DIM if USE_AUDEERING else 0})\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fdcf05c2",
+   "metadata": {},
+   "source": [
+    "## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc; QMOS mượn exp07/UTMOSv2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4d225f54",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "aud_dev = extract_audeering(dev_stems, \"dev\")\n",
+    "\n",
+    "def load_exp07_qmos():\n",
+    "    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "        import csv\n",
+    "        d = {}\n",
+    "        with open(EXP07_ANSWER) as f:\n",
+    "            for row in csv.DictReader(f):\n",
+    "                d[row[\"wav\"]] = float(row[\"QMOS\"]); d[stem(row[\"wav\"])] = float(row[\"QMOS\"])\n",
+    "        print(f\"✅ Mượn QMOS exp07 ({EXP07_ANSWER}): {len(d)//2} wav\")\n",
+    "        return d\n",
+    "    return None\n",
+    "\n",
+    "qmos_map = load_exp07_qmos()\n",
+    "if qmos_map is None:\n",
+    "    print(\"ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024).\")\n",
+    "    pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")\n",
+    "    import utmosv2\n",
+    "    v2 = utmosv2.create_model(pretrained=True)\n",
+    "    qmos_map = {}\n",
+    "    for n in tqdm(dev_names, desc=\"UTMOSv2\"):\n",
+    "        wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else str(n) + \".wav\")\n",
+    "        if not os.path.exists(wav):\n",
+    "            continue\n",
+    "        out = v2.predict(input_path=wav)\n",
+    "        qmos_map[n] = float(out[\"predicted_mos\"]) if isinstance(out, dict) else float(out)\n",
+    "    del v2; torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    wave = load_wav(sid)\n",
+    "    if wave is None or (USE_AUDEERING and sid not in aud_dev):\n",
+    "        return None\n",
+    "    iv = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        fw = wavlm_branch(iv, am)\n",
+    "        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    n_real = n_def = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pr = predict_emotion(sid)\n",
+    "            if pr is None:\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1\n",
+    "            else:\n",
+    "                emos, cat5, vad3 = pr; n_real += 1\n",
+    "            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42503595",
+   "metadata": {},
+   "source": [
+    "## 8. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "42dec31f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0] and \"EMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp15_predict.zip answer.txt \"\n",
+    "          f\"&& unzip -l submission_track2_exp15_predict.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp15_predict.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fbef2a21",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- File này **chỉ inference** — không train, không cần train.csv. Dùng khi đã có `ft_mamba_emotion_full*.pt`.\n",
+    "- ⚠️ **Siêu tham số Mamba/heads (MAMBA_DMODEL/LAYERS/DSTATE, TRUNK_HIDDEN, HEAD_HIDDEN) PHẢI khớp lúc train**\n",
+    "  (ckpt không lưu các số này) — nếu lúc train exp15 bạn đổi, hãy sửa cho khớp ở cell 0, nếu không load_state_dict\n",
+    "  sẽ lệch key / sai shape.\n",
+    "- `USE_MAMBA`, `Z_DIM`, `AUD_DIM`, `UNFREEZE_TOP_LAYERS` thì **đọc tự động từ ckpt**.\n",
+    "- QMOS: tốt nhất Add Input `answer.txt` exp07 (0.548); không có thì tự chấm UTMOSv2.\n",
+    "- Smoke-test: đặt `LIMIT_DEV=20` chạy thử cho nhanh, OK rồi đặt lại `None` để chấm đủ 2730."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp15_predict_pipeline.py ADDED Viewed

	@@ -0,0 +1,554 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp15 PREDICT-ONLY (nạp checkpoint → chấm DEV, KHÔNG train) — Kaggle
+#
+# **Mục đích:** bạn ĐÃ có checkpoint exp15 (`ft_mamba_emotion_full*.pt`, lưu cả backbone WavLM + Mamba enc + heads).
+# File này **chỉ inference**: dựng lại đúng kiến trúc → nạp trọng số + thống kê chuẩn hóa TỪ ckpt →
+# dự đoán 5 cột cảm xúc trên tập DEV → ghép QMOS (exp07/UTMOSv2) → `answer.txt` → zip nộp.
+# **KHÔNG** train, **KHÔNG** cần train.csv (chỉ cần wav DEV + metadata.csv để lấy cảm xúc target cho EMOS).
+#
+# ## Vì sao nhanh
+# - Không có vòng train → chỉ 1 lượt forward qua DEV (~2730 mẫu). Việc lâu nhất là trích audeering DEV
+#   (~vài phút; có cache thì gần như tức thì).
+#
+# ## Chuẩn bị input trên Kaggle (Add Input)
+# 1. Dataset Track 2 (wav + `metadata.csv` + `sets/dev.scp`).
+# 2. **Checkpoint** exp15: dataset chứa `ft_mamba_emotion_full*.pt` (vd `cache_exp8`). Auto-dò; hoặc trỏ `CKPT_PATH`.
+# 3. (tùy chọn) cache audeering `aud_dev.npz` để khỏi trích lại.
+# 4. (tùy chọn) `answer.txt` exp07 để mượn cột QMOS 0.548.
+#
+# **Cách chạy:** GPU **T4** + Internet **On** → Add Input → Run All.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os, glob
+# ── TỰ DÒ DATA_ROOT (quét /kaggle/input tìm thư mục có sets + wav/ + metadata.csv) ──
+def find_data_root(search_root="/kaggle/input"):
+    cands = []
+    for dev_scp in glob.glob(os.path.join(search_root, "**", "sets", "dev.scp"), recursive=True):
+        root = os.path.dirname(os.path.dirname(dev_scp))
+        score = os.path.isdir(os.path.join(root, "wav")) + os.path.exists(os.path.join(root, "metadata.csv"))
+        cands.append((score, root))
+    cands.sort(reverse=True)
+    return cands
+_cands = find_data_root("/kaggle/input")
+if _cands:
+    print("🔎 Ứng viên DATA_ROOT:")
+    for sc, r in _cands:
+        print(f"   [{sc}/2] {r}")
+    DATA_ROOT = _cands[0][1]
+    print(f"👉 Tự chọn DATA_ROOT = {DATA_ROOT}")
+else:
+    DATA_ROOT = "/kaggle/input/datasets/minhtoan2"   # dự phòng — sửa tay
+    print(f"❌ Không thấy sets/dev.scp → dùng dự phòng {DATA_ROOT} (đã Add Input chưa?)")
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header) — lấy cảm xúc target cho EMOS
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/ft_cache"
+os.makedirs(CACHE_DIR, exist_ok=True)
+# ── CHECKPOINT exp15 (đủ backbone + Mamba + heads) ───────────────────────────
+CKPT_PATH = ""    # << "" = auto-dò ft_mamba_emotion_full*.pt; hoặc "/kaggle/input/<slug>/ft_mamba_emotion_full (2).pt"
+def find_ckpt(explicit):
+    """Tìm checkpoint exp15. Khớp cả tên bị thêm hậu tố trùng, vd 'ft_mamba_emotion_full (2).pt'."""
+    if explicit and os.path.exists(explicit):
+        return explicit
+    for base in ["/kaggle/input", "/kaggle/working"]:
+        hits = sorted(glob.glob(os.path.join(base, "**", "ft_mamba_emotion_full*.pt"), recursive=True))
+        if hits:
+            return hits[0]
+    return ""
+CKPT_PATH = find_ckpt(CKPT_PATH)
+assert CKPT_PATH, "❌ Không thấy checkpoint ft_mamba_emotion_full*.pt. Đã Add Input dataset chứa ckpt chưa?"
+print("✅ Dùng checkpoint:", CKPT_PATH)
+# (Tùy chọn) tái dùng cache audeering DEV — quét đệ quy (file có thể nằm trong archive/)
+CACHE_INPUT = "/kaggle/input/cache-exp8"   # << SỬA slug (hoặc "")
+if CACHE_INPUT and os.path.isdir(CACHE_INPUT):
+    import shutil
+    _n = 0
+    for _fp in glob.glob(os.path.join(CACHE_INPUT, "**", "aud_*.npz"), recursive=True):
+        shutil.copy(_fp, os.path.join(CACHE_DIR, os.path.basename(_fp))); _n += 1
+    print(f"📦 Copy {_n} file aud_*.npz từ {CACHE_INPUT}")
+# Mượn cột QMOS exp07 (0.548). Trỏ answer.txt exp07 nếu có; không thì UTMOSv2.
+EXP07_ANSWER = "/kaggle/input/exp07-answer/answer.txt"   # << (tùy chọn)
+# ── Siêu tham số PHẢI KHỚP lúc train exp15 (ckpt không lưu các số này của Mamba) ──
+MAMBA_DMODEL  = 256
+MAMBA_LAYERS  = 2
+MAMBA_DSTATE  = 16
+BIDIRECTIONAL = True
+TRUNK_HIDDEN  = 512
+HEAD_HIDDEN   = 128
+DROPOUT       = 0.3       # không ảnh hưởng eval (model.eval() tắt dropout) — chỉ để dựng đúng shape
+DEVICE       = "cuda"
+SR           = 16000
+MAX_SECONDS  = 6          # khớp lúc train (exp15 = 6)
+USE_AMP      = True
+LIMIT_DEV    = None       # << để None chấm ĐỦ 2730; đặt 20 để smoke-test nhanh
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, DEV_SCP, CKPT_PATH]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER (để dựng đúng kiến trúc WavLM rồi nạp ckpt đè lên)
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("loralib", "speechbrain", "speechmos", "librosa", "soundfile",
+            "scipy", "scikit-learn", "pandas", "tqdm")
+# Mamba kernel CUDA (tùy chọn — không có thì dùng Mamba thuần PyTorch, inference vẫn ổn vì chỉ 1 lượt forward)
+INSTALL_MAMBA_SSM = True
+if INSTALL_MAMBA_SSM:
+    try:
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q", "ninja"], check=True)
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q", "--no-build-isolation", "causal-conv1d>=1.2.0"], check=True)
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q", "--no-build-isolation", "mamba-ssm"], check=True)
+        print("✅ Cài mamba-ssm xong (dùng kernel CUDA nếu import được).")
+    except Exception as e:
+        print("⚠️ Cài mamba-ssm thất bại:", repr(e), "→ Mamba thuần PyTorch (inference vẫn chạy).")
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Nạp checkpoint → dựng WavLM → load trọng số backbone đã fine-tune
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (chậm)")
+ckpt = torch.load(CKPT_PATH, map_location="cpu", weights_only=False)  # ckpt có numpy → cần False
+assert "wavlm" in ckpt, "❌ Checkpoint KHÔNG có 'wavlm' (backbone) → không inference được. Cần ft_mamba_emotion_full*.pt đủ."
+print("✅ Nạp ckpt | keys:", list(ckpt.keys()))
+# Lấy cấu hình KIẾN TRÚC từ ckpt (để dựng đúng shape head)
+USE_MAMBA  = bool(ckpt.get("USE_MAMBA", True))
+Z_DIM      = int(ckpt.get("Z_DIM", 256))
+AUD_DIM    = int(ckpt.get("AUD_DIM", 0))
+USE_AUDEERING = AUD_DIM > 0
+UNFREEZE_TOP_LAYERS = int(ckpt.get("UNFREEZE_TOP_LAYERS", 6))
+print(f"Từ ckpt: USE_MAMBA={USE_MAMBA} · Z_DIM={Z_DIM} · AUD_DIM={AUD_DIM} (audeering={'ON' if USE_AUDEERING else 'OFF'})")
+def find_hf_backbone(module):
+    cands = []
+    for name, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((name, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+try:
+    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+    _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+    name, wavlm = find_hf_backbone(_wrapper)
+    if wavlm is not None:
+        print(f"✅ Dựng backbone WavLM từ SAILER wrapper tại '.{name}'")
+except Exception as e:
+    print("⚠️ Lỗi nạp SAILER wrapper:", repr(e), "→ fallback WavLM trắng.")
+if wavlm is None:
+    from transformers import WavLMModel
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("ℹ️ Fallback: microsoft/wavlm-large.")
+wavlm = wavlm.to(device)
+WAVLM_DIM = int(wavlm.config.hidden_size)
+wavlm.config.layerdrop = 0.0
+miss, unexp = wavlm.load_state_dict(ckpt["wavlm"], strict=False)
+print(f"🔁 load wavlm từ ckpt: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0)")
+if len(miss) > 20 or len(unexp) > 20:
+    print("   ⚠️ Lệch key nhiều → kiểm tra backbone có khớp ckpt không.")
+wavlm.eval()
+def frame_mask(T, attn_mask):
+    if attn_mask is None:
+        return torch.ones((1, T), dtype=torch.bool, device=device)
+    try:
+        return wavlm._get_feature_vector_attention_mask(T, attn_mask).bool()
+    except Exception:
+        return torch.ones((attn_mask.shape[0], T), dtype=torch.bool, device=attn_mask.device)
+def masked_mean(hidden, attn_mask):
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    fm = frame_mask(hidden.shape[1], attn_mask).unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+# %% [markdown]
+# ## 3. audeering MSP-dim (FROZEN) — chỉ dựng nếu ckpt có dùng (AUD_DIM>0)
+# %%
+import numpy as np
+import librosa
+from tqdm.auto import tqdm
+aud_backbone = aud_head = aud_proc = None
+if USE_AUDEERING:
+    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+    from huggingface_hub import hf_hub_download
+    AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+    aud_backbone = Wav2Vec2Model(aud_cfg)
+    try:
+        _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+            hf_hub_download(AUD_NAME, "model.safetensors"))
+    except Exception:
+        _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+    bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+    aud_backbone.load_state_dict(bb_sd, strict=False)
+    _hid = _sd["classifier.dense.weight"].shape[0]
+    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _sd["classifier.out_proj.weight"].shape[0]))
+    aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+    aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+    aud_backbone = aud_backbone.to(device).eval()
+    aud_head = aud_head.to(device).eval()
+    assert _hid + 3 == AUD_DIM, f"⚠️ AUD_DIM dựng ({_hid+3}) ≠ ckpt ({AUD_DIM}) → audeering không khớp!"
+    print(f"✅ audeering frozen ({AUD_DIM}-D)")
+def load_wav(name_or_stem):
+    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(
+        WAV_DIR, name_or_stem if str(name_or_stem).endswith(".wav") else str(name_or_stem) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    return wave[: MAX_SECONDS * SR].astype(np.float32)
+@torch.no_grad()
+def extract_audeering(stems, tag):
+    if not USE_AUDEERING:
+        return {}
+    cache_path = os.path.join(CACHE_DIR, f"aud_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[aud/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    for i, s in enumerate(tqdm(todo, desc=f"audeering {tag}")):
+        wave = load_wav(s)
+        if wave is None:
+            continue
+        x = aud_proc(wave, sampling_rate=SR).input_values[0]
+        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)
+        h = aud_backbone(x)[0].mean(dim=1)
+        out = aud_head(h)[0].cpu().numpy()
+        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]
+        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)
+        if (i + 1) % 500 == 0:
+            np.savez(cache_path, **store)
+    if todo:
+        np.savez(cache_path, **store)
+    return store
+# %% [markdown]
+# ## 4. Cảm xúc target theo wavID (cho one-hot điều kiện của head EMOS)
+# %%
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+target_map = load_target_emotions()
+print("Target cảm xúc:", len(target_map), "wav")
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+# %% [markdown]
+# ## 5. Khối Mamba (giống exp15) + MambaEncoder
+# %%
+import math
+try:
+    from mamba_ssm import Mamba as _OfficialMamba
+    _HAS_MAMBA_SSM = True
+    print("✅ Dùng mamba-ssm (CUDA kernel)")
+except Exception:
+    _HAS_MAMBA_SSM = False
+    print("ℹ️ Không có mamba-ssm → Mamba thuần PyTorch")
+class MambaBlockTorch(nn.Module):
+    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
+        super().__init__()
+        self.d_inner = expand * d_model
+        self.dt_rank = math.ceil(d_model / 16)
+        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)
+        self.conv1d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=d_conv,
+                                groups=self.d_inner, padding=d_conv - 1, bias=True)
+        self.x_proj = nn.Linear(self.d_inner, self.dt_rank + d_state * 2, bias=False)
+        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True)
+        A = torch.arange(1, d_state + 1, dtype=torch.float32).repeat(self.d_inner, 1)
+        self.A_log = nn.Parameter(torch.log(A))
+        self.D = nn.Parameter(torch.ones(self.d_inner))
+        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
+        self.d_state = d_state
+    def forward(self, x):
+        B, L, _ = x.shape
+        xin, z = self.in_proj(x).chunk(2, dim=-1)
+        xin = xin.transpose(1, 2)
+        xin = self.conv1d(xin)[..., :L].transpose(1, 2)
+        xin = F.silu(xin)
+        y = self._ssm(xin) * F.silu(z)
+        return self.out_proj(y)
+    def _ssm(self, x):
+        A = -torch.exp(self.A_log)
+        delta, Bm, Cm = torch.split(self.x_proj(x), [self.dt_rank, self.d_state, self.d_state], dim=-1)
+        delta = F.softplus(self.dt_proj(delta))
+        dA = torch.exp(delta.unsqueeze(-1) * A)
+        dB_x = delta.unsqueeze(-1) * Bm.unsqueeze(2) * x.unsqueeze(-1)
+        h = torch.zeros(x.shape[0], self.d_inner, self.d_state, device=x.device, dtype=x.dtype)
+        ys = []
+        for t in range(x.shape[1]):
+            h = dA[:, t] * h + dB_x[:, t]
+            ys.append((h * Cm[:, t].unsqueeze(1)).sum(-1))
+        return torch.stack(ys, dim=1) + x * self.D
+class MambaLayer(nn.Module):
+    def __init__(self, d_model, d_state):
+        super().__init__()
+        self.norm = nn.LayerNorm(d_model)
+        self.mix = _OfficialMamba(d_model=d_model, d_state=d_state, d_conv=4, expand=2) \
+            if _HAS_MAMBA_SSM else MambaBlockTorch(d_model, d_state=d_state)
+    def forward(self, x):
+        return x + self.mix(self.norm(x))
+class MambaEncoder(nn.Module):
+    def __init__(self, d_in, d_model, n_layers, d_state, z_dim, bidir):
+        super().__init__()
+        self.bidir = bidir
+        self.proj = nn.Linear(d_in, d_model)
+        self.fwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])
+        if bidir:
+            self.bwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])
+        self.attn = nn.Linear(d_model, 1)
+        self.out = nn.Linear(d_model, z_dim)
+    @staticmethod
+    def _run(layers, h):
+        for L in layers:
+            h = L(h)
+        return h
+    def forward(self, x, mask):
+        with torch.cuda.amp.autocast(enabled=False):
+            x = x.float()
+            h = self.proj(x)
+            out = self._run(self.fwd, h)
+            if self.bidir:
+                out = out + torch.flip(self._run(self.bwd, torch.flip(h, dims=[1])), dims=[1])
+            a = self.attn(out).squeeze(-1).masked_fill(~mask, float("-inf"))
+            w = torch.softmax(a, dim=1).unsqueeze(-1)
+            return self.out((out * w).sum(1))
+# %% [markdown]
+# ## 6. Dựng enc + heads → nạp trọng số từ ckpt + lấy chuẩn hóa từ ckpt
+# %%
+N_EMO = len(EMOTIONS5)
+WAVLM_BRANCH = Z_DIM if USE_MAMBA else WAVLM_DIM
+TRUNK_IN = WAVLM_BRANCH + (AUD_DIM if USE_AUDEERING else 0)
+enc = MambaEncoder(WAVLM_DIM, MAMBA_DMODEL, MAMBA_LAYERS, MAMBA_DSTATE, Z_DIM, BIDIRECTIONAL).to(device) \
+    if USE_MAMBA else None
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+hm, hu = heads.load_state_dict(ckpt["heads"], strict=False)
+print(f"🔁 load heads từ ckpt: thiếu {len(hm)} / dư {len(hu)} key (kỳ vọng 0)")
+if USE_MAMBA:
+    assert ckpt.get("enc") is not None, "❌ ckpt USE_MAMBA=True nhưng KHÔNG có 'enc' → không inference đúng được."
+    em, eu = enc.load_state_dict(ckpt["enc"], strict=False)
+    print(f"🔁 load Mamba enc từ ckpt: thiếu {len(em)} / dư {len(eu)} key (kỳ vọng 0)")
+heads.eval()
+if USE_MAMBA:
+    enc.eval()
+# Chuẩn hóa LẤY TỪ ckpt (head dự đoán ở thang z-score này → phải giải chuẩn đúng thang)
+emos_mu = float(ckpt["emos_mu"]); emos_sd = float(ckpt["emos_sd"])
+vad_mu = np.asarray(ckpt["vad_mu"], dtype=np.float32); vad_sd = np.asarray(ckpt["vad_sd"], dtype=np.float32)
+print(f"Chuẩn hóa từ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}")
+def wavlm_branch(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state
+    if USE_MAMBA:
+        return enc(out, frame_mask(out.shape[1], attn_mask))
+    return masked_mean(out, attn_mask)
+print(f"Trunk input = {TRUNK_IN} (wavlm-branch {WAVLM_BRANCH} [{'Mamba' if USE_MAMBA else 'mean-pool'}] + aud {AUD_DIM if USE_AUDEERING else 0})")
+# %% [markdown]
+# ## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc; QMOS mượn exp07/UTMOSv2)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+aud_dev = extract_audeering(dev_stems, "dev")
+def load_exp07_qmos():
+    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+        import csv
+        d = {}
+        with open(EXP07_ANSWER) as f:
+            for row in csv.DictReader(f):
+                d[row["wav"]] = float(row["QMOS"]); d[stem(row["wav"])] = float(row["QMOS"])
+        print(f"✅ Mượn QMOS exp07 ({EXP07_ANSWER}): {len(d)//2} wav")
+        return d
+    return None
+qmos_map = load_exp07_qmos()
+if qmos_map is None:
+    print("ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024).")
+    pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")
+    import utmosv2
+    v2 = utmosv2.create_model(pretrained=True)
+    qmos_map = {}
+    for n in tqdm(dev_names, desc="UTMOSv2"):
+        wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else str(n) + ".wav")
+        if not os.path.exists(wav):
+            continue
+        out = v2.predict(input_path=wav)
+        qmos_map[n] = float(out["predicted_mos"]) if isinstance(out, dict) else float(out)
+    del v2; torch.cuda.empty_cache() if device == "cuda" else None
+@torch.no_grad()
+def predict_emotion(sid):
+    wave = load_wav(sid)
+    if wave is None or (USE_AUDEERING and sid not in aud_dev):
+        return None
+    iv = torch.from_numpy(wave).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        fw = wavlm_branch(iv, am)
+        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    n_real = n_def = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pr = predict_emotion(sid)
+            if pr is None:
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1
+            else:
+                emos, cat5, vad3 = pr; n_real += 1
+            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 8. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0] and "EMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp15_predict.zip answer.txt "
+          f"&& unzip -l submission_track2_exp15_predict.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp15_predict.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - File này **chỉ inference** — không train, không cần train.csv. Dùng khi đã có `ft_mamba_emotion_full*.pt`.
+# - ⚠️ **Siêu tham số Mamba/heads (MAMBA_DMODEL/LAYERS/DSTATE, TRUNK_HIDDEN, HEAD_HIDDEN) PHẢI khớp lúc train**
+#   (ckpt không lưu các số này) — nếu lúc train exp15 bạn đổi, hãy sửa cho khớp ở cell 0, nếu không load_state_dict
+#   sẽ lệch key / sai shape.
+# - `USE_MAMBA`, `Z_DIM`, `AUD_DIM`, `UNFREEZE_TOP_LAYERS` thì **đọc tự động từ ckpt**.
+# - QMOS: tốt nhất Add Input `answer.txt` exp07 (0.548); không có thì tự chấm UTMOSv2.
+# - Smoke-test: đặt `LIMIT_DEV=20` chạy thử cho nhanh, OK rồi đặt lại `None` để chấm đủ 2730.

track2/exp15_wavlm_mamba_emotion.ipynb ADDED Viewed

	@@ -0,0 +1,1081 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "5b4b651f",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — exp15 (WavLM FINE-TUNE + MAMBA head cho 5 cột cảm xúc) — Kaggle\n",
+    "\n",
+    "**Ý tưởng:** exp08 fine-tune WavLM nhưng vẫn **mean-pool** đặc trưng theo thời gian → 1 vector/wav\n",
+    "(vứt bỏ động lực thời gian: lên/xuống giọng, ngắt quãng, run giọng — rất quan trọng cho cảm xúc).\n",
+    "exp15 **thay mean-pool bằng MAMBA head** (bộ mã hóa chuỗi học được, độ phức tạp tuyến tính) → kỳ vọng\n",
+    "nắm temporal dynamics tốt hơn. Tham khảo: MambaRate (AudioMOS 2025, arXiv:2507.12090).\n",
+    "\n",
+    "## Kiến trúc (= exp08 đổi đúng 1 chỗ: pool → Mamba)\n",
+    "```\n",
+    " wav ─► WavLM-large (SAILER warm-start, mở băng N lớp, TRAINABLE) ─► hidden states (B, T, 1024)\n",
+    "                                                                           │  (KHÔNG mean-pool)\n",
+    "                                                             MambaEncoder (proj 1024→d, Mamba×L 2 chiều,\n",
+    "                                                             attentive-pool có mask) ─► z (B, Z_DIM)\n",
+    "                                                                           │\n",
+    "      (tùy chọn) audeering MSP-dim FROZEN [emb|vad3] ──concat──► TRUNK ─┬─► EMOS (+ one-hot target)\n",
+    "                                                                         ├─► CAT (5, soft-CE)\n",
+    "                                                                         └─► VAD (3)\n",
+    " QMOS: KHÔNG train ở đây → mượn cột QMOS exp07 (0.548) hoặc UTMOSv2.\n",
+    "```\n",
+    "- **Cờ `USE_MAMBA`:** True = Mamba head; False = quay về `masked_mean` = **đúng exp08**\n",
+    "  → đây là **ablation chính cho paper** (\"Mamba temporal head vs mean-pooling\", CÙNG backbone fine-tune).\n",
+    "\n",
+    "## ⚠️ Đánh đổi / gotcha (đã phòng trong code)\n",
+    "- Fine-tune = chạy lại WavLM mỗi epoch (không cache được) → **lần đầu BẮT BUỘC `LIMIT_TRAIN=300`, `LIMIT_DEV=20`**.\n",
+    "- `mamba-ssm` khó cài Kaggle → tự fallback **Mamba thuần PyTorch** (vòng-lặp-thời-gian). Bản này khi fine-tune\n",
+    "  **chậm + nặng RAM hơn** → cap `MAX_SECONDS=6`, `BATCH=2`. OOM/quá chậm → hạ MAX_SECONDS→5, MAMBA_LAYERS→1,\n",
+    "  hoặc thử cài `mamba-ssm causal-conv1d`.\n",
+    "- `layerdrop=0` (tránh CheckpointError khi grad-ckpt — bài học exp12). KHÔNG đụng numpy (lệch ABI).\n",
+    "- **Checkpoint lưu CẢ backbone + Mamba + heads mỗi best** (bài học exp08 mất backbone).\n",
+    "\n",
+    "## 🔁 RESUME (yêu cầu của user): \"nếu có checkpoint thì train TIẾP, không train lại từ đầu\"\n",
+    "- Notebook **tự dò** `ft_mamba_emotion_full.pt` trong `/kaggle/input` và `/kaggle/working` (hoặc trỏ tay `RESUME_CKPT`).\n",
+    "- Có ckpt đủ (backbone WavLM + Mamba enc + heads) → **nạp lại trạng thái + thống kê chuẩn hóa TỪ ckpt** rồi train tiếp;\n",
+    "  `best` khởi tạo = điểm VAL của ckpt → chỉ ghi đè khi train tiếp **TỐT HƠN** (không sợ tụt). `RESUME_LR_SCALE<1` để hạ LR.\n",
+    "- KHÔNG có ckpt → train mới từ SAILER warm-start như cũ (hành vi exp15 gốc giữ nguyên).\n",
+    "\n",
+    "**Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input dataset Track 2 (+ Add Input checkpoint cũ nếu muốn resume)\n",
+    "→ sửa `DATA_ROOT` → Run All."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "194bcd01",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8ed47b3b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, glob\n",
+    "\n",
+    "# ── TỰ DÒ DATA_ROOT (quét /kaggle/input tìm thư mục có sets/train.csv + wav/ + metadata.csv) ──\n",
+    "def find_data_root(search_root=\"/kaggle/input\"):\n",
+    "    cands = []\n",
+    "    for train_csv in glob.glob(os.path.join(search_root, \"**\", \"sets\", \"train.csv\"), recursive=True):\n",
+    "        root = os.path.dirname(os.path.dirname(train_csv))          # .../<root>/sets/train.csv → <root>\n",
+    "        score = os.path.isdir(os.path.join(root, \"wav\")) + os.path.exists(os.path.join(root, \"metadata.csv\"))\n",
+    "        cands.append((score, root))\n",
+    "    cands.sort(reverse=True)                                        # ưu tiên thư mục đủ wav + metadata\n",
+    "    return cands\n",
+    "\n",
+    "_cands = find_data_root(\"/kaggle/input\")\n",
+    "if _cands:\n",
+    "    print(\"🔎 Ứng viên DATA_ROOT (điểm cao = đủ wav+metadata):\")\n",
+    "    for sc, r in _cands:\n",
+    "        print(f\"   [{sc}/2] {r}\")\n",
+    "    DATA_ROOT = _cands[0][1]\n",
+    "    print(f\"👉 Tự chọn DATA_ROOT = {DATA_ROOT}\")\n",
+    "else:\n",
+    "    DATA_ROOT = \"/kaggle/input/datasets/minhtoan2\"   # dự phòng — sửa tay nếu auto-dò không thấy\n",
+    "    print(f\"❌ Không thấy sets/train.csv trong /kaggle/input → dùng dự phòng {DATA_ROOT} (đã Add Input chưa?)\")\n",
+    "\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"     # wavID|emotion|transcript (KHÔNG header)\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/ft_cache\"         # cache audeering (.npz) — WavLM/Mamba KHÔNG cache (đang train)\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# (Tùy chọn) tái dùng cache audeering cũ (read-only /kaggle/input → copy sang working)\n",
+    "# Dataset cache_exp8: aud_*.npz nằm trong thư mục con archive/ → quét ĐỆ QUY để bắt mọi vị trí.\n",
+    "CACHE_INPUT = \"/kaggle/input/cache-exp8\"   # << SỬA slug (dataset cache_exp8 → Kaggle đổi _→-); hoặc \"\"\n",
+    "if CACHE_INPUT and os.path.isdir(CACHE_INPUT):\n",
+    "    import shutil\n",
+    "    _n = 0\n",
+    "    for _fp in glob.glob(os.path.join(CACHE_INPUT, \"**\", \"aud_*.npz\"), recursive=True):\n",
+    "        shutil.copy(_fp, os.path.join(CACHE_DIR, os.path.basename(_fp))); _n += 1\n",
+    "    print(f\"📦 Tái dùng cache: copy {_n} file aud_*.npz (quét đệ quy {CACHE_INPUT})\")\n",
+    "else:\n",
+    "    print(f\"ℹ️ Không thấy CACHE_INPUT={CACHE_INPUT} → sẽ tự trích audeering.\")\n",
+    "\n",
+    "# Mượn cột QMOS exp07 (0.548). Trỏ answer.txt exp07 nếu có; không thì UTMOSv2.\n",
+    "EXP07_ANSWER = \"/kaggle/input/exp07-answer/answer.txt\"   # << (tùy chọn)\n",
+    "\n",
+    "# ── Cờ Mamba (ablation chính) ────────────────────────────────────────────────\n",
+    "USE_MAMBA           = True        # True = Mamba head; False = mean-pool = ĐÚNG exp08\n",
+    "\n",
+    "# ── Siêu tham số Mamba head ──────────────────────────────────────────────────\n",
+    "MAMBA_DMODEL        = 256\n",
+    "MAMBA_LAYERS        = 2\n",
+    "MAMBA_DSTATE        = 16\n",
+    "BIDIRECTIONAL       = True\n",
+    "Z_DIM               = 256         # chiều vector ra sau attentive-pool, thay cho emb WavLM mean-pool\n",
+    "\n",
+    "# ── Fine-tune / siêu tham số (kế thừa exp08) ─────────────────────────────────\n",
+    "DEVICE              = \"cuda\"\n",
+    "SR                  = 16000\n",
+    "MAX_SECONDS         = 6           # giảm từ 8 (exp08) vì Mamba backprop-through-time nặng RAM hơn\n",
+    "UNFREEZE_TOP_LAYERS = 6           # số lớp Transformer trên cùng được train (0 = freeze hết)\n",
+    "TRUNK_HIDDEN        = 512\n",
+    "HEAD_HIDDEN         = 128\n",
+    "DROPOUT             = 0.3\n",
+    "LR_BACKBONE         = 1e-5\n",
+    "LR_HEAD             = 1e-3        # cho Mamba + trunk + head (train từ đầu)\n",
+    "WEIGHT_DECAY        = 1e-5\n",
+    "EPOCHS              = 12\n",
+    "PATIENCE            = 3\n",
+    "BATCH               = 2           # nhỏ (backbone to + Mamba); bù bằng ACCUM\n",
+    "ACCUM               = 16          # effective batch = 32\n",
+    "VAL_FRAC            = 0.10\n",
+    "SEED                = 42\n",
+    "USE_AMP             = True\n",
+    "USE_GRAD_CKPT       = True\n",
+    "USE_AUDEERING       = True\n",
+    "USE_UNCERTAINTY     = True\n",
+    "RANK_LAMBDA         = 0.3         # 0 = chỉ MSE (cũ). >0 = thêm pairwise ranking loss (tối ưu thẳng SRCC) cho emos/val/aro/dom\n",
+    "                                  # ⚠️ ranking cần NHIỀU cặp/batch mới mạnh → BATCH nhỏ (2) thì tác dụng yếu (xem Ghi chú)\n",
+    "\n",
+    "LIMIT_TRAIN         = 300         # << LẦN ĐẦU 300; chạy thật None\n",
+    "LIMIT_DEV           = 20          # << LẦN ĐẦU 20; chạy thật None\n",
+    "\n",
+    "# ── RESUME — train TIẾP từ checkpoint, KHÔNG train lại từ đầu ─────────────────\n",
+    "# Để \"\" + auto-dò: nếu thấy `ft_mamba_emotion_full.pt` (đủ backbone+Mamba+heads) trong /kaggle/input\n",
+    "# hoặc /kaggle/working → nạp lại rồi train tiếp. Trỏ tay RESUME_CKPT nếu muốn chỉ định file cụ thể.\n",
+    "RESUME_CKPT         = \"\"          # << \"\" = auto-dò; hoặc \"/kaggle/input/<slug>/ft_mamba_emotion_full.pt\"\n",
+    "RESUME_LR_SCALE     = 1.0         # <1.0 hạ LR khi train tiếp (vd 0.5 nếu val đã chững)\n",
+    "\n",
+    "def find_resume_ckpt(explicit):\n",
+    "    \"\"\"Tìm checkpoint exp15 để train tiếp. Ưu tiên đường dẫn user trỏ; không thì auto-dò.\n",
+    "    Khớp cả tên bị Kaggle/Windows thêm hậu tố trùng, vd 'ft_mamba_emotion_full (2).pt'.\"\"\"\n",
+    "    if explicit and os.path.exists(explicit):\n",
+    "        return explicit\n",
+    "    for base in [\"/kaggle/input\", \"/kaggle/working\"]:\n",
+    "        hits = sorted(glob.glob(os.path.join(base, \"**\", \"ft_mamba_emotion_full*.pt\"), recursive=True))\n",
+    "        if hits:\n",
+    "            return hits[0]\n",
+    "    return \"\"\n",
+    "\n",
+    "RESUME_CKPT = find_resume_ckpt(RESUME_CKPT)\n",
+    "RESUME      = bool(RESUME_CKPT)\n",
+    "print(\"🔁 RESUME =\", RESUME, (\"→ train tiếp từ: \" + RESUME_CKPT) if RESUME else \"(không thấy ckpt → train MỚI từ đầu)\")\n",
+    "\n",
+    "# Mốc so (exp08 fine-tune + mean-pool — đối thủ trực tiếp của Mamba head)\n",
+    "EXP08 = {\"emos\": 0.811, \"val\": 0.659, \"aro\": 0.793, \"dom\": 0.751}\n",
+    "\n",
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(p):\n",
+    "    return os.path.splitext(os.path.basename(str(p)))[0]\n",
+    "\n",
+    "print(\"USE_MAMBA =\", USE_MAMBA, \"(False → ra đúng exp08)\")\n",
+    "print(\"DATA_ROOT:\", DATA_ROOT)\n",
+    "for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:\n",
+    "    print((\"  ✅ \" if os.path.exists(p) else \"  ❌ THIẾU \") + p)\n",
+    "print(f\"Fine-tune: mở băng {UNFREEZE_TOP_LAYERS} lớp · BATCH {BATCH}×ACCUM {ACCUM} · MAX {MAX_SECONDS}s\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8010473",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt + tải code SAILER (clone + sys.path)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1b8d9fad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys, subprocess\n",
+    "\n",
+    "def pip_install(*pkgs):\n",
+    "    subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", *pkgs], check=True)\n",
+    "\n",
+    "pip_install(\"loralib\", \"speechbrain\", \"speechmos\", \"librosa\", \"soundfile\",\n",
+    "            \"scipy\", \"scikit-learn\", \"pandas\", \"tqdm\")\n",
+    "\n",
+    "# Cài kernel CUDA Mamba (nhanh + nhẹ RAM hơn bản thuần PyTorch nhiều). Build hay lỗi/chậm trên Kaggle\n",
+    "# → bọc try/except: lỗi thì BỎ QUA, mục 6a tự fallback Mamba thuần PyTorch. KHÔNG để chết notebook.\n",
+    "INSTALL_MAMBA_SSM = True   # đặt False nếu muốn BỎ QUA, dùng thẳng Mamba thuần PyTorch\n",
+    "if INSTALL_MAMBA_SSM and USE_MAMBA:\n",
+    "    try:\n",
+    "        # --no-build-isolation cho CẢ HAI → dùng torch+CUDA sẵn có của Kaggle để biên dịch (đừng kéo torch khác).\n",
+    "        # Cần ninja để build nhanh. -q ẩn log nên bước này có thể \"treo\" vài phút khi đang compile — bình thường.\n",
+    "        subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\", \"ninja\"], check=True)\n",
+    "        subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\",\n",
+    "                        \"--no-build-isolation\", \"causal-conv1d>=1.2.0\"], check=True)\n",
+    "        subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"-q\",\n",
+    "                        \"--no-build-isolation\", \"mamba-ssm\"], check=True)\n",
+    "        print(\"✅ Cài mamba-ssm + causal-conv1d xong (sẽ dùng kernel CUDA nếu import được).\")\n",
+    "    except Exception as e:\n",
+    "        print(\"⚠️ Cài mamba-ssm thất bại:\", repr(e), \"→ dùng Mamba thuần PyTorch (chậm hơn).\")\n",
+    "        print(\"   ℹ️ Vẫn chạy bình thường. Nếu chạy THẬT (LIMIT=None) quá chậm → xem Ghi chú cuối notebook.\")\n",
+    "\n",
+    "REPO_DIR = \"/kaggle/working/vox-profile-release\"\n",
+    "if not os.path.exists(REPO_DIR):\n",
+    "    subprocess.run([\"git\", \"clone\", \"--depth\", \"1\",\n",
+    "                    \"https://github.com/tiantiaf0627/vox-profile-release.git\", REPO_DIR], check=True)\n",
+    "if REPO_DIR not in sys.path:\n",
+    "    sys.path.insert(0, REPO_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "598d74d9",
+   "metadata": {},
+   "source": [
+    "## 2. Nạp SAILER → lấy backbone WavLM bên trong để FINE-TUNE (warm-start)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5346a63d",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "import torch.nn.functional as F\n",
+    "\n",
+    "device = DEVICE if torch.cuda.is_available() else \"cpu\"\n",
+    "print(\"Device:\", device, (\"✅ \" + torch.cuda.get_device_name(0)) if device == \"cuda\" else \"⚠️ CPU (rất chậm!)\")\n",
+    "\n",
+    "def find_hf_backbone(module):\n",
+    "    \"\"\"Tìm submodule kiểu HF WavLM backbone: có .feature_extractor và .encoder.layers.\"\"\"\n",
+    "    cands = []\n",
+    "    for name, m in module.named_modules():\n",
+    "        enc = getattr(m, \"encoder\", None)\n",
+    "        if getattr(m, \"feature_extractor\", None) is not None and enc is not None \\\n",
+    "                and getattr(enc, \"layers\", None) is not None:\n",
+    "            cands.append((name, m))\n",
+    "    if not cands:\n",
+    "        return None, None\n",
+    "    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)\n",
+    "    return cands[0]\n",
+    "\n",
+    "wavlm = None\n",
+    "try:\n",
+    "    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402\n",
+    "    _wrapper = WavLMWrapper.from_pretrained(\"tiantiaf/wavlm-large-categorical-emotion\")\n",
+    "    name, wavlm = find_hf_backbone(_wrapper)\n",
+    "    if wavlm is not None:\n",
+    "        print(f\"✅ Warm-start SAILER: backbone WavLM tại '.{name}' \"\n",
+    "              f\"({sum(p.numel() for p in wavlm.parameters())/1e6:.0f}M params)\")\n",
+    "    else:\n",
+    "        print(\"⚠️ Không tìm thấy backbone HF trong wrapper SAILER → fallback WavLM trắng.\")\n",
+    "except Exception as e:\n",
+    "    print(\"⚠️ Lỗi nạp SAILER wrapper:\", repr(e), \"→ fallback WavLM trắng.\")\n",
+    "\n",
+    "if wavlm is None:\n",
+    "    from transformers import WavLMModel\n",
+    "    wavlm = WavLMModel.from_pretrained(\"microsoft/wavlm-large\")\n",
+    "    print(\"ℹ️ Fallback: microsoft/wavlm-large (KHÔNG warm-start SAILER).\")\n",
+    "\n",
+    "wavlm = wavlm.to(device)\n",
+    "WAVLM_DIM = int(wavlm.config.hidden_size)\n",
+    "wavlm.config.layerdrop = 0.0   # ⚠️ tránh CheckpointError khi grad-ckpt (bài học exp12)\n",
+    "\n",
+    "# ── RESUME: nạp trọng số backbone đã fine-tune từ checkpoint (đè lên warm-start SAILER) ──\n",
+    "resume_ckpt = None\n",
+    "if RESUME:\n",
+    "    resume_ckpt = torch.load(RESUME_CKPT, map_location=\"cpu\", weights_only=False)  # ckpt có numpy → cần False\n",
+    "    assert \"wavlm\" in resume_ckpt, (\"❌ Checkpoint KHÔNG có 'wavlm' (backbone) → không resume được. \"\n",
+    "                                    \"Dùng file ft_mamba_emotion_full.pt do exp15 lưu.\")\n",
+    "    if resume_ckpt.get(\"USE_MAMBA\", USE_MAMBA) != USE_MAMBA:\n",
+    "        print(f\"   ⚠️ ckpt USE_MAMBA={resume_ckpt.get('USE_MAMBA')} ≠ cấu hình hiện tại {USE_MAMBA} → kiến trúc LỆCH! \"\n",
+    "              \"Đặt USE_MAMBA cho khớp ckpt.\")\n",
+    "    miss, unexp = wavlm.load_state_dict(resume_ckpt[\"wavlm\"], strict=False)\n",
+    "    print(f\"🔁 RESUME load wavlm từ ckpt: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0). keys ckpt:\", list(resume_ckpt.keys()))\n",
+    "    if len(miss) > 20 or len(unexp) > 20:\n",
+    "        print(\"   ⚠️ Lệch key nhiều → kiểm tra UNFREEZE_TOP_LAYERS / backbone có khớp ckpt không.\")\n",
+    "\n",
+    "# ── Đóng băng partial: feature-extractor + tất cả trừ UNFREEZE_TOP_LAYERS lớp trên ──\n",
+    "for p in wavlm.parameters():\n",
+    "    p.requires_grad = False\n",
+    "enc_layers = wavlm.encoder.layers\n",
+    "n_layers = len(enc_layers)\n",
+    "for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:\n",
+    "    for p in layer.parameters():\n",
+    "        p.requires_grad = True\n",
+    "n_train = sum(p.numel() for p in wavlm.parameters() if p.requires_grad)\n",
+    "print(f\"WavLM: {n_layers} lớp · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} → {n_train/1e6:.1f}M param train (dim {WAVLM_DIM})\")\n",
+    "\n",
+    "if USE_GRAD_CKPT:\n",
+    "    wavlm.gradient_checkpointing_enable()\n",
+    "    if hasattr(wavlm, \"enable_input_require_grads\"):\n",
+    "        wavlm.enable_input_require_grads()\n",
+    "\n",
+    "def frame_mask(T, attn_mask):\n",
+    "    \"\"\"attn_mask (B, Lwav) → frame-mask (B, T) bool (True=frame thật). Khớp downsample của WavLM.\"\"\"\n",
+    "    if attn_mask is None:\n",
+    "        return torch.ones((1, T), dtype=torch.bool, device=device)\n",
+    "    try:\n",
+    "        fm = wavlm._get_feature_vector_attention_mask(T, attn_mask)\n",
+    "        return fm.bool()\n",
+    "    except Exception:\n",
+    "        return torch.ones((attn_mask.shape[0], T), dtype=torch.bool, device=attn_mask.device)\n",
+    "\n",
+    "def masked_mean(hidden, attn_mask):\n",
+    "    \"\"\"Mean-pool theo thời gian bỏ pad (đường exp08 khi USE_MAMBA=False).\"\"\"\n",
+    "    if attn_mask is None:\n",
+    "        return hidden.mean(dim=1)\n",
+    "    fm = frame_mask(hidden.shape[1], attn_mask).unsqueeze(-1).to(hidden.dtype)\n",
+    "    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c72d5983",
+   "metadata": {},
+   "source": [
+    "## 3. Nạp audeering MSP-dim (FROZEN) — đặc trưng phụ (như exp08)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d967397d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "AUD_DIM = 0\n",
+    "aud_backbone = aud_head = aud_proc = None\n",
+    "if USE_AUDEERING:\n",
+    "    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor\n",
+    "    from huggingface_hub import hf_hub_download\n",
+    "    AUD_NAME = \"audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim\"\n",
+    "    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)\n",
+    "    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)\n",
+    "    aud_backbone = Wav2Vec2Model(aud_cfg)\n",
+    "    try:\n",
+    "        _sd = __import__(\"safetensors.torch\", fromlist=[\"load_file\"]).load_file(\n",
+    "            hf_hub_download(AUD_NAME, \"model.safetensors\"))\n",
+    "    except Exception:\n",
+    "        _sd = torch.load(hf_hub_download(AUD_NAME, \"pytorch_model.bin\"), map_location=\"cpu\")\n",
+    "    bb_sd = {k[len(\"wav2vec2.\"):]: v for k, v in _sd.items() if k.startswith(\"wav2vec2.\")}\n",
+    "    aud_backbone.load_state_dict(bb_sd, strict=False)\n",
+    "    _hid = _sd[\"classifier.dense.weight\"].shape[0]\n",
+    "    _out = _sd[\"classifier.out_proj.weight\"].shape[0]\n",
+    "    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))\n",
+    "    aud_head[0].weight.data.copy_(_sd[\"classifier.dense.weight\"]); aud_head[0].bias.data.copy_(_sd[\"classifier.dense.bias\"])\n",
+    "    aud_head[2].weight.data.copy_(_sd[\"classifier.out_proj.weight\"]); aud_head[2].bias.data.copy_(_sd[\"classifier.out_proj.bias\"])\n",
+    "    aud_backbone = aud_backbone.to(device).eval()\n",
+    "    aud_head = aud_head.to(device).eval()\n",
+    "    AUD_DIM = _hid + 3\n",
+    "    print(f\"✅ audeering frozen (đặc trưng phụ {AUD_DIM}-D = emb {_hid} + vad 3)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1a5f1592",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import librosa\n",
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "def load_wav(name_or_stem):\n",
+    "    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(\n",
+    "        WAV_DIR, name_or_stem if str(name_or_stem).endswith(\".wav\") else str(name_or_stem) + \".wav\")\n",
+    "    if not os.path.exists(p):\n",
+    "        return None\n",
+    "    wave, _ = librosa.load(p, sr=SR, mono=True)\n",
+    "    return wave[: MAX_SECONDS * SR].astype(np.float32)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def extract_audeering(stems, tag):\n",
+    "    if not USE_AUDEERING:\n",
+    "        return {}\n",
+    "    cache_path = os.path.join(CACHE_DIR, f\"aud_{tag}.npz\")\n",
+    "    store = {}\n",
+    "    if os.path.exists(cache_path):\n",
+    "        z = np.load(cache_path, allow_pickle=True)\n",
+    "        store = {k: z[k] for k in z.files}\n",
+    "        print(f\"[aud/{tag}] nạp cache: {len(store)}\")\n",
+    "    todo = [s for s in stems if s not in store]\n",
+    "    for i, s in enumerate(tqdm(todo, desc=f\"audeering {tag}\")):\n",
+    "        wave = load_wav(s)\n",
+    "        if wave is None:\n",
+    "            continue\n",
+    "        x = aud_proc(wave, sampling_rate=SR).input_values[0]\n",
+    "        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)\n",
+    "        h = aud_backbone(x)[0].mean(dim=1)\n",
+    "        out = aud_head(h)[0].cpu().numpy()                  # [arousal, dominance, valence] ∈[0,1]\n",
+    "        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]\n",
+    "        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)\n",
+    "        if (i + 1) % 500 == 0:\n",
+    "            np.savez(cache_path, **store)\n",
+    "    if todo:\n",
+    "        np.savez(cache_path, **store)\n",
+    "    return store"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "50717e09",
+   "metadata": {},
+   "source": [
+    "## 4. Đọc & gộp nhãn theo wavID (EMOS / VAD / CAT) — như exp08"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b5c3e935",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    tgt = {}\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) >= 2:\n",
+    "                tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "def _col(cols_map, *names, df=None, default_idx=None):\n",
+    "    for n in names:\n",
+    "        if n in cols_map:\n",
+    "            return cols_map[n]\n",
+    "    return list(df.columns)[default_idx] if default_idx is not None else None\n",
+    "\n",
+    "def parse_emocat_votes(cell):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    for tok in str(cell).replace(\"/\", \",\").replace(\";\", \",\").replace(\"|\", \",\").replace(\" \", \",\").split(\",\"):\n",
+    "        e = norm_emotion(tok)\n",
+    "        if e in EMOTIONS5:\n",
+    "            v[EMOTIONS5.index(e)] += 1.0\n",
+    "    return v\n",
+    "\n",
+    "def load_train_labels():\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    cols = {c.lower().strip(): c for c in df.columns}\n",
+    "    wav_col = _col(cols, \"wavid\", \"wav\", df=df, default_idx=1)\n",
+    "    emos_col = _col(cols, \"emos\", \"emo\", \"emomos\")\n",
+    "    val_col = _col(cols, \"val\", \"valence\"); aro_col = _col(cols, \"aro\", \"arousal\"); dom_col = _col(cols, \"dom\", \"dominance\")\n",
+    "    cat_col = _col(cols, \"emocat\", \"cat\", \"emotion\")\n",
+    "    assert emos_col, f\"Không thấy cột eMOS (cột: {list(df.columns)})\"\n",
+    "    df[\"_stem\"] = df[wav_col].map(stem)\n",
+    "    rows = []\n",
+    "    for sid, g in df.groupby(\"_stem\"):\n",
+    "        rec = {\"wavID\": sid, \"emos\": float(g[emos_col].mean())}\n",
+    "        rec[\"val\"] = float(g[val_col].mean()) if val_col else np.nan\n",
+    "        rec[\"aro\"] = float(g[aro_col].mean()) if aro_col else np.nan\n",
+    "        rec[\"dom\"] = float(g[dom_col].mean()) if dom_col else np.nan\n",
+    "        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "        if cat_col:\n",
+    "            for cell in g[cat_col]:\n",
+    "                votes += parse_emocat_votes(cell)\n",
+    "        s = votes.sum()\n",
+    "        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)\n",
+    "        for i in range(len(EMOTIONS5)):\n",
+    "            rec[f\"cat{i}\"] = float(cat[i])\n",
+    "        rows.append(rec)\n",
+    "    return pd.DataFrame(rows)\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "train_df = load_train_labels()\n",
+    "HAS_VAD = bool(train_df[\"val\"].notna().any())\n",
+    "print(f\"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b5ab79d3",
+   "metadata": {},
+   "source": [
+    "## 5. Dataset / DataLoader (load wav theo batch — KHÔNG cache WavLM vì đang train)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9989f142",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import Dataset, DataLoader\n",
+    "\n",
+    "train_stems = [s for s in train_df[\"wavID\"] if target_map.get(s) is not None]\n",
+    "if LIMIT_TRAIN:\n",
+    "    train_stems = train_stems[:LIMIT_TRAIN]\n",
+    "aud_tr = extract_audeering(train_stems, \"train\")\n",
+    "\n",
+    "lab = train_df.set_index(\"wavID\")\n",
+    "\n",
+    "def _zfit(arr):\n",
+    "    a = np.asarray(arr, dtype=np.float32)\n",
+    "    return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)\n",
+    "\n",
+    "if RESUME and resume_ckpt is not None:\n",
+    "    # QUAN TRỌNG: lấy chuẩn hóa TỪ ckpt (head đã train theo thang này) — KHÔNG tính lại để khỏi lệch thang\n",
+    "    emos_mu = float(resume_ckpt[\"emos_mu\"]); emos_sd = float(resume_ckpt[\"emos_sd\"])\n",
+    "    vad_mu = np.asarray(resume_ckpt[\"vad_mu\"], dtype=np.float32)\n",
+    "    vad_sd = np.asarray(resume_ckpt[\"vad_sd\"], dtype=np.float32)\n",
+    "    print(f\"🔁 RESUME: dùng chuẩn hóa TỪ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}\")\n",
+    "else:\n",
+    "    emos_mu, emos_sd = _zfit([lab.loc[s, \"emos\"] for s in train_stems])\n",
+    "    if HAS_VAD:\n",
+    "        vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in [\"val\", \"aro\", \"dom\"]], dtype=np.float32)\n",
+    "        vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in [\"val\", \"aro\", \"dom\"]], dtype=np.float32)\n",
+    "    else:\n",
+    "        vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)\n",
+    "\n",
+    "def onehot_target(tgt):\n",
+    "    v = np.zeros(len(EMOTIONS5), dtype=np.float32)\n",
+    "    if tgt in EMOTIONS5:\n",
+    "        v[EMOTIONS5.index(tgt)] = 1.0\n",
+    "    return v\n",
+    "\n",
+    "class EmoDataset(Dataset):\n",
+    "    def __init__(self, stems):\n",
+    "        self.stems = [s for s in stems if (load_wav(s) is not None) and ((not USE_AUDEERING) or s in aud_tr)]\n",
+    "    def __len__(self):\n",
+    "        return len(self.stems)\n",
+    "    def __getitem__(self, i):\n",
+    "        s = self.stems[i]\n",
+    "        wave = load_wav(s)\n",
+    "        emos = (float(lab.loc[s, \"emos\"]) - emos_mu) / emos_sd\n",
+    "        if HAS_VAD:\n",
+    "            vad = (np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32) - vad_mu) / vad_sd\n",
+    "        else:\n",
+    "            vad = np.zeros(3, dtype=np.float32)\n",
+    "        cat = np.array([lab.loc[s, f\"cat{j}\"] for j in range(len(EMOTIONS5))], dtype=np.float32)\n",
+    "        aud = aud_tr[s] if USE_AUDEERING else np.zeros(0, dtype=np.float32)\n",
+    "        return {\"wave\": wave, \"tgt\": onehot_target(target_map.get(s)), \"aud\": aud,\n",
+    "                \"emos\": np.float32(emos), \"vad\": vad, \"cat\": cat,\n",
+    "                \"emos_raw\": np.float32(lab.loc[s, \"emos\"]),\n",
+    "                \"vad_raw\": np.array([lab.loc[s, \"val\"], lab.loc[s, \"aro\"], lab.loc[s, \"dom\"]], np.float32)}\n",
+    "\n",
+    "def collate(batch):\n",
+    "    L = max(len(b[\"wave\"]) for b in batch)\n",
+    "    waves = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    mask = np.zeros((len(batch), L), dtype=np.float32)\n",
+    "    for i, b in enumerate(batch):\n",
+    "        waves[i, : len(b[\"wave\"])] = b[\"wave\"]; mask[i, : len(b[\"wave\"])] = 1.0\n",
+    "    return {\n",
+    "        \"input_values\": torch.from_numpy(waves), \"attn_mask\": torch.from_numpy(mask).long(),\n",
+    "        \"tgt\": torch.from_numpy(np.stack([b[\"tgt\"] for b in batch])),\n",
+    "        \"aud\": torch.from_numpy(np.stack([b[\"aud\"] for b in batch])) if USE_AUDEERING else None,\n",
+    "        \"emos\": torch.from_numpy(np.stack([b[\"emos\"] for b in batch])).unsqueeze(1),\n",
+    "        \"vad\": torch.from_numpy(np.stack([b[\"vad\"] for b in batch])),\n",
+    "        \"cat\": torch.from_numpy(np.stack([b[\"cat\"] for b in batch])),\n",
+    "        \"emos_raw\": np.stack([b[\"emos_raw\"] for b in batch]),\n",
+    "        \"vad_raw\": np.stack([b[\"vad_raw\"] for b in batch]),\n",
+    "    }\n",
+    "\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "ds = EmoDataset(train_stems)\n",
+    "print(\"Dataset hợp lệ:\", len(ds), \"wav\")\n",
+    "tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)\n",
+    "tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)\n",
+    "va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6006ec6c",
+   "metadata": {},
+   "source": [
+    "## 6a. Khối MAMBA (thuần PyTorch, fallback nếu không có `mamba-ssm`)\n",
+    "Theo \"mamba-minimal\" — đúng công thức selective SSM, chỉ chậm hơn kernel CUDA. Chạy trong fp32 cho ổn định."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9089952",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import math\n",
+    "\n",
+    "try:\n",
+    "    from mamba_ssm import Mamba as _OfficialMamba\n",
+    "    _HAS_MAMBA_SSM = True\n",
+    "    print(\"✅ Dùng mamba-ssm (CUDA kernel)\")\n",
+    "except Exception:\n",
+    "    _HAS_MAMBA_SSM = False\n",
+    "    print(\"ℹ️ Không có mamba-ssm → Mamba thuần PyTorch (chậm hơn khi fine-tune)\")\n",
+    "\n",
+    "class MambaBlockTorch(nn.Module):\n",
+    "    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):\n",
+    "        super().__init__()\n",
+    "        self.d_inner = expand * d_model\n",
+    "        self.dt_rank = math.ceil(d_model / 16)\n",
+    "        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)\n",
+    "        self.conv1d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=d_conv,\n",
+    "                                groups=self.d_inner, padding=d_conv - 1, bias=True)\n",
+    "        self.x_proj = nn.Linear(self.d_inner, self.dt_rank + d_state * 2, bias=False)\n",
+    "        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True)\n",
+    "        A = torch.arange(1, d_state + 1, dtype=torch.float32).repeat(self.d_inner, 1)\n",
+    "        self.A_log = nn.Parameter(torch.log(A))\n",
+    "        self.D = nn.Parameter(torch.ones(self.d_inner))\n",
+    "        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)\n",
+    "        self.d_state = d_state\n",
+    "\n",
+    "    def forward(self, x):                                 # x: (B, L, d_model)\n",
+    "        B, L, _ = x.shape\n",
+    "        xin, z = self.in_proj(x).chunk(2, dim=-1)\n",
+    "        xin = xin.transpose(1, 2)\n",
+    "        xin = self.conv1d(xin)[..., :L].transpose(1, 2)\n",
+    "        xin = F.silu(xin)\n",
+    "        y = self._ssm(xin) * F.silu(z)\n",
+    "        return self.out_proj(y)\n",
+    "\n",
+    "    def _ssm(self, x):\n",
+    "        A = -torch.exp(self.A_log)\n",
+    "        delta, Bm, Cm = torch.split(self.x_proj(x), [self.dt_rank, self.d_state, self.d_state], dim=-1)\n",
+    "        delta = F.softplus(self.dt_proj(delta))\n",
+    "        dA = torch.exp(delta.unsqueeze(-1) * A)\n",
+    "        dB_x = delta.unsqueeze(-1) * Bm.unsqueeze(2) * x.unsqueeze(-1)\n",
+    "        h = torch.zeros(x.shape[0], self.d_inner, self.d_state, device=x.device, dtype=x.dtype)\n",
+    "        ys = []\n",
+    "        for t in range(x.shape[1]):\n",
+    "            h = dA[:, t] * h + dB_x[:, t]\n",
+    "            ys.append((h * Cm[:, t].unsqueeze(1)).sum(-1))\n",
+    "        return torch.stack(ys, dim=1) + x * self.D\n",
+    "\n",
+    "class MambaLayer(nn.Module):\n",
+    "    def __init__(self, d_model, d_state):\n",
+    "        super().__init__()\n",
+    "        self.norm = nn.LayerNorm(d_model)\n",
+    "        self.mix = _OfficialMamba(d_model=d_model, d_state=d_state, d_conv=4, expand=2) \\\n",
+    "            if _HAS_MAMBA_SSM else MambaBlockTorch(d_model, d_state=d_state)\n",
+    "    def forward(self, x):\n",
+    "        return x + self.mix(self.norm(x))\n",
+    "\n",
+    "class MambaEncoder(nn.Module):\n",
+    "    \"\"\"1024 → d_model → [Mamba ×L] (2 chiều) → attentive-pool (có mask) → Z_DIM.\"\"\"\n",
+    "    def __init__(self, d_in, d_model, n_layers, d_state, z_dim, bidir):\n",
+    "        super().__init__()\n",
+    "        self.bidir = bidir\n",
+    "        self.proj = nn.Linear(d_in, d_model)\n",
+    "        self.fwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])\n",
+    "        if bidir:\n",
+    "            self.bwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])\n",
+    "        self.attn = nn.Linear(d_model, 1)\n",
+    "        self.out = nn.Linear(d_model, z_dim)\n",
+    "\n",
+    "    @staticmethod\n",
+    "    def _run(layers, h):\n",
+    "        for L in layers:\n",
+    "            h = L(h)\n",
+    "        return h\n",
+    "\n",
+    "    def forward(self, x, mask):                           # x:(B,L,1024) mask:(B,L) bool\n",
+    "        with torch.cuda.amp.autocast(enabled=False):      # SSM chạy fp32 cho ổn định\n",
+    "            x = x.float()\n",
+    "            h = self.proj(x)\n",
+    "            out = self._run(self.fwd, h)\n",
+    "            if self.bidir:\n",
+    "                out = out + torch.flip(self._run(self.bwd, torch.flip(h, dims=[1])), dims=[1])\n",
+    "            a = self.attn(out).squeeze(-1).masked_fill(~mask, float(\"-inf\"))\n",
+    "            w = torch.softmax(a, dim=1).unsqueeze(-1)\n",
+    "            return self.out((out * w).sum(1))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff1cec20",
+   "metadata": {},
+   "source": [
+    "## 6b. Head cảm xúc + train loop (AMP + grad-accum + uncertainty weighting)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c414e504",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "from scipy.stats import spearmanr\n",
+    "\n",
+    "torch.manual_seed(SEED); np.random.seed(SEED)\n",
+    "N_EMO = len(EMOTIONS5)\n",
+    "WAVLM_BRANCH = Z_DIM if USE_MAMBA else WAVLM_DIM\n",
+    "TRUNK_IN = WAVLM_BRANCH + (AUD_DIM if USE_AUDEERING else 0)\n",
+    "\n",
+    "enc = MambaEncoder(WAVLM_DIM, MAMBA_DMODEL, MAMBA_LAYERS, MAMBA_DSTATE, Z_DIM, BIDIRECTIONAL).to(device) \\\n",
+    "    if USE_MAMBA else None\n",
+    "\n",
+    "class EmoHeads(nn.Module):\n",
+    "    def __init__(self, d_in, trunk_h, head_h, p, n_emo):\n",
+    "        super().__init__()\n",
+    "        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),\n",
+    "                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))\n",
+    "        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))\n",
+    "        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))\n",
+    "        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))\n",
+    "    def forward(self, feat, tgt):\n",
+    "        h = self.trunk(feat)\n",
+    "        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)\n",
+    "\n",
+    "heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)\n",
+    "print(f\"Trunk input = {TRUNK_IN} (wavlm-branch {WAVLM_BRANCH} [{'Mamba' if USE_MAMBA else 'mean-pool'}] + aud {AUD_DIM if USE_AUDEERING else 0})\")\n",
+    "if USE_MAMBA:\n",
+    "    print(f\"Mamba encoder: {sum(p.numel() for p in enc.parameters())/1e6:.2f}M param\")\n",
+    "\n",
+    "# ── RESUME: nạp heads (+ Mamba enc) từ checkpoint ──\n",
+    "if RESUME and resume_ckpt is not None:\n",
+    "    hm, hu = heads.load_state_dict(resume_ckpt[\"heads\"], strict=False)\n",
+    "    print(f\"🔁 RESUME load heads từ ckpt: thiếu {len(hm)} / dư {len(hu)} key (kỳ vọng 0)\")\n",
+    "    if USE_MAMBA and resume_ckpt.get(\"enc\") is not None:\n",
+    "        em, eu = enc.load_state_dict(resume_ckpt[\"enc\"], strict=False)\n",
+    "        print(f\"🔁 RESUME load Mamba enc từ ckpt: thiếu {len(em)} / dư {len(eu)} key (kỳ vọng 0)\")\n",
+    "    elif USE_MAMBA:\n",
+    "        print(\"   ⚠️ ckpt KHÔNG có 'enc' (Mamba) → Mamba head train lại từ đầu (chỉ resume backbone+heads).\")\n",
+    "\n",
+    "TASKS = [\"emos\", \"cat\", \"val\", \"aro\", \"dom\"]\n",
+    "log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))\n",
+    "bb_params = [p for p in wavlm.parameters() if p.requires_grad]\n",
+    "head_params = list(heads.parameters()) + (list(enc.parameters()) if USE_MAMBA else []) \\\n",
+    "    + ([log_var] if USE_UNCERTAINTY else [])\n",
+    "_lr_scale = RESUME_LR_SCALE if RESUME else 1.0\n",
+    "opt = torch.optim.AdamW([\n",
+    "    {\"params\": bb_params, \"lr\": LR_BACKBONE * _lr_scale},\n",
+    "    {\"params\": head_params, \"lr\": LR_HEAD * _lr_scale},\n",
+    "], weight_decay=WEIGHT_DECAY)\n",
+    "if RESUME and _lr_scale != 1.0:\n",
+    "    print(f\"🔁 RESUME: LR ×{_lr_scale} → backbone {LR_BACKBONE*_lr_scale:.1e} · head {LR_HEAD*_lr_scale:.1e}\")\n",
+    "scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == \"cuda\")\n",
+    "mse = nn.MSELoss()\n",
+    "\n",
+    "def soft_ce(logits, target_dist):\n",
+    "    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()\n",
+    "\n",
+    "def wavlm_branch(input_values, attn_mask):\n",
+    "    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state    # (B,T,D)\n",
+    "    if USE_MAMBA:\n",
+    "        return enc(out, frame_mask(out.shape[1], attn_mask))                  # (B, Z_DIM)\n",
+    "    return masked_mean(out, attn_mask)                                        # (B, D)\n",
+    "\n",
+    "def forward_batch(b):\n",
+    "    fw = wavlm_branch(b[\"input_values\"].to(device), b[\"attn_mask\"].to(device))\n",
+    "    feat = torch.cat([fw, b[\"aud\"].to(device)], dim=1) if USE_AUDEERING else fw\n",
+    "    return heads(feat, b[\"tgt\"].to(device))\n",
+    "\n",
+    "def pairwise_rank_loss(pred, target):\n",
+    "    \"\"\"Hinge ranking trên MỌI cặp trong batch → tối ưu thẳng thứ hạng (≈ SRCC). Khả vi (backprop được).\n",
+    "    Cần ≥2 mẫu/batch mới có cặp; batch càng to càng nhiều cặp → tín hiệu càng mạnh.\"\"\"\n",
+    "    p = pred.reshape(-1); t = target.reshape(-1)\n",
+    "    if p.numel() < 2:\n",
+    "        return torch.zeros((), device=p.device)\n",
+    "    sign = torch.sign(t.unsqueeze(0) - t.unsqueeze(1))      # +1 nếu câu i ĐÁNG cao hơn câu j\n",
+    "    diff = p.unsqueeze(0) - p.unsqueeze(1)                  # chênh lệch model dự đoán\n",
+    "    return torch.relu(-sign * diff).mean()                  # phạt khi xếp sai thứ tự\n",
+    "\n",
+    "def compute_loss(emos_p, cat_l, vad_p, b):\n",
+    "    L = {\"emos\": mse(emos_p, b[\"emos\"].to(device)), \"cat\": soft_ce(cat_l, b[\"cat\"].to(device))}\n",
+    "    if HAS_VAD:\n",
+    "        vt = b[\"vad\"].to(device)\n",
+    "        L[\"val\"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L[\"aro\"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L[\"dom\"] = mse(vad_p[:, 2:3], vt[:, 2:3])\n",
+    "    else:\n",
+    "        vt = None\n",
+    "        z = torch.zeros((), device=device); L[\"val\"] = L[\"aro\"] = L[\"dom\"] = z\n",
+    "    # Ranking loss CHỈ cho các cột chấm SRCC (emos/val/aro/dom). CAT là ERR phân bố → giữ soft-CE.\n",
+    "    if RANK_LAMBDA > 0:\n",
+    "        L[\"emos\"] = L[\"emos\"] + RANK_LAMBDA * pairwise_rank_loss(emos_p, b[\"emos\"].to(device))\n",
+    "        if HAS_VAD:\n",
+    "            L[\"val\"] = L[\"val\"] + RANK_LAMBDA * pairwise_rank_loss(vad_p[:, 0:1], vt[:, 0:1])\n",
+    "            L[\"aro\"] = L[\"aro\"] + RANK_LAMBDA * pairwise_rank_loss(vad_p[:, 1:2], vt[:, 1:2])\n",
+    "            L[\"dom\"] = L[\"dom\"] + RANK_LAMBDA * pairwise_rank_loss(vad_p[:, 2:3], vt[:, 2:3])\n",
+    "    if USE_UNCERTAINTY:\n",
+    "        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))\n",
+    "    return sum(L.values())\n",
+    "\n",
+    "def set_mode(train):\n",
+    "    wavlm.train(train); heads.train(train)\n",
+    "    if USE_MAMBA:\n",
+    "        enc.train(train)\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def evaluate():\n",
+    "    set_mode(False)\n",
+    "    P = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}; Y = {\"emos\": [], \"val\": [], \"aro\": [], \"dom\": []}\n",
+    "    catP, catY = [], []\n",
+    "    for b in va_loader:\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "        P[\"emos\"] += emos_p.float().cpu().numpy().ravel().tolist(); Y[\"emos\"] += b[\"emos_raw\"].tolist()\n",
+    "        vad_p = vad_p.float().cpu().numpy()\n",
+    "        for j, t in enumerate([\"val\", \"aro\", \"dom\"]):\n",
+    "            P[t] += vad_p[:, j].tolist(); Y[t] += b[\"vad_raw\"][:, j].tolist()\n",
+    "        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b[\"cat\"])\n",
+    "    out = {t: spearmanr(P[t], Y[t]).correlation for t in [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])}\n",
+    "    q = np.concatenate(catP); p = np.concatenate(catY)\n",
+    "    out[\"cat_err\"] = float(np.abs(q - p).sum(1).mean())\n",
+    "    return out\n",
+    "\n",
+    "def mean_srcc(m):\n",
+    "    keys = [\"emos\"] + ([\"val\", \"aro\", \"dom\"] if HAS_VAD else [])\n",
+    "    return float(np.mean([m[k] for k in keys]))\n",
+    "\n",
+    "CKPT_PATH = os.path.join(OUT_DIR, \"ft_mamba_emotion_full.pt\")\n",
+    "def save_full_ckpt(state, val_emos=float(\"nan\")):\n",
+    "    torch.save({\"wavlm\": state[\"wavlm\"], \"heads\": state[\"heads\"], \"enc\": state.get(\"enc\"),\n",
+    "                \"USE_MAMBA\": USE_MAMBA, \"emos_mu\": emos_mu, \"emos_sd\": emos_sd,\n",
+    "                \"vad_mu\": vad_mu, \"vad_sd\": vad_sd, \"WAVLM_DIM\": WAVLM_DIM, \"AUD_DIM\": AUD_DIM,\n",
+    "                \"Z_DIM\": Z_DIM, \"UNFREEZE_TOP_LAYERS\": UNFREEZE_TOP_LAYERS,\n",
+    "                \"val_emos\": float(val_emos)}, CKPT_PATH)\n",
+    "\n",
+    "def snapshot():\n",
+    "    s = {\"wavlm\": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},\n",
+    "         \"heads\": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}\n",
+    "    if USE_MAMBA:\n",
+    "        s[\"enc\"] = {k: v.cpu().clone() for k, v in enc.state_dict().items()}\n",
+    "    return s\n",
+    "\n",
+    "# RESUME: init best = điểm VAL của ckpt hiện tại → chỉ ghi đè nếu train tiếp TỐT HƠN (không sợ tụt)\n",
+    "if RESUME and resume_ckpt is not None:\n",
+    "    m0 = evaluate(); best = mean_srcc(m0); best_state = snapshot(); bad = 0\n",
+    "    print(f\"📍 RESUME — checkpoint hiện tại: mean SRCC={best:.4f} | \"\n",
+    "          + \" \".join(f\"{k}={m0[k]:.3f}\" for k in ['emos', 'val', 'aro', 'dom'] if k in m0))\n",
+    "else:\n",
+    "    m0 = None\n",
+    "    best, best_state, bad = -1e9, None, 0\n",
+    "for ep in range(1, EPOCHS + 1):\n",
+    "    set_mode(True)\n",
+    "    opt.zero_grad(); run = 0.0; nb = 0\n",
+    "    for step, b in enumerate(tqdm(tr_loader, desc=f\"epoch {ep}\")):\n",
+    "        with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "            emos_p, cat_l, vad_p = forward_batch(b)\n",
+    "            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM\n",
+    "        scaler.scale(loss).backward()\n",
+    "        if (step + 1) % ACCUM == 0:\n",
+    "            scaler.step(opt); scaler.update(); opt.zero_grad()\n",
+    "        run += loss.item() * ACCUM; nb += 1\n",
+    "    m = evaluate(); sc = mean_srcc(m)\n",
+    "    msg = \" \".join(f\"{k}={m[k]:.3f}\" for k in [\"emos\", \"val\", \"aro\", \"dom\"] if k in m)\n",
+    "    print(f\"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})\")\n",
+    "    if sc > best:\n",
+    "        best = sc; bad = 0\n",
+    "        best_state = snapshot()\n",
+    "        save_full_ckpt(best_state, m[\"emos\"])\n",
+    "        print(f\"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})\")\n",
+    "    else:\n",
+    "        bad += 1\n",
+    "        if bad >= PATIENCE:\n",
+    "            print(f\"Early stop ở epoch {ep}.\"); break\n",
+    "\n",
+    "if best_state:\n",
+    "    wavlm.load_state_dict(best_state[\"wavlm\"]); heads.load_state_dict(best_state[\"heads\"])\n",
+    "    if USE_MAMBA:\n",
+    "        enc.load_state_dict(best_state[\"enc\"])\n",
+    "final = evaluate()\n",
+    "if RESUME and m0 is not None:\n",
+    "    print(f\"\\n🔁 RESUME: mean SRCC ckpt {mean_srcc(m0):.4f} → sau train tiếp {mean_srcc(final):.4f} \"\n",
+    "          + (\"🚀 cải thiện → đã ghi đè ckpt\" if mean_srcc(final) > mean_srcc(m0) + 1e-4 else \"➖ không cải thiện (giữ best cũ)\"))\n",
+    "print(f\"\\n✅ VAL (nội bộ) — exp15 (Mamba={'ON' if USE_MAMBA else 'OFF'}):\")\n",
+    "print(f\"   EMOS={final['emos']:.4f} (exp08 {EXP08['emos']})\")\n",
+    "if HAS_VAD:\n",
+    "    print(f\"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} \"\n",
+    "          f\"(exp08 {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})\")\n",
+    "warn = [f\"EMOS {final['emos']:.3f}<{EXP08['emos']}\"] if final[\"emos\"] < EXP08[\"emos\"] - 0.005 else []\n",
+    "if HAS_VAD:\n",
+    "    warn += [f\"{t.upper()} {final[t]:.3f}<{EXP08[t]}\" for t in [\"val\", \"aro\", \"dom\"] if final[t] < EXP08[t] - 0.005]\n",
+    "print(\"   ⚠️ Mamba head CHƯA thắng exp08 ở:\", \"; \".join(warn), \"(vẫn là kết quả cho paper)\" if warn else \"\")\n",
+    "if not warn:\n",
+    "    print(\"   ✅ Mamba head thắng/ngang exp08 ở mọi cột → temporal modeling có ích!\")\n",
+    "save_full_ckpt(best_state if best_state else\n",
+    "               {\"wavlm\": wavlm.state_dict(), \"heads\": heads.state_dict(),\n",
+    "                \"enc\": enc.state_dict() if USE_MAMBA else None}, final[\"emos\"])\n",
+    "print(f\"✅ Đã lưu {CKPT_PATH} (CÓ backbone + Mamba + heads). NHỚ Save Version!\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9c748af2",
+   "metadata": {},
+   "source": [
+    "## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc exp15; QMOS mượn exp07/UTMOSv2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "92d43e56",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT_DEV:\n",
+    "    dev_names = dev_names[:LIMIT_DEV]\n",
+    "dev_stems = [stem(n) for n in dev_names]\n",
+    "print(\"DEV:\", len(dev_names), \"mẫu\")\n",
+    "aud_dev = extract_audeering(dev_stems, \"dev\")\n",
+    "\n",
+    "def load_exp07_qmos():\n",
+    "    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):\n",
+    "        import csv\n",
+    "        d = {}\n",
+    "        with open(EXP07_ANSWER) as f:\n",
+    "            for row in csv.DictReader(f):\n",
+    "                d[row[\"wav\"]] = float(row[\"QMOS\"]); d[stem(row[\"wav\"])] = float(row[\"QMOS\"])\n",
+    "        print(f\"✅ Mượn QMOS exp07 ({EXP07_ANSWER}): {len(d)//2} wav\")\n",
+    "        return d\n",
+    "    return None\n",
+    "\n",
+    "qmos_map = load_exp07_qmos()\n",
+    "if qmos_map is None:\n",
+    "    print(\"ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024).\")\n",
+    "    pip_install(\"git+https://github.com/sarulab-speech/UTMOSv2.git\")\n",
+    "    import utmosv2\n",
+    "    v2 = utmosv2.create_model(pretrained=True)\n",
+    "    qmos_map = {}\n",
+    "    for n in tqdm(dev_names, desc=\"UTMOSv2\"):\n",
+    "        wav = os.path.join(WAV_DIR, n if str(n).endswith(\".wav\") else str(n) + \".wav\")\n",
+    "        if not os.path.exists(wav):\n",
+    "            continue\n",
+    "        out = v2.predict(input_path=wav)\n",
+    "        qmos_map[n] = float(out[\"predicted_mos\"]) if isinstance(out, dict) else float(out)\n",
+    "    del v2; torch.cuda.empty_cache() if device == \"cuda\" else None\n",
+    "\n",
+    "@torch.no_grad()\n",
+    "def predict_emotion(sid):\n",
+    "    wave = load_wav(sid)\n",
+    "    if wave is None or (USE_AUDEERING and sid not in aud_dev):\n",
+    "        return None\n",
+    "    set_mode(False)\n",
+    "    iv = torch.from_numpy(wave).unsqueeze(0).to(device)\n",
+    "    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)\n",
+    "    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)\n",
+    "    with torch.cuda.amp.autocast(enabled=USE_AMP and device == \"cuda\"):\n",
+    "        fw = wavlm_branch(iv, am)\n",
+    "        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw\n",
+    "        emos_p, cat_l, vad_p = heads(feat, tgt)\n",
+    "    emos = float(emos_p.item()) * emos_sd + emos_mu\n",
+    "    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()\n",
+    "    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu\n",
+    "    return emos, cat5, vad3\n",
+    "\n",
+    "def fmt_cat(p5):\n",
+    "    return \"|\".join(f\"{e}:{p5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    n_real = n_def = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in tqdm(dev_names, desc=\"answer\"):\n",
+    "            sid = stem(name)\n",
+    "            pr = predict_emotion(sid)\n",
+    "            if pr is None:\n",
+    "                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1\n",
+    "            else:\n",
+    "                emos, cat5, vad3 = pr; n_real += 1\n",
+    "            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "20ec4343",
+   "metadata": {},
+   "source": [
+    "## 8. Validate + đóng zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e289ea27",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    assert rows[0][0] == \"wav\" and \"QMOS\" in rows[0] and \"EMOS\" in rows[0], \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(rows[0]), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {rows[0]}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "os.system(f\"cd {OUT_DIR} && zip -j submission_track2_exp15_mamba-emotion.zip answer.txt \"\n",
+    "          f\"&& unzip -l submission_track2_exp15_mamba-emotion.zip\")\n",
+    "print(\"Sẵn sàng nộp:\", os.path.join(OUT_DIR, \"submission_track2_exp15_mamba-emotion.zip\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7aeeb9ea",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- **🔁 RESUME (train tiếp, không train lại từ đầu):** Add Input dataset chứa `ft_mamba_emotion_full.pt` của lần\n",
+    "  chạy trước (hoặc để nó nằm sẵn trong `/kaggle/working` khi chạy nối phiên) → notebook tự dò & train tiếp.\n",
+    "  `EPOCHS` lúc này là **số epoch train THÊM**. Val chững → đặt `RESUME_LR_SCALE=0.5`. Muốn ép train mới: `RESUME_CKPT=\"—\"`\n",
+    "  (đường dẫn không tồn tại) hoặc xóa ckpt khỏi input. ⚠️ `USE_MAMBA` phải KHỚP ckpt (code sẽ cảnh báo nếu lệch).\n",
+    "- **Lần đầu** `LIMIT_TRAIN=300`, `LIMIT_DEV=20` → kiểm 1 epoch không OOM / không CheckpointError; rồi đặt `None`.\n",
+    "- **Ablation chính cho paper:** chạy `USE_MAMBA=True` vs `USE_MAMBA=False` (=exp08) → so EMOS/VAL/ARO/DOM nội bộ\n",
+    "  → trả lời \"Mamba temporal head có hơn mean-pooling không?\".\n",
+    "- **OOM / quá chậm trên T4 (nhất là khi dùng Mamba thuần PyTorch):** giảm theo thứ tự\n",
+    "  `MAX_SECONDS` (6→5) → `MAMBA_LAYERS` (2→1) → `UNFREEZE_TOP_LAYERS` (6→4) → `BATCH` (2→1, tăng `ACCUM`).\n",
+    "  Hoặc thử cài `mamba-ssm causal-conv1d` (nhanh + nhẹ RAM hơn nhiều) — code tự dùng nếu import được.\n",
+    "- **Ranking loss (`RANK_LAMBDA`):** thêm pairwise ranking cho 4 cột SRCC (emos/val/aro/dom) → khớp metric\n",
+    "  UTT-SRCC hơn MSE. ⚠️ **Điểm yếu:** ranking tính trên các cặp TRONG 1 mini-batch; `BATCH=2` → mỗi forward\n",
+    "  chỉ có 1 cặp → tín hiệu YẾU. Muốn ranking mạnh: tăng `BATCH` (4→8 nếu VRAM chịu được). Ở các exp head\n",
+    "  ĐÓNG BĂNG (exp06/07, BATCH=64) ranking mạnh hơn nhiều. A/B `RANK_LAMBDA=0` vs `0.3` → bảng ablation cho paper.\n",
+    "- **QMOS:** Add Input answer.txt exp07 vào `/kaggle/input/exp07-answer/answer.txt` để mượn QMOS 0.548;\n",
+    "  không có thì tự chấm UTMOSv2 (cần Internet On).\n",
+    "- Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp15)."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp15_wavlm_mamba_emotion_pipeline.py ADDED Viewed

	@@ -0,0 +1,920 @@

+# %% [markdown]
+# # VMC2026 Track 2 — exp15 (WavLM FINE-TUNE + MAMBA head cho 5 cột cảm xúc) — Kaggle
+#
+# **Ý tưởng:** exp08 fine-tune WavLM nhưng vẫn **mean-pool** đặc trưng theo thời gian → 1 vector/wav
+# (vứt bỏ động lực thời gian: lên/xuống giọng, ngắt quãng, run giọng — rất quan trọng cho cảm xúc).
+# exp15 **thay mean-pool bằng MAMBA head** (bộ mã hóa chuỗi học được, độ phức tạp tuyến tính) → kỳ vọng
+# nắm temporal dynamics tốt hơn. Tham khảo: MambaRate (AudioMOS 2025, arXiv:2507.12090).
+#
+# ## Kiến trúc (= exp08 đổi đúng 1 chỗ: pool → Mamba)
+# ```
+#  wav ─► WavLM-large (SAILER warm-start, mở băng N lớp, TRAINABLE) ─► hidden states (B, T, 1024)
+#                                                                            │  (KHÔNG mean-pool)
+#                                                              MambaEncoder (proj 1024→d, Mamba×L 2 chiều,
+#                                                              attentive-pool có mask) ─► z (B, Z_DIM)
+#                                                                            │
+#       (tùy chọn) audeering MSP-dim FROZEN [emb|vad3] ──concat──► TRUNK ─┬─► EMOS (+ one-hot target)
+#                                                                          ├─► CAT (5, soft-CE)
+#                                                                          └─► VAD (3)
+#  QMOS: KHÔNG train ở đây → mượn cột QMOS exp07 (0.548) hoặc UTMOSv2.
+# ```
+# - **Cờ `USE_MAMBA`:** True = Mamba head; False = quay về `masked_mean` = **đúng exp08**
+#   → đây là **ablation chính cho paper** ("Mamba temporal head vs mean-pooling", CÙNG backbone fine-tune).
+#
+# ## ⚠️ Đánh đổi / gotcha (đã phòng trong code)
+# - Fine-tune = chạy lại WavLM mỗi epoch (không cache được) → **lần đầu BẮT BUỘC `LIMIT_TRAIN=300`, `LIMIT_DEV=20`**.
+# - `mamba-ssm` khó cài Kaggle → tự fallback **Mamba thuần PyTorch** (vòng-lặp-thời-gian). Bản này khi fine-tune
+#   **chậm + nặng RAM hơn** → cap `MAX_SECONDS=6`, `BATCH=2`. OOM/quá chậm → hạ MAX_SECONDS→5, MAMBA_LAYERS→1,
+#   hoặc thử cài `mamba-ssm causal-conv1d`.
+# - `layerdrop=0` (tránh CheckpointError khi grad-ckpt — bài học exp12). KHÔNG đụng numpy (lệch ABI).
+# - **Checkpoint lưu CẢ backbone + Mamba + heads mỗi best** (bài học exp08 mất backbone).
+#
+# ## 🔁 RESUME (yêu cầu của user): "nếu có checkpoint thì train TIẾP, không train lại từ đầu"
+# - Notebook **tự dò** `ft_mamba_emotion_full.pt` trong `/kaggle/input` và `/kaggle/working` (hoặc trỏ tay `RESUME_CKPT`).
+# - Có ckpt đủ (backbone WavLM + Mamba enc + heads) → **nạp lại trạng thái + thống kê chuẩn hóa TỪ ckpt** rồi train tiếp;
+#   `best` khởi tạo = điểm VAL của ckpt → chỉ ghi đè khi train tiếp **TỐT HƠN** (không sợ tụt). `RESUME_LR_SCALE<1` để hạ LR.
+# - KHÔNG có ckpt → train mới từ SAILER warm-start như cũ (hành vi exp15 gốc giữ nguyên).
+#
+# **Cách chạy Kaggle:** GPU **T4** + Internet **On** → Add Input dataset Track 2 (+ Add Input checkpoint cũ nếu muốn resume)
+# → sửa `DATA_ROOT` → Run All.
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os, glob
+# ── TỰ DÒ DATA_ROOT (quét /kaggle/input tìm thư mục có sets/train.csv + wav/ + metadata.csv) ──
+def find_data_root(search_root="/kaggle/input"):
+    cands = []
+    for train_csv in glob.glob(os.path.join(search_root, "**", "sets", "train.csv"), recursive=True):
+        root = os.path.dirname(os.path.dirname(train_csv))          # .../<root>/sets/train.csv → <root>
+        score = os.path.isdir(os.path.join(root, "wav")) + os.path.exists(os.path.join(root, "metadata.csv"))
+        cands.append((score, root))
+    cands.sort(reverse=True)                                        # ưu tiên thư mục đủ wav + metadata
+    return cands
+_cands = find_data_root("/kaggle/input")
+if _cands:
+    print("🔎 Ứng viên DATA_ROOT (điểm cao = đủ wav+metadata):")
+    for sc, r in _cands:
+        print(f"   [{sc}/2] {r}")
+    DATA_ROOT = _cands[0][1]
+    print(f"👉 Tự chọn DATA_ROOT = {DATA_ROOT}")
+else:
+    DATA_ROOT = "/kaggle/input/datasets/minhtoan2"   # dự phòng — sửa tay nếu auto-dò không thấy
+    print(f"❌ Không thấy sets/train.csv trong /kaggle/input → dùng dự phòng {DATA_ROOT} (đã Add Input chưa?)")
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"     # wavID|emotion|transcript (KHÔNG header)
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"   # lisID|wavID|qMOS|emoCat|eMOS|val|dom|aro
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/ft_cache"         # cache audeering (.npz) — WavLM/Mamba KHÔNG cache (đang train)
+os.makedirs(CACHE_DIR, exist_ok=True)
+# (Tùy chọn) tái dùng cache audeering cũ (read-only /kaggle/input → copy sang working)
+# Dataset cache_exp8: aud_*.npz nằm trong thư mục con archive/ → quét ĐỆ QUY để bắt mọi vị trí.
+CACHE_INPUT = "/kaggle/input/cache-exp8"   # << SỬA slug (dataset cache_exp8 → Kaggle đổi _→-); hoặc ""
+if CACHE_INPUT and os.path.isdir(CACHE_INPUT):
+    import shutil
+    _n = 0
+    for _fp in glob.glob(os.path.join(CACHE_INPUT, "**", "aud_*.npz"), recursive=True):
+        shutil.copy(_fp, os.path.join(CACHE_DIR, os.path.basename(_fp))); _n += 1
+    print(f"📦 Tái dùng cache: copy {_n} file aud_*.npz (quét đệ quy {CACHE_INPUT})")
+else:
+    print(f"ℹ️ Không thấy CACHE_INPUT={CACHE_INPUT} → sẽ tự trích audeering.")
+# Mượn cột QMOS exp07 (0.548). Trỏ answer.txt exp07 nếu có; không thì UTMOSv2.
+EXP07_ANSWER = "/kaggle/input/exp07-answer/answer.txt"   # << (tùy chọn)
+# ── Cờ Mamba (ablation chính) ────────────────────────────────────────────────
+USE_MAMBA           = True        # True = Mamba head; False = mean-pool = ĐÚNG exp08
+# ── Siêu tham số Mamba head ──────────────────────────────────────────────────
+MAMBA_DMODEL        = 256
+MAMBA_LAYERS        = 2
+MAMBA_DSTATE        = 16
+BIDIRECTIONAL       = True
+Z_DIM               = 256         # chiều vector ra sau attentive-pool, thay cho emb WavLM mean-pool
+# ── Fine-tune / siêu tham số (kế thừa exp08) ─────────────────────────────────
+DEVICE              = "cuda"
+SR                  = 16000
+MAX_SECONDS         = 6           # giảm từ 8 (exp08) vì Mamba backprop-through-time nặng RAM hơn
+UNFREEZE_TOP_LAYERS = 6           # số lớp Transformer trên cùng được train (0 = freeze hết)
+TRUNK_HIDDEN        = 512
+HEAD_HIDDEN         = 128
+DROPOUT             = 0.3
+LR_BACKBONE         = 1e-5
+LR_HEAD             = 1e-3        # cho Mamba + trunk + head (train từ đầu)
+WEIGHT_DECAY        = 1e-5
+EPOCHS              = 12
+PATIENCE            = 3
+BATCH               = 2           # nhỏ (backbone to + Mamba); bù bằng ACCUM
+ACCUM               = 16          # effective batch = 32
+VAL_FRAC            = 0.10
+SEED                = 42
+USE_AMP             = True
+USE_GRAD_CKPT       = True
+USE_AUDEERING       = True
+USE_UNCERTAINTY     = True
+RANK_LAMBDA         = 0.3         # 0 = chỉ MSE (cũ). >0 = thêm pairwise ranking loss (tối ưu thẳng SRCC) cho emos/val/aro/dom
+                                  # ⚠️ ranking cần NHIỀU cặp/batch mới mạnh → BATCH nhỏ (2) thì tác dụng yếu (xem Ghi chú)
+LIMIT_TRAIN         = 300         # << LẦN ĐẦU 300; chạy thật None
+LIMIT_DEV           = 20          # << LẦN ĐẦU 20; chạy thật None
+# ── RESUME — train TIẾP từ checkpoint, KHÔNG train lại từ đầu ─────────────────
+# Để "" + auto-dò: nếu thấy `ft_mamba_emotion_full.pt` (đủ backbone+Mamba+heads) trong /kaggle/input
+# hoặc /kaggle/working → nạp lại rồi train tiếp. Trỏ tay RESUME_CKPT nếu muốn chỉ định file cụ thể.
+RESUME_CKPT         = ""          # << "" = auto-dò; hoặc "/kaggle/input/<slug>/ft_mamba_emotion_full.pt"
+RESUME_LR_SCALE     = 1.0         # <1.0 hạ LR khi train tiếp (vd 0.5 nếu val đã chững)
+def find_resume_ckpt(explicit):
+    """Tìm checkpoint exp15 để train tiếp. Ưu tiên đường dẫn user trỏ; không thì auto-dò.
+    Khớp cả tên bị Kaggle/Windows thêm hậu tố trùng, vd 'ft_mamba_emotion_full (2).pt'."""
+    if explicit and os.path.exists(explicit):
+        return explicit
+    for base in ["/kaggle/input", "/kaggle/working"]:
+        hits = sorted(glob.glob(os.path.join(base, "**", "ft_mamba_emotion_full*.pt"), recursive=True))
+        if hits:
+            return hits[0]
+    return ""
+RESUME_CKPT = find_resume_ckpt(RESUME_CKPT)
+RESUME      = bool(RESUME_CKPT)
+print("🔁 RESUME =", RESUME, ("→ train tiếp từ: " + RESUME_CKPT) if RESUME else "(không thấy ckpt → train MỚI từ đầu)")
+# Mốc so (exp08 fine-tune + mean-pool — đối thủ trực tiếp của Mamba head)
+EXP08 = {"emos": 0.811, "val": 0.659, "aro": 0.793, "dom": 0.751}
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(p):
+    return os.path.splitext(os.path.basename(str(p)))[0]
+print("USE_MAMBA =", USE_MAMBA, "(False → ra đúng exp08)")
+print("DATA_ROOT:", DATA_ROOT)
+for p in [WAV_DIR, METADATA_CSV, TRAIN_CSV, DEV_SCP]:
+    print(("  ✅ " if os.path.exists(p) else "  ❌ THIẾU ") + p)
+print(f"Fine-tune: mở băng {UNFREEZE_TOP_LAYERS} lớp · BATCH {BATCH}×ACCUM {ACCUM} · MAX {MAX_SECONDS}s")
+# %% [markdown]
+# ## 1. Cài đặt + tải code SAILER (clone + sys.path)
+# %%
+import sys, subprocess
+def pip_install(*pkgs):
+    subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True)
+pip_install("loralib", "speechbrain", "speechmos", "librosa", "soundfile",
+            "scipy", "scikit-learn", "pandas", "tqdm")
+# Cài kernel CUDA Mamba (nhanh + nhẹ RAM hơn bản thuần PyTorch nhiều). Build hay lỗi/chậm trên Kaggle
+# → bọc try/except: lỗi thì BỎ QUA, mục 6a tự fallback Mamba thuần PyTorch. KHÔNG để chết notebook.
+INSTALL_MAMBA_SSM = True   # đặt False nếu muốn BỎ QUA, dùng thẳng Mamba thuần PyTorch
+if INSTALL_MAMBA_SSM and USE_MAMBA:
+    try:
+        # --no-build-isolation cho CẢ HAI → dùng torch+CUDA sẵn có của Kaggle để biên dịch (đừng kéo torch khác).
+        # Cần ninja để build nhanh. -q ẩn log nên bước này có thể "treo" vài phút khi đang compile — bình thường.
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q", "ninja"], check=True)
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q",
+                        "--no-build-isolation", "causal-conv1d>=1.2.0"], check=True)
+        subprocess.run([sys.executable, "-m", "pip", "install", "-q",
+                        "--no-build-isolation", "mamba-ssm"], check=True)
+        print("✅ Cài mamba-ssm + causal-conv1d xong (sẽ dùng kernel CUDA nếu import được).")
+    except Exception as e:
+        print("⚠️ Cài mamba-ssm thất bại:", repr(e), "→ dùng Mamba thuần PyTorch (chậm hơn).")
+        print("   ℹ️ Vẫn chạy bình thường. Nếu chạy THẬT (LIMIT=None) quá chậm → xem Ghi chú cuối notebook.")
+REPO_DIR = "/kaggle/working/vox-profile-release"
+if not os.path.exists(REPO_DIR):
+    subprocess.run(["git", "clone", "--depth", "1",
+                    "https://github.com/tiantiaf0627/vox-profile-release.git", REPO_DIR], check=True)
+if REPO_DIR not in sys.path:
+    sys.path.insert(0, REPO_DIR)
+# %% [markdown]
+# ## 2. Nạp SAILER → lấy backbone WavLM bên trong để FINE-TUNE (warm-start)
+# %%
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+device = DEVICE if torch.cuda.is_available() else "cpu"
+print("Device:", device, ("✅ " + torch.cuda.get_device_name(0)) if device == "cuda" else "⚠️ CPU (rất chậm!)")
+def find_hf_backbone(module):
+    """Tìm submodule kiểu HF WavLM backbone: có .feature_extractor và .encoder.layers."""
+    cands = []
+    for name, m in module.named_modules():
+        enc = getattr(m, "encoder", None)
+        if getattr(m, "feature_extractor", None) is not None and enc is not None \
+                and getattr(enc, "layers", None) is not None:
+            cands.append((name, m))
+    if not cands:
+        return None, None
+    cands.sort(key=lambda nm: sum(p.numel() for p in nm[1].parameters()), reverse=True)
+    return cands[0]
+wavlm = None
+try:
+    from src.model.emotion.wavlm_emotion import WavLMWrapper   # noqa: E402
+    _wrapper = WavLMWrapper.from_pretrained("tiantiaf/wavlm-large-categorical-emotion")
+    name, wavlm = find_hf_backbone(_wrapper)
+    if wavlm is not None:
+        print(f"✅ Warm-start SAILER: backbone WavLM tại '.{name}' "
+              f"({sum(p.numel() for p in wavlm.parameters())/1e6:.0f}M params)")
+    else:
+        print("⚠️ Không tìm thấy backbone HF trong wrapper SAILER → fallback WavLM trắng.")
+except Exception as e:
+    print("⚠️ Lỗi nạp SAILER wrapper:", repr(e), "→ fallback WavLM trắng.")
+if wavlm is None:
+    from transformers import WavLMModel
+    wavlm = WavLMModel.from_pretrained("microsoft/wavlm-large")
+    print("ℹ️ Fallback: microsoft/wavlm-large (KHÔNG warm-start SAILER).")
+wavlm = wavlm.to(device)
+WAVLM_DIM = int(wavlm.config.hidden_size)
+wavlm.config.layerdrop = 0.0   # ⚠️ tránh CheckpointError khi grad-ckpt (bài học exp12)
+# ── RESUME: nạp trọng số backbone đã fine-tune từ checkpoint (đè lên warm-start SAILER) ──
+resume_ckpt = None
+if RESUME:
+    resume_ckpt = torch.load(RESUME_CKPT, map_location="cpu", weights_only=False)  # ckpt có numpy → cần False
+    assert "wavlm" in resume_ckpt, ("❌ Checkpoint KHÔNG có 'wavlm' (backbone) → không resume được. "
+                                    "Dùng file ft_mamba_emotion_full.pt do exp15 lưu.")
+    if resume_ckpt.get("USE_MAMBA", USE_MAMBA) != USE_MAMBA:
+        print(f"   ⚠️ ckpt USE_MAMBA={resume_ckpt.get('USE_MAMBA')} ≠ cấu hình hiện tại {USE_MAMBA} → kiến trúc LỆCH! "
+              "Đặt USE_MAMBA cho khớp ckpt.")
+    miss, unexp = wavlm.load_state_dict(resume_ckpt["wavlm"], strict=False)
+    print(f"🔁 RESUME load wavlm từ ckpt: thiếu {len(miss)} / dư {len(unexp)} key (kỳ vọng ~0). keys ckpt:", list(resume_ckpt.keys()))
+    if len(miss) > 20 or len(unexp) > 20:
+        print("   ⚠️ Lệch key nhiều → kiểm tra UNFREEZE_TOP_LAYERS / backbone có khớp ckpt không.")
+# ── Đóng băng partial: feature-extractor + tất cả trừ UNFREEZE_TOP_LAYERS lớp trên ──
+for p in wavlm.parameters():
+    p.requires_grad = False
+enc_layers = wavlm.encoder.layers
+n_layers = len(enc_layers)
+for layer in enc_layers[max(0, n_layers - UNFREEZE_TOP_LAYERS):]:
+    for p in layer.parameters():
+        p.requires_grad = True
+n_train = sum(p.numel() for p in wavlm.parameters() if p.requires_grad)
+print(f"WavLM: {n_layers} lớp · mở băng {min(UNFREEZE_TOP_LAYERS, n_layers)} → {n_train/1e6:.1f}M param train (dim {WAVLM_DIM})")
+if USE_GRAD_CKPT:
+    wavlm.gradient_checkpointing_enable()
+    if hasattr(wavlm, "enable_input_require_grads"):
+        wavlm.enable_input_require_grads()
+def frame_mask(T, attn_mask):
+    """attn_mask (B, Lwav) → frame-mask (B, T) bool (True=frame thật). Khớp downsample của WavLM."""
+    if attn_mask is None:
+        return torch.ones((1, T), dtype=torch.bool, device=device)
+    try:
+        fm = wavlm._get_feature_vector_attention_mask(T, attn_mask)
+        return fm.bool()
+    except Exception:
+        return torch.ones((attn_mask.shape[0], T), dtype=torch.bool, device=attn_mask.device)
+def masked_mean(hidden, attn_mask):
+    """Mean-pool theo thời gian bỏ pad (đường exp08 khi USE_MAMBA=False)."""
+    if attn_mask is None:
+        return hidden.mean(dim=1)
+    fm = frame_mask(hidden.shape[1], attn_mask).unsqueeze(-1).to(hidden.dtype)
+    return (hidden * fm).sum(1) / fm.sum(1).clamp(min=1e-6)
+# %% [markdown]
+# ## 3. Nạp audeering MSP-dim (FROZEN) — đặc trưng phụ (như exp08)
+# %%
+AUD_DIM = 0
+aud_backbone = aud_head = aud_proc = None
+if USE_AUDEERING:
+    from transformers import Wav2Vec2Model, Wav2Vec2Config, Wav2Vec2Processor
+    from huggingface_hub import hf_hub_download
+    AUD_NAME = "audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim"
+    aud_proc = Wav2Vec2Processor.from_pretrained(AUD_NAME)
+    aud_cfg = Wav2Vec2Config.from_pretrained(AUD_NAME)
+    aud_backbone = Wav2Vec2Model(aud_cfg)
+    try:
+        _sd = __import__("safetensors.torch", fromlist=["load_file"]).load_file(
+            hf_hub_download(AUD_NAME, "model.safetensors"))
+    except Exception:
+        _sd = torch.load(hf_hub_download(AUD_NAME, "pytorch_model.bin"), map_location="cpu")
+    bb_sd = {k[len("wav2vec2."):]: v for k, v in _sd.items() if k.startswith("wav2vec2.")}
+    aud_backbone.load_state_dict(bb_sd, strict=False)
+    _hid = _sd["classifier.dense.weight"].shape[0]
+    _out = _sd["classifier.out_proj.weight"].shape[0]
+    aud_head = nn.Sequential(nn.Linear(_hid, _hid), nn.Tanh(), nn.Linear(_hid, _out))
+    aud_head[0].weight.data.copy_(_sd["classifier.dense.weight"]); aud_head[0].bias.data.copy_(_sd["classifier.dense.bias"])
+    aud_head[2].weight.data.copy_(_sd["classifier.out_proj.weight"]); aud_head[2].bias.data.copy_(_sd["classifier.out_proj.bias"])
+    aud_backbone = aud_backbone.to(device).eval()
+    aud_head = aud_head.to(device).eval()
+    AUD_DIM = _hid + 3
+    print(f"✅ audeering frozen (đặc trưng phụ {AUD_DIM}-D = emb {_hid} + vad 3)")
+# %%
+import numpy as np
+import librosa
+from tqdm.auto import tqdm
+def load_wav(name_or_stem):
+    p = name_or_stem if os.path.isabs(str(name_or_stem)) else os.path.join(
+        WAV_DIR, name_or_stem if str(name_or_stem).endswith(".wav") else str(name_or_stem) + ".wav")
+    if not os.path.exists(p):
+        return None
+    wave, _ = librosa.load(p, sr=SR, mono=True)
+    return wave[: MAX_SECONDS * SR].astype(np.float32)
+@torch.no_grad()
+def extract_audeering(stems, tag):
+    if not USE_AUDEERING:
+        return {}
+    cache_path = os.path.join(CACHE_DIR, f"aud_{tag}.npz")
+    store = {}
+    if os.path.exists(cache_path):
+        z = np.load(cache_path, allow_pickle=True)
+        store = {k: z[k] for k in z.files}
+        print(f"[aud/{tag}] nạp cache: {len(store)}")
+    todo = [s for s in stems if s not in store]
+    for i, s in enumerate(tqdm(todo, desc=f"audeering {tag}")):
+        wave = load_wav(s)
+        if wave is None:
+            continue
+        x = aud_proc(wave, sampling_rate=SR).input_values[0]
+        x = torch.from_numpy(np.asarray(x, dtype=np.float32)).unsqueeze(0).to(device)
+        h = aud_backbone(x)[0].mean(dim=1)
+        out = aud_head(h)[0].cpu().numpy()                  # [arousal, dominance, valence] ∈[0,1]
+        vad = np.array([1 + 4 * out[2], 1 + 4 * out[0], 1 + 4 * out[1]], dtype=np.float32)  # [VAL,ARO,DOM]
+        store[s] = np.concatenate([h[0].cpu().numpy(), vad]).astype(np.float32)
+        if (i + 1) % 500 == 0:
+            np.savez(cache_path, **store)
+    if todo:
+        np.savez(cache_path, **store)
+    return store
+# %% [markdown]
+# ## 4. Đọc & gộp nhãn theo wavID (EMOS / VAD / CAT) — như exp08
+# %%
+import pandas as pd
+def load_target_emotions():
+    tgt = {}
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) >= 2:
+                tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+def _col(cols_map, *names, df=None, default_idx=None):
+    for n in names:
+        if n in cols_map:
+            return cols_map[n]
+    return list(df.columns)[default_idx] if default_idx is not None else None
+def parse_emocat_votes(cell):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    for tok in str(cell).replace("/", ",").replace(";", ",").replace("|", ",").replace(" ", ",").split(","):
+        e = norm_emotion(tok)
+        if e in EMOTIONS5:
+            v[EMOTIONS5.index(e)] += 1.0
+    return v
+def load_train_labels():
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    cols = {c.lower().strip(): c for c in df.columns}
+    wav_col = _col(cols, "wavid", "wav", df=df, default_idx=1)
+    emos_col = _col(cols, "emos", "emo", "emomos")
+    val_col = _col(cols, "val", "valence"); aro_col = _col(cols, "aro", "arousal"); dom_col = _col(cols, "dom", "dominance")
+    cat_col = _col(cols, "emocat", "cat", "emotion")
+    assert emos_col, f"Không thấy cột eMOS (cột: {list(df.columns)})"
+    df["_stem"] = df[wav_col].map(stem)
+    rows = []
+    for sid, g in df.groupby("_stem"):
+        rec = {"wavID": sid, "emos": float(g[emos_col].mean())}
+        rec["val"] = float(g[val_col].mean()) if val_col else np.nan
+        rec["aro"] = float(g[aro_col].mean()) if aro_col else np.nan
+        rec["dom"] = float(g[dom_col].mean()) if dom_col else np.nan
+        votes = np.zeros(len(EMOTIONS5), dtype=np.float32)
+        if cat_col:
+            for cell in g[cat_col]:
+                votes += parse_emocat_votes(cell)
+        s = votes.sum()
+        cat = votes / s if s > 0 else np.full(len(EMOTIONS5), 0.2, dtype=np.float32)
+        for i in range(len(EMOTIONS5)):
+            rec[f"cat{i}"] = float(cat[i])
+        rows.append(rec)
+    return pd.DataFrame(rows)
+target_map = load_target_emotions()
+train_df = load_train_labels()
+HAS_VAD = bool(train_df["val"].notna().any())
+print(f"Target: {len(target_map)} | wav train (gộp): {len(train_df)} | có VAD: {HAS_VAD}")
+# %% [markdown]
+# ## 5. Dataset / DataLoader (load wav theo batch — KHÔNG cache WavLM vì đang train)
+# %%
+from torch.utils.data import Dataset, DataLoader
+train_stems = [s for s in train_df["wavID"] if target_map.get(s) is not None]
+if LIMIT_TRAIN:
+    train_stems = train_stems[:LIMIT_TRAIN]
+aud_tr = extract_audeering(train_stems, "train")
+lab = train_df.set_index("wavID")
+def _zfit(arr):
+    a = np.asarray(arr, dtype=np.float32)
+    return float(np.nanmean(a)), float(np.nanstd(a) + 1e-6)
+if RESUME and resume_ckpt is not None:
+    # QUAN TRỌNG: lấy chuẩn hóa TỪ ckpt (head đã train theo thang này) — KHÔNG tính lại để khỏi lệch thang
+    emos_mu = float(resume_ckpt["emos_mu"]); emos_sd = float(resume_ckpt["emos_sd"])
+    vad_mu = np.asarray(resume_ckpt["vad_mu"], dtype=np.float32)
+    vad_sd = np.asarray(resume_ckpt["vad_sd"], dtype=np.float32)
+    print(f"🔁 RESUME: dùng chuẩn hóa TỪ ckpt: emos μ={emos_mu:.3f} σ={emos_sd:.3f} | vad μ={np.round(vad_mu,2)}")
+else:
+    emos_mu, emos_sd = _zfit([lab.loc[s, "emos"] for s in train_stems])
+    if HAS_VAD:
+        vad_mu = np.array([_zfit([lab.loc[s, c] for s in train_stems])[0] for c in ["val", "aro", "dom"]], dtype=np.float32)
+        vad_sd = np.array([_zfit([lab.loc[s, c] for s in train_stems])[1] for c in ["val", "aro", "dom"]], dtype=np.float32)
+    else:
+        vad_mu = np.zeros(3, dtype=np.float32); vad_sd = np.ones(3, dtype=np.float32)
+def onehot_target(tgt):
+    v = np.zeros(len(EMOTIONS5), dtype=np.float32)
+    if tgt in EMOTIONS5:
+        v[EMOTIONS5.index(tgt)] = 1.0
+    return v
+class EmoDataset(Dataset):
+    def __init__(self, stems):
+        self.stems = [s for s in stems if (load_wav(s) is not None) and ((not USE_AUDEERING) or s in aud_tr)]
+    def __len__(self):
+        return len(self.stems)
+    def __getitem__(self, i):
+        s = self.stems[i]
+        wave = load_wav(s)
+        emos = (float(lab.loc[s, "emos"]) - emos_mu) / emos_sd
+        if HAS_VAD:
+            vad = (np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32) - vad_mu) / vad_sd
+        else:
+            vad = np.zeros(3, dtype=np.float32)
+        cat = np.array([lab.loc[s, f"cat{j}"] for j in range(len(EMOTIONS5))], dtype=np.float32)
+        aud = aud_tr[s] if USE_AUDEERING else np.zeros(0, dtype=np.float32)
+        return {"wave": wave, "tgt": onehot_target(target_map.get(s)), "aud": aud,
+                "emos": np.float32(emos), "vad": vad, "cat": cat,
+                "emos_raw": np.float32(lab.loc[s, "emos"]),
+                "vad_raw": np.array([lab.loc[s, "val"], lab.loc[s, "aro"], lab.loc[s, "dom"]], np.float32)}
+def collate(batch):
+    L = max(len(b["wave"]) for b in batch)
+    waves = np.zeros((len(batch), L), dtype=np.float32)
+    mask = np.zeros((len(batch), L), dtype=np.float32)
+    for i, b in enumerate(batch):
+        waves[i, : len(b["wave"])] = b["wave"]; mask[i, : len(b["wave"])] = 1.0
+    return {
+        "input_values": torch.from_numpy(waves), "attn_mask": torch.from_numpy(mask).long(),
+        "tgt": torch.from_numpy(np.stack([b["tgt"] for b in batch])),
+        "aud": torch.from_numpy(np.stack([b["aud"] for b in batch])) if USE_AUDEERING else None,
+        "emos": torch.from_numpy(np.stack([b["emos"] for b in batch])).unsqueeze(1),
+        "vad": torch.from_numpy(np.stack([b["vad"] for b in batch])),
+        "cat": torch.from_numpy(np.stack([b["cat"] for b in batch])),
+        "emos_raw": np.stack([b["emos_raw"] for b in batch]),
+        "vad_raw": np.stack([b["vad_raw"] for b in batch]),
+    }
+from sklearn.model_selection import train_test_split
+ds = EmoDataset(train_stems)
+print("Dataset hợp lệ:", len(ds), "wav")
+tr_i, va_i = train_test_split(np.arange(len(ds)), test_size=VAL_FRAC, random_state=SEED)
+tr_loader = DataLoader(torch.utils.data.Subset(ds, tr_i), batch_size=BATCH, shuffle=True, collate_fn=collate, num_workers=2)
+va_loader = DataLoader(torch.utils.data.Subset(ds, va_i), batch_size=BATCH, shuffle=False, collate_fn=collate, num_workers=2)
+# %% [markdown]
+# ## 6a. Khối MAMBA (thuần PyTorch, fallback nếu không có `mamba-ssm`)
+# Theo "mamba-minimal" — đúng công thức selective SSM, chỉ chậm hơn kernel CUDA. Chạy trong fp32 cho ổn định.
+# %%
+import math
+try:
+    from mamba_ssm import Mamba as _OfficialMamba
+    _HAS_MAMBA_SSM = True
+    print("✅ Dùng mamba-ssm (CUDA kernel)")
+except Exception:
+    _HAS_MAMBA_SSM = False
+    print("ℹ️ Không có mamba-ssm → Mamba thuần PyTorch (chậm hơn khi fine-tune)")
+class MambaBlockTorch(nn.Module):
+    def __init__(self, d_model, d_state=16, d_conv=4, expand=2):
+        super().__init__()
+        self.d_inner = expand * d_model
+        self.dt_rank = math.ceil(d_model / 16)
+        self.in_proj = nn.Linear(d_model, self.d_inner * 2, bias=False)
+        self.conv1d = nn.Conv1d(self.d_inner, self.d_inner, kernel_size=d_conv,
+                                groups=self.d_inner, padding=d_conv - 1, bias=True)
+        self.x_proj = nn.Linear(self.d_inner, self.dt_rank + d_state * 2, bias=False)
+        self.dt_proj = nn.Linear(self.dt_rank, self.d_inner, bias=True)
+        A = torch.arange(1, d_state + 1, dtype=torch.float32).repeat(self.d_inner, 1)
+        self.A_log = nn.Parameter(torch.log(A))
+        self.D = nn.Parameter(torch.ones(self.d_inner))
+        self.out_proj = nn.Linear(self.d_inner, d_model, bias=False)
+        self.d_state = d_state
+    def forward(self, x):                                 # x: (B, L, d_model)
+        B, L, _ = x.shape
+        xin, z = self.in_proj(x).chunk(2, dim=-1)
+        xin = xin.transpose(1, 2)
+        xin = self.conv1d(xin)[..., :L].transpose(1, 2)
+        xin = F.silu(xin)
+        y = self._ssm(xin) * F.silu(z)
+        return self.out_proj(y)
+    def _ssm(self, x):
+        A = -torch.exp(self.A_log)
+        delta, Bm, Cm = torch.split(self.x_proj(x), [self.dt_rank, self.d_state, self.d_state], dim=-1)
+        delta = F.softplus(self.dt_proj(delta))
+        dA = torch.exp(delta.unsqueeze(-1) * A)
+        dB_x = delta.unsqueeze(-1) * Bm.unsqueeze(2) * x.unsqueeze(-1)
+        h = torch.zeros(x.shape[0], self.d_inner, self.d_state, device=x.device, dtype=x.dtype)
+        ys = []
+        for t in range(x.shape[1]):
+            h = dA[:, t] * h + dB_x[:, t]
+            ys.append((h * Cm[:, t].unsqueeze(1)).sum(-1))
+        return torch.stack(ys, dim=1) + x * self.D
+class MambaLayer(nn.Module):
+    def __init__(self, d_model, d_state):
+        super().__init__()
+        self.norm = nn.LayerNorm(d_model)
+        self.mix = _OfficialMamba(d_model=d_model, d_state=d_state, d_conv=4, expand=2) \
+            if _HAS_MAMBA_SSM else MambaBlockTorch(d_model, d_state=d_state)
+    def forward(self, x):
+        return x + self.mix(self.norm(x))
+class MambaEncoder(nn.Module):
+    """1024 → d_model → [Mamba ×L] (2 chiều) → attentive-pool (có mask) → Z_DIM."""
+    def __init__(self, d_in, d_model, n_layers, d_state, z_dim, bidir):
+        super().__init__()
+        self.bidir = bidir
+        self.proj = nn.Linear(d_in, d_model)
+        self.fwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])
+        if bidir:
+            self.bwd = nn.ModuleList([MambaLayer(d_model, d_state) for _ in range(n_layers)])
+        self.attn = nn.Linear(d_model, 1)
+        self.out = nn.Linear(d_model, z_dim)
+    @staticmethod
+    def _run(layers, h):
+        for L in layers:
+            h = L(h)
+        return h
+    def forward(self, x, mask):                           # x:(B,L,1024) mask:(B,L) bool
+        with torch.cuda.amp.autocast(enabled=False):      # SSM chạy fp32 cho ổn định
+            x = x.float()
+            h = self.proj(x)
+            out = self._run(self.fwd, h)
+            if self.bidir:
+                out = out + torch.flip(self._run(self.bwd, torch.flip(h, dims=[1])), dims=[1])
+            a = self.attn(out).squeeze(-1).masked_fill(~mask, float("-inf"))
+            w = torch.softmax(a, dim=1).unsqueeze(-1)
+            return self.out((out * w).sum(1))
+# %% [markdown]
+# ## 6b. Head cảm xúc + train loop (AMP + grad-accum + uncertainty weighting)
+# %%
+from scipy.stats import spearmanr
+torch.manual_seed(SEED); np.random.seed(SEED)
+N_EMO = len(EMOTIONS5)
+WAVLM_BRANCH = Z_DIM if USE_MAMBA else WAVLM_DIM
+TRUNK_IN = WAVLM_BRANCH + (AUD_DIM if USE_AUDEERING else 0)
+enc = MambaEncoder(WAVLM_DIM, MAMBA_DMODEL, MAMBA_LAYERS, MAMBA_DSTATE, Z_DIM, BIDIRECTIONAL).to(device) \
+    if USE_MAMBA else None
+class EmoHeads(nn.Module):
+    def __init__(self, d_in, trunk_h, head_h, p, n_emo):
+        super().__init__()
+        self.trunk = nn.Sequential(nn.Linear(d_in, trunk_h), nn.ReLU(), nn.Dropout(p),
+                                   nn.Linear(trunk_h, trunk_h), nn.ReLU(), nn.Dropout(p))
+        self.emos = nn.Sequential(nn.Linear(trunk_h + n_emo, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 1))
+        self.cat = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, n_emo))
+        self.vad = nn.Sequential(nn.Linear(trunk_h, head_h), nn.ReLU(), nn.Dropout(p), nn.Linear(head_h, 3))
+    def forward(self, feat, tgt):
+        h = self.trunk(feat)
+        return self.emos(torch.cat([h, tgt], 1)), self.cat(h), self.vad(h)
+heads = EmoHeads(TRUNK_IN, TRUNK_HIDDEN, HEAD_HIDDEN, DROPOUT, N_EMO).to(device)
+print(f"Trunk input = {TRUNK_IN} (wavlm-branch {WAVLM_BRANCH} [{'Mamba' if USE_MAMBA else 'mean-pool'}] + aud {AUD_DIM if USE_AUDEERING else 0})")
+if USE_MAMBA:
+    print(f"Mamba encoder: {sum(p.numel() for p in enc.parameters())/1e6:.2f}M param")
+# ── RESUME: nạp heads (+ Mamba enc) từ checkpoint ──
+if RESUME and resume_ckpt is not None:
+    hm, hu = heads.load_state_dict(resume_ckpt["heads"], strict=False)
+    print(f"🔁 RESUME load heads từ ckpt: thiếu {len(hm)} / dư {len(hu)} key (kỳ vọng 0)")
+    if USE_MAMBA and resume_ckpt.get("enc") is not None:
+        em, eu = enc.load_state_dict(resume_ckpt["enc"], strict=False)
+        print(f"🔁 RESUME load Mamba enc từ ckpt: thiếu {len(em)} / dư {len(eu)} key (kỳ vọng 0)")
+    elif USE_MAMBA:
+        print("   ⚠️ ckpt KHÔNG có 'enc' (Mamba) → Mamba head train lại từ đầu (chỉ resume backbone+heads).")
+TASKS = ["emos", "cat", "val", "aro", "dom"]
+log_var = nn.Parameter(torch.zeros(len(TASKS), device=device))
+bb_params = [p for p in wavlm.parameters() if p.requires_grad]
+head_params = list(heads.parameters()) + (list(enc.parameters()) if USE_MAMBA else []) \
+    + ([log_var] if USE_UNCERTAINTY else [])
+_lr_scale = RESUME_LR_SCALE if RESUME else 1.0
+opt = torch.optim.AdamW([
+    {"params": bb_params, "lr": LR_BACKBONE * _lr_scale},
+    {"params": head_params, "lr": LR_HEAD * _lr_scale},
+], weight_decay=WEIGHT_DECAY)
+if RESUME and _lr_scale != 1.0:
+    print(f"🔁 RESUME: LR ×{_lr_scale} → backbone {LR_BACKBONE*_lr_scale:.1e} · head {LR_HEAD*_lr_scale:.1e}")
+scaler = torch.cuda.amp.GradScaler(enabled=USE_AMP and device == "cuda")
+mse = nn.MSELoss()
+def soft_ce(logits, target_dist):
+    return -(target_dist * F.log_softmax(logits, dim=1)).sum(1).mean()
+def wavlm_branch(input_values, attn_mask):
+    out = wavlm(input_values, attention_mask=attn_mask).last_hidden_state    # (B,T,D)
+    if USE_MAMBA:
+        return enc(out, frame_mask(out.shape[1], attn_mask))                  # (B, Z_DIM)
+    return masked_mean(out, attn_mask)                                        # (B, D)
+def forward_batch(b):
+    fw = wavlm_branch(b["input_values"].to(device), b["attn_mask"].to(device))
+    feat = torch.cat([fw, b["aud"].to(device)], dim=1) if USE_AUDEERING else fw
+    return heads(feat, b["tgt"].to(device))
+def pairwise_rank_loss(pred, target):
+    """Hinge ranking trên MỌI cặp trong batch → tối ưu thẳng thứ hạng (≈ SRCC). Khả vi (backprop được).
+    Cần ≥2 mẫu/batch mới có cặp; batch càng to càng nhiều cặp → tín hiệu càng mạnh."""
+    p = pred.reshape(-1); t = target.reshape(-1)
+    if p.numel() < 2:
+        return torch.zeros((), device=p.device)
+    sign = torch.sign(t.unsqueeze(0) - t.unsqueeze(1))      # +1 nếu câu i ĐÁNG cao hơn câu j
+    diff = p.unsqueeze(0) - p.unsqueeze(1)                  # chênh lệch model dự đoán
+    return torch.relu(-sign * diff).mean()                  # phạt khi xếp sai thứ tự
+def compute_loss(emos_p, cat_l, vad_p, b):
+    L = {"emos": mse(emos_p, b["emos"].to(device)), "cat": soft_ce(cat_l, b["cat"].to(device))}
+    if HAS_VAD:
+        vt = b["vad"].to(device)
+        L["val"] = mse(vad_p[:, 0:1], vt[:, 0:1]); L["aro"] = mse(vad_p[:, 1:2], vt[:, 1:2]); L["dom"] = mse(vad_p[:, 2:3], vt[:, 2:3])
+    else:
+        vt = None
+        z = torch.zeros((), device=device); L["val"] = L["aro"] = L["dom"] = z
+    # Ranking loss CHỈ cho các cột chấm SRCC (emos/val/aro/dom). CAT là ERR phân bố → giữ soft-CE.
+    if RANK_LAMBDA > 0:
+        L["emos"] = L["emos"] + RANK_LAMBDA * pairwise_rank_loss(emos_p, b["emos"].to(device))
+        if HAS_VAD:
+            L["val"] = L["val"] + RANK_LAMBDA * pairwise_rank_loss(vad_p[:, 0:1], vt[:, 0:1])
+            L["aro"] = L["aro"] + RANK_LAMBDA * pairwise_rank_loss(vad_p[:, 1:2], vt[:, 1:2])
+            L["dom"] = L["dom"] + RANK_LAMBDA * pairwise_rank_loss(vad_p[:, 2:3], vt[:, 2:3])
+    if USE_UNCERTAINTY:
+        return sum(torch.exp(-log_var[i]) * L[t] + log_var[i] for i, t in enumerate(TASKS))
+    return sum(L.values())
+def set_mode(train):
+    wavlm.train(train); heads.train(train)
+    if USE_MAMBA:
+        enc.train(train)
+@torch.no_grad()
+def evaluate():
+    set_mode(False)
+    P = {"emos": [], "val": [], "aro": [], "dom": []}; Y = {"emos": [], "val": [], "aro": [], "dom": []}
+    catP, catY = [], []
+    for b in va_loader:
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+        P["emos"] += emos_p.float().cpu().numpy().ravel().tolist(); Y["emos"] += b["emos_raw"].tolist()
+        vad_p = vad_p.float().cpu().numpy()
+        for j, t in enumerate(["val", "aro", "dom"]):
+            P[t] += vad_p[:, j].tolist(); Y[t] += b["vad_raw"][:, j].tolist()
+        catP.append(F.softmax(cat_l, 1).float().cpu().numpy()); catY.append(b["cat"])
+    out = {t: spearmanr(P[t], Y[t]).correlation for t in ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])}
+    q = np.concatenate(catP); p = np.concatenate(catY)
+    out["cat_err"] = float(np.abs(q - p).sum(1).mean())
+    return out
+def mean_srcc(m):
+    keys = ["emos"] + (["val", "aro", "dom"] if HAS_VAD else [])
+    return float(np.mean([m[k] for k in keys]))
+CKPT_PATH = os.path.join(OUT_DIR, "ft_mamba_emotion_full.pt")
+def save_full_ckpt(state, val_emos=float("nan")):
+    torch.save({"wavlm": state["wavlm"], "heads": state["heads"], "enc": state.get("enc"),
+                "USE_MAMBA": USE_MAMBA, "emos_mu": emos_mu, "emos_sd": emos_sd,
+                "vad_mu": vad_mu, "vad_sd": vad_sd, "WAVLM_DIM": WAVLM_DIM, "AUD_DIM": AUD_DIM,
+                "Z_DIM": Z_DIM, "UNFREEZE_TOP_LAYERS": UNFREEZE_TOP_LAYERS,
+                "val_emos": float(val_emos)}, CKPT_PATH)
+def snapshot():
+    s = {"wavlm": {k: v.cpu().clone() for k, v in wavlm.state_dict().items()},
+         "heads": {k: v.cpu().clone() for k, v in heads.state_dict().items()}}
+    if USE_MAMBA:
+        s["enc"] = {k: v.cpu().clone() for k, v in enc.state_dict().items()}
+    return s
+# RESUME: init best = điểm VAL của ckpt hiện tại → chỉ ghi đè nếu train tiếp TỐT HƠN (không sợ tụt)
+if RESUME and resume_ckpt is not None:
+    m0 = evaluate(); best = mean_srcc(m0); best_state = snapshot(); bad = 0
+    print(f"📍 RESUME — checkpoint hiện tại: mean SRCC={best:.4f} | "
+          + " ".join(f"{k}={m0[k]:.3f}" for k in ['emos', 'val', 'aro', 'dom'] if k in m0))
+else:
+    m0 = None
+    best, best_state, bad = -1e9, None, 0
+for ep in range(1, EPOCHS + 1):
+    set_mode(True)
+    opt.zero_grad(); run = 0.0; nb = 0
+    for step, b in enumerate(tqdm(tr_loader, desc=f"epoch {ep}")):
+        with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+            emos_p, cat_l, vad_p = forward_batch(b)
+            loss = compute_loss(emos_p, cat_l, vad_p, b) / ACCUM
+        scaler.scale(loss).backward()
+        if (step + 1) % ACCUM == 0:
+            scaler.step(opt); scaler.update(); opt.zero_grad()
+        run += loss.item() * ACCUM; nb += 1
+    m = evaluate(); sc = mean_srcc(m)
+    msg = " ".join(f"{k}={m[k]:.3f}" for k in ["emos", "val", "aro", "dom"] if k in m)
+    print(f"epoch {ep:2d} | loss {run/max(nb,1):.4f} | {msg} | cat_err {m['cat_err']:.3f} | mean {sc:.4f} (best {max(best,sc):.4f})")
+    if sc > best:
+        best = sc; bad = 0
+        best_state = snapshot()
+        save_full_ckpt(best_state, m["emos"])
+        print(f"   💾 lưu best → {CKPT_PATH} (epoch {ep}, mean {sc:.4f})")
+    else:
+        bad += 1
+        if bad >= PATIENCE:
+            print(f"Early stop ở epoch {ep}."); break
+if best_state:
+    wavlm.load_state_dict(best_state["wavlm"]); heads.load_state_dict(best_state["heads"])
+    if USE_MAMBA:
+        enc.load_state_dict(best_state["enc"])
+final = evaluate()
+if RESUME and m0 is not None:
+    print(f"\n🔁 RESUME: mean SRCC ckpt {mean_srcc(m0):.4f} → sau train tiếp {mean_srcc(final):.4f} "
+          + ("🚀 cải thiện → đã ghi đè ckpt" if mean_srcc(final) > mean_srcc(m0) + 1e-4 else "➖ không cải thiện (giữ best cũ)"))
+print(f"\n✅ VAL (nội bộ) — exp15 (Mamba={'ON' if USE_MAMBA else 'OFF'}):")
+print(f"   EMOS={final['emos']:.4f} (exp08 {EXP08['emos']})")
+if HAS_VAD:
+    print(f"   VAL/ARO/DOM={final['val']:.4f}/{final['aro']:.4f}/{final['dom']:.4f} "
+          f"(exp08 {EXP08['val']}/{EXP08['aro']}/{EXP08['dom']})")
+warn = [f"EMOS {final['emos']:.3f}<{EXP08['emos']}"] if final["emos"] < EXP08["emos"] - 0.005 else []
+if HAS_VAD:
+    warn += [f"{t.upper()} {final[t]:.3f}<{EXP08[t]}" for t in ["val", "aro", "dom"] if final[t] < EXP08[t] - 0.005]
+print("   ⚠️ Mamba head CHƯA thắng exp08 ở:", "; ".join(warn), "(vẫn là kết quả cho paper)" if warn else "")
+if not warn:
+    print("   ✅ Mamba head thắng/ngang exp08 ở mọi cột → temporal modeling có ích!")
+save_full_ckpt(best_state if best_state else
+               {"wavlm": wavlm.state_dict(), "heads": heads.state_dict(),
+                "enc": enc.state_dict() if USE_MAMBA else None}, final["emos"])
+print(f"✅ Đã lưu {CKPT_PATH} (CÓ backbone + Mamba + heads). NHỚ Save Version!")
+# %% [markdown]
+# ## 7. Dự đoán DEV → answer.txt (5 cột cảm xúc exp15; QMOS mượn exp07/UTMOSv2)
+# %%
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT_DEV:
+    dev_names = dev_names[:LIMIT_DEV]
+dev_stems = [stem(n) for n in dev_names]
+print("DEV:", len(dev_names), "mẫu")
+aud_dev = extract_audeering(dev_stems, "dev")
+def load_exp07_qmos():
+    if EXP07_ANSWER and os.path.exists(EXP07_ANSWER):
+        import csv
+        d = {}
+        with open(EXP07_ANSWER) as f:
+            for row in csv.DictReader(f):
+                d[row["wav"]] = float(row["QMOS"]); d[stem(row["wav"])] = float(row["QMOS"])
+        print(f"✅ Mượn QMOS exp07 ({EXP07_ANSWER}): {len(d)//2} wav")
+        return d
+    return None
+qmos_map = load_exp07_qmos()
+if qmos_map is None:
+    print("ℹ️ Không có answer.txt exp07 → chấm QMOS bằng UTMOSv2 (T05, vô địch VMC2024).")
+    pip_install("git+https://github.com/sarulab-speech/UTMOSv2.git")
+    import utmosv2
+    v2 = utmosv2.create_model(pretrained=True)
+    qmos_map = {}
+    for n in tqdm(dev_names, desc="UTMOSv2"):
+        wav = os.path.join(WAV_DIR, n if str(n).endswith(".wav") else str(n) + ".wav")
+        if not os.path.exists(wav):
+            continue
+        out = v2.predict(input_path=wav)
+        qmos_map[n] = float(out["predicted_mos"]) if isinstance(out, dict) else float(out)
+    del v2; torch.cuda.empty_cache() if device == "cuda" else None
+@torch.no_grad()
+def predict_emotion(sid):
+    wave = load_wav(sid)
+    if wave is None or (USE_AUDEERING and sid not in aud_dev):
+        return None
+    set_mode(False)
+    iv = torch.from_numpy(wave).unsqueeze(0).to(device)
+    am = torch.ones((1, len(wave)), dtype=torch.long, device=device)
+    tgt = torch.from_numpy(onehot_target(target_map.get(sid))).unsqueeze(0).to(device)
+    with torch.cuda.amp.autocast(enabled=USE_AMP and device == "cuda"):
+        fw = wavlm_branch(iv, am)
+        feat = torch.cat([fw, torch.from_numpy(aud_dev[sid]).unsqueeze(0).to(device)], dim=1) if USE_AUDEERING else fw
+        emos_p, cat_l, vad_p = heads(feat, tgt)
+    emos = float(emos_p.item()) * emos_sd + emos_mu
+    cat5 = F.softmax(cat_l, 1)[0].float().cpu().numpy()
+    vad3 = vad_p[0].float().cpu().numpy() * vad_sd + vad_mu
+    return emos, cat5, vad3
+def fmt_cat(p5):
+    return "|".join(f"{e}:{p5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    n_real = n_def = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in tqdm(dev_names, desc="answer"):
+            sid = stem(name)
+            pr = predict_emotion(sid)
+            if pr is None:
+                emos, cat5, vad3 = 3.0, np.full(5, 0.2, np.float32), np.array([3.0, 3.0, 3.0]); n_def += 1
+            else:
+                emos, cat5, vad3 = pr; n_real += 1
+            qmos = qmos_map.get(name, qmos_map.get(sid, 3.0))
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},{vad3[0]:.6g},{vad3[1]:.6g},{vad3[2]:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | cảm xúc thật {n_real}, mặc định {n_def}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+# %% [markdown]
+# ## 8. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    assert rows[0][0] == "wav" and "QMOS" in rows[0] and "EMOS" in rows[0], "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(rows[0]), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {rows[0]}")
+validate(answer_path)
+os.system(f"cd {OUT_DIR} && zip -j submission_track2_exp15_mamba-emotion.zip answer.txt "
+          f"&& unzip -l submission_track2_exp15_mamba-emotion.zip")
+print("Sẵn sàng nộp:", os.path.join(OUT_DIR, "submission_track2_exp15_mamba-emotion.zip"))
+# %% [markdown]
+# ## Ghi chú
+# - **🔁 RESUME (train tiếp, không train lại từ đầu):** Add Input dataset chứa `ft_mamba_emotion_full.pt` của lần
+#   chạy trước (hoặc để nó nằm sẵn trong `/kaggle/working` khi chạy nối phiên) → notebook tự dò & train tiếp.
+#   `EPOCHS` lúc này là **số epoch train THÊM**. Val chững → đặt `RESUME_LR_SCALE=0.5`. Muốn ép train mới: `RESUME_CKPT="—"`
+#   (đường dẫn không tồn tại) hoặc xóa ckpt khỏi input. ⚠️ `USE_MAMBA` phải KHỚP ckpt (code sẽ cảnh báo nếu lệch).
+# - **Lần đầu** `LIMIT_TRAIN=300`, `LIMIT_DEV=20` → kiểm 1 epoch không OOM / không CheckpointError; rồi đặt `None`.
+# - **Ablation chính cho paper:** chạy `USE_MAMBA=True` vs `USE_MAMBA=False` (=exp08) → so EMOS/VAL/ARO/DOM nội bộ
+#   → trả lời "Mamba temporal head có hơn mean-pooling không?".
+# - **OOM / quá chậm trên T4 (nhất là khi dùng Mamba thuần PyTorch):** giảm theo thứ tự
+#   `MAX_SECONDS` (6→5) → `MAMBA_LAYERS` (2→1) → `UNFREEZE_TOP_LAYERS` (6→4) → `BATCH` (2→1, tăng `ACCUM`).
+#   Hoặc thử cài `mamba-ssm causal-conv1d` (nhanh + nhẹ RAM hơn nhiều) — code tự dùng nếu import được.
+# - **Ranking loss (`RANK_LAMBDA`):** thêm pairwise ranking cho 4 cột SRCC (emos/val/aro/dom) → khớp metric
+#   UTT-SRCC hơn MSE. ⚠️ **Điểm yếu:** ranking tính trên các cặp TRONG 1 mini-batch; `BATCH=2` → mỗi forward
+#   chỉ có 1 cặp → tín hiệu YẾU. Muốn ranking mạnh: tăng `BATCH` (4→8 nếu VRAM chịu được). Ở các exp head
+#   ĐÓNG BĂNG (exp06/07, BATCH=64) ranking mạnh hơn nhiều. A/B `RANK_LAMBDA=0` vs `0.3` → bảng ablation cho paper.
+# - **QMOS:** Add Input answer.txt exp07 vào `/kaggle/input/exp07-answer/answer.txt` để mượn QMOS 0.548;
+#   không có thì tự chấm UTMOSv2 (cần Internet On).
+# - Ghi config → kết quả → nhận xét vào `docs/04_experiments_log.md` (mục exp15).

track2/exp16_llm_judge.ipynb ADDED Viewed

	@@ -0,0 +1,650 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7bae4e03",
+   "metadata": {},
+   "source": [
+    "# exp16 — Audio-LLM-as-Judge cho MOS cảm xúc (Track 2)\n",
+    "\n",
+    "**Ý tưởng:** đưa thẳng audio cho một **audio-LLM** (Gemini / GPT-4o-audio) qua **API** + prompt có\n",
+    "cấu trúc → bắt nó chấm cả 6 cột (`QMOS, EMOS, CAT, VAL, ARO, DOM`) → ráp `answer.txt` → nộp CodaBench.\n",
+    "\n",
+    "**Mục tiêu chính = NOVELTY cho paper** (khảo sát có hệ thống audio-LLM-as-judge cho MOS cảm xúc),\n",
+    "so với hệ SSL đã train (exp07 QMOS 0.548 · exp08 EMOS 0.811…). KHÔNG cần GPU — thuần gọi API.\n",
+    "\n",
+    "| Đặc điểm | Giá trị |\n",
+    "|---|---|\n",
+    "| GPU | ❌ không cần (chỉ network I/O) |\n",
+    "| Tốn phí | ✅ API trả tiền theo token/audio → **cache + resume bắt buộc** |\n",
+    "| Provider | `gemini` (mặc định, đã có billing) · `openai` (GPT-4o-audio, để so 2 LLM) |\n",
+    "| Output | `answer.txt` 6 cột giống exp07 |\n",
+    "\n",
+    "**Cách dùng Kaggle:** Internet = **On**; Add-ons → Secrets: `GEMINI_API_KEY` (và `OPENAI_API_KEY`\n",
+    "nếu chạy provider openai). Settings GPU **không cần**. Sửa `DATA_ROOT` cho khớp slug rồi Run All.\n",
+    "\n",
+    "⚠️ **Model ID có thể đã đổi** theo thời gian → kiểm tra `GEMINI_MODEL` / `OPENAI_MODEL` còn nhận\n",
+    "audio không trước khi chạy full (xem mục 1)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "720a7dc2",
+   "metadata": {},
+   "source": [
+    "## 0. Cấu hình — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c583b4dc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, io, re, json, time, base64, glob\n",
+    "\n",
+    "# ── Data Track 2 trên Kaggle ────────────────────────────────────────────────\n",
+    "DATA_ROOT    = \"/kaggle/input/vmc2026-track2-full/vmc2026-track2\"   # << SỬA slug\n",
+    "WAV_DIR      = f\"{DATA_ROOT}/wav\"\n",
+    "METADATA_CSV = f\"{DATA_ROOT}/metadata.csv\"      # wavID|emotion|transcript (KHÔNG header) — nhãn cảm xúc target\n",
+    "DEV_SCP      = f\"{DATA_ROOT}/sets/dev.scp\"      # danh sách wav DEV cần nộp (train phase)\n",
+    "TRAIN_CSV    = f\"{DATA_ROOT}/sets/train.csv\"    # chỉ cần khi SHOT_MODE=\"few_shot\"\n",
+    "\n",
+    "OUT_DIR   = \"/kaggle/working\"\n",
+    "CACHE_DIR = \"/kaggle/working/exp16_llm_cache\"   # nên Save Version / lưu Dataset để KHÔNG gọi lại API\n",
+    "os.makedirs(CACHE_DIR, exist_ok=True)\n",
+    "\n",
+    "# ── Provider & model ────────────────────────────────────────────────────────\n",
+    "PROVIDER     = \"gemini\"                  # \"gemini\" | \"openai\"\n",
+    "GEMINI_MODEL = \"gemini-2.5-flash\"        # << xác nhận model audio hiện hành (baseline dùng họ gemini-*-flash)\n",
+    "OPENAI_MODEL = \"gpt-4o-audio-preview\"    # << model audio của OpenAI; cần OPENAI_API_KEY\n",
+    "TEMPERATURE  = 0.0                       # cố định để TÁI LẬP (paper)\n",
+    "\n",
+    "# ── Chế độ chạy ─────────────────────────────────────────────────────────────\n",
+    "SHOT_MODE    = \"zero_shot\"   # \"zero_shot\" | \"few_shot\" (nhét K ví dụ audio có nhãn từ train.csv)\n",
+    "FEW_K        = 2             # số ví dụ few-shot (mỗi ví dụ = 1 audio + nhãn vàng) — tốn thêm token!\n",
+    "LIMIT        = 20           # << số nhỏ (20) để smoke test; None = full DEV (~2730) — CHẠY THỬ TRƯỚC\n",
+    "MAX_SECONDS  = 12           # cắt audio cho rẻ + nhanh\n",
+    "WORKERS      = 4            # luồng gọi song song (giảm nếu dính rate limit)\n",
+    "MAX_RETRY    = 3            # số lần thử lại 1 wav khi lỗi mạng / JSON hỏng\n",
+    "RETRY_SLEEP  = 2.0          # giây nghỉ giữa các lần thử\n",
+    "\n",
+    "TAG = f\"{PROVIDER}_{(GEMINI_MODEL if PROVIDER=='gemini' else OPENAI_MODEL)}_{SHOT_MODE}\".replace(\"/\", \"-\")\n",
+    "CACHE_PATH = os.path.join(CACHE_DIR, f\"{TAG}.jsonl\")   # 1 dòng JSON / wav (raw + parsed) → resume\n",
+    "print(\"TAG:\", TAG, \"| cache:\", CACHE_PATH)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f2cc876e",
+   "metadata": {},
+   "source": [
+    "## 0b. Nhãn cảm xúc target + chuẩn hóa lớp (tái dùng quy ước baseline)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5b5c7f92",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "EMOTIONS5 = [\"angry\", \"happy\", \"neutral\", \"sad\", \"surprised\"]   # THỨ TỰ chuẩn cho cột CAT\n",
+    "\n",
+    "_EMO_ALIAS = {\n",
+    "    \"angry\": \"angry\", \"anger\": \"angry\",\n",
+    "    \"happy\": \"happy\", \"happiness\": \"happy\", \"joy\": \"happy\",\n",
+    "    \"neutral\": \"neutral\", \"calm\": \"neutral\",\n",
+    "    \"sad\": \"sad\", \"sadness\": \"sad\",\n",
+    "    \"surprise\": \"surprised\", \"surprised\": \"surprised\", \"surprising\": \"surprised\",\n",
+    "}\n",
+    "\n",
+    "def norm_emotion(label):\n",
+    "    key = str(label).strip().lower()\n",
+    "    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n",
+    "\n",
+    "def stem(name):\n",
+    "    return os.path.splitext(os.path.basename(name))[0]\n",
+    "\n",
+    "def load_target_emotions():\n",
+    "    \"\"\"metadata.csv (wavID|emotion|transcript, không header) → {stem: emotion_chuẩn}.\"\"\"\n",
+    "    tgt = {}\n",
+    "    if not (METADATA_CSV and os.path.exists(METADATA_CSV)):\n",
+    "        print(\"⚠️ Không thấy metadata.csv → EMOS sẽ thiếu cảm xúc target.\")\n",
+    "        return tgt\n",
+    "    with open(METADATA_CSV, encoding=\"utf-8\") as f:\n",
+    "        for ln in f:\n",
+    "            parts = ln.strip().split(\"|\")\n",
+    "            if len(parts) < 2:\n",
+    "                continue\n",
+    "            tgt[stem(parts[0])] = norm_emotion(parts[1])\n",
+    "    return tgt\n",
+    "\n",
+    "target_map = load_target_emotions()\n",
+    "print(\"Nhãn cảm xúc target:\", len(target_map))\n",
+    "\n",
+    "def list_dev():\n",
+    "    with open(DEV_SCP) as f:\n",
+    "        return [ln.strip() for ln in f if ln.strip()]\n",
+    "\n",
+    "dev_names = list_dev()\n",
+    "if LIMIT:\n",
+    "    dev_names = dev_names[:LIMIT]\n",
+    "print(\"DEV cần chấm:\", len(dev_names), \"mẫu\", \"| LIMIT =\", LIMIT)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "881a37a5",
+   "metadata": {},
+   "source": [
+    "## 1. Cài SDK + nạp key\n",
+    "\n",
+    "Gemini dùng SDK mới `google-genai`; OpenAI dùng `openai`. Trên Kaggle **Internet phải On**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a1d0c66b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip -q install google-genai openai soundfile librosa\n",
+    "\n",
+    "def setup_keys():\n",
+    "    \"\"\"Nạp API key từ Kaggle Secrets (fallback: biến môi trường đã set sẵn).\"\"\"\n",
+    "    try:\n",
+    "        from kaggle_secrets import UserSecretsClient\n",
+    "        sec = UserSecretsClient()\n",
+    "        for k in [\"GEMINI_API_KEY\", \"OPENAI_API_KEY\"]:\n",
+    "            try:\n",
+    "                os.environ[k] = sec.get_secret(k)\n",
+    "                print(f\"Đã nạp {k} từ Secrets\")\n",
+    "            except Exception:\n",
+    "                pass\n",
+    "    except Exception as e:\n",
+    "        print(\"Không dùng được Kaggle Secrets:\", e, \"→ set tay os.environ[...] nếu cần\")\n",
+    "\n",
+    "setup_keys()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a4ceeacf",
+   "metadata": {},
+   "source": [
+    "## 2. Đọc + chuẩn hóa audio (16kHz mono, cắt MAX_SECONDS) → bytes WAV trong RAM"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "68d431ff",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "\n",
+    "def load_wav_bytes(path, sr=16000, max_seconds=MAX_SECONDS):\n",
+    "    \"\"\"Trả (wav_bytes, base64_str). Cắt ≤ max_seconds, resample 16k mono, encode WAV PCM16.\"\"\"\n",
+    "    import soundfile as sf\n",
+    "    try:\n",
+    "        import librosa\n",
+    "        y, _ = librosa.load(path, sr=sr, mono=True)\n",
+    "    except Exception:\n",
+    "        y, in_sr = sf.read(path)\n",
+    "        if y.ndim > 1:\n",
+    "            y = y.mean(axis=1)\n",
+    "        if in_sr != sr:   # fallback resample tuyến tính nếu không có librosa\n",
+    "            idx = np.linspace(0, len(y) - 1, int(len(y) * sr / in_sr))\n",
+    "            y = np.interp(idx, np.arange(len(y)), y)\n",
+    "    if max_seconds:\n",
+    "        y = y[: int(sr * max_seconds)]\n",
+    "    buf = io.BytesIO()\n",
+    "    sf.write(buf, y.astype(np.float32), sr, format=\"WAV\", subtype=\"PCM_16\")\n",
+    "    raw = buf.getvalue()\n",
+    "    return raw, base64.b64encode(raw).decode(\"ascii\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d846428",
+   "metadata": {},
+   "source": [
+    "## 3. Prompt — định nghĩa 6 metric + ép JSON nghiêm ngặt\n",
+    "\n",
+    "QMOS = chất lượng/độ tự nhiên (sạch, không méo/robot). EMOS = độ KHỚP với **cảm xúc target**.\n",
+    "CAT = phân phối vote 5 lớp. VAD = Valence/Arousal/Dominance. Tất cả thang **1–5** (CAT là tỉ lệ 0–1)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7046919",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "SYSTEM_INSTRUCTION = (\n",
+    "    \"You are an expert evaluator of emotional text-to-speech. \"\n",
+    "    \"Listen to the audio and rate it. Respond with ONLY a compact JSON object, no prose.\"\n",
+    ")\n",
+    "\n",
+    "def build_prompt(target_emo):\n",
+    "    tgt = target_emo if target_emo else \"unknown\"\n",
+    "    return (\n",
+    "        \"Rate this speech utterance. The INTENDED (target) emotion is: \"\n",
+    "        f\"\\\"{tgt}\\\".\\n\\n\"\n",
+    "        \"Return a JSON object with EXACTLY these keys (numbers on a 1-5 scale unless stated):\\n\"\n",
+    "        \"  \\\"qmos\\\": overall audio QUALITY / naturalness (1=very unnatural/robotic/distorted, 5=clean & human-like).\\n\"\n",
+    "        \"  \\\"emos\\\": how well the emotion expressed MATCHES the target emotion above \"\n",
+    "        \"(1=not matching at all, 5=perfectly matching).\\n\"\n",
+    "        \"  \\\"cat\\\": an object with probabilities (summing to 1.0) over the 5 perceived emotions: \"\n",
+    "        \"{\\\"neutral\\\":_, \\\"happy\\\":_, \\\"sad\\\":_, \\\"angry\\\":_, \\\"surprised\\\":_}.\\n\"\n",
+    "        \"  \\\"val\\\": valence (1=very negative, 5=very positive).\\n\"\n",
+    "        \"  \\\"aro\\\": arousal (1=very calm, 5=very excited).\\n\"\n",
+    "        \"  \\\"dom\\\": dominance (1=very submissive, 5=very dominant).\\n\\n\"\n",
+    "        \"Example format: \"\n",
+    "        \"{\\\"qmos\\\":3.5,\\\"emos\\\":4.0,\"\n",
+    "        \"\\\"cat\\\":{\\\"neutral\\\":0.1,\\\"happy\\\":0.7,\\\"sad\\\":0.0,\\\"angry\\\":0.1,\\\"surprised\\\":0.1},\"\n",
+    "        \"\\\"val\\\":4.0,\\\"aro\\\":3.5,\\\"dom\\\":3.0}\\n\"\n",
+    "        \"Respond with ONLY the JSON.\"\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fe8d1303",
+   "metadata": {},
+   "source": [
+    "## 3b. (tùy chọn) Few-shot — lấy K ví dụ audio có nhãn vàng từ train.csv\n",
+    "\n",
+    "Bật khi `SHOT_MODE=\"few_shot\"`. Mỗi ví dụ = 1 audio train + nhãn vàng (gộp TB theo wav). Tốn thêm token."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5fdf89c7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "few_shot_examples = []   # list[(audio_b64, audio_bytes, gold_json_str)]\n",
+    "\n",
+    "def _agg_train_labels():\n",
+    "    \"\"\"Gộp train.csv (sep='|') theo wavID → nhãn vàng trung bình; CAT = tỉ lệ vote.\"\"\"\n",
+    "    import pandas as pd\n",
+    "    df = pd.read_csv(TRAIN_CSV, sep=\"|\")\n",
+    "    rows = {}\n",
+    "    for wav, g in df.groupby(\"wavID\"):\n",
+    "        votes = np.zeros(5, np.float32)\n",
+    "        for cell in g[\"emoCat\"].astype(str):\n",
+    "            for tok in cell.split(\",\"):\n",
+    "                e = norm_emotion(tok)\n",
+    "                if e in EMOTIONS5:\n",
+    "                    votes[EMOTIONS5.index(e)] += 1\n",
+    "        s = votes.sum()\n",
+    "        cat = (votes / s) if s > 0 else np.full(5, 0.2, np.float32)\n",
+    "        rows[stem(wav)] = dict(\n",
+    "            qmos=float(g[\"qMOS\"].mean()), emos=float(g[\"eMOS\"].mean()),\n",
+    "            val=float(g[\"val\"].mean()), aro=float(g[\"aro\"].mean()), dom=float(g[\"dom\"].mean()),\n",
+    "            cat={EMOTIONS5[i]: round(float(cat[i]), 4) for i in range(5)},\n",
+    "        )\n",
+    "    return rows\n",
+    "\n",
+    "def build_few_shot():\n",
+    "    if SHOT_MODE != \"few_shot\":\n",
+    "        return\n",
+    "    labels = _agg_train_labels()\n",
+    "    picked = list(labels.keys())[:FEW_K]\n",
+    "    for sid in picked:\n",
+    "        wavp = os.path.join(WAV_DIR, sid + \".wav\")\n",
+    "        if not os.path.exists(wavp):\n",
+    "            continue\n",
+    "        raw, b64 = load_wav_bytes(wavp)\n",
+    "        gold = labels[sid]\n",
+    "        gold_json = json.dumps({\n",
+    "            \"qmos\": round(gold[\"qmos\"], 2), \"emos\": round(gold[\"emos\"], 2),\n",
+    "            \"cat\": gold[\"cat\"], \"val\": round(gold[\"val\"], 2),\n",
+    "            \"aro\": round(gold[\"aro\"], 2), \"dom\": round(gold[\"dom\"], 2),\n",
+    "        })\n",
+    "        few_shot_examples.append((b64, raw, gold_json))\n",
+    "    print(f\"Few-shot: {len(few_shot_examples)} ví dụ\")\n",
+    "\n",
+    "build_few_shot()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4d3c1fef",
+   "metadata": {},
+   "source": [
+    "## 4. Gọi API — trừu tượng hóa provider (gemini / openai)\n",
+    "\n",
+    "Mỗi provider tự dựng message của nó (kèm few-shot nếu có). Trả về **text thô** để parse ở mục 5."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae85c4bf",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "_client = {\"gemini\": None, \"openai\": None}\n",
+    "\n",
+    "def _gemini_client():\n",
+    "    if _client[\"gemini\"] is None:\n",
+    "        from google import genai\n",
+    "        _client[\"gemini\"] = genai.Client(api_key=os.environ[\"GEMINI_API_KEY\"])\n",
+    "    return _client[\"gemini\"]\n",
+    "\n",
+    "def _openai_client():\n",
+    "    if _client[\"openai\"] is None:\n",
+    "        from openai import OpenAI\n",
+    "        _client[\"openai\"] = OpenAI(api_key=os.environ[\"OPENAI_API_KEY\"])\n",
+    "    return _client[\"openai\"]\n",
+    "\n",
+    "def call_gemini(audio_b64, audio_bytes, prompt):\n",
+    "    from google.genai import types\n",
+    "    client = _gemini_client()\n",
+    "    contents = []\n",
+    "    for ex_b64, ex_bytes, ex_gold in few_shot_examples:   # few-shot: audio ví dụ + nhãn vàng\n",
+    "        contents.append(types.Content(role=\"user\", parts=[\n",
+    "            types.Part.from_bytes(data=ex_bytes, mime_type=\"audio/wav\"),\n",
+    "            types.Part.from_text(text=build_prompt(None)),\n",
+    "        ]))\n",
+    "        contents.append(types.Content(role=\"model\", parts=[types.Part.from_text(text=ex_gold)]))\n",
+    "    contents.append(types.Content(role=\"user\", parts=[\n",
+    "        types.Part.from_bytes(data=audio_bytes, mime_type=\"audio/wav\"),\n",
+    "        types.Part.from_text(text=prompt),\n",
+    "    ]))\n",
+    "    resp = client.models.generate_content(\n",
+    "        model=GEMINI_MODEL, contents=contents,\n",
+    "        config=types.GenerateContentConfig(\n",
+    "            system_instruction=SYSTEM_INSTRUCTION, temperature=TEMPERATURE),\n",
+    "    )\n",
+    "    return resp.text\n",
+    "\n",
+    "def call_openai(audio_b64, audio_bytes, prompt):\n",
+    "    client = _openai_client()\n",
+    "    messages = [{\"role\": \"system\", \"content\": SYSTEM_INSTRUCTION}]\n",
+    "    for ex_b64, ex_bytes, ex_gold in few_shot_examples:\n",
+    "        messages.append({\"role\": \"user\", \"content\": [\n",
+    "            {\"type\": \"text\", \"text\": build_prompt(None)},\n",
+    "            {\"type\": \"input_audio\", \"input_audio\": {\"data\": ex_b64, \"format\": \"wav\"}},\n",
+    "        ]})\n",
+    "        messages.append({\"role\": \"assistant\", \"content\": ex_gold})\n",
+    "    messages.append({\"role\": \"user\", \"content\": [\n",
+    "        {\"type\": \"text\", \"text\": prompt},\n",
+    "        {\"type\": \"input_audio\", \"input_audio\": {\"data\": audio_b64, \"format\": \"wav\"}},\n",
+    "    ]})\n",
+    "    resp = client.chat.completions.create(\n",
+    "        model=OPENAI_MODEL, messages=messages, temperature=TEMPERATURE,\n",
+    "        modalities=[\"text\"],\n",
+    "    )\n",
+    "    return resp.choices[0].message.content\n",
+    "\n",
+    "def call_llm(audio_b64, audio_bytes, prompt):\n",
+    "    return call_gemini(audio_b64, audio_bytes, prompt) if PROVIDER == \"gemini\" \\\n",
+    "        else call_openai(audio_b64, audio_bytes, prompt)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f6ef0abc",
+   "metadata": {},
+   "source": [
+    "## 5. Parse JSON chịu lỗi → 6 cột; clamp [1,5]; chuẩn hóa CAT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "74507a4a",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def _clamp(x, lo=1.0, hi=5.0, default=3.0):\n",
+    "    try:\n",
+    "        v = float(x)\n",
+    "    except Exception:\n",
+    "        return default\n",
+    "    return max(lo, min(hi, v))\n",
+    "\n",
+    "def parse_response(text):\n",
+    "    \"\"\"text thô LLM → dict {qmos,emos,cat5(list theo EMOTIONS5),val,aro,dom} hoặc None nếu hỏng.\"\"\"\n",
+    "    if not text:\n",
+    "        return None\n",
+    "    m = re.search(r\"\\{.*\\}\", text, re.DOTALL)   # trích khối JSON đầu tiên\n",
+    "    if not m:\n",
+    "        return None\n",
+    "    try:\n",
+    "        d = json.loads(m.group(0))\n",
+    "    except Exception:\n",
+    "        return None\n",
+    "    cat_in = d.get(\"cat\", {}) or {}\n",
+    "    cat = np.zeros(5, np.float32)\n",
+    "    for k, v in cat_in.items():\n",
+    "        e = norm_emotion(k)\n",
+    "        if e in EMOTIONS5:\n",
+    "            try:\n",
+    "                cat[EMOTIONS5.index(e)] = max(0.0, float(v))\n",
+    "            except Exception:\n",
+    "                pass\n",
+    "    cat = cat / cat.sum() if cat.sum() > 0 else np.full(5, 0.2, np.float32)\n",
+    "    return dict(\n",
+    "        qmos=_clamp(d.get(\"qmos\")), emos=_clamp(d.get(\"emos\")),\n",
+    "        cat5=cat.tolist(),\n",
+    "        val=_clamp(d.get(\"val\")), aro=_clamp(d.get(\"aro\")), dom=_clamp(d.get(\"dom\")),\n",
+    "    )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "462449d1",
+   "metadata": {},
+   "source": [
+    "## 6. Vòng chấm có CACHE + RESUME (KHÔNG gọi lại wav đã có trong cache)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ee30edbc",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def load_cache():\n",
+    "    done = {}\n",
+    "    if os.path.exists(CACHE_PATH):\n",
+    "        with open(CACHE_PATH, encoding=\"utf-8\") as f:\n",
+    "            for ln in f:\n",
+    "                try:\n",
+    "                    r = json.loads(ln)\n",
+    "                    done[r[\"stem\"]] = r\n",
+    "                except Exception:\n",
+    "                    continue\n",
+    "    return done\n",
+    "\n",
+    "def score_one(name):\n",
+    "    \"\"\"Gọi LLM cho 1 wav, retry; trả record dict {stem,name,raw,parsed}.\"\"\"\n",
+    "    sid = stem(name)\n",
+    "    wavp = os.path.join(WAV_DIR, name if name.endswith(\".wav\") else name + \".wav\")\n",
+    "    tgt = target_map.get(sid)\n",
+    "    prompt = build_prompt(tgt)\n",
+    "    last_err = None\n",
+    "    for attempt in range(MAX_RETRY):\n",
+    "        try:\n",
+    "            _, b64 = (None, None)\n",
+    "            raw_bytes, b64 = load_wav_bytes(wavp)\n",
+    "            text = call_llm(b64, raw_bytes, prompt)\n",
+    "            parsed = parse_response(text)\n",
+    "            if parsed is not None:\n",
+    "                return dict(stem=sid, name=name, raw=text, parsed=parsed, ok=True)\n",
+    "            last_err = \"parse_fail\"\n",
+    "        except Exception as e:\n",
+    "            last_err = str(e)\n",
+    "        time.sleep(RETRY_SLEEP * (attempt + 1))\n",
+    "    return dict(stem=sid, name=name, raw=None, parsed=None, ok=False, err=last_err)\n",
+    "\n",
+    "def run_scoring():\n",
+    "    from concurrent.futures import ThreadPoolExecutor, as_completed\n",
+    "    done = load_cache()\n",
+    "    todo = [n for n in dev_names if stem(n) not in done]\n",
+    "    print(f\"Cache có {len(done)} | cần chấm thêm {len(todo)} | ước lượng {len(todo)} call API\")\n",
+    "    if not todo:\n",
+    "        return done\n",
+    "    n_ok = n_bad = 0\n",
+    "    with open(CACHE_PATH, \"a\", encoding=\"utf-8\") as fout, \\\n",
+    "         ThreadPoolExecutor(max_workers=WORKERS) as ex:\n",
+    "        futs = {ex.submit(score_one, n): n for n in todo}\n",
+    "        for i, fut in enumerate(as_completed(futs), 1):\n",
+    "            rec = fut.result()\n",
+    "            fout.write(json.dumps(rec, ensure_ascii=False) + \"\\n\")\n",
+    "            fout.flush()\n",
+    "            done[rec[\"stem\"]] = rec\n",
+    "            n_ok += int(rec[\"ok\"]); n_bad += int(not rec[\"ok\"])\n",
+    "            if i % 50 == 0 or i == len(todo):\n",
+    "                print(f\"  {i}/{len(todo)} | ok={n_ok} bad={n_bad}\")\n",
+    "    if n_bad:\n",
+    "        print(f\"⚠️ {n_bad} wav hỏng (parse/API) → sẽ điền mặc định ở build_answer.\")\n",
+    "    return done\n",
+    "\n",
+    "records = run_scoring()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eda27285",
+   "metadata": {},
+   "source": [
+    "## 7. Ráp `answer.txt` 6 cột (giống exp07) + validate + zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b9a4bd65",
+   "metadata": {
+    "lines_to_next_cell": 1
+   },
+   "outputs": [],
+   "source": [
+    "def fmt_cat(probs5):\n",
+    "    return \"|\".join(f\"{e}:{probs5[i]:.6g}\" for i, e in enumerate(EMOTIONS5))\n",
+    "\n",
+    "def build_answer(out_path):\n",
+    "    n_real = n_default = 0\n",
+    "    with open(out_path, \"w\") as f:\n",
+    "        f.write(\"wav,QMOS,EMOS,CAT,VAL,ARO,DOM\\n\")\n",
+    "        for name in dev_names:\n",
+    "            sid = stem(name)\n",
+    "            rec = records.get(sid)\n",
+    "            p = rec[\"parsed\"] if (rec and rec.get(\"parsed\")) else None\n",
+    "            if p is None:\n",
+    "                qmos = emos = val = aro = dom = 3.0\n",
+    "                cat5 = [0.2] * 5\n",
+    "                n_default += 1\n",
+    "            else:\n",
+    "                qmos, emos = p[\"qmos\"], p[\"emos\"]\n",
+    "                val, aro, dom = p[\"val\"], p[\"aro\"], p[\"dom\"]\n",
+    "                cat5 = p[\"cat5\"]; n_real += 1\n",
+    "            f.write(f\"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},\"\n",
+    "                    f\"{val:.6g},{aro:.6g},{dom:.6g}\\n\")\n",
+    "    print(f\"Ghi {len(dev_names)} dòng → {out_path} | LLM thật {n_real}, mặc định {n_default}\")\n",
+    "\n",
+    "answer_path = os.path.join(OUT_DIR, \"answer.txt\")\n",
+    "build_answer(answer_path)\n",
+    "\n",
+    "def validate(path):\n",
+    "    import csv\n",
+    "    with open(path) as f:\n",
+    "        rows = list(csv.reader(f))\n",
+    "    header = rows[0]\n",
+    "    assert header[0] == \"wav\" and \"QMOS\" in header and \"EMOS\" in header, \"Header sai\"\n",
+    "    for i, r in enumerate(rows[1:], 2):\n",
+    "        assert len(r) == len(header), f\"Dòng {i} sai số cột\"\n",
+    "    print(f\"OK: {len(rows)-1} dòng, header = {header}\")\n",
+    "\n",
+    "validate(answer_path)\n",
+    "!cd /kaggle/working && zip -j submission_track2_exp16.zip answer.txt && unzip -l submission_track2_exp16.zip\n",
+    "print(\"Sẵn sàng nộp: /kaggle/working/submission_track2_exp16.zip\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f8eeafb8",
+   "metadata": {},
+   "source": [
+    "## 8. (tùy chọn) Ensemble muộn: trộn THỨ HẠNG điểm LLM + hệ trained\n",
+    "\n",
+    "Trung bình rank của exp16 với một `answer.txt` đã có (vd bản trộn cột exp07+exp08) cho từng cột số.\n",
+    "Đa dạng nguồn → có thể giảm nhiễu. CHỈ chạy khi có sẵn file kia (đặt đường dẫn rồi bỏ comment)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d1a15c1f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def ensemble_rank_average(answer_a, answer_b, out_path):\n",
+    "    \"\"\"Trộn 2 answer.txt theo TRUNG BÌNH THỨ HẠNG cho 5 cột số (QMOS/EMOS/VAL/ARO/DOM); CAT lấy theo A.\"\"\"\n",
+    "    import pandas as pd\n",
+    "    num_cols = [\"QMOS\", \"EMOS\", \"VAL\", \"ARO\", \"DOM\"]\n",
+    "    A = pd.read_csv(answer_a); B = pd.read_csv(answer_b)\n",
+    "    A = A.set_index(\"wav\"); B = B.set_index(\"wav\").reindex(A.index)\n",
+    "    out = A.copy()\n",
+    "    for c in num_cols:\n",
+    "        if c in A.columns and c in B.columns:\n",
+    "            ra = A[c].rank(); rb = B[c].rank()\n",
+    "            out[c] = ((ra + rb) / 2.0)        # SRCC bất biến với scale → để nguyên rank trung bình\n",
+    "    out.reset_index().to_csv(out_path, index=False)\n",
+    "    print(\"Ensemble →\", out_path)\n",
+    "\n",
+    "# ensemble_rank_average(answer_path,\n",
+    "#     \"/kaggle/input/.../exp_mix_q07_emo08/answer.txt\",\n",
+    "#     os.path.join(OUT_DIR, \"answer_ens.txt\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14172d70",
+   "metadata": {},
+   "source": [
+    "## Ghi chú nộp & paper\n",
+    "- Nộp: My Submissions → **Track 2** (bỏ chọn track khác) → `submission_track2_exp16.zip` → đọc SRCC 6 cột.\n",
+    "- **Bảng A (paper):** đặt SRCC exp16 (gemini/openai, zero-shot) cạnh exp07 (QMOS 0.548) + exp08\n",
+    "  (EMOS 0.811 · CAT 0.133 · VAD 0.659/0.793/0.751). Kỳ vọng: LLM khá ở EMOS/CAT, yếu ở QMOS.\n",
+    "- **Bảng B:** chạy lại `SHOT_MODE=\"few_shot\"` (1 provider) → so zero vs few-shot.\n",
+    "- **Cache:** Save Version để giữ `exp16_llm_cache/*.jsonl` (không trả tiền lại). Lưu thành Kaggle\n",
+    "  Dataset nếu muốn dùng cho eval phase.\n",
+    "- **Khai báo external resource** (API thương mại Gemini/OpenAI) trong `12_system_description.md`."
+   ]
+  }
+ ],
+ "metadata": {
+  "jupytext": {
+   "cell_metadata_filter": "-all",
+   "main_language": "python",
+   "notebook_metadata_filter": "-all"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/exp16_llm_judge_pipeline.py ADDED Viewed

	@@ -0,0 +1,480 @@

+# %% [markdown]
+# # exp16 — Audio-LLM-as-Judge cho MOS cảm xúc (Track 2)
+#
+# **Ý tưởng:** đưa thẳng audio cho một **audio-LLM** (Gemini / GPT-4o-audio) qua **API** + prompt có
+# cấu trúc → bắt nó chấm cả 6 cột (`QMOS, EMOS, CAT, VAL, ARO, DOM`) → ráp `answer.txt` → nộp CodaBench.
+#
+# **Mục tiêu chính = NOVELTY cho paper** (khảo sát có hệ thống audio-LLM-as-judge cho MOS cảm xúc),
+# so với hệ SSL đã train (exp07 QMOS 0.548 · exp08 EMOS 0.811…). KHÔNG cần GPU — thuần gọi API.
+#
+# | Đặc điểm | Giá trị |
+# |---|---|
+# | GPU | ❌ không cần (chỉ network I/O) |
+# | Tốn phí | ✅ API trả tiền theo token/audio → **cache + resume bắt buộc** |
+# | Provider | `gemini` (mặc định, đã có billing) · `openai` (GPT-4o-audio, để so 2 LLM) |
+# | Output | `answer.txt` 6 cột giống exp07 |
+#
+# **Cách dùng Kaggle:** Internet = **On**; Add-ons → Secrets: `GEMINI_API_KEY` (và `OPENAI_API_KEY`
+# nếu chạy provider openai). Settings GPU **không cần**. Sửa `DATA_ROOT` cho khớp slug rồi Run All.
+#
+# ⚠️ **Model ID có thể đã đổi** theo thời gian → kiểm tra `GEMINI_MODEL` / `OPENAI_MODEL` còn nhận
+# audio không trước khi chạy full (xem mục 1).
+# %% [markdown]
+# ## 0. Cấu hình — SỬA Ở ĐÂY
+# %%
+import os, io, re, json, time, base64, glob
+# ── Data Track 2 trên Kaggle ────────────────────────────────────────────────
+DATA_ROOT    = "/kaggle/input/vmc2026-track2-full/vmc2026-track2"   # << SỬA slug
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"      # wavID|emotion|transcript (KHÔNG header) — nhãn cảm xúc target
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"      # danh sách wav DEV cần nộp (train phase)
+TRAIN_CSV    = f"{DATA_ROOT}/sets/train.csv"    # chỉ cần khi SHOT_MODE="few_shot"
+OUT_DIR   = "/kaggle/working"
+CACHE_DIR = "/kaggle/working/exp16_llm_cache"   # nên Save Version / lưu Dataset để KHÔNG gọi lại API
+os.makedirs(CACHE_DIR, exist_ok=True)
+# ── Provider & model ────────────────────────────────────────────────────────
+PROVIDER     = "gemini"                  # "gemini" | "openai"
+GEMINI_MODEL = "gemini-2.5-flash"        # << xác nhận model audio hiện hành (baseline dùng họ gemini-*-flash)
+OPENAI_MODEL = "gpt-4o-audio-preview"    # << model audio của OpenAI; cần OPENAI_API_KEY
+TEMPERATURE  = 0.0                       # cố định để TÁI LẬP (paper)
+# ── Chế độ chạy ─────────────────────────────────────────────────────────────
+SHOT_MODE    = "zero_shot"   # "zero_shot" | "few_shot" (nhét K ví dụ audio có nhãn từ train.csv)
+FEW_K        = 2             # số ví dụ few-shot (mỗi ví dụ = 1 audio + nhãn vàng) — tốn thêm token!
+LIMIT        = 20           # << số nhỏ (20) để smoke test; None = full DEV (~2730) — CHẠY THỬ TRƯỚC
+MAX_SECONDS  = 12           # cắt audio cho rẻ + nhanh
+WORKERS      = 4            # luồng gọi song song (giảm nếu dính rate limit)
+MAX_RETRY    = 3            # số lần thử lại 1 wav khi lỗi mạng / JSON hỏng
+RETRY_SLEEP  = 2.0          # giây nghỉ giữa các lần thử
+TAG = f"{PROVIDER}_{(GEMINI_MODEL if PROVIDER=='gemini' else OPENAI_MODEL)}_{SHOT_MODE}".replace("/", "-")
+CACHE_PATH = os.path.join(CACHE_DIR, f"{TAG}.jsonl")   # 1 dòng JSON / wav (raw + parsed) → resume
+print("TAG:", TAG, "| cache:", CACHE_PATH)
+# %% [markdown]
+# ## 0b. Nhãn cảm xúc target + chuẩn hóa lớp (tái dùng quy ước baseline)
+# %%
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]   # THỨ TỰ chuẩn cho cột CAT
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def stem(name):
+    return os.path.splitext(os.path.basename(name))[0]
+def load_target_emotions():
+    """metadata.csv (wavID|emotion|transcript, không header) → {stem: emotion_chuẩn}."""
+    tgt = {}
+    if not (METADATA_CSV and os.path.exists(METADATA_CSV)):
+        print("⚠️ Không thấy metadata.csv → EMOS sẽ thiếu cảm xúc target.")
+        return tgt
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            tgt[stem(parts[0])] = norm_emotion(parts[1])
+    return tgt
+target_map = load_target_emotions()
+print("Nhãn cảm xúc target:", len(target_map))
+def list_dev():
+    with open(DEV_SCP) as f:
+        return [ln.strip() for ln in f if ln.strip()]
+dev_names = list_dev()
+if LIMIT:
+    dev_names = dev_names[:LIMIT]
+print("DEV cần chấm:", len(dev_names), "mẫu", "| LIMIT =", LIMIT)
+# %% [markdown]
+# ## 1. Cài SDK + nạp key
+#
+# Gemini dùng SDK mới `google-genai`; OpenAI dùng `openai`. Trên Kaggle **Internet phải On**.
+# %%
+# !pip -q install google-genai openai soundfile librosa
+def setup_keys():
+    """Nạp API key từ Kaggle Secrets (fallback: biến môi trường đã set sẵn)."""
+    try:
+        from kaggle_secrets import UserSecretsClient
+        sec = UserSecretsClient()
+        for k in ["GEMINI_API_KEY", "OPENAI_API_KEY"]:
+            try:
+                os.environ[k] = sec.get_secret(k)
+                print(f"Đã nạp {k} từ Secrets")
+            except Exception:
+                pass
+    except Exception as e:
+        print("Không dùng được Kaggle Secrets:", e, "→ set tay os.environ[...] nếu cần")
+setup_keys()
+# %% [markdown]
+# ## 2. Đọc + chuẩn hóa audio (16kHz mono, cắt MAX_SECONDS) → bytes WAV trong RAM
+# %%
+import numpy as np
+def load_wav_bytes(path, sr=16000, max_seconds=MAX_SECONDS):
+    """Trả (wav_bytes, base64_str). Cắt ≤ max_seconds, resample 16k mono, encode WAV PCM16."""
+    import soundfile as sf
+    try:
+        import librosa
+        y, _ = librosa.load(path, sr=sr, mono=True)
+    except Exception:
+        y, in_sr = sf.read(path)
+        if y.ndim > 1:
+            y = y.mean(axis=1)
+        if in_sr != sr:   # fallback resample tuyến tính nếu không có librosa
+            idx = np.linspace(0, len(y) - 1, int(len(y) * sr / in_sr))
+            y = np.interp(idx, np.arange(len(y)), y)
+    if max_seconds:
+        y = y[: int(sr * max_seconds)]
+    buf = io.BytesIO()
+    sf.write(buf, y.astype(np.float32), sr, format="WAV", subtype="PCM_16")
+    raw = buf.getvalue()
+    return raw, base64.b64encode(raw).decode("ascii")
+# %% [markdown]
+# ## 3. Prompt — định nghĩa 6 metric + ép JSON nghiêm ngặt
+#
+# QMOS = chất lượng/độ tự nhiên (sạch, không méo/robot). EMOS = độ KHỚP với **cảm xúc target**.
+# CAT = phân phối vote 5 lớp. VAD = Valence/Arousal/Dominance. Tất cả thang **1–5** (CAT là tỉ lệ 0–1).
+# %%
+SYSTEM_INSTRUCTION = (
+    "You are an expert evaluator of emotional text-to-speech. "
+    "Listen to the audio and rate it. Respond with ONLY a compact JSON object, no prose."
+)
+def build_prompt(target_emo):
+    tgt = target_emo if target_emo else "unknown"
+    return (
+        "Rate this speech utterance. The INTENDED (target) emotion is: "
+        f"\"{tgt}\".\n\n"
+        "Return a JSON object with EXACTLY these keys (numbers on a 1-5 scale unless stated):\n"
+        "  \"qmos\": overall audio QUALITY / naturalness (1=very unnatural/robotic/distorted, 5=clean & human-like).\n"
+        "  \"emos\": how well the emotion expressed MATCHES the target emotion above "
+        "(1=not matching at all, 5=perfectly matching).\n"
+        "  \"cat\": an object with probabilities (summing to 1.0) over the 5 perceived emotions: "
+        "{\"neutral\":_, \"happy\":_, \"sad\":_, \"angry\":_, \"surprised\":_}.\n"
+        "  \"val\": valence (1=very negative, 5=very positive).\n"
+        "  \"aro\": arousal (1=very calm, 5=very excited).\n"
+        "  \"dom\": dominance (1=very submissive, 5=very dominant).\n\n"
+        "Example format: "
+        "{\"qmos\":3.5,\"emos\":4.0,"
+        "\"cat\":{\"neutral\":0.1,\"happy\":0.7,\"sad\":0.0,\"angry\":0.1,\"surprised\":0.1},"
+        "\"val\":4.0,\"aro\":3.5,\"dom\":3.0}\n"
+        "Respond with ONLY the JSON."
+    )
+# %% [markdown]
+# ## 3b. (tùy chọn) Few-shot — lấy K ví dụ audio có nhãn vàng từ train.csv
+#
+# Bật khi `SHOT_MODE="few_shot"`. Mỗi ví dụ = 1 audio train + nhãn vàng (gộp TB theo wav). Tốn thêm token.
+# %%
+few_shot_examples = []   # list[(audio_b64, audio_bytes, gold_json_str)]
+def _agg_train_labels():
+    """Gộp train.csv (sep='|') theo wavID → nhãn vàng trung bình; CAT = tỉ lệ vote."""
+    import pandas as pd
+    df = pd.read_csv(TRAIN_CSV, sep="|")
+    rows = {}
+    for wav, g in df.groupby("wavID"):
+        votes = np.zeros(5, np.float32)
+        for cell in g["emoCat"].astype(str):
+            for tok in cell.split(","):
+                e = norm_emotion(tok)
+                if e in EMOTIONS5:
+                    votes[EMOTIONS5.index(e)] += 1
+        s = votes.sum()
+        cat = (votes / s) if s > 0 else np.full(5, 0.2, np.float32)
+        rows[stem(wav)] = dict(
+            qmos=float(g["qMOS"].mean()), emos=float(g["eMOS"].mean()),
+            val=float(g["val"].mean()), aro=float(g["aro"].mean()), dom=float(g["dom"].mean()),
+            cat={EMOTIONS5[i]: round(float(cat[i]), 4) for i in range(5)},
+        )
+    return rows
+def build_few_shot():
+    if SHOT_MODE != "few_shot":
+        return
+    labels = _agg_train_labels()
+    picked = list(labels.keys())[:FEW_K]
+    for sid in picked:
+        wavp = os.path.join(WAV_DIR, sid + ".wav")
+        if not os.path.exists(wavp):
+            continue
+        raw, b64 = load_wav_bytes(wavp)
+        gold = labels[sid]
+        gold_json = json.dumps({
+            "qmos": round(gold["qmos"], 2), "emos": round(gold["emos"], 2),
+            "cat": gold["cat"], "val": round(gold["val"], 2),
+            "aro": round(gold["aro"], 2), "dom": round(gold["dom"], 2),
+        })
+        few_shot_examples.append((b64, raw, gold_json))
+    print(f"Few-shot: {len(few_shot_examples)} ví dụ")
+build_few_shot()
+# %% [markdown]
+# ## 4. Gọi API — trừu tượng hóa provider (gemini / openai)
+#
+# Mỗi provider tự dựng message của nó (kèm few-shot nếu có). Trả về **text thô** để parse ở mục 5.
+# %%
+_client = {"gemini": None, "openai": None}
+def _gemini_client():
+    if _client["gemini"] is None:
+        from google import genai
+        _client["gemini"] = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
+    return _client["gemini"]
+def _openai_client():
+    if _client["openai"] is None:
+        from openai import OpenAI
+        _client["openai"] = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
+    return _client["openai"]
+def call_gemini(audio_b64, audio_bytes, prompt):
+    from google.genai import types
+    client = _gemini_client()
+    contents = []
+    for ex_b64, ex_bytes, ex_gold in few_shot_examples:   # few-shot: audio ví dụ + nhãn vàng
+        contents.append(types.Content(role="user", parts=[
+            types.Part.from_bytes(data=ex_bytes, mime_type="audio/wav"),
+            types.Part.from_text(text=build_prompt(None)),
+        ]))
+        contents.append(types.Content(role="model", parts=[types.Part.from_text(text=ex_gold)]))
+    contents.append(types.Content(role="user", parts=[
+        types.Part.from_bytes(data=audio_bytes, mime_type="audio/wav"),
+        types.Part.from_text(text=prompt),
+    ]))
+    resp = client.models.generate_content(
+        model=GEMINI_MODEL, contents=contents,
+        config=types.GenerateContentConfig(
+            system_instruction=SYSTEM_INSTRUCTION, temperature=TEMPERATURE),
+    )
+    return resp.text
+def call_openai(audio_b64, audio_bytes, prompt):
+    client = _openai_client()
+    messages = [{"role": "system", "content": SYSTEM_INSTRUCTION}]
+    for ex_b64, ex_bytes, ex_gold in few_shot_examples:
+        messages.append({"role": "user", "content": [
+            {"type": "text", "text": build_prompt(None)},
+            {"type": "input_audio", "input_audio": {"data": ex_b64, "format": "wav"}},
+        ]})
+        messages.append({"role": "assistant", "content": ex_gold})
+    messages.append({"role": "user", "content": [
+        {"type": "text", "text": prompt},
+        {"type": "input_audio", "input_audio": {"data": audio_b64, "format": "wav"}},
+    ]})
+    resp = client.chat.completions.create(
+        model=OPENAI_MODEL, messages=messages, temperature=TEMPERATURE,
+        modalities=["text"],
+    )
+    return resp.choices[0].message.content
+def call_llm(audio_b64, audio_bytes, prompt):
+    return call_gemini(audio_b64, audio_bytes, prompt) if PROVIDER == "gemini" \
+        else call_openai(audio_b64, audio_bytes, prompt)
+# %% [markdown]
+# ## 5. Parse JSON chịu lỗi → 6 cột; clamp [1,5]; chuẩn hóa CAT
+# %%
+def _clamp(x, lo=1.0, hi=5.0, default=3.0):
+    try:
+        v = float(x)
+    except Exception:
+        return default
+    return max(lo, min(hi, v))
+def parse_response(text):
+    """text thô LLM → dict {qmos,emos,cat5(list theo EMOTIONS5),val,aro,dom} hoặc None nếu hỏng."""
+    if not text:
+        return None
+    m = re.search(r"\{.*\}", text, re.DOTALL)   # trích khối JSON đầu tiên
+    if not m:
+        return None
+    try:
+        d = json.loads(m.group(0))
+    except Exception:
+        return None
+    cat_in = d.get("cat", {}) or {}
+    cat = np.zeros(5, np.float32)
+    for k, v in cat_in.items():
+        e = norm_emotion(k)
+        if e in EMOTIONS5:
+            try:
+                cat[EMOTIONS5.index(e)] = max(0.0, float(v))
+            except Exception:
+                pass
+    cat = cat / cat.sum() if cat.sum() > 0 else np.full(5, 0.2, np.float32)
+    return dict(
+        qmos=_clamp(d.get("qmos")), emos=_clamp(d.get("emos")),
+        cat5=cat.tolist(),
+        val=_clamp(d.get("val")), aro=_clamp(d.get("aro")), dom=_clamp(d.get("dom")),
+    )
+# %% [markdown]
+# ## 6. Vòng chấm có CACHE + RESUME (KHÔNG gọi lại wav đã có trong cache)
+# %%
+def load_cache():
+    done = {}
+    if os.path.exists(CACHE_PATH):
+        with open(CACHE_PATH, encoding="utf-8") as f:
+            for ln in f:
+                try:
+                    r = json.loads(ln)
+                    done[r["stem"]] = r
+                except Exception:
+                    continue
+    return done
+def score_one(name):
+    """Gọi LLM cho 1 wav, retry; trả record dict {stem,name,raw,parsed}."""
+    sid = stem(name)
+    wavp = os.path.join(WAV_DIR, name if name.endswith(".wav") else name + ".wav")
+    tgt = target_map.get(sid)
+    prompt = build_prompt(tgt)
+    last_err = None
+    for attempt in range(MAX_RETRY):
+        try:
+            _, b64 = (None, None)
+            raw_bytes, b64 = load_wav_bytes(wavp)
+            text = call_llm(b64, raw_bytes, prompt)
+            parsed = parse_response(text)
+            if parsed is not None:
+                return dict(stem=sid, name=name, raw=text, parsed=parsed, ok=True)
+            last_err = "parse_fail"
+        except Exception as e:
+            last_err = str(e)
+        time.sleep(RETRY_SLEEP * (attempt + 1))
+    return dict(stem=sid, name=name, raw=None, parsed=None, ok=False, err=last_err)
+def run_scoring():
+    from concurrent.futures import ThreadPoolExecutor, as_completed
+    done = load_cache()
+    todo = [n for n in dev_names if stem(n) not in done]
+    print(f"Cache có {len(done)} | cần chấm thêm {len(todo)} | ước lượng {len(todo)} call API")
+    if not todo:
+        return done
+    n_ok = n_bad = 0
+    with open(CACHE_PATH, "a", encoding="utf-8") as fout, \
+         ThreadPoolExecutor(max_workers=WORKERS) as ex:
+        futs = {ex.submit(score_one, n): n for n in todo}
+        for i, fut in enumerate(as_completed(futs), 1):
+            rec = fut.result()
+            fout.write(json.dumps(rec, ensure_ascii=False) + "\n")
+            fout.flush()
+            done[rec["stem"]] = rec
+            n_ok += int(rec["ok"]); n_bad += int(not rec["ok"])
+            if i % 50 == 0 or i == len(todo):
+                print(f"  {i}/{len(todo)} | ok={n_ok} bad={n_bad}")
+    if n_bad:
+        print(f"⚠️ {n_bad} wav hỏng (parse/API) → sẽ điền mặc định ở build_answer.")
+    return done
+records = run_scoring()
+# %% [markdown]
+# ## 7. Ráp `answer.txt` 6 cột (giống exp07) + validate + zip
+# %%
+def fmt_cat(probs5):
+    return "|".join(f"{e}:{probs5[i]:.6g}" for i, e in enumerate(EMOTIONS5))
+def build_answer(out_path):
+    n_real = n_default = 0
+    with open(out_path, "w") as f:
+        f.write("wav,QMOS,EMOS,CAT,VAL,ARO,DOM\n")
+        for name in dev_names:
+            sid = stem(name)
+            rec = records.get(sid)
+            p = rec["parsed"] if (rec and rec.get("parsed")) else None
+            if p is None:
+                qmos = emos = val = aro = dom = 3.0
+                cat5 = [0.2] * 5
+                n_default += 1
+            else:
+                qmos, emos = p["qmos"], p["emos"]
+                val, aro, dom = p["val"], p["aro"], p["dom"]
+                cat5 = p["cat5"]; n_real += 1
+            f.write(f"{name},{qmos:.6g},{emos:.6g},{fmt_cat(cat5)},"
+                    f"{val:.6g},{aro:.6g},{dom:.6g}\n")
+    print(f"Ghi {len(dev_names)} dòng → {out_path} | LLM thật {n_real}, mặc định {n_default}")
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+build_answer(answer_path)
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+# !cd /kaggle/working && zip -j submission_track2_exp16.zip answer.txt && unzip -l submission_track2_exp16.zip
+print("Sẵn sàng nộp: /kaggle/working/submission_track2_exp16.zip")
+# %% [markdown]
+# ## 8. (tùy chọn) Ensemble muộn: trộn THỨ HẠNG điểm LLM + hệ trained
+#
+# Trung bình rank của exp16 với một `answer.txt` đã có (vd bản trộn cột exp07+exp08) cho từng cột số.
+# Đa dạng nguồn → có thể giảm nhiễu. CHỈ chạy khi có sẵn file kia (đặt đường dẫn rồi bỏ comment).
+# %%
+def ensemble_rank_average(answer_a, answer_b, out_path):
+    """Trộn 2 answer.txt theo TRUNG BÌNH THỨ HẠNG cho 5 cột số (QMOS/EMOS/VAL/ARO/DOM); CAT lấy theo A."""
+    import pandas as pd
+    num_cols = ["QMOS", "EMOS", "VAL", "ARO", "DOM"]
+    A = pd.read_csv(answer_a); B = pd.read_csv(answer_b)
+    A = A.set_index("wav"); B = B.set_index("wav").reindex(A.index)
+    out = A.copy()
+    for c in num_cols:
+        if c in A.columns and c in B.columns:
+            ra = A[c].rank(); rb = B[c].rank()
+            out[c] = ((ra + rb) / 2.0)        # SRCC bất biến với scale → để nguyên rank trung bình
+    out.reset_index().to_csv(out_path, index=False)
+    print("Ensemble →", out_path)
+# ensemble_rank_average(answer_path,
+#     "/kaggle/input/.../exp_mix_q07_emo08/answer.txt",
+#     os.path.join(OUT_DIR, "answer_ens.txt"))
+# %% [markdown]
+# ## Ghi chú nộp & paper
+# - Nộp: My Submissions → **Track 2** (bỏ chọn track khác) → `submission_track2_exp16.zip` → đọc SRCC 6 cột.
+# - **Bảng A (paper):** đặt SRCC exp16 (gemini/openai, zero-shot) cạnh exp07 (QMOS 0.548) + exp08
+#   (EMOS 0.811 · CAT 0.133 · VAD 0.659/0.793/0.751). Kỳ vọng: LLM khá ở EMOS/CAT, yếu ở QMOS.
+# - **Bảng B:** chạy lại `SHOT_MODE="few_shot"` (1 provider) → so zero vs few-shot.
+# - **Cache:** Save Version để giữ `exp16_llm_cache/*.jsonl` (không trả tiền lại). Lưu thành Kaggle
+#   Dataset nếu muốn dùng cho eval phase.
+# - **Khai báo external resource** (API thương mại Gemini/OpenAI) trong `12_system_description.md`.

track2/track2_baseline.ipynb ADDED Viewed

	@@ -0,0 +1,130 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# VMC2026 Track 2 — Baseline Pipeline (Kaggle)\n\nQMOS (SpeechMOS) + EmoCat (emotion2vec) + **EMOS (emotion2vec target-prob, mặc định offline)** → gộp `answer.txt`.\n\n**Trước khi chạy:** Accelerator = **GPU T4**, Internet = **On**.\n- **+ Add Input** → tab Datasets → dataset Track 2 đã upload (Kaggle tự giải nén → có thư mục `vmc2026-track2/`).\n- Với mặc định `EMOS_METHOD='emotion2vec'`: **KHÔNG cần** `GEMINI_API_KEY`. Chỉ cần Secrets khi đổi sang `'gemini'` (để có thêm VAD).\n\nChạy được ngay: **QMOS + EmoCat + EMOS** (chỉ cần wav + `metadata.csv` chứa cảm xúc target).\n\n> ⚠️ Train phase: dự đoán tập **DEV** (`sets/dev.scp`, ~2730 mẫu). Thư mục `wav/` có cả train+dev nên KHÔNG glob hết — chỉ lấy đúng dev.scp."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 0. Config — SỬA Ở ĐÂY"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import os, glob\n\n# ── Data Track 2 trên Kaggle (dataset đã upload, KHÔNG có thư mục con lồng) ──\nDATA_ROOT    = '/kaggle/input/vmc2026-track2-full'   # << slug dataset bạn upload\nWAV_DIR      = f'{DATA_ROOT}/wav'\nMETADATA_CSV = f'{DATA_ROOT}/metadata.csv'      # wavID|emotion|transcript (KHÔNG header)\nDEV_SCP      = f'{DATA_ROOT}/sets/dev.scp'      # danh sách wav tập DEV (tập cần nộp ở train phase)\n\n# Test nhanh trên ESD: trỏ WAV_DIR vào ESD, đặt DEV_SCP=None và METADATA_CSV=None.\n# WAV_DIR = '/kaggle/input/datasets/nguyenthanhlim/emotional-speech-dataset-esd/Emotion Speech Dataset'\n# DEV_SCP = None; METADATA_CSV = None\n\nLIMIT = 20   # << 20 = chạy THỬ nhanh. Đổi None để chạy TOÀN BỘ DEV rồi nộp.\n\n# ── Cách tính EMOS ──────────────────────────────────────────────────────────\n# 'emotion2vec': OFFLINE, MIỄN PHÍ (exp01, khuyến nghị) — P(cảm xúc target) từ emotion2vec → scale 1–5.\n# 'gemini'     : LLM-as-judge qua Gemini API (cần GEMINI_API_KEY, tốn phí). Chỉ cách này có VAD.\nEMOS_METHOD = 'emotion2vec'\n\nOUT_DIR = '/kaggle/working'\nRUN_QMOS, RUN_EMOCAT = True, True\n_have_meta = bool(METADATA_CSV) and os.path.exists(METADATA_CSV)\nRUN_EMOS = _have_meta                                # cả 2 cách đều cần target từ metadata\nRUN_VAD  = _have_meta and EMOS_METHOD == 'gemini'    # VAD chỉ có ở Gemini\nEMOTIONS5 = ['angry', 'happy', 'neutral', 'sad', 'surprised']\n\n# Chuẩn hóa nhãn cảm xúc target (metadata) → đúng 1 trong 5 lớp của emotion2vec.\n_EMO_ALIAS = {'angry':'angry','anger':'angry','happy':'happy','happiness':'happy','joy':'happy',\n              'neutral':'neutral','calm':'neutral','sad':'sad','sadness':'sad',\n              'surprise':'surprised','surprised':'surprised','surprising':'surprised'}\ndef norm_emotion(label):\n    key = str(label).strip().lower()\n    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)\n\ndef list_wavs(d):\n    # Có DEV_SCP → đọc danh sách tên file tập DEV (wav nằm phẳng trong wav/).\n    # Không có   → quét đệ quy mọi .wav (chế độ test ESD, lồng speaker/emotion).\n    if DEV_SCP and os.path.exists(DEV_SCP):\n        with open(DEV_SCP) as f:\n            names = [ln.strip() for ln in f if ln.strip()]\n        wavs = [os.path.join(d, n) for n in names]\n    else:\n        wavs = sorted(glob.glob(os.path.join(d, '**', '*.wav'), recursive=True))\n    return wavs[:LIMIT] if LIMIT else wavs\n\nprint('WAV_DIR:', WAV_DIR, '| EMOS_METHOD:', EMOS_METHOD)\nprint('Số wav:', len(list_wavs(WAV_DIR)) if os.path.isdir(WAV_DIR) else '(chưa thấy thư mục)')\nprint('Chế độ DEV (dev.scp):', bool(DEV_SCP and os.path.exists(DEV_SCP)))\nif METADATA_CSV and os.path.exists(METADATA_CSV) and DEV_SCP and os.path.exists(DEV_SCP):\n    n_meta = sum(1 for _ in open(METADATA_CSV))\n    n_dev  = sum(1 for _ in open(DEV_SCP))\n    print(f'metadata.csv: {n_meta} dòng | dev.scp: {n_dev} dòng')"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Cài đặt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q speechmos funasr librosa soundfile pandas google-genai loguru tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. QMOS — SpeechMOS (UTMOS, không cần fairseq)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "def run_qmos(wav_dir):\n    import torch, librosa\n    dev = 'cuda' if torch.cuda.is_available() else 'cpu'\n    predictor = torch.hub.load('tarepan/SpeechMOS:v1.2.0', 'utmos22_strong', trust_repo=True).to(dev)  # << GPU\n    print('QMOS device:', dev)\n    scores, missing = {}, 0\n    for w in list_wavs(wav_dir):              # w là đường dẫn đầy đủ\n        if not os.path.exists(w):             # mẫu ESD/DailyTalk chưa lấy ngoài → bỏ qua, không crash\n            missing += 1; continue\n        wave, _ = librosa.load(w, sr=16000, mono=True)\n        wave_t = torch.from_numpy(wave).unsqueeze(0).to(dev)   # << đưa input lên GPU\n        scores[w] = float(predictor(wave_t, sr=16000).mean().item())\n    if missing: print(f'[QMOS] Bỏ qua {missing} file thiếu (chưa có ESD/DailyTalk) → điểm mặc định.')\n    return scores\n\nqmos_scores = run_qmos(WAV_DIR) if RUN_QMOS else {}\nprint('QMOS xong:', len(qmos_scores))\nlist(qmos_scores.items())[:3]"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. EmoCat — emotion2vec+ large\n",
+    "Đã sửa bug bản gốc + lọc 5 lớp + chuẩn hóa tổng = 1."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "def run_emocat(wav_dir):\n    import torch\n    from funasr import AutoModel\n    dev = 'cuda:0' if torch.cuda.is_available() else 'cpu'\n    model = AutoModel(model='iic/emotion2vec_plus_large', hub='hf', device=dev)  # << chạy GPU\n    print('EmoCat device:', dev)\n    results, missing = {}, 0\n    for w in list_wavs(wav_dir):              # w là đường dẫn đầy đủ\n        if not os.path.exists(w):             # mẫu ESD/DailyTalk chưa lấy ngoài → bỏ qua\n            missing += 1; continue\n        rec = model.generate(w, granularity='utterance', extract_embedding=False)\n        probs = {e: 0.0 for e in EMOTIONS5}\n        for lab, sc in zip(rec[0]['labels'], rec[0]['scores']):\n            name = lab.split('/')[-1]\n            if name in probs:\n                probs[name] = float(sc)\n        total = sum(probs.values())\n        if total > 0:\n            probs = {k: v / total for k, v in probs.items()}\n        results[w] = probs\n    if missing: print(f'[EmoCat] Bỏ qua {missing} file thiếu (chưa có ESD/DailyTalk) → phân bố mặc định.')\n    return results\n\nemocat_probs = run_emocat(WAV_DIR) if RUN_EMOCAT else {}\nprint('EmoCat xong:', len(emocat_probs))\nlist(emocat_probs.items())[:2]"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "## 4. EMOS — emotion2vec target-prob (mặc định) hoặc Gemini\n**emotion2vec (exp01, offline):** lấy P(cảm xúc target) từ emotion2vec (đã tính ở cell EmoCat), scale [0,1]→[1,5]. Chấm đủ 2.730 mẫu, KHÔNG cần API. SRCC chỉ quan tâm thứ hạng nên scale tuyến tính không đổi tương quan.\n\n**Gemini (`EMOS_METHOD='gemini'`):** LLM-as-judge, cần `GEMINI_API_KEY` + credit; tự lọc metadata về DEV để đỡ tốn. Chỉ cách này có VAD."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "emos_scores, vad_scores = {}, {}   # key = TÊN FILE wav (uttID, có .wav)\n\n# Đọc cảm xúc target từ metadata.csv → {stem: emotion_chuẩn}\ndef load_target_emotions():\n    tgt = {}\n    if not (METADATA_CSV and os.path.exists(METADATA_CSV)):\n        return tgt\n    with open(METADATA_CSV, encoding='utf-8') as f:\n        for ln in f:\n            parts = ln.strip().split('|')\n            if len(parts) >= 2:\n                stem = os.path.splitext(os.path.basename(parts[0]))[0]\n                tgt[stem] = norm_emotion(parts[1])\n    return tgt\n\ntarget_map = load_target_emotions()\nprint('Nhãn cảm xúc target đọc được:', len(target_map))\n\nif RUN_EMOS and EMOS_METHOD == 'emotion2vec':\n    # ── EMOS OFFLINE: P(cảm xúc target) từ emotion2vec (cell EmoCat), scale [0,1]→[1,5] ──\n    assert RUN_EMOCAT and emocat_probs, 'EMOS theo emotion2vec cần chạy cell EmoCat (mục 3) trước.'\n    miss_t = miss_p = 0\n    for w in list_wavs(WAV_DIR):\n        name = os.path.basename(w)\n        tgt = target_map.get(os.path.splitext(name)[0])\n        probs = emocat_probs.get(w)\n        if tgt is None:\n            miss_t += 1; continue\n        if not probs:\n            miss_p += 1; continue\n        emos_scores[name] = 1.0 + 4.0 * probs.get(tgt, 0.0)   # p=0→1 điểm, p=1→5 điểm\n    if miss_t: print(f'[EMOS-e2v] {miss_t} mẫu thiếu nhãn target → mặc định 3.')\n    if miss_p: print(f'[EMOS-e2v] {miss_p} mẫu thiếu prob emotion2vec → mặc định 3.')\n    print(f'✅ EMOS (emotion2vec) cho {len(emos_scores)} mẫu — không cần API.')\n\nelif RUN_EMOS or RUN_VAD:   # EMOS_METHOD == 'gemini'\n    try:\n        from kaggle_secrets import UserSecretsClient\n        os.environ['GEMINI_API_KEY'] = UserSecretsClient().get_secret('GEMINI_API_KEY')\n        print('Đã nạp GEMINI_API_KEY từ Secrets')\n    except Exception as e:\n        print('Chưa nạp được key:', e)\n\n    # ── Lọc metadata.csv → CHỈ giữ mẫu thuộc DEV (tránh trả tiền Gemini cho mẫu train) ──\n    dev_stems = {os.path.splitext(n.strip())[0] for n in open(DEV_SCP) if n.strip()}\n    META_DEV = '/kaggle/working/metadata_dev.csv'\n    kept = 0\n    with open(METADATA_CSV) as fin, open(META_DEV, 'w') as fout:\n        for line in fin:\n            if not line.strip():\n                continue\n            stem = os.path.splitext(os.path.basename(line.split('|')[0].strip()))[0]\n            if stem in dev_stems:\n                fout.write(line); kept += 1\n    print(f'metadata_dev.csv: {kept} dòng (kỳ vọng ~{len(dev_stems)})')\n\n    GEMINI_ROWS = f'--end-row {LIMIT}' if LIMIT else ''\n    !git clone -q https://github.com/voicemos-challenge/vmc2026-baselines.git /kaggle/working/vmc2026-baselines\n    !cd /kaggle/working/vmc2026-baselines/track2/EMOS && python Gemini_EMOS.py --metadata-path $META_DEV --base-path $WAV_DIR --output-file /kaggle/working/emos.csv --workers 4 --resume $GEMINI_ROWS\n    !cd /kaggle/working/vmc2026-baselines/track2/VAD && python Gemini_VAD.py --metadata-path $META_DEV --base-path $WAV_DIR --output-file /kaggle/working/vad.csv --workers 4 --resume $GEMINI_ROWS\n\n    import pandas as pd\n    if os.path.exists('/kaggle/working/emos.csv'):\n        d = pd.read_csv('/kaggle/working/emos.csv'); emos_scores = dict(zip(d['uttID'], d['emos']))\n    if os.path.exists('/kaggle/working/vad.csv'):\n        d = pd.read_csv('/kaggle/working/vad.csv')   # cột chuẩn: uttID, val, aro, dom\n        for _, r in d.iterrows():\n            vad_scores[r['uttID']] = (r['val'], r['aro'], r['dom'])\n\n    if emos_scores:\n        dev_bases = {os.path.basename(w) for w in list_wavs(WAV_DIR)}\n        if not (set(emos_scores) & dev_bases):\n            print('⚠️ KEY LỆCH: uttID không khớp tên file dev → EMOS/VAD sẽ về mặc định!')\n        else:\n            print('✅ Key khớp — EMOS/VAD sẽ gộp đúng.')\n\nprint('EMOS:', len(emos_scores), '| VAD:', len(vad_scores))"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Gộp answer.txt (tự bỏ cột thiếu)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "def fmt_cat(p):\n    return '|'.join(f'{e}:{p[e]:.6g}' for e in EMOTIONS5)\n\ndef build_answer(out_path):\n    wavs = list_wavs(WAV_DIR)\n    have_cat = RUN_EMOCAT and len(emocat_probs) > 0\n    have_vad = RUN_VAD and len(vad_scores) > 0\n    cols = ['wav', 'QMOS', 'EMOS']\n    if have_cat: cols.append('CAT')\n    if have_vad: cols += ['VAL', 'ARO', 'DOM']\n    with open(out_path, 'w') as f:\n        f.write(','.join(cols) + '\\n')\n        for w in wavs:\n            name = os.path.basename(w)            # tên file = cột wav & key của emos/vad\n            row = [name, f\"{qmos_scores.get(w, 3.0):.6g}\", str(emos_scores.get(name, 3))]\n            if have_cat: row.append(fmt_cat(emocat_probs.get(w, {e: 0.2 for e in EMOTIONS5})))\n            if have_vad:\n                v = vad_scores.get(name, (3, 3, 3)); row += [str(v[0]), str(v[1]), str(v[2])]\n            f.write(','.join(row) + '\\n')\n    print(f'Ghi {len(wavs)} dòng → {out_path} | cột: {cols}')\n\nanswer_path = os.path.join(OUT_DIR, 'answer.txt')\nbuild_answer(answer_path)\n!head -3 {answer_path}"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Validate + zip"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import csv\n",
+    "with open(answer_path) as f:\n",
+    "    rows = list(csv.reader(f))\n",
+    "header = rows[0]\n",
+    "assert header[0] == 'wav' and 'QMOS' in header and 'EMOS' in header, 'Header sai'\n",
+    "for i, r in enumerate(rows[1:], 2):\n",
+    "    assert len(r) == len(header), f'Dòng {i} sai số cột'\n",
+    "print(f'OK: {len(rows)-1} dòng, header = {header}')\n",
+    "!cd /kaggle/working && zip -j submission_track2.zip answer.txt && unzip -l submission_track2.zip"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/track2_baseline_pipeline.py ADDED Viewed

	@@ -0,0 +1,321 @@

+# %% [markdown]
+# # VMC2026 Track 2 — Baseline Pipeline (Kaggle)
+#
+# Chạy 4 baseline → gộp thành `answer.txt` đúng chuẩn nộp CodaBench.
+#
+# | Sub-task | Baseline | GPU | Cần label? |
+# |---|---|---|---|
+# | QMOS | SpeechMOS (UTMOS bản pip) | có (nhẹ) | không (chỉ cần wav) |
+# | EmoCat (CAT) | emotion2vec+ large (funasr) | có (nhẹ) | không (chỉ cần wav) |
+# | EMOS | **emotion2vec target-prob** (mặc định, offline) HOẶC Gemini | có (nhẹ) | cần `metadata.csv` (nhãn target) |
+# | VAD | Gemini LLM-as-judge (chỉ khi EMOS_METHOD="gemini") | không | cần `metadata.csv` + API key |
+#
+# **Cách dùng trên Kaggle:**
+# 1. Tạo Notebook, Settings → Accelerator = **GPU T4**, Internet = **On** (cần verify phone).
+# 2. **+ Add Input** → chọn dataset Track 2 đã upload (Kaggle tự giải nén → có thư mục `vmc2026-track2/`).
+# 3. Add-ons → Secrets: thêm `GEMINI_API_KEY` (cho EMOS/VAD).
+# 4. Sửa `DATA_ROOT` ở cell 0 cho khớp slug dataset, rồi chạy lần lượt từng cell.
+#
+# Format đích `answer.txt`: `wav,QMOS,EMOS,CAT,VAL,ARO,DOM` — xem `08_track2_spec.md`.
+# QMOS & EMOS bắt buộc; CAT/VAD tùy chọn. Có thể nộp tập con cột.
+#
+# > ⚠️ Ở **training phase** ta dự đoán cho tập **DEV** (`sets/dev.scp`, ~2730 mẫu) rồi nộp.
+# > Thư mục `wav/` chứa cả train + dev nên KHÔNG glob hết — chỉ lấy đúng danh sách dev.scp.
+# %% [markdown]
+# ## 0. Cấu hình đường dẫn — SỬA Ở ĐÂY
+# %%
+import os, glob
+# ── Data Track 2 trên Kaggle ────────────────────────────────────────────────
+# Kaggle TỰ giải nén .tar.gz khi tạo Dataset → có sẵn thư mục `vmc2026-track2/`.
+# Đổi <track2-data> thành slug dataset của bạn (xem thanh path bên phải khi Add Input).
+DATA_ROOT    = "/kaggle/input/<track2-data>/vmc2026-track2"   # << SỬA slug
+WAV_DIR      = f"{DATA_ROOT}/wav"
+METADATA_CSV = f"{DATA_ROOT}/metadata.csv"      # định dạng: wavID|emotion|transcript (KHÔNG header)
+DEV_SCP      = f"{DATA_ROOT}/sets/dev.scp"      # danh sách wav của tập DEV (tập cần nộp ở train phase)
+# Muốn test nhanh trên ESD (chưa có data thật): trỏ WAV_DIR vào ESD, đặt DEV_SCP=None, METADATA_CSV=None.
+# WAV_DIR = "/kaggle/input/datasets/nguyenthanhlim/emotional-speech-dataset-esd/Emotion Speech Dataset"
+# DEV_SCP = None; METADATA_CSV = None
+LIMIT = None   # << số nhỏ (vd 20) để chạy thử cho nhanh; None = chạy toàn bộ tập DEV
+OUT_DIR = "/kaggle/working"
+# ── Cách tính EMOS ──────────────────────────────────────────────────────────
+# "emotion2vec": OFFLINE, MIỄN PHÍ (exp01, khuyến nghị) — lấy P(cảm xúc target) từ
+#                emotion2vec (model đã chạy cho CAT) rồi scale về thang 1–5. Không cần API.
+# "gemini"     : LLM-as-judge qua Gemini API (cần GEMINI_API_KEY, tốn phí). Chỉ cách này có VAD.
+EMOS_METHOD = "emotion2vec"
+_have_meta  = bool(METADATA_CSV) and os.path.exists(METADATA_CSV)   # cần nhãn cảm xúc target
+RUN_QMOS    = True
+RUN_EMOCAT  = True
+RUN_EMOS    = _have_meta                                  # cả 2 cách đều cần target từ metadata
+RUN_VAD     = _have_meta and EMOS_METHOD == "gemini"      # VAD chỉ có ở Gemini
+EMOTIONS5 = ["angry", "happy", "neutral", "sad", "surprised"]
+# Chuẩn hóa nhãn cảm xúc target (metadata) → đúng 1 trong 5 lớp của emotion2vec.
+_EMO_ALIAS = {
+    "angry": "angry", "anger": "angry",
+    "happy": "happy", "happiness": "happy", "joy": "happy",
+    "neutral": "neutral", "calm": "neutral",
+    "sad": "sad", "sadness": "sad",
+    "surprise": "surprised", "surprised": "surprised", "surprising": "surprised",
+}
+def norm_emotion(label):
+    """Đưa nhãn cảm xúc bất kỳ về 1 trong EMOTIONS5; None nếu không khớp."""
+    key = str(label).strip().lower()
+    return _EMO_ALIAS.get(key, key if key in EMOTIONS5 else None)
+def list_wavs(d):
+    """Trả về list đường dẫn .wav đầy đủ cần dự đoán.
+    - Có DEV_SCP  → đọc danh sách tên file tập DEV (đúng tập cần nộp, wav nằm phẳng trong wav/).
+    - Không có    → quét đệ quy mọi .wav trong thư mục (chế độ test ESD, lồng speaker/emotion)."""
+    if DEV_SCP and os.path.exists(DEV_SCP):
+        with open(DEV_SCP) as f:
+            names = [ln.strip() for ln in f if ln.strip()]
+        wavs = [os.path.join(d, n) for n in names]
+    else:
+        wavs = sorted(glob.glob(os.path.join(d, "**", "*.wav"), recursive=True))
+    return wavs[:LIMIT] if LIMIT else wavs
+print("WAV_DIR:", WAV_DIR)
+print("Số wav:", len(list_wavs(WAV_DIR)) if os.path.isdir(WAV_DIR) else "(chưa thấy thư mục)")
+print("Chế độ DEV (dev.scp):", bool(DEV_SCP and os.path.exists(DEV_SCP)))
+# %% [markdown]
+# ## 1. Cài đặt
+# %%
+# !pip install -q speechmos funasr librosa soundfile pandas google-genai loguru tqdm
+# %% [markdown]
+# ## 2. QMOS — SpeechMOS (UTMOS)
+# Dùng SpeechMOS qua torch.hub (không cần fairseq). Output: dict {wav: score 1-5}.
+# %%
+def run_qmos(wav_dir):
+    import torch, librosa
+    # SpeechMOS yêu cầu 16kHz; input shape (Batch, Time); sr truyền dạng keyword.
+    predictor = torch.hub.load(
+        "tarepan/SpeechMOS:v1.2.0", "utmos22_strong", trust_repo=True
+    )
+    scores = {}
+    missing = 0
+    for w in list_wavs(wav_dir):
+        if not os.path.exists(w):       # mẫu ESD/DailyTalk chưa lấy ngoài → bỏ qua, không crash
+            missing += 1
+            continue
+        wave, _ = librosa.load(w, sr=16000, mono=True)
+        wave_t = torch.from_numpy(wave).unsqueeze(0)   # (1, Time)
+        score = predictor(wave_t, sr=16000)            # → tensor shape (1,)
+        scores[w] = float(score.mean().item())
+    if missing:
+        print(f"[QMOS] Bỏ qua {missing} file thiếu (chưa có ESD/DailyTalk) → sẽ nhận điểm mặc định.")
+    return scores
+qmos_scores = run_qmos(WAV_DIR) if RUN_QMOS else {}
+print("QMOS xong:", len(qmos_scores), "mẫu")
+list(qmos_scores.items())[:3]
+# %% [markdown]
+# ## 3. EmoCat — emotion2vec+ large (funasr)
+# Sửa bug bản gốc + lọc 5 lớp + **chuẩn hóa tổng = 1** (đúng format CAT).
+# Output: dict {wav: {angry:p, happy:p, neutral:p, sad:p, surprised:p}}.
+# %%
+def run_emocat(wav_dir):
+    from funasr import AutoModel
+    model = AutoModel(model="iic/emotion2vec_plus_large", hub="hf")
+    results = {}
+    missing = 0
+    for w in list_wavs(wav_dir):
+        if not os.path.exists(w):       # mẫu ESD/DailyTalk chưa lấy ngoài → bỏ qua
+            missing += 1
+            continue
+        rec = model.generate(
+            w,
+            granularity="utterance",
+            extract_embedding=False,
+        )
+        labels = rec[0]["labels"]
+        scores = rec[0]["scores"]
+        # gom điểm 5 lớp quan tâm (label có thể dạng "xx/angry")
+        probs = {e: 0.0 for e in EMOTIONS5}
+        for lab, sc in zip(labels, scores):
+            name = lab.split("/")[-1]
+            if name in probs:
+                probs[name] = float(sc)
+        total = sum(probs.values())
+        if total > 0:                      # chuẩn hóa lại trên 5 lớp
+            probs = {k: v / total for k, v in probs.items()}
+        results[w] = probs
+    if missing:
+        print(f"[EmoCat] Bỏ qua {missing} file thiếu (chưa có ESD/DailyTalk) → sẽ nhận phân bố mặc định.")
+    return results
+emocat_probs = run_emocat(WAV_DIR) if RUN_EMOCAT else {}
+print("EmoCat xong:", len(emocat_probs), "mẫu")
+list(emocat_probs.items())[:2]
+# %% [markdown]
+# ## 4. EMOS — emotion2vec target-prob (mặc định) hoặc Gemini
+# **Cách emotion2vec (exp01, offline):** với mỗi wav, lấy xác suất emotion2vec gán cho ĐÚNG cảm xúc
+# target (đọc từ `metadata.csv`), scale [0,1] → [1,5]. Vì EMOS chấm bằng SRCC (thứ hạng) nên scale
+# tuyến tính không đổi tương quan — chỉ cần thứ tự đúng. KHÔNG cần train, KHÔNG cần API.
+# **Cách Gemini:** gọi script baseline gốc (cần `GEMINI_API_KEY`); chỉ cách này mới có VAD.
+# `metadata.csv` dạng `wavID|emotion|transcript` (không header); `emotion` = cảm xúc target.
+# %%
+def load_target_emotions():
+    """Đọc metadata.csv (wavID|emotion|transcript, không header) → {stem: emotion_chuẩn}."""
+    tgt = {}
+    if not (METADATA_CSV and os.path.exists(METADATA_CSV)):
+        return tgt
+    with open(METADATA_CSV, encoding="utf-8") as f:
+        for ln in f:
+            parts = ln.strip().split("|")
+            if len(parts) < 2:
+                continue
+            stem = os.path.splitext(os.path.basename(parts[0]))[0]
+            tgt[stem] = norm_emotion(parts[1])
+    return tgt
+def run_emos_emotion2vec(wav_dir, target_map):
+    """EMOS offline = P(cảm xúc target) từ emotion2vec, scale [0,1] → [1,5].
+    Dùng lại emocat_probs (đã tính ở mục 3) nên KHÔNG tốn thêm tính toán."""
+    out, miss_tgt, miss_prob = {}, 0, 0
+    for w in list_wavs(wav_dir):
+        name = os.path.basename(w)
+        tgt = target_map.get(os.path.splitext(name)[0])
+        probs = emocat_probs.get(w)
+        if tgt is None:
+            miss_tgt += 1; continue
+        if not probs:
+            miss_prob += 1; continue
+        out[name] = 1.0 + 4.0 * probs.get(tgt, 0.0)   # p=0→1 điểm, p=1→5 điểm
+    if miss_tgt:  print(f"[EMOS-e2v] {miss_tgt} mẫu thiếu nhãn target → mặc định 3.")
+    if miss_prob: print(f"[EMOS-e2v] {miss_prob} mẫu thiếu prob emotion2vec → mặc định 3.")
+    return out
+def setup_gemini_key():
+    try:
+        from kaggle_secrets import UserSecretsClient
+        os.environ["GEMINI_API_KEY"] = UserSecretsClient().get_secret("GEMINI_API_KEY")
+        print("Đã nạp GEMINI_API_KEY từ Kaggle Secrets")
+    except Exception as e:
+        print("Chưa nạp được key từ Secrets:", e, "→ set thủ công os.environ['GEMINI_API_KEY']")
+emos_scores = {}   # {uttID(tên file .wav): điểm EMOS}
+vad_scores = {}    # {uttID: (val, aro, dom)} — chỉ có khi dùng Gemini
+target_map = load_target_emotions()
+print("Nhãn cảm xúc target đọc được:", len(target_map))
+if RUN_EMOS and EMOS_METHOD == "emotion2vec":
+    assert RUN_EMOCAT and emocat_probs, "EMOS theo emotion2vec cần EmoCat chạy trước (RUN_EMOCAT=True)."
+    emos_scores = run_emos_emotion2vec(WAV_DIR, target_map)
+elif RUN_EMOS and EMOS_METHOD == "gemini":
+    setup_gemini_key()
+    # !git clone -q https://github.com/voicemos-challenge/vmc2026-baselines.git /kaggle/working/vmc2026-baselines
+    # Chạy (1-based, inclusive). Với eval lớn nên chia batch + giảm --workers do quota free tier.
+    # Gemini chỉ chấm các dòng metadata.csv trùng với DEV; cách đơn giản: chạy hết rồi lọc lại ở build_answer.
+    # !cd /kaggle/working/vmc2026-baselines/track2/EMOS && python Gemini_EMOS.py \
+    #     --metadata-path {METADATA_CSV} --base-path {WAV_DIR} \
+    #     --output-file /kaggle/working/emos.csv --workers 4 --resume
+    # !cd /kaggle/working/vmc2026-baselines/track2/VAD && python Gemini_VAD.py \
+    #     --metadata-path {METADATA_CSV} --base-path {WAV_DIR} \
+    #     --output-file /kaggle/working/vad.csv --workers 4 --resume
+    import pandas as pd
+    if os.path.exists("/kaggle/working/emos.csv"):
+        df = pd.read_csv("/kaggle/working/emos.csv")
+        emos_scores = dict(zip(df["uttID"], df["emos"]))
+    if os.path.exists("/kaggle/working/vad.csv"):
+        df = pd.read_csv("/kaggle/working/vad.csv")
+        # cột output VAD chuẩn của Gemini_VAD.py: uttID, val, aro, dom
+        for _, r in df.iterrows():
+            vad_scores[r["uttID"]] = (r["val"], r["aro"], r["dom"])
+print("EMOS:", len(emos_scores), "| VAD:", len(vad_scores))
+# %% [markdown]
+# ## 5. Gộp thành `answer.txt`
+# QMOS & EMOS bắt buộc. Tự bỏ cột nếu thiếu dữ liệu (nộp tập con hợp lệ).
+# Lưu ý key: qmos/emocat theo path đầy đủ; emos/vad theo TÊN FILE → tra cứu bằng basename.
+# %%
+def fmt_cat(p):
+    return "|".join(f"{e}:{p[e]:.6g}" for e in EMOTIONS5)
+def build_answer(out_path):
+    wavs = list_wavs(WAV_DIR)
+    have_emos = RUN_EMOS and len(emos_scores) > 0
+    have_cat  = RUN_EMOCAT and len(emocat_probs) > 0
+    have_vad  = RUN_VAD and len(vad_scores) > 0
+    cols = ["wav", "QMOS", "EMOS"]          # QMOS+EMOS bắt buộc
+    if have_cat:  cols.append("CAT")
+    if have_vad:  cols += ["VAL", "ARO", "DOM"]
+    n = 0
+    with open(out_path, "w") as f:
+        f.write(",".join(cols) + "\n")
+        for w in wavs:
+            name = os.path.basename(w)              # tên file = key của emos/vad, và là giá trị cột wav
+            row = [name,
+                   f"{qmos_scores.get(w, 3.0):.6g}",
+                   str(emos_scores.get(name, 3))]
+            if have_cat:
+                row.append(fmt_cat(emocat_probs.get(w, {e: 0.2 for e in EMOTIONS5})))
+            if have_vad:
+                v = vad_scores.get(name, (3, 3, 3))
+                row += [str(v[0]), str(v[1]), str(v[2])]
+            f.write(",".join(row) + "\n")
+            n += 1
+    print(f"Ghi {n} dòng → {out_path} | cột: {cols}")
+    return cols
+answer_path = os.path.join(OUT_DIR, "answer.txt")
+cols = build_answer(answer_path)
+# %% [markdown]
+# ## 6. Validate + đóng zip
+# %%
+def validate(path):
+    import csv
+    with open(path) as f:
+        rows = list(csv.reader(f))
+    header = rows[0]
+    assert header[0] == "wav" and "QMOS" in header and "EMOS" in header, "Header sai"
+    for i, r in enumerate(rows[1:], 2):
+        assert len(r) == len(header), f"Dòng {i} sai số cột"
+    print(f"OK: {len(rows)-1} dòng, header = {header}")
+validate(answer_path)
+# !cd /kaggle/working && zip -j submission_track2.zip answer.txt && unzip -l submission_track2.zip
+print("Sẵn sàng nộp: /kaggle/working/submission_track2.zip (chứa answer.txt)")
+# %% [markdown]
+# ## Ghi chú
+# - Nộp: My Submissions → chọn **Track 2**, **bỏ chọn** track khác → upload `submission_track2.zip`.
+# - `metadata.csv` (wavID|emotion|transcript, không header) chứa nhãn cảm xúc target cho Gemini EMOS/VAD.
+# - Train phase: dự đoán tập DEV (`sets/dev.scp`). `sets/train.csv` có nhãn người nghe để train mô hình riêng.
+# - Quota Gemini free tier dễ hết với eval lớn → chia batch `--start-row/--end-row`, giảm `--workers`, dùng `--resume`.
+# - Khi có data thật: sửa `DATA_ROOT` ở cell 0 rồi chạy lại từ đầu.

track2/track2_prepare_data.ipynb ADDED Viewed

	@@ -0,0 +1,249 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# VMC2026 Track 2 — Chuẩn bị data (gộp ESD + DailyTalk) trên Kaggle\n",
+    "\n",
+    "Gói Track 2 thiếu **1.417 mẫu giọng thật** (license tách ra):\n",
+    "- **sys006** = ESD (1.379 file) · **sys001** = DailyTalk (38 file)\n",
+    "\n",
+    "Notebook này: cài **SoX** + build **sv56** → gom đúng utterance từ ESD/DailyTalk →\n",
+    "**chuẩn hóa âm lượng** (giống mẫu TTS) → ráp vào `wav/` đủ **15.477 file**.\n",
+    "\n",
+    "### Cách dùng\n",
+    "1. Settings → **Internet = On** (cần tải/biên dịch sv56). GPU không bắt buộc.\n",
+    "2. **+ Add Input** 3 dataset:\n",
+    "   - Gói Track 2 (`vmc2026_track2_..._v3.tar.gz` — Kaggle tự giải nén ra `vmc2026-track2/`).\n",
+    "   - ESD: `Emotional Speech Dataset (ESD).zip` (Kaggle tự giải nén ra `Emotion Speech Dataset/`).\n",
+    "   - DailyTalk: `dailytalk.zip` (giải nén ra `dailytalk/data/...`).\n",
+    "3. **Run All**. Xong → **Save Version** (Commit) để lưu `wav/` ra output.\n",
+    "4. Từ output đó → **Create Dataset** → dùng làm input cho notebook train/baseline.\n",
+    "\n",
+    "> Notebook tự dò vị trí ESD/DailyTalk dù Kaggle giải nén ra thư mục tên gì."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 0. Tìm gói Track 2 + copy ra thư mục ghi được"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os, glob, shutil, subprocess\n",
+    "\n",
+    "# Tự dò thư mục vmc2026-track2 trong mọi input đã add.\n",
+    "_cands = glob.glob(\"/kaggle/input/*/vmc2026-track2\") + glob.glob(\"/kaggle/input/**/vmc2026-track2\", recursive=True)\n",
+    "TRACK2_SRC = _cands[0] if _cands else None\n",
+    "assert TRACK2_SRC, \"Không thấy thư mục vmc2026-track2 — đã Add Input gói Track 2 chưa?\"\n",
+    "\n",
+    "WORK = \"/kaggle/working/vmc2026-track2\"     # bản ghi được (input là read-only)\n",
+    "print(\"Track2 source :\", TRACK2_SRC)\n",
+    "\n",
+    "# Copy toàn bộ gói ra working (gồm wav/ 14060 file + scripts + csv). Mất vài phút.\n",
+    "if not os.path.exists(WORK):\n",
+    "    print(\"Đang copy gói Track 2 ra working (vài phút)...\")\n",
+    "    shutil.copytree(TRACK2_SRC, WORK)\n",
+    "print(\"Số wav hiện có:\", len(os.listdir(f\"{WORK}/wav\")))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Cài SoX + build sv56 (cần Internet = On)\n",
+    "sv56 = công cụ chuẩn hóa âm lượng của ITU-T, build từ source openitu/STL."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def sh(cmd):\n",
+    "    print(\"$\", cmd)\n",
+    "    print(subprocess.run(cmd, shell=True, capture_output=True, text=True).stdout[-2000:])\n",
+    "\n",
+    "# SoX + trình biên dịch (để build sv56)\n",
+    "sh(\"apt-get -qq update && apt-get -qq install -y sox make gcc\")\n",
+    "sh(\"which sox && sox --version\")\n",
+    "\n",
+    "# sv56demo\n",
+    "SV56_DIR = \"/kaggle/working/STL-2009\"\n",
+    "SV56_BIN_DIR = f\"{SV56_DIR}/src/sv56\"\n",
+    "if not os.path.exists(f\"{SV56_BIN_DIR}/sv56demo\"):\n",
+    "    sh(\"cd /kaggle/working && wget -q https://github.com/openitu/STL/archive/refs/tags/v2009.tar.gz\")\n",
+    "    sh(\"cd /kaggle/working && tar -xf v2009.tar.gz\")\n",
+    "    sh(f\"cd {SV56_BIN_DIR} && make -f makefile.unx\")\n",
+    "assert os.path.exists(f\"{SV56_BIN_DIR}/sv56demo\"), \"Build sv56 thất bại — kiểm tra Internet=On + log make.\"\n",
+    "\n",
+    "# Đưa cả sox và sv56demo vào PATH cho các script .sh dùng được\n",
+    "os.environ[\"PATH\"] = SV56_BIN_DIR + \":\" + os.environ[\"PATH\"]\n",
+    "sh(\"which sv56demo\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Dò vị trí ESD + DailyTalk (tự tìm dù tên thư mục khác nhau)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def find_root(rel_path):\n",
+    "    \"\"\"Tìm thư mục ROOT trong /kaggle/input sao cho ROOT/rel_path tồn tại.\"\"\"\n",
+    "    base = os.path.basename(rel_path)\n",
+    "    for hit in glob.glob(f\"/kaggle/input/**/{base}\", recursive=True):\n",
+    "        if hit.endswith(rel_path.replace(\"/\", os.sep)) or hit.endswith(rel_path):\n",
+    "            return hit[: -len(rel_path)].rstrip(\"/\")\n",
+    "    return None\n",
+    "\n",
+    "# ESD: dòng CSV \"0014/Angry/000381.wav\" → file thật là \"0014/Angry/0014_000381.wav\"\n",
+    "_esd_first = open(f\"{WORK}/ESD_utts_train_dev.csv\").readline().strip().split(\",\")[0]\n",
+    "_p = _esd_first.split(\"/\")                       # [spk, emo, uttID.wav]\n",
+    "ESD_REL = f\"{_p[0]}/{_p[1]}/{_p[0]}_{_p[2]}\"\n",
+    "ESD_ROOT = find_root(ESD_REL)\n",
+    "\n",
+    "# DailyTalk: dòng CSV \"1020/0_1_d1020.wav\" → file thật \".../data/1020/0_1_d1020.wav\"\n",
+    "_dt_first = open(f\"{WORK}/DT_utts_train_dev.csv\").readline().strip().split(\",\")[0]\n",
+    "DT_ROOT = find_root(_dt_first)                   # ROOT sao cho ROOT/1020/0_1_d1020.wav tồn tại\n",
+    "\n",
+    "print(\"ESD_REL  :\", ESD_REL)\n",
+    "print(\"ESD_ROOT :\", ESD_ROOT)\n",
+    "print(\"DT_ROOT  :\", DT_ROOT)\n",
+    "assert ESD_ROOT, \"Không thấy ESD — đã Add Input 'Emotional Speech Dataset (ESD).zip' chưa?\"\n",
+    "assert DT_ROOT, \"Không thấy DailyTalk — đã Add Input 'dailytalk.zip' chưa?\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Gom các utterance cần dùng → thư mục gathered/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "GATHERED = f\"{WORK}/gathered\"\n",
+    "os.makedirs(GATHERED, exist_ok=True)\n",
+    "\n",
+    "# ESD: copy ROOT/spk/emo/spk_uttID  →  gathered/<tên vmc2026>\n",
+    "n_esd = 0\n",
+    "for line in open(f\"{WORK}/ESD_utts_train_dev.csv\"):\n",
+    "    src_rel, dst = line.strip().split(\",\")\n",
+    "    p = src_rel.split(\"/\")\n",
+    "    src = f\"{ESD_ROOT}/{p[0]}/{p[1]}/{p[0]}_{p[2]}\"\n",
+    "    if os.path.exists(src):\n",
+    "        shutil.copy(src, f\"{GATHERED}/{dst}\")\n",
+    "        n_esd += 1\n",
+    "    else:\n",
+    "        print(\"ESD thiếu:\", src)\n",
+    "\n",
+    "# DailyTalk: copy ROOT/parts[0]  →  gathered/<tên vmc2026>\n",
+    "n_dt = 0\n",
+    "for line in open(f\"{WORK}/DT_utts_train_dev.csv\"):\n",
+    "    src_rel, dst = line.strip().split(\",\")\n",
+    "    src = f\"{DT_ROOT}/{src_rel}\"\n",
+    "    if os.path.exists(src):\n",
+    "        shutil.copy(src, f\"{GATHERED}/{dst}\")\n",
+    "        n_dt += 1\n",
+    "    else:\n",
+    "        print(\"DailyTalk thiếu:\", src)\n",
+    "\n",
+    "print(f\"Đã gom: ESD {n_esd}/1379 · DailyTalk {n_dt}/38 · tổng {len(os.listdir(GATHERED))}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Chuẩn hóa âm lượng bằng sv56 (mức -26 dB, giữ nguyên sample rate)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Dùng chính script gốc trong gói: batch_normRMSE.sh → tạo *_norm.wav trong gathered/\n",
+    "sh(f\"bash {WORK}/sv56scripts/batch_normRMSE.sh {GATHERED}\")\n",
+    "\n",
+    "# Move các file đã chuẩn hóa vào wav/ với đúng tên (bỏ hậu tố _norm)\n",
+    "moved = 0\n",
+    "for n in os.listdir(GATHERED):\n",
+    "    if n.endswith(\"_norm.wav\"):\n",
+    "        final = \"_\".join(n.split(\"_\")[:-1]) + \".wav\"   # bỏ \"_norm\"\n",
+    "        shutil.move(f\"{GATHERED}/{n}\", f\"{WORK}/wav/{final}\")\n",
+    "        moved += 1\n",
+    "print(\"Đã chuẩn hóa & move vào wav/:\", moved, \"file\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Kiểm tra đủ 15.477 + dọn rác"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "total = len(glob.glob(f\"{WORK}/wav/*.wav\"))\n",
+    "print(\"Tổng wav trong wav/:\", total)\n",
+    "if total == 15477:\n",
+    "    shutil.rmtree(GATHERED, ignore_errors=True)\n",
+    "    # Dọn artifact build để dataset lưu ra GỌN (chỉ giữ vmc2026-track2/)\n",
+    "    shutil.rmtree(SV56_DIR, ignore_errors=True)\n",
+    "    for f in glob.glob(\"/kaggle/working/v2009.tar.gz\"):\n",
+    "        os.remove(f)\n",
+    "    print(\"✅ ĐỦ 15.477 file. Sẵn sàng Save Version → tạo Dataset.\")\n",
+    "    print(\"   Dùng dataset này cho notebook baseline: DATA_ROOT = '/kaggle/input/<dataset-mới>/vmc2026-track2'\")\n",
+    "else:\n",
+    "    print(f\"⚠️ Chưa đủ (đang {total}). Kiểm tra log bước 3-4 xem ESD/DailyTalk có thiếu file nào.\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ghi chú\n",
+    "- Output nặng (~2-3GB do 15.477 wav). `wav/` đã gồm cả train+dev nên dùng được cho cả fine-tune lẫn inference.\n",
+    "- sv56 chuẩn hóa để mẫu ESD/DailyTalk cùng mức âm lượng với mẫu TTS → tránh model bị nhiễu bởi độ to.\n",
+    "- Nếu Internet Off: SoX có thể có sẵn nhưng KHÔNG build được sv56 → bắt buộc bật Internet."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

track2/track2_prepare_data_pipeline.py ADDED Viewed

	@@ -0,0 +1,164 @@

+# %% [markdown]
+# # VMC2026 Track 2 — Chuẩn bị data (gộp ESD + DailyTalk) trên Kaggle
+#
+# Gói Track 2 thiếu **1.417 mẫu giọng thật** (license tách ra):
+# - **sys006** = ESD (1.379 file) · **sys001** = DailyTalk (38 file)
+#
+# Notebook này: cài **SoX** + build **sv56** → gom đúng utterance từ ESD/DailyTalk →
+# **chuẩn hóa âm lượng** (giống mẫu TTS) → ráp vào `wav/` đủ **15.477 file**.
+#
+# ### Cách dùng
+# 1. Settings → **Internet = On** (cần tải/biên dịch sv56). GPU không bắt buộc.
+# 2. **+ Add Input** 3 dataset:
+#    - Gói Track 2 (`vmc2026_track2_..._v3.tar.gz` — Kaggle tự giải nén ra `vmc2026-track2/`).
+#    - ESD: `Emotional Speech Dataset (ESD).zip` (Kaggle tự giải nén ra `Emotion Speech Dataset/`).
+#    - DailyTalk: `dailytalk.zip` (giải nén ra `dailytalk/data/...`).
+# 3. **Run All**. Xong → **Save Version** (Commit) để lưu `wav/` ra output.
+# 4. Từ output đó → **Create Dataset** → dùng làm input cho notebook train/baseline.
+#
+# > Notebook tự dò vị trí ESD/DailyTalk dù Kaggle giải nén ra thư mục tên gì.
+# %% [markdown]
+# ## 0. Tìm gói Track 2 + copy ra thư mục ghi được
+# %%
+import os, glob, shutil, subprocess
+# Tự dò thư mục vmc2026-track2 trong mọi input đã add.
+_cands = glob.glob("/kaggle/input/*/vmc2026-track2") + glob.glob("/kaggle/input/**/vmc2026-track2", recursive=True)
+TRACK2_SRC = _cands[0] if _cands else None
+assert TRACK2_SRC, "Không thấy thư mục vmc2026-track2 — đã Add Input gói Track 2 chưa?"
+WORK = "/kaggle/working/vmc2026-track2"     # bản ghi được (input là read-only)
+print("Track2 source :", TRACK2_SRC)
+# Copy toàn bộ gói ra working (gồm wav/ 14060 file + scripts + csv). Mất vài phút.
+if not os.path.exists(WORK):
+    print("Đang copy gói Track 2 ra working (vài phút)...")
+    shutil.copytree(TRACK2_SRC, WORK)
+print("Số wav hiện có:", len(os.listdir(f"{WORK}/wav")))
+# %% [markdown]
+# ## 1. Cài SoX + build sv56 (cần Internet = On)
+# sv56 = công cụ chuẩn hóa âm lượng của ITU-T, build từ source openitu/STL.
+# %%
+def sh(cmd):
+    print("$", cmd)
+    print(subprocess.run(cmd, shell=True, capture_output=True, text=True).stdout[-2000:])
+# SoX + trình biên dịch (để build sv56)
+sh("apt-get -qq update && apt-get -qq install -y sox make gcc")
+sh("which sox && sox --version")
+# sv56demo
+SV56_DIR = "/kaggle/working/STL-2009"
+SV56_BIN_DIR = f"{SV56_DIR}/src/sv56"
+if not os.path.exists(f"{SV56_BIN_DIR}/sv56demo"):
+    sh("cd /kaggle/working && wget -q https://github.com/openitu/STL/archive/refs/tags/v2009.tar.gz")
+    sh("cd /kaggle/working && tar -xf v2009.tar.gz")
+    sh(f"cd {SV56_BIN_DIR} && make -f makefile.unx")
+assert os.path.exists(f"{SV56_BIN_DIR}/sv56demo"), "Build sv56 thất bại — kiểm tra Internet=On + log make."
+# Đưa cả sox và sv56demo vào PATH cho các script .sh dùng được
+os.environ["PATH"] = SV56_BIN_DIR + ":" + os.environ["PATH"]
+sh("which sv56demo")
+# %% [markdown]
+# ## 2. Dò vị trí ESD + DailyTalk (tự tìm dù tên thư mục khác nhau)
+# %%
+def find_root(rel_path):
+    """Tìm thư mục ROOT trong /kaggle/input sao cho ROOT/rel_path tồn tại."""
+    base = os.path.basename(rel_path)
+    for hit in glob.glob(f"/kaggle/input/**/{base}", recursive=True):
+        if hit.endswith(rel_path.replace("/", os.sep)) or hit.endswith(rel_path):
+            return hit[: -len(rel_path)].rstrip("/")
+    return None
+# ESD: dòng CSV "0014/Angry/000381.wav" → file thật là "0014/Angry/0014_000381.wav"
+_esd_first = open(f"{WORK}/ESD_utts_train_dev.csv").readline().strip().split(",")[0]
+_p = _esd_first.split("/")                       # [spk, emo, uttID.wav]
+ESD_REL = f"{_p[0]}/{_p[1]}/{_p[0]}_{_p[2]}"
+ESD_ROOT = find_root(ESD_REL)
+# DailyTalk: dòng CSV "1020/0_1_d1020.wav" → file thật ".../data/1020/0_1_d1020.wav"
+_dt_first = open(f"{WORK}/DT_utts_train_dev.csv").readline().strip().split(",")[0]
+DT_ROOT = find_root(_dt_first)                   # ROOT sao cho ROOT/1020/0_1_d1020.wav tồn tại
+print("ESD_REL  :", ESD_REL)
+print("ESD_ROOT :", ESD_ROOT)
+print("DT_ROOT  :", DT_ROOT)
+assert ESD_ROOT, "Không thấy ESD — đã Add Input 'Emotional Speech Dataset (ESD).zip' chưa?"
+assert DT_ROOT, "Không thấy DailyTalk — đã Add Input 'dailytalk.zip' chưa?"
+# %% [markdown]
+# ## 3. Gom các utterance cần dùng → thư mục gathered/
+# %%
+GATHERED = f"{WORK}/gathered"
+os.makedirs(GATHERED, exist_ok=True)
+# ESD: copy ROOT/spk/emo/spk_uttID  →  gathered/<tên vmc2026>
+n_esd = 0
+for line in open(f"{WORK}/ESD_utts_train_dev.csv"):
+    src_rel, dst = line.strip().split(",")
+    p = src_rel.split("/")
+    src = f"{ESD_ROOT}/{p[0]}/{p[1]}/{p[0]}_{p[2]}"
+    if os.path.exists(src):
+        shutil.copy(src, f"{GATHERED}/{dst}")
+        n_esd += 1
+    else:
+        print("ESD thiếu:", src)
+# DailyTalk: copy ROOT/parts[0]  →  gathered/<tên vmc2026>
+n_dt = 0
+for line in open(f"{WORK}/DT_utts_train_dev.csv"):
+    src_rel, dst = line.strip().split(",")
+    src = f"{DT_ROOT}/{src_rel}"
+    if os.path.exists(src):
+        shutil.copy(src, f"{GATHERED}/{dst}")
+        n_dt += 1
+    else:
+        print("DailyTalk thiếu:", src)
+print(f"Đã gom: ESD {n_esd}/1379 · DailyTalk {n_dt}/38 · tổng {len(os.listdir(GATHERED))}")
+# %% [markdown]
+# ## 4. Chuẩn hóa âm lượng bằng sv56 (mức -26 dB, giữ nguyên sample rate)
+# %%
+# Dùng chính script gốc trong gói: batch_normRMSE.sh → tạo *_norm.wav trong gathered/
+sh(f"bash {WORK}/sv56scripts/batch_normRMSE.sh {GATHERED}")
+# Move các file đã chuẩn hóa vào wav/ với đúng tên (bỏ hậu tố _norm)
+moved = 0
+for n in os.listdir(GATHERED):
+    if n.endswith("_norm.wav"):
+        final = "_".join(n.split("_")[:-1]) + ".wav"   # bỏ "_norm"
+        shutil.move(f"{GATHERED}/{n}", f"{WORK}/wav/{final}")
+        moved += 1
+print("Đã chuẩn hóa & move vào wav/:", moved, "file")
+# %% [markdown]
+# ## 5. Kiểm tra đủ 15.477 + dọn rác
+# %%
+total = len(glob.glob(f"{WORK}/wav/*.wav"))
+print("Tổng wav trong wav/:", total)
+if total == 15477:
+    shutil.rmtree(GATHERED, ignore_errors=True)
+    # Dọn artifact build để dataset lưu ra GỌN (chỉ giữ vmc2026-track2/)
+    shutil.rmtree(SV56_DIR, ignore_errors=True)
+    for f in glob.glob("/kaggle/working/v2009.tar.gz"):
+        os.remove(f)
+    print("✅ ĐỦ 15.477 file. Sẵn sàng Save Version → tạo Dataset.")
+    print("   Dùng dataset này cho notebook baseline: DATA_ROOT = '/kaggle/input/<dataset-mới>/vmc2026-track2'")
+else:
+    print(f"⚠️ Chưa đủ (đang {total}). Kiểm tra log bước 3-4 xem ESD/DailyTalk có thiếu file nào.")
+# %% [markdown]
+# ## Ghi chú
+# - Output nặng (~2-3GB do 15.477 wav). `wav/` đã gồm cả train+dev nên dùng được cho cả fine-tune lẫn inference.
+# - sv56 chuẩn hóa để mẫu ESD/DailyTalk cùng mức âm lượng với mẫu TTS → tránh model bị nhiễu bởi độ to.
+# - Nếu Internet Off: SoX có thể có sẵn nhưng KHÔNG build được sv56 → bắt buộc bật Internet.