File size: 4,804 Bytes
c795015 97c1b2f 16cbc1c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
---
license: mit
datasets:
- Helsinki-NLP/open_subtitles
language:
- zh
base_model:
- hfl/chinese-macbert-base
pipeline_tag: text-classification
tags:
- agent
- nlp
- chinese
- sentiment-analysis
- emotion
- regression
- vad
- valence-arousal-dominance
- transformers
- bert
- macbert
---
<div align="center">
<h1>vad-macbert</h1>
<p>Chinese VAD (valence/arousal/dominance) regression on top of chinese-macbert-base.</p>
<p>
<a href="https://huggingface.co/Pectics/vad-macbert">
<img alt="HF Model" src="https://img.shields.io/badge/Hugging%20Face-Model-yellow">
</a>
<img alt="Task" src="https://img.shields.io/badge/task-VAD%20regression-1f6feb">
<img alt="Backbone" src="https://img.shields.io/badge/backbone-chinese--macbert--base-4b8bbe">
</p>
</div>
The model predicts 3 continuous values aligned to the VAD scale produced by
`RobroKools/vad-bert` (teacher model).
## Quickstart
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_path = "Pectics/vad-macbert"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval()
text = "这部电影让我很感动。"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
vad = outputs.logits.squeeze().tolist()
print("VAD:", vad)
```
## Model Details
- Base model: `hfl/chinese-macbert-base`
- Task: VAD regression (3 outputs: valence, arousal, dominance)
- Head: `AutoModelForSequenceClassification` with `num_labels=3`, `problem_type=regression`
## Data Sources & Labeling
### en-zh_cn_vad_clean.csv
- Source: OpenSubtitles EN-ZH parallel corpus.
- Labeling: English side fed into `RobroKools/vad-bert` to obtain VAD values,
then assigned to the paired Chinese text.
### en-zh_cn_vad_long.csv
- Derived from `en-zh_cn_vad_clean.csv` by filtering for longer texts using a
length threshold (original threshold was not recorded).
- Inferred from statistics: minimum length is 32 characters, so the filter
likely kept samples with length >= 32 chars.
### en-zh_cn_vad_long_clean.csv
- Cleaned from `en-zh_cn_vad_long.csv` by removing subtitle formatting noise:
- ASS/SSA tag blocks like `{\\fs..\\pos(..)}` (including broken `{` blocks)
- HTML-like tags (e.g. `<i>...</i>`)
- Escape codes like `\\N`, `\\n`, `\\h`, `\\t`
- Extra whitespace normalization
- Non-CJK rows were dropped.
### en-zh_cn_vad_mix.csv
- Mixed dataset created for replay training:
- 200k samples from `en-zh_cn_vad_clean.csv`
- 200k samples from `en-zh_cn_vad_long_clean.csv`
- Shuffled after sampling
## Training Summary
The final model (`vad-macbert-mix/best`) was obtained in three stages:
1. **Base training** on `en-zh_cn_vad_clean.csv`
2. **Long-text adaptation** on `en-zh_cn_vad_long_clean.csv`
3. **Replay mix** on `en-zh_cn_vad_mix.csv` (resume from stage 2)
### Final-stage Command (Replay Mix)
```
--model_name hfl/chinese-macbert-base
--output_dir train/vad-macbert-mix
--data_path train/en-zh_cn_vad_mix.csv
--epochs 4
--batch_size 32
--grad_accum_steps 4
--learning_rate 0.00001
--weight_decay 0.01
--warmup_ratio 0.1
--warmup_steps 0
--max_length 512
--eval_ratio 0.01
--eval_every 100
--eval_batches 200
--loss huber
--huber_delta 1.0
--shuffle_buffer 4096
--min_chars 2
--save_every 100
--log_every 1
--max_steps 5000
--seed 42
--dtype fp16
--num_rows 400000
--resume_from train/vad-macbert-long/best
--encoding utf-8
```
Training environment (conda `llm`):
- Python 3.10.19
- torch 2.9.1+cu130
- transformers 4.57.6
## Evaluation
Benchmark script: `train/vad_benchmark.py`
- Evaluation uses a fixed stride derived from `eval_ratio=0.01`
(roughly 1 out of 100 samples).
- Length buckets by character count: 0–20, 20–40, 40–80, 80–120, 120–200,
200–400, 400+
### Results (vad-macbert-mix/best)
**en-zh_cn_vad_clean.csv**
- mse_mean=0.043734
- mae_mean=0.149322
- pearson_mean=0.7335
**en-zh_cn_vad_long_clean.csv**
- mse_mean=0.031895
- mae_mean=0.131320
- pearson_mean=0.7565
Notes:
- `400+` bucket Pearson is unstable due to small sample size; interpret with care.
## Limitations
- Labels are derived from an English VAD teacher and transferred via parallel
alignment, so they reflect the teacher’s bias and may not match human Chinese
annotations.
- Subtitle corpora include translation artifacts and formatting noise; cleaned
versions mitigate but do not fully remove this.
- Extreme-length sentences are under-represented; performance on 400+ chars
is not reliable.
## Files in This Repo
- `config.json`
- `model.safetensors`
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt`
- `training_args.json`
|