|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- Helsinki-NLP/open_subtitles |
|
|
language: |
|
|
- zh |
|
|
base_model: |
|
|
- hfl/chinese-macbert-base |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- agent |
|
|
- nlp |
|
|
- chinese |
|
|
- sentiment-analysis |
|
|
- emotion |
|
|
- regression |
|
|
- vad |
|
|
- valence-arousal-dominance |
|
|
- transformers |
|
|
- bert |
|
|
- macbert |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<h1>vad-macbert</h1> |
|
|
<p>Chinese VAD (valence/arousal/dominance) regression on top of chinese-macbert-base.</p> |
|
|
<p> |
|
|
<a href="https://huggingface.co/Pectics/vad-macbert"> |
|
|
<img alt="HF Model" src="https://img.shields.io/badge/Hugging%20Face-Model-yellow"> |
|
|
</a> |
|
|
<img alt="Task" src="https://img.shields.io/badge/task-VAD%20regression-1f6feb"> |
|
|
<img alt="Backbone" src="https://img.shields.io/badge/backbone-chinese--macbert--base-4b8bbe"> |
|
|
</p> |
|
|
</div> |
|
|
|
|
|
The model predicts 3 continuous values aligned to the VAD scale produced by |
|
|
`RobroKools/vad-bert` (teacher model). |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
model_path = "Pectics/vad-macbert" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_path) |
|
|
model.eval() |
|
|
|
|
|
text = "这部电影让我很感动。" |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
vad = outputs.logits.squeeze().tolist() |
|
|
print("VAD:", vad) |
|
|
``` |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- Base model: `hfl/chinese-macbert-base` |
|
|
- Task: VAD regression (3 outputs: valence, arousal, dominance) |
|
|
- Head: `AutoModelForSequenceClassification` with `num_labels=3`, `problem_type=regression` |
|
|
|
|
|
## Data Sources & Labeling |
|
|
|
|
|
### en-zh_cn_vad_clean.csv |
|
|
- Source: OpenSubtitles EN-ZH parallel corpus. |
|
|
- Labeling: English side fed into `RobroKools/vad-bert` to obtain VAD values, |
|
|
then assigned to the paired Chinese text. |
|
|
|
|
|
### en-zh_cn_vad_long.csv |
|
|
- Derived from `en-zh_cn_vad_clean.csv` by filtering for longer texts using a |
|
|
length threshold (original threshold was not recorded). |
|
|
- Inferred from statistics: minimum length is 32 characters, so the filter |
|
|
likely kept samples with length >= 32 chars. |
|
|
|
|
|
### en-zh_cn_vad_long_clean.csv |
|
|
- Cleaned from `en-zh_cn_vad_long.csv` by removing subtitle formatting noise: |
|
|
- ASS/SSA tag blocks like `{\\fs..\\pos(..)}` (including broken `{` blocks) |
|
|
- HTML-like tags (e.g. `<i>...</i>`) |
|
|
- Escape codes like `\\N`, `\\n`, `\\h`, `\\t` |
|
|
- Extra whitespace normalization |
|
|
- Non-CJK rows were dropped. |
|
|
|
|
|
### en-zh_cn_vad_mix.csv |
|
|
- Mixed dataset created for replay training: |
|
|
- 200k samples from `en-zh_cn_vad_clean.csv` |
|
|
- 200k samples from `en-zh_cn_vad_long_clean.csv` |
|
|
- Shuffled after sampling |
|
|
|
|
|
## Training Summary |
|
|
|
|
|
The final model (`vad-macbert-mix/best`) was obtained in three stages: |
|
|
|
|
|
1. **Base training** on `en-zh_cn_vad_clean.csv` |
|
|
2. **Long-text adaptation** on `en-zh_cn_vad_long_clean.csv` |
|
|
3. **Replay mix** on `en-zh_cn_vad_mix.csv` (resume from stage 2) |
|
|
|
|
|
### Final-stage Command (Replay Mix) |
|
|
|
|
|
``` |
|
|
--model_name hfl/chinese-macbert-base |
|
|
--output_dir train/vad-macbert-mix |
|
|
--data_path train/en-zh_cn_vad_mix.csv |
|
|
--epochs 4 |
|
|
--batch_size 32 |
|
|
--grad_accum_steps 4 |
|
|
--learning_rate 0.00001 |
|
|
--weight_decay 0.01 |
|
|
--warmup_ratio 0.1 |
|
|
--warmup_steps 0 |
|
|
--max_length 512 |
|
|
--eval_ratio 0.01 |
|
|
--eval_every 100 |
|
|
--eval_batches 200 |
|
|
--loss huber |
|
|
--huber_delta 1.0 |
|
|
--shuffle_buffer 4096 |
|
|
--min_chars 2 |
|
|
--save_every 100 |
|
|
--log_every 1 |
|
|
--max_steps 5000 |
|
|
--seed 42 |
|
|
--dtype fp16 |
|
|
--num_rows 400000 |
|
|
--resume_from train/vad-macbert-long/best |
|
|
--encoding utf-8 |
|
|
``` |
|
|
|
|
|
Training environment (conda `llm`): |
|
|
|
|
|
- Python 3.10.19 |
|
|
- torch 2.9.1+cu130 |
|
|
- transformers 4.57.6 |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Benchmark script: `train/vad_benchmark.py` |
|
|
|
|
|
- Evaluation uses a fixed stride derived from `eval_ratio=0.01` |
|
|
(roughly 1 out of 100 samples). |
|
|
- Length buckets by character count: 0–20, 20–40, 40–80, 80–120, 120–200, |
|
|
200–400, 400+ |
|
|
|
|
|
### Results (vad-macbert-mix/best) |
|
|
|
|
|
**en-zh_cn_vad_clean.csv** |
|
|
|
|
|
- mse_mean=0.043734 |
|
|
- mae_mean=0.149322 |
|
|
- pearson_mean=0.7335 |
|
|
|
|
|
**en-zh_cn_vad_long_clean.csv** |
|
|
|
|
|
- mse_mean=0.031895 |
|
|
- mae_mean=0.131320 |
|
|
- pearson_mean=0.7565 |
|
|
|
|
|
Notes: |
|
|
- `400+` bucket Pearson is unstable due to small sample size; interpret with care. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Labels are derived from an English VAD teacher and transferred via parallel |
|
|
alignment, so they reflect the teacher’s bias and may not match human Chinese |
|
|
annotations. |
|
|
- Subtitle corpora include translation artifacts and formatting noise; cleaned |
|
|
versions mitigate but do not fully remove this. |
|
|
- Extreme-length sentences are under-represented; performance on 400+ chars |
|
|
is not reliable. |
|
|
|
|
|
## Files in This Repo |
|
|
|
|
|
- `config.json` |
|
|
- `model.safetensors` |
|
|
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt` |
|
|
- `training_args.json` |
|
|
|