---
license: cc-by-nc-4.0
language:
- zh
- en
tags:
- speech
- asr
- sensevoice
- paralinguistic
- nonverbal-vocalization
datasets:
- NV-Bench
- amphion/Emilia-NV
- nonverbalspeech/nonverbalspeech38k
- deepvk/NonverbalTTS
- xunyi/SMIIP-NV
pipeline_tag: automatic-speech-recognition
metrics:
- cer
base_model:
- FunAudioLLM/SenseVoiceSmall
---

# Multi-lingual NVASR

**Multi-lingual Nonverbal Vocalization Automatic Speech Recognition**

[![Demo Page](https://img.shields.io/badge/Demo-Page-blue)](https://nvbench.github.io)
[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/AnonyData/NV-Bench)
[![Model](https://img.shields.io/badge/Model-HuggingFace-yellow)](https://huggingface.co/AnonyData/Multilingual-NVASR)

Multi-lingual NVASR is a speech recognition model fine-tuned from [SenseVoice-Small](https://github.com/FunAudioLLM/SenseVoice) for transcribing both regular speech and **nonverbal vocalizations (NVVs)** with a unified paralinguistic label taxonomy. It is a core component of the [NV-Bench](https://nvbench.github.io) evaluation pipeline.

## Highlights

- 🗣️ **Multi-lingual Support** — Chinese (zh), English (en)
- 🎯 **NVV-Aware Transcription** — Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text
- 📊 **High-Quality General ASR** — Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks
- 🏷️ **Unified Label Taxonomy** — Consistent paralinguistic labels across all supported languages

## NVV Taxonomy

NVVs are organized into three functional levels:

| Function | Categories |
|----------|------------|
| Vegetative | `[Cough]`, `[Sigh]`, `[Breathing]` |
| Affect Burst | `[Surprise-oh]`, `[Surprise-ah]`, `[Dissatisfaction-hnn]`, `[Laughter]` |
| Conversational Grunt | `[Uhm]`, `[Question-en/oh/ah/ei/huh]`, `[Confirmation-en]` |

> [!NOTE]
> Mandarin supports 13 NVV categories; English supports 7 categories.

## Usage

### Quick Start with FunASR

```python
from funasr import AutoModel

model = AutoModel(model="path/to/Multi-lingual-NVASR")

# Single file inference
res = model.generate(
    input="example/zh.mp3",
    language="auto",
    use_itn=True,
)
print(res[0]["text"])
```

## Evaluation Metrics

Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline:

| Metric | Description |
|--------|-------------|
| **OCER / OWER** | Overall Character/Word Error Rate (text + NVV tags) |
| **PCER / PWER** | Paralinguistic CER/WER (NVV tags only) |
| **CER / WER** | Text-only error rate (NVV tags removed) |

> Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. — *NV-Bench*

## File Structure

```
Multi-lingual NVASR/
├── model.pt                        # Model weights (~2.8 GB)
├── config.yaml                     # Model architecture configuration
├── configuration.json              # FunASR pipeline configuration
├── am.mvn                          # Acoustic model mean-variance normalization
├── paralingustic_tokenizer.model   # SentencePiece tokenizer with NVV vocabulary
├── example/                        # Example audio files
│   ├── zh.mp3                      # Chinese example
│   ├── en.mp3                      # English example
```

## Related Resources

- **NV-Bench Project Page**: [https://nvbench.github.io](https://nvbench.github.io)
- **NV-Bench Dataset**: [Hugging Face](https://huggingface.co/datasets/AnonyData/NV-Bench)
- **SenseVoice**: [GitHub](https://github.com/FunAudioLLM/SenseVoice)

## Citation

If you use this model, please cite:

```bibtex
Coming soon
```

## License

This project is licensed under the [CC BY-NC-4.0 License](LICENSE).