Multilingual-NVASR / README.md
AnonyData's picture
Update README.md
4157c08 verified
---
license: cc-by-nc-4.0
language:
- zh
- en
tags:
- speech
- asr
- sensevoice
- paralinguistic
- nonverbal-vocalization
datasets:
- NV-Bench
- amphion/Emilia-NV
- nonverbalspeech/nonverbalspeech38k
- deepvk/NonverbalTTS
- xunyi/SMIIP-NV
pipeline_tag: automatic-speech-recognition
metrics:
- cer
base_model:
- FunAudioLLM/SenseVoiceSmall
---
# Multi-lingual NVASR
**Multi-lingual Nonverbal Vocalization Automatic Speech Recognition**
[![Demo Page](https://img.shields.io/badge/Demo-Page-blue)](https://nvbench.github.io)
[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/AnonyData/NV-Bench)
[![Model](https://img.shields.io/badge/Model-HuggingFace-yellow)](https://huggingface.co/AnonyData/Multilingual-NVASR)
Multi-lingual NVASR is a speech recognition model fine-tuned from [SenseVoice-Small](https://github.com/FunAudioLLM/SenseVoice) for transcribing both regular speech and **nonverbal vocalizations (NVVs)** with a unified paralinguistic label taxonomy. It is a core component of the [NV-Bench](https://nvbench.github.io) evaluation pipeline.
## Highlights
- πŸ—£οΈ **Multi-lingual Support** β€” Chinese (zh), English (en)
- 🎯 **NVV-Aware Transcription** β€” Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text
- πŸ“Š **High-Quality General ASR** β€” Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks
- 🏷️ **Unified Label Taxonomy** β€” Consistent paralinguistic labels across all supported languages
## NVV Taxonomy
NVVs are organized into three functional levels:
| Function | Categories |
|----------|------------|
| Vegetative | `[Cough]`, `[Sigh]`, `[Breathing]` |
| Affect Burst | `[Surprise-oh]`, `[Surprise-ah]`, `[Dissatisfaction-hnn]`, `[Laughter]` |
| Conversational Grunt | `[Uhm]`, `[Question-en/oh/ah/ei/huh]`, `[Confirmation-en]` |
> [!NOTE]
> Mandarin supports 13 NVV categories; English supports 7 categories.
## Usage
### Quick Start with FunASR
```python
from funasr import AutoModel
model = AutoModel(model="path/to/Multi-lingual-NVASR")
# Single file inference
res = model.generate(
input="example/zh.mp3",
language="auto",
use_itn=True,
)
print(res[0]["text"])
```
## Evaluation Metrics
Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline:
| Metric | Description |
|--------|-------------|
| **OCER / OWER** | Overall Character/Word Error Rate (text + NVV tags) |
| **PCER / PWER** | Paralinguistic CER/WER (NVV tags only) |
| **CER / WER** | Text-only error rate (NVV tags removed) |
> Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. β€” *NV-Bench*
## File Structure
```
Multi-lingual NVASR/
β”œβ”€β”€ model.pt # Model weights (~2.8 GB)
β”œβ”€β”€ config.yaml # Model architecture configuration
β”œβ”€β”€ configuration.json # FunASR pipeline configuration
β”œβ”€β”€ am.mvn # Acoustic model mean-variance normalization
β”œβ”€β”€ paralingustic_tokenizer.model # SentencePiece tokenizer with NVV vocabulary
β”œβ”€β”€ example/ # Example audio files
β”‚ β”œβ”€β”€ zh.mp3 # Chinese example
β”‚ β”œβ”€β”€ en.mp3 # English example
```
## Related Resources
- **NV-Bench Project Page**: [https://nvbench.github.io](https://nvbench.github.io)
- **NV-Bench Dataset**: [Hugging Face](https://huggingface.co/datasets/AnonyData/NV-Bench)
- **SenseVoice**: [GitHub](https://github.com/FunAudioLLM/SenseVoice)
## Citation
If you use this model, please cite:
```bibtex
Coming soon
```
## License
This project is licensed under the [CC BY-NC-4.0 License](LICENSE).