--- license: cc-by-nc-4.0 language: - zh - en tags: - speech - asr - sensevoice - paralinguistic - nonverbal-vocalization datasets: - NV-Bench - amphion/Emilia-NV - nonverbalspeech/nonverbalspeech38k - deepvk/NonverbalTTS - xunyi/SMIIP-NV pipeline_tag: automatic-speech-recognition metrics: - cer base_model: - FunAudioLLM/SenseVoiceSmall --- # Multi-lingual NVASR **Multi-lingual Nonverbal Vocalization Automatic Speech Recognition** [![Demo Page](https://img.shields.io/badge/Demo-Page-blue)](https://nvbench.github.io) [![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-yellow)](https://huggingface.co/datasets/AnonyData/NV-Bench) [![Model](https://img.shields.io/badge/Model-HuggingFace-yellow)](https://huggingface.co/AnonyData/Multilingual-NVASR) Multi-lingual NVASR is a speech recognition model fine-tuned from [SenseVoice-Small](https://github.com/FunAudioLLM/SenseVoice) for transcribing both regular speech and **nonverbal vocalizations (NVVs)** with a unified paralinguistic label taxonomy. It is a core component of the [NV-Bench](https://nvbench.github.io) evaluation pipeline. ## Highlights - 🗣️ **Multi-lingual Support** — Chinese (zh), English (en) - 🎯 **NVV-Aware Transcription** — Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text - 📊 **High-Quality General ASR** — Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks - 🏷️ **Unified Label Taxonomy** — Consistent paralinguistic labels across all supported languages ## NVV Taxonomy NVVs are organized into three functional levels: | Function | Categories | |----------|------------| | Vegetative | `[Cough]`, `[Sigh]`, `[Breathing]` | | Affect Burst | `[Surprise-oh]`, `[Surprise-ah]`, `[Dissatisfaction-hnn]`, `[Laughter]` | | Conversational Grunt | `[Uhm]`, `[Question-en/oh/ah/ei/huh]`, `[Confirmation-en]` | > [!NOTE] > Mandarin supports 13 NVV categories; English supports 7 categories. ## Usage ### Quick Start with FunASR ```python from funasr import AutoModel model = AutoModel(model="path/to/Multi-lingual-NVASR") # Single file inference res = model.generate( input="example/zh.mp3", language="auto", use_itn=True, ) print(res[0]["text"]) ``` ## Evaluation Metrics Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline: | Metric | Description | |--------|-------------| | **OCER / OWER** | Overall Character/Word Error Rate (text + NVV tags) | | **PCER / PWER** | Paralinguistic CER/WER (NVV tags only) | | **CER / WER** | Text-only error rate (NVV tags removed) | > Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. — *NV-Bench* ## File Structure ``` Multi-lingual NVASR/ ├── model.pt # Model weights (~2.8 GB) ├── config.yaml # Model architecture configuration ├── configuration.json # FunASR pipeline configuration ├── am.mvn # Acoustic model mean-variance normalization ├── paralingustic_tokenizer.model # SentencePiece tokenizer with NVV vocabulary ├── example/ # Example audio files │ ├── zh.mp3 # Chinese example │ ├── en.mp3 # English example ``` ## Related Resources - **NV-Bench Project Page**: [https://nvbench.github.io](https://nvbench.github.io) - **NV-Bench Dataset**: [Hugging Face](https://huggingface.co/datasets/AnonyData/NV-Bench) - **SenseVoice**: [GitHub](https://github.com/FunAudioLLM/SenseVoice) ## Citation If you use this model, please cite: ```bibtex Coming soon ``` ## License This project is licensed under the [CC BY-NC-4.0 License](LICENSE).