Multilingual-NVASR / README.md
AnonyData's picture
Update README.md
4157c08 verified
metadata
license: cc-by-nc-4.0
language:
  - zh
  - en
tags:
  - speech
  - asr
  - sensevoice
  - paralinguistic
  - nonverbal-vocalization
datasets:
  - NV-Bench
  - amphion/Emilia-NV
  - nonverbalspeech/nonverbalspeech38k
  - deepvk/NonverbalTTS
  - xunyi/SMIIP-NV
pipeline_tag: automatic-speech-recognition
metrics:
  - cer
base_model:
  - FunAudioLLM/SenseVoiceSmall

Multi-lingual NVASR

Multi-lingual Nonverbal Vocalization Automatic Speech Recognition

Demo Page Dataset Model

Multi-lingual NVASR is a speech recognition model fine-tuned from SenseVoice-Small for transcribing both regular speech and nonverbal vocalizations (NVVs) with a unified paralinguistic label taxonomy. It is a core component of the NV-Bench evaluation pipeline.

Highlights

  • πŸ—£οΈ Multi-lingual Support β€” Chinese (zh), English (en)
  • 🎯 NVV-Aware Transcription β€” Accurately transcribes nonverbal vocalizations (laughter, coughs, sighs, etc.) as structured tags within text
  • πŸ“Š High-Quality General ASR β€” Maintains competitive CER on standard ASR benchmarks while significantly outperforming baselines on NVV-specific tasks
  • 🏷️ Unified Label Taxonomy β€” Consistent paralinguistic labels across all supported languages

NVV Taxonomy

NVVs are organized into three functional levels:

Function Categories
Vegetative [Cough], [Sigh], [Breathing]
Affect Burst [Surprise-oh], [Surprise-ah], [Dissatisfaction-hnn], [Laughter]
Conversational Grunt [Uhm], [Question-en/oh/ah/ei/huh], [Confirmation-en]

Mandarin supports 13 NVV categories; English supports 7 categories.

Usage

Quick Start with FunASR

from funasr import AutoModel

model = AutoModel(model="path/to/Multi-lingual-NVASR")

# Single file inference
res = model.generate(
    input="example/zh.mp3",
    language="auto",
    use_itn=True,
)
print(res[0]["text"])

Evaluation Metrics

Multi-lingual NVASR supports the following evaluation metrics used in the NV-Bench pipeline:

Metric Description
OCER / OWER Overall Character/Word Error Rate (text + NVV tags)
PCER / PWER Paralinguistic CER/WER (NVV tags only)
CER / WER Text-only error rate (NVV tags removed)

Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks. β€” NV-Bench

File Structure

Multi-lingual NVASR/
β”œβ”€β”€ model.pt                        # Model weights (~2.8 GB)
β”œβ”€β”€ config.yaml                     # Model architecture configuration
β”œβ”€β”€ configuration.json              # FunASR pipeline configuration
β”œβ”€β”€ am.mvn                          # Acoustic model mean-variance normalization
β”œβ”€β”€ paralingustic_tokenizer.model   # SentencePiece tokenizer with NVV vocabulary
β”œβ”€β”€ example/                        # Example audio files
β”‚   β”œβ”€β”€ zh.mp3                      # Chinese example
β”‚   β”œβ”€β”€ en.mp3                      # English example

Related Resources

Citation

If you use this model, please cite:

Coming soon

License

This project is licensed under the CC BY-NC-4.0 License.