|
|
--- |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
- or |
|
|
- bn |
|
|
- ta |
|
|
- te |
|
|
- kn |
|
|
- ml |
|
|
- mr |
|
|
- gu |
|
|
- pa |
|
|
- as |
|
|
license: apache-2.0 |
|
|
pipeline_tag: audio-classification |
|
|
library_name: transformers |
|
|
tags: |
|
|
- language-identification |
|
|
- indian-languages |
|
|
- multilingual |
|
|
- speech |
|
|
- asr-preprocessing |
|
|
- callcenter-ai |
|
|
- speech-analytics |
|
|
- audio-classification |
|
|
- wav2vec2 |
|
|
- transformers |
|
|
- pytorch |
|
|
- huggingface |
|
|
--- |
|
|
|
|
|
# **Vakgyata** |
|
|
|
|
|
**Language Identification for Indian Languages from Speech** |
|
|
|
|
|
--- |
|
|
|
|
|
## **Model Overview** |
|
|
|
|
|
`vakgyata` is an open-source language identification model specifically designed to classify Indian languages from raw speech audio. It is built upon the pretrained [`Harveenchadha/wav2vec2-pretrained-clsril-23-10k`](https://huggingface.co/Harveenchadha/wav2vec2-pretrained-clsril-23-10k) with additional **Layer Normalization** integrated to improve stability and performance for audio classification tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## **Variants and Model Sizes** |
|
|
|
|
|
| Variant | Parameters | Accuracy | |
|
|
| ---------------- | ---------- | -------- | |
|
|
| `vakgyata-base` | 95M | 95.88% | |
|
|
| `vakgyata-small` | 52M | 95.06% | |
|
|
| `vakgyata-mini` | 38M | 95.06% | |
|
|
| `vakgyata-tiny` | 24M | 93.63% | |
|
|
|
|
|
--- |
|
|
|
|
|
## **Supported Languages** |
|
|
|
|
|
| Language | Code | |
|
|
| --------------- | ----- | |
|
|
| English (India) | en-IN | |
|
|
| Hindi | hi-IN | |
|
|
| Odia | or-IN | |
|
|
| Bengali | bn-IN | |
|
|
| Tamil | ta-IN | |
|
|
| Telugu | te-IN | |
|
|
| Kannada | kn-IN | |
|
|
| Malayalam | ml-IN | |
|
|
| Marathi | mr-IN | |
|
|
| Gujarati | gu-IN | |
|
|
| Punjabi | pa-IN | |
|
|
| Assamese | as-IN | |
|
|
|
|
|
--- |
|
|
|
|
|
## **Specifications** |
|
|
|
|
|
* **Supported Sampling Rate:** 16000 Hz |
|
|
* **Recommended Audio Format:** 16kHz, 16bit PCM (Mono) |
|
|
|
|
|
--- |
|
|
|
|
|
## **Installation** |
|
|
|
|
|
```bash |
|
|
pip install transformers torchaudio |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## **Usage** |
|
|
|
|
|
```python |
|
|
from transformers import Wav2Vec2ForSequenceClassification, AutoFeatureExtractor |
|
|
import torch |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
model_id = "onecxi/vakgyata-tiny" |
|
|
|
|
|
processor = AutoFeatureExtractor.from_pretrained(model_id) |
|
|
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_id).to(device) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## **Inference Example** |
|
|
|
|
|
```python |
|
|
import torchaudio |
|
|
|
|
|
# Load the audio (ensure it's 16kHz mono) |
|
|
audio, sr = torchaudio.load("path/to/audio.wav") |
|
|
|
|
|
# Preprocess |
|
|
inputs = processor(audio.squeeze(), sampling_rate=sr, return_tensors="pt").to(device) |
|
|
|
|
|
# Inference |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
# Softmax to get probabilities |
|
|
probs = logits.softmax(dim=-1).cpu().numpy() |
|
|
|
|
|
# Predicted language |
|
|
language = model.config.id2label.get(probs.argmax()) |
|
|
print("Predicted Language:", language) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## **Citation** |
|
|
|
|
|
If you use this model in your research or application, please consider citing the model and its base source: |
|
|
|
|
|
``` |
|
|
@misc{vakgyata2024, |
|
|
title={vakgyata: Language Identification for Indian Speech}, |
|
|
author={OneCXI}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/onecxi/vakgyata-tiny} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |