Create README.md

16b0537 verified 28 days ago

21.3 kB

language:
  - fa
license: cc-by-sa-4.0
tags:
  - automatic-speech-recognition
  - persian
  - farsi
  - conformer
  - ctc
  - attention-decoder
  - acoustic-sentinel
  - asr
  - speech
datasets:
  - GinkgoQ/Qoqnus
metrics:
  - wer
  - cer
model-index:
  - name: QoQnus-DARA
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: GinkgoQ/Qoqnus (hezarai_cv13_test)
          type: GinkgoQ/Qoqnus
          split: hezarai_cv13_test
        metrics:
          - type: wer
            value: 17.3
            name: WER
          - type: cer
            value: 7.51
            name: CER

GinkgoQ

QoQnus

QoQnus-DARA

Dual-head ASR with Robust Acoustics · Persian (فارسی) · 195.8M params · by GinkgoQ

Overview

QoQnus-DARA is a Persian ASR model built from scratch by GinkgoQ, part of the QoQnus SpeechLine — a family of Persian speech models trained on the GinkgoQ/Qoqnus corpus (~2,000 hours, 1.7M+ utterances).

The architecture introduces an Acoustic Sentinel — a lightweight speech/non-speech classifier that gates the attention decoder, preventing hallucination on silent or noisy inputs. Combined with a causal Conformer encoder and a dual CTC+attention training objective, DARA achieves strong accuracy with streaming-capable inference.

Property	Value
Language	Persian · فارسی
Architecture	Causal Conformer + Gated Decoder + Acoustic Sentinel
Parameters	195.8M
Vocabulary	4,096 SentencePiece Unigram tokens
Input	16 kHz mono, RMS-normalized to −20 dB
Mel features	80-band log-mel · N_FFT=1024 · HOP=256 · Slaney norm
Decoding	CTC greedy + Attention beam search (beam=5)
Training data	GinkgoQ/Qoqnus
License	CC BY-SA 4.0

Benchmark Results

Evaluated on 5,867 samples from gpt_informal_train (informal conversational Persian) and 1,000 samples from hezarai_cv13_test (Common Voice Persian test set). All models use identical text normalization.

Informal Persian (gpt_informal_train · 5,867 samples)

Model	WER ↓	CER ↓	Median WER	Perfect (WER=0)	Speed
QoQnus-DARA (ours)	34.41%	17.88%	27.27%	949/5867 (16.2%)	6.5 ms
QoQnus-Moonshine (ours)	34.93%	10.28%	25.00%	1790/5867 (30.5%)	12.3 ms
vhdm/whisper-large-fa-v1	31.81%	15.42%	25.00%	879/5867 (15.0%)	58.7 ms
jonatasgrosman/wav2vec2-large-xlsr-53-persian	37.81%	12.49%	30.00%	664/5867 (11.3%)	17.4 ms
m3hrdadfi/wav2vec2-large-xlsr-persian-v3	40.51%	14.65%	28.57%	666/5867 (11.4%)	17.4 ms
nvidia/stt_fa_fastconformer_hybrid_large	46.62%	23.39%	41.67%	297/5867 (5.1%)	9.6 ms

QoQnus-DARA is 9× faster than Whisper-large-fa while achieving competitive WER. On formal speech (hezarai_cv13_test), QoQnus-DARA achieves 17.30% WER / 7.51% CER.

Common Voice Persian Test (hezarai_cv13_test · 1,000 samples)

Metric	Value
Average WER	17.30%
Average CER	7.51%
Median WER	0.00%
Median CER	0.00%
Perfect predictions (WER=0)	595/1000 (59.5%)
Throughput	~154 samples/sec

Median WER of 0% — the majority of utterances are transcribed perfectly.

Quick Start

from transformers import AutoProcessor, AutoModelForCTC
import torch, torchaudio

# ── load model ────────────────────────────────────────────────────────────────
# For the full DARA pipeline, use the inference script below.
# A HuggingFace-native wrapper is coming soon.

Using the inference script

pip install torch torchaudio sentencepiece transformers
git clone https://huggingface.co/GinkgoQ/QoQnus-DARA
cd QoQnus-DARA

Load model

from huggingface_hub import hf_hub_download
import torch, sys
from pathlib import Path

# download what you need
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
model_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "dara_model.py")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer.model")

# add the downloaded dir to path so dara_model.py is importable
sys.path.insert(0, str(Path(model_path).parent))
from dara_model import build_dara_small

# load weights
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)
print("model loaded")

Use model

import torch, torchaudio, unicodedata, re
import sentencepiece as spm
from huggingface_hub import hf_hub_download
from dara_model import build_dara_small

# ── download model weights ────────────────────────────────────────────────────
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer/fa_unigram_4096/tokenizer.model")

# ── load ──────────────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)

sp = spm.SentencePieceProcessor()
sp.load(tok_path)

# ── mel preprocessing (must match training) ───────────────────────────────────
import torchaudio.transforms as T
import numpy as np

SAMPLE_RATE, N_FFT, HOP, N_MEL = 16000, 1024, 256, 80
_RMS = 10 ** (-20.0 / 20.0)

mel_fn = T.MelSpectrogram(
    sample_rate=SAMPLE_RATE, n_fft=N_FFT, win_length=N_FFT,
    hop_length=HOP, f_min=0, f_max=8000, n_mels=N_MEL,
    power=1.0, norm="slaney", mel_scale="slaney",
)

def audio_to_mel(path):
    wav, sr = torchaudio.load(path)
    if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
    if sr != SAMPLE_RATE:
        wav = torchaudio.functional.resample(wav, sr, SAMPLE_RATE)
    wav = wav * (_RMS / wav.pow(2).mean().sqrt().clamp(min=1e-9))
    return torch.log(mel_fn(wav.squeeze(0)).clamp(min=1e-5)).T  # (T, 80)

def enc_len(t): return ((t + 1) // 2 + 1) // 2

# ── inference ─────────────────────────────────────────────────────────────────
mel = audio_to_mel("audio.wav").unsqueeze(0).to(device, dtype=torch.bfloat16)

with torch.no_grad():
    with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
        h               = model.encoder(model.subsampler(mel))
        sentinel_logits = model.sentinel(h)
        mask            = model.masker(sentinel_logits)
        ctc_logits      = model.ctc(h)

# CTC greedy (fast)
ids = ctc_logits[0, :enc_len(mel.shape[1])].argmax(-1).tolist()
col = [ids[0]] if ids else []
for t in ids[1:]:
    if t != col[-1]: col.append(t)
ctc_text = sp.decode([t for t in col if t != 4096])
print(f"CTC:  {ctc_text}")

# Attention beam search (more accurate, beam=5)
BOS, EOS, PAD = 2, 3, 1
tokens = torch.tensor([[BOS]], dtype=torch.long, device=device)
finished = False
for _ in range(200):
    with torch.no_grad():
        with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
            logits = model.decoder(tokens, h, mask)
    next_id = logits[0, -1].float().argmax(-1).item()
    if next_id == EOS:
        break
    tokens = torch.cat([tokens, torch.tensor([[next_id]], device=device)], dim=1)
attn_ids = [t for t in tokens[0, 1:].tolist() if t not in (BOS, EOS, PAD)]
attn_text = sp.decode(attn_ids)
print(f"ATTN: {attn_text}")

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        QoQnus-DARA                              │
│                                                                 │
│  Audio (16kHz)                                                  │
│      │                                                          │
│      ▼                                                          │
│  RMS Normalize (−20 dB)                                         │
│      │                                                          │
│      ▼                                                          │
│  Log-Mel Spectrogram (80 bands, N_FFT=1024, Slaney)             │
│      │                                                          │
│      ▼                                                          │
│  ConvSubsampler  ──  2× causal conv (stride=2) → 4× reduction  │
│      │                                                          │
│      ▼                                                          │
│  Conformer Encoder  ──  12 layers, d=512, 8 heads, causal       │
│      │                                                          │
│      ├──────────────────────┬──────────────────────┐           │
│      ▼                      ▼                      ▼           │
│  CTC Head              Acoustic Sentinel      Gated Decoder     │
│  (fast decode)         (speech/noise gate)    (4 layers, d=512) │
│                              │                      │           │
│                              └──── mask ────────────┘           │
│                                                                 │
│                         Transcription                           │
└─────────────────────────────────────────────────────────────────┘

Component Details

ConvSubsampler

Two causal strided convolutions (kernel=3, stride=2 each) reduce time resolution by 4×. Input: (T, 80) → Output: (T/4, 512). GELU activation + LayerNorm.

Conformer Encoder — 12 layers

Each block:

Feed-forward (half-step, 512→2048→512, SiLU, dropout=0.1)
Multi-head self-attention with RoPE (8 heads, causal, dropout=0.1)
Depthwise conv module (kernel=31, causal, dropout=0.1)
Feed-forward (half-step)
Post-block LayerNorm

Causal masking enables streaming inference.

Acoustic Sentinel — the novel component

A 2-layer Transformer classifier (d=256, 4 heads) that predicts per-frame speech probability. Trained with BCE loss to distinguish speech from non-speech (MUSAN music noise).

The sentinel output gates the decoder's cross-attention — non-speech frames are masked to zero, preventing the decoder from hallucinating on silence or background noise.

Gated Decoder — 4 layers

Autoregressive decoder:

Embedding dropout (p=0.1)
Causal self-attention with RoPE (8 heads, dropout=0.1)
Gated cross-attention over sentinel-masked encoder output (8 heads)
Feed-forward (512→2048→512, GELU, dropout=0.1)
Pre-norm (LayerNorm before each sub-layer)

Parameter breakdown

Component	Parameters
ConvSubsampler	~0.8M
Conformer Encoder (12 layers)	~85M
CTC Head	~2.1M
Acoustic Sentinel	~2.5M
Gated Decoder (4 layers)	~105M
Total	~195.8M

QoQnus SpeechLine

QoQnus-DARA
_{Dual-head ASR · Robust Acoustics}
_{195.8M · Conformer + Sentinel}
🤗 This model

QoQnus-Moonshine
_{Streaming ASR · Informal Persian}
_{108M · Moonshine Streaming}
_{Coming soon}

QoQnus-DARA-Base
_{Dual-head ASR · Larger capacity}
_{~350M · Conformer + Sentinel}
_{In development}

All models in the SpeechLine are trained on GinkgoQ/Qoqnus — a 3,000+ hour Persian speech corpus curated and released by GinkgoQ.

Dataset — GinkgoQ/Qoqnus

Bucket	Utterances	Hours	SNR
train_high	~1,083,351	~1,260h	≥ 40 dB
train_medium	~631,573	~740h	30–40 dB
val	~15,262	~18h	—
test	~33,000	~39h	—

Sources: Common Voice 13/17, VHDM, Pourmand YouTube, Srezas (YouTube/Fleurs/Yazdi), KiarashQ, Mana-TTS, GPTInformal, Mshojaei, Thomcles, PERTTS.

Training Phases

Phase	Data	Steps	Peak LR	Augmentation
1	train_high	150,000	1e-3	None
2	train_high + 5% MUSAN	50,000	3e-4	None
3a	train_high + train_medium	20,000	5e-5	None
Noise v2	train_high + train_medium	50,000	3e-5	Curriculum noise

Loss Function

L_total = 0.3 × L_CTC  +  0.7 × L_attention  +  0.3 × L_sentinel

L_CTC:       CTC loss (blank = vocab_size = 4096)
L_attention: Cross-entropy with label_smoothing=0.1
L_sentinel:  BCE with frame-level speech/non-speech mask

Key Hyperparameters

optimizer    = AdamW(betas=(0.9, 0.98), weight_decay=1e-2)
# weight decay applied to weight matrices only — biases and LayerNorm excluded
scheduler    = cosine_decay_with_warmup
precision    = bfloat16
grad_clip    = 1.0
dropout      = 0.1  # all encoder and decoder modules
label_smooth = 0.1

Text Normalization

Applied to both training targets and inference output:

NFKC Unicode normalization
Arabic → Persian substitution (ك→ک, ي→ی, ة→ه, أ/إ/آ→ا, ؤ→و, ئ→ی)
Diacritic removal (harakat, shadda, tanwin)
ASCII digits → Persian digits (0→۰ … 9→۹)
Lowercase ASCII (Latin loanwords)
Whitespace normalization

Limitations

Known limitations

Noise robustness: Trained primarily on clean speech (SNR ≥ 30 dB). Performance degrades below ~20 dB SNR. Noise-robust variants under active development.
Informal/dialectal speech: Handles colloquial Persian but was primarily trained on formal/broadcast speech.
Proper nouns: Rare names and places may be substituted with phonetically similar common words.
Long-form audio: Optimized for utterances up to ~15 seconds. Use VAD segmentation for longer audio.
Causal encoder: Left-context-only attention enables streaming but is slightly less accurate than bidirectional models on offline tasks.

Hardware & Training Time

Trained on a single NVIDIA GeForce RTX 4090 Laptop GPU (16 GB VRAM).

Phase	Duration
Phase 1 (150k steps)	~33 hours
Phase 2 (50k steps)	~11 hours
Phase 3a (20k steps)	~3 hours
Noise training (50k steps)	~28 hours
Total	~75+ hours

About GinkgoQ

GinkgoQ

GinkgoQ is an AI research initiative focused on Persian language technology. The QoQnus (ققنوس) brand — named after the Persian phoenix — represents a family of speech and language models built from the ground up for the Persian-speaking community.

Resource	Link
🤗 Organization	huggingface.co/GinkgoQ
📦 Training Dataset	GinkgoQ/Qoqnus
🔊 This Model	GinkgoQ/QoQnus-DARA

Citation

@misc{QoQnus-DARA-2026,
  title        = {QoQnus-DARA: Dual-head ASR with Robust Acoustics for Persian},
  author       = {GinkgoQ},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/GinkgoQ/QoQnus-DARA}},
  note         = {Trained on GinkgoQ/Qoqnus Persian speech corpus}
}

QoQnus SpeechLine · Built by GinkgoQ

_{QoQnus (ققنوس) — the Persian phoenix · rising from silence into speech}