QoQnus-DARA / README.md
ArmanAsq's picture
Create README.md
16b0537 verified
metadata
language:
  - fa
license: cc-by-sa-4.0
tags:
  - automatic-speech-recognition
  - persian
  - farsi
  - conformer
  - ctc
  - attention-decoder
  - acoustic-sentinel
  - asr
  - speech
datasets:
  - GinkgoQ/Qoqnus
metrics:
  - wer
  - cer
model-index:
  - name: QoQnus-DARA
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: GinkgoQ/Qoqnus (hezarai_cv13_test)
          type: GinkgoQ/Qoqnus
          split: hezarai_cv13_test
        metrics:
          - type: wer
            value: 17.3
            name: WER
          - type: cer
            value: 7.51
            name: CER

GinkgoQ

QoQnus

QoQnus-DARA

Dual-head ASR with Robust Acoustics  ·  Persian (فارسی)  ·  195.8M params  ·  by GinkgoQ

Model Dataset Language WER CER License Params


Overview

QoQnus-DARA is a Persian ASR model built from scratch by GinkgoQ, part of the QoQnus SpeechLine — a family of Persian speech models trained on the GinkgoQ/Qoqnus corpus (~2,000 hours, 1.7M+ utterances).

The architecture introduces an Acoustic Sentinel — a lightweight speech/non-speech classifier that gates the attention decoder, preventing hallucination on silent or noisy inputs. Combined with a causal Conformer encoder and a dual CTC+attention training objective, DARA achieves strong accuracy with streaming-capable inference.

Property Value
Language Persian · فارسی
Architecture Causal Conformer + Gated Decoder + Acoustic Sentinel
Parameters 195.8M
Vocabulary 4,096 SentencePiece Unigram tokens
Input 16 kHz mono, RMS-normalized to −20 dB
Mel features 80-band log-mel · N_FFT=1024 · HOP=256 · Slaney norm
Decoding CTC greedy + Attention beam search (beam=5)
Training data GinkgoQ/Qoqnus
License CC BY-SA 4.0

Benchmark Results

Evaluated on 5,867 samples from gpt_informal_train (informal conversational Persian) and 1,000 samples from hezarai_cv13_test (Common Voice Persian test set). All models use identical text normalization.

Informal Persian (gpt_informal_train · 5,867 samples)

Model WER ↓ CER ↓ Median WER Perfect (WER=0) Speed
QoQnus-DARA (ours) 34.41% 17.88% 27.27% 949/5867 (16.2%) 6.5 ms
QoQnus-Moonshine (ours) 34.93% 10.28% 25.00% 1790/5867 (30.5%) 12.3 ms
vhdm/whisper-large-fa-v1 31.81% 15.42% 25.00% 879/5867 (15.0%) 58.7 ms
jonatasgrosman/wav2vec2-large-xlsr-53-persian 37.81% 12.49% 30.00% 664/5867 (11.3%) 17.4 ms
m3hrdadfi/wav2vec2-large-xlsr-persian-v3 40.51% 14.65% 28.57% 666/5867 (11.4%) 17.4 ms
nvidia/stt_fa_fastconformer_hybrid_large 46.62% 23.39% 41.67% 297/5867 (5.1%) 9.6 ms

QoQnus-DARA is 9× faster than Whisper-large-fa while achieving competitive WER. On formal speech (hezarai_cv13_test), QoQnus-DARA achieves 17.30% WER / 7.51% CER.

Common Voice Persian Test (hezarai_cv13_test · 1,000 samples)

Metric Value
Average WER 17.30%
Average CER 7.51%
Median WER 0.00%
Median CER 0.00%
Perfect predictions (WER=0) 595/1000 (59.5%)
Throughput ~154 samples/sec

Median WER of 0% — the majority of utterances are transcribed perfectly.


Quick Start

from transformers import AutoProcessor, AutoModelForCTC
import torch, torchaudio

# ── load model ────────────────────────────────────────────────────────────────
# For the full DARA pipeline, use the inference script below.
# A HuggingFace-native wrapper is coming soon.

Using the inference script

pip install torch torchaudio sentencepiece transformers
git clone https://huggingface.co/GinkgoQ/QoQnus-DARA
cd QoQnus-DARA

Load model

from huggingface_hub import hf_hub_download
import torch, sys
from pathlib import Path

# download what you need
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
model_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "dara_model.py")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer.model")

# add the downloaded dir to path so dara_model.py is importable
sys.path.insert(0, str(Path(model_path).parent))
from dara_model import build_dara_small

# load weights
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)
print("model loaded")

Use model

import torch, torchaudio, unicodedata, re
import sentencepiece as spm
from huggingface_hub import hf_hub_download
from dara_model import build_dara_small

# ── download model weights ────────────────────────────────────────────────────
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer/fa_unigram_4096/tokenizer.model")

# ── load ──────────────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)

sp = spm.SentencePieceProcessor()
sp.load(tok_path)

# ── mel preprocessing (must match training) ───────────────────────────────────
import torchaudio.transforms as T
import numpy as np

SAMPLE_RATE, N_FFT, HOP, N_MEL = 16000, 1024, 256, 80
_RMS = 10 ** (-20.0 / 20.0)

mel_fn = T.MelSpectrogram(
    sample_rate=SAMPLE_RATE, n_fft=N_FFT, win_length=N_FFT,
    hop_length=HOP, f_min=0, f_max=8000, n_mels=N_MEL,
    power=1.0, norm="slaney", mel_scale="slaney",
)

def audio_to_mel(path):
    wav, sr = torchaudio.load(path)
    if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
    if sr != SAMPLE_RATE:
        wav = torchaudio.functional.resample(wav, sr, SAMPLE_RATE)
    wav = wav * (_RMS / wav.pow(2).mean().sqrt().clamp(min=1e-9))
    return torch.log(mel_fn(wav.squeeze(0)).clamp(min=1e-5)).T  # (T, 80)

def enc_len(t): return ((t + 1) // 2 + 1) // 2

# ── inference ─────────────────────────────────────────────────────────────────
mel = audio_to_mel("audio.wav").unsqueeze(0).to(device, dtype=torch.bfloat16)

with torch.no_grad():
    with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
        h               = model.encoder(model.subsampler(mel))
        sentinel_logits = model.sentinel(h)
        mask            = model.masker(sentinel_logits)
        ctc_logits      = model.ctc(h)

# CTC greedy (fast)
ids = ctc_logits[0, :enc_len(mel.shape[1])].argmax(-1).tolist()
col = [ids[0]] if ids else []
for t in ids[1:]:
    if t != col[-1]: col.append(t)
ctc_text = sp.decode([t for t in col if t != 4096])
print(f"CTC:  {ctc_text}")

# Attention beam search (more accurate, beam=5)
BOS, EOS, PAD = 2, 3, 1
tokens = torch.tensor([[BOS]], dtype=torch.long, device=device)
finished = False
for _ in range(200):
    with torch.no_grad():
        with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
            logits = model.decoder(tokens, h, mask)
    next_id = logits[0, -1].float().argmax(-1).item()
    if next_id == EOS:
        break
    tokens = torch.cat([tokens, torch.tensor([[next_id]], device=device)], dim=1)
attn_ids = [t for t in tokens[0, 1:].tolist() if t not in (BOS, EOS, PAD)]
attn_text = sp.decode(attn_ids)
print(f"ATTN: {attn_text}")

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        QoQnus-DARA                              │
│                                                                 │
│  Audio (16kHz)                                                  │
│      │                                                          │
│      ▼                                                          │
│  RMS Normalize (−20 dB)                                         │
│      │                                                          │
│      ▼                                                          │
│  Log-Mel Spectrogram (80 bands, N_FFT=1024, Slaney)             │
│      │                                                          │
│      ▼                                                          │
│  ConvSubsampler  ──  2× causal conv (stride=2) → 4× reduction  │
│      │                                                          │
│      ▼                                                          │
│  Conformer Encoder  ──  12 layers, d=512, 8 heads, causal       │
│      │                                                          │
│      ├──────────────────────┬──────────────────────┐           │
│      ▼                      ▼                      ▼           │
│  CTC Head              Acoustic Sentinel      Gated Decoder     │
│  (fast decode)         (speech/noise gate)    (4 layers, d=512) │
│                              │                      │           │
│                              └──── mask ────────────┘           │
│                                                                 │
│                         Transcription                           │
└─────────────────────────────────────────────────────────────────┘

Component Details

ConvSubsampler

Two causal strided convolutions (kernel=3, stride=2 each) reduce time resolution by 4×. Input: (T, 80) → Output: (T/4, 512). GELU activation + LayerNorm.

Conformer Encoder — 12 layers

Each block:

  • Feed-forward (half-step, 512→2048→512, SiLU, dropout=0.1)
  • Multi-head self-attention with RoPE (8 heads, causal, dropout=0.1)
  • Depthwise conv module (kernel=31, causal, dropout=0.1)
  • Feed-forward (half-step)
  • Post-block LayerNorm

Causal masking enables streaming inference.

Acoustic Sentinel — the novel component

A 2-layer Transformer classifier (d=256, 4 heads) that predicts per-frame speech probability. Trained with BCE loss to distinguish speech from non-speech (MUSAN music noise).

The sentinel output gates the decoder's cross-attention — non-speech frames are masked to zero, preventing the decoder from hallucinating on silence or background noise.

Gated Decoder — 4 layers

Autoregressive decoder:

  • Embedding dropout (p=0.1)
  • Causal self-attention with RoPE (8 heads, dropout=0.1)
  • Gated cross-attention over sentinel-masked encoder output (8 heads)
  • Feed-forward (512→2048→512, GELU, dropout=0.1)
  • Pre-norm (LayerNorm before each sub-layer)
Parameter breakdown
Component Parameters
ConvSubsampler ~0.8M
Conformer Encoder (12 layers) ~85M
CTC Head ~2.1M
Acoustic Sentinel ~2.5M
Gated Decoder (4 layers) ~105M
Total ~195.8M

QoQnus SpeechLine

QoQnus-DARA
Dual-head ASR · Robust Acoustics
195.8M · Conformer + Sentinel
🤗 This model
QoQnus-Moonshine
Streaming ASR · Informal Persian
108M · Moonshine Streaming
Coming soon
QoQnus-DARA-Base
Dual-head ASR · Larger capacity
~350M · Conformer + Sentinel
In development

All models in the SpeechLine are trained on GinkgoQ/Qoqnus — a 3,000+ hour Persian speech corpus curated and released by GinkgoQ.


Dataset — GinkgoQ/Qoqnus

Bucket Utterances Hours SNR
train_high ~1,083,351 ~1,260h ≥ 40 dB
train_medium ~631,573 ~740h 30–40 dB
val ~15,262 ~18h
test ~33,000 ~39h

Sources: Common Voice 13/17, VHDM, Pourmand YouTube, Srezas (YouTube/Fleurs/Yazdi), KiarashQ, Mana-TTS, GPTInformal, Mshojaei, Thomcles, PERTTS.

Training Phases

Phase Data Steps Peak LR Augmentation
1 train_high 150,000 1e-3 None
2 train_high + 5% MUSAN 50,000 3e-4 None
3a train_high + train_medium 20,000 5e-5 None
Noise v2 train_high + train_medium 50,000 3e-5 Curriculum noise

Loss Function

L_total = 0.3 × L_CTC  +  0.7 × L_attention  +  0.3 × L_sentinel

L_CTC:       CTC loss (blank = vocab_size = 4096)
L_attention: Cross-entropy with label_smoothing=0.1
L_sentinel:  BCE with frame-level speech/non-speech mask

Key Hyperparameters

optimizer    = AdamW(betas=(0.9, 0.98), weight_decay=1e-2)
# weight decay applied to weight matrices only — biases and LayerNorm excluded
scheduler    = cosine_decay_with_warmup
precision    = bfloat16
grad_clip    = 1.0
dropout      = 0.1  # all encoder and decoder modules
label_smooth = 0.1

Text Normalization

Applied to both training targets and inference output:

  1. NFKC Unicode normalization
  2. Arabic → Persian substitution (ك→ک, ي→ی, ة→ه, أ/إ/آ→ا, ؤ→و, ئ→ی)
  3. Diacritic removal (harakat, shadda, tanwin)
  4. ASCII digits → Persian digits (0→۰ … 9→۹)
  5. Lowercase ASCII (Latin loanwords)
  6. Whitespace normalization

Limitations

Known limitations
  • Noise robustness: Trained primarily on clean speech (SNR ≥ 30 dB). Performance degrades below ~20 dB SNR. Noise-robust variants under active development.
  • Informal/dialectal speech: Handles colloquial Persian but was primarily trained on formal/broadcast speech.
  • Proper nouns: Rare names and places may be substituted with phonetically similar common words.
  • Long-form audio: Optimized for utterances up to ~15 seconds. Use VAD segmentation for longer audio.
  • Causal encoder: Left-context-only attention enables streaming but is slightly less accurate than bidirectional models on offline tasks.

Hardware & Training Time

Trained on a single NVIDIA GeForce RTX 4090 Laptop GPU (16 GB VRAM).

Phase Duration
Phase 1 (150k steps) ~33 hours
Phase 2 (50k steps) ~11 hours
Phase 3a (20k steps) ~3 hours
Noise training (50k steps) ~28 hours
Total ~75+ hours

About GinkgoQ

GinkgoQ

GinkgoQ is an AI research initiative focused on Persian language technology. The QoQnus (ققنوس) brand — named after the Persian phoenix — represents a family of speech and language models built from the ground up for the Persian-speaking community.

Resource Link
🤗 Organization huggingface.co/GinkgoQ
📦 Training Dataset GinkgoQ/Qoqnus
🔊 This Model GinkgoQ/QoQnus-DARA

Citation

@misc{QoQnus-DARA-2026,
  title        = {QoQnus-DARA: Dual-head ASR with Robust Acoustics for Persian},
  author       = {GinkgoQ},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/GinkgoQ/QoQnus-DARA}},
  note         = {Trained on GinkgoQ/Qoqnus Persian speech corpus}
}

QoQnus SpeechLine · Built by GinkgoQ

QoQnus (ققنوس) — the Persian phoenix · rising from silence into speech