You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

GinkgoQ

QoQnus

QoQnus-DARA

Dual-head ASR with Robust Acoustics  ·  Persian (فارسی)  ·  195.8M params  ·  by GinkgoQ

Model Dataset Language WER CER License Params


Overview

QoQnus-DARA is a Persian ASR model built from scratch by GinkgoQ, part of the QoQnus SpeechLine — a family of Persian speech models trained on the GinkgoQ/Qoqnus corpus (~2,000 hours, 1.7M+ utterances).

The architecture introduces an Acoustic Sentinel — a lightweight speech/non-speech classifier that gates the attention decoder, preventing hallucination on silent or noisy inputs. Combined with a causal Conformer encoder and a dual CTC+attention training objective, DARA achieves strong accuracy with streaming-capable inference.

Property Value
Language Persian · فارسی
Architecture Causal Conformer + Gated Decoder + Acoustic Sentinel
Parameters 195.8M
Vocabulary 4,096 SentencePiece Unigram tokens
Input 16 kHz mono, RMS-normalized to −20 dB
Mel features 80-band log-mel · N_FFT=1024 · HOP=256 · Slaney norm
Decoding CTC greedy + Attention beam search (beam=5)
Training data GinkgoQ/Qoqnus
License CC BY-SA 4.0

Benchmark Results

Evaluated on 5,867 samples from gpt_informal_train (informal conversational Persian) and 1,000 samples from hezarai_cv13_test (Common Voice Persian test set). All models use identical text normalization.

Informal Persian (gpt_informal_train · 5,867 samples)

Model WER ↓ CER ↓ Median WER Perfect (WER=0) Speed
QoQnus-DARA (ours) 34.41% 17.88% 27.27% 949/5867 (16.2%) 6.5 ms
QoQnus-Moonshine (ours) 34.93% 10.28% 25.00% 1790/5867 (30.5%) 12.3 ms
vhdm/whisper-large-fa-v1 31.81% 15.42% 25.00% 879/5867 (15.0%) 58.7 ms
jonatasgrosman/wav2vec2-large-xlsr-53-persian 37.81% 12.49% 30.00% 664/5867 (11.3%) 17.4 ms
m3hrdadfi/wav2vec2-large-xlsr-persian-v3 40.51% 14.65% 28.57% 666/5867 (11.4%) 17.4 ms
nvidia/stt_fa_fastconformer_hybrid_large 46.62% 23.39% 41.67% 297/5867 (5.1%) 9.6 ms

QoQnus-DARA is 9× faster than Whisper-large-fa while achieving competitive WER. On formal speech (hezarai_cv13_test), QoQnus-DARA achieves 17.30% WER / 7.51% CER.

Common Voice Persian Test (hezarai_cv13_test · 1,000 samples)

Metric Value
Average WER 17.30%
Average CER 7.51%
Median WER 0.00%
Median CER 0.00%
Perfect predictions (WER=0) 595/1000 (59.5%)
Throughput ~154 samples/sec

Median WER of 0% — the majority of utterances are transcribed perfectly.


Quick Start

from transformers import AutoProcessor, AutoModelForCTC
import torch, torchaudio

# ── load model ────────────────────────────────────────────────────────────────
# For the full DARA pipeline, use the inference script below.
# A HuggingFace-native wrapper is coming soon.

Using the inference script

pip install torch torchaudio sentencepiece transformers
git clone https://huggingface.co/GinkgoQ/QoQnus-DARA
cd QoQnus-DARA

Load model

from huggingface_hub import hf_hub_download
import torch, sys
from pathlib import Path

# download what you need
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
model_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "dara_model.py")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer.model")

# add the downloaded dir to path so dara_model.py is importable
sys.path.insert(0, str(Path(model_path).parent))
from dara_model import build_dara_small

# load weights
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)
print("model loaded")

Use model

import torch, torchaudio, unicodedata, re
import sentencepiece as spm
from huggingface_hub import hf_hub_download
from dara_model import build_dara_small

# ── download model weights ────────────────────────────────────────────────────
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer/fa_unigram_4096/tokenizer.model")

# ── load ──────────────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)

sp = spm.SentencePieceProcessor()
sp.load(tok_path)

# ── mel preprocessing (must match training) ───────────────────────────────────
import torchaudio.transforms as T
import numpy as np

SAMPLE_RATE, N_FFT, HOP, N_MEL = 16000, 1024, 256, 80
_RMS = 10 ** (-20.0 / 20.0)

mel_fn = T.MelSpectrogram(
    sample_rate=SAMPLE_RATE, n_fft=N_FFT, win_length=N_FFT,
    hop_length=HOP, f_min=0, f_max=8000, n_mels=N_MEL,
    power=1.0, norm="slaney", mel_scale="slaney",
)

def audio_to_mel(path):
    wav, sr = torchaudio.load(path)
    if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
    if sr != SAMPLE_RATE:
        wav = torchaudio.functional.resample(wav, sr, SAMPLE_RATE)
    wav = wav * (_RMS / wav.pow(2).mean().sqrt().clamp(min=1e-9))
    return torch.log(mel_fn(wav.squeeze(0)).clamp(min=1e-5)).T  # (T, 80)

def enc_len(t): return ((t + 1) // 2 + 1) // 2

# ── inference ─────────────────────────────────────────────────────────────────
mel = audio_to_mel("audio.wav").unsqueeze(0).to(device, dtype=torch.bfloat16)

with torch.no_grad():
    with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
        h               = model.encoder(model.subsampler(mel))
        sentinel_logits = model.sentinel(h)
        mask            = model.masker(sentinel_logits)
        ctc_logits      = model.ctc(h)

# CTC greedy (fast)
ids = ctc_logits[0, :enc_len(mel.shape[1])].argmax(-1).tolist()
col = [ids[0]] if ids else []
for t in ids[1:]:
    if t != col[-1]: col.append(t)
ctc_text = sp.decode([t for t in col if t != 4096])
print(f"CTC:  {ctc_text}")

# Attention beam search (more accurate, beam=5)
BOS, EOS, PAD = 2, 3, 1
tokens = torch.tensor([[BOS]], dtype=torch.long, device=device)
finished = False
for _ in range(200):
    with torch.no_grad():
        with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
            logits = model.decoder(tokens, h, mask)
    next_id = logits[0, -1].float().argmax(-1).item()
    if next_id == EOS:
        break
    tokens = torch.cat([tokens, torch.tensor([[next_id]], device=device)], dim=1)
attn_ids = [t for t in tokens[0, 1:].tolist() if t not in (BOS, EOS, PAD)]
attn_text = sp.decode(attn_ids)
print(f"ATTN: {attn_text}")

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        QoQnus-DARA                              │
│                                                                 │
│  Audio (16kHz)                                                  │
│      │                                                          │
│      ▼                                                          │
│  RMS Normalize (−20 dB)                                         │
│      │                                                          │
│      ▼                                                          │
│  Log-Mel Spectrogram (80 bands, N_FFT=1024, Slaney)             │
│      │                                                          │
│      ▼                                                          │
│  ConvSubsampler  ──  2× causal conv (stride=2) → 4× reduction  │
│      │                                                          │
│      ▼                                                          │
│  Conformer Encoder  ──  12 layers, d=512, 8 heads, causal       │
│      │                                                          │
│      ├──────────────────────┬──────────────────────┐           │
│      ▼                      ▼                      ▼           │
│  CTC Head              Acoustic Sentinel      Gated Decoder     │
│  (fast decode)         (speech/noise gate)    (4 layers, d=512) │
│                              │                      │           │
│                              └──── mask ────────────┘           │
│                                                                 │
│                         Transcription                           │
└─────────────────────────────────────────────────────────────────┘

Component Details

ConvSubsampler

Two causal strided convolutions (kernel=3, stride=2 each) reduce time resolution by 4×. Input: (T, 80) → Output: (T/4, 512). GELU activation + LayerNorm.

Conformer Encoder — 12 layers

Each block:

  • Feed-forward (half-step, 512→2048→512, SiLU, dropout=0.1)
  • Multi-head self-attention with RoPE (8 heads, causal, dropout=0.1)
  • Depthwise conv module (kernel=31, causal, dropout=0.1)
  • Feed-forward (half-step)
  • Post-block LayerNorm

Causal masking enables streaming inference.

Acoustic Sentinel — the novel component

A 2-layer Transformer classifier (d=256, 4 heads) that predicts per-frame speech probability. Trained with BCE loss to distinguish speech from non-speech (MUSAN music noise).

The sentinel output gates the decoder's cross-attention — non-speech frames are masked to zero, preventing the decoder from hallucinating on silence or background noise.

Gated Decoder — 4 layers

Autoregressive decoder:

  • Embedding dropout (p=0.1)
  • Causal self-attention with RoPE (8 heads, dropout=0.1)
  • Gated cross-attention over sentinel-masked encoder output (8 heads)
  • Feed-forward (512→2048→512, GELU, dropout=0.1)
  • Pre-norm (LayerNorm before each sub-layer)
Parameter breakdown
Component Parameters
ConvSubsampler ~0.8M
Conformer Encoder (12 layers) ~85M
CTC Head ~2.1M
Acoustic Sentinel ~2.5M
Gated Decoder (4 layers) ~105M
Total ~195.8M

QoQnus SpeechLine

QoQnus-DARA
Dual-head ASR · Robust Acoustics
195.8M · Conformer + Sentinel
🤗 This model
QoQnus-Moonshine
Streaming ASR · Informal Persian
108M · Moonshine Streaming
Coming soon
QoQnus-DARA-Base
Dual-head ASR · Larger capacity
~350M · Conformer + Sentinel
In development

All models in the SpeechLine are trained on GinkgoQ/Qoqnus — a 3,000+ hour Persian speech corpus curated and released by GinkgoQ.


Dataset — GinkgoQ/Qoqnus

Bucket Utterances Hours SNR
train_high ~1,083,351 ~1,260h ≥ 40 dB
train_medium ~631,573 ~740h 30–40 dB
val ~15,262 ~18h
test ~33,000 ~39h

Sources: Common Voice 13/17, VHDM, Pourmand YouTube, Srezas (YouTube/Fleurs/Yazdi), KiarashQ, Mana-TTS, GPTInformal, Mshojaei, Thomcles, PERTTS.

Training Phases

Phase Data Steps Peak LR Augmentation
1 train_high 150,000 1e-3 None
2 train_high + 5% MUSAN 50,000 3e-4 None
3a train_high + train_medium 20,000 5e-5 None
Noise v2 train_high + train_medium 50,000 3e-5 Curriculum noise

Loss Function

L_total = 0.3 × L_CTC  +  0.7 × L_attention  +  0.3 × L_sentinel

L_CTC:       CTC loss (blank = vocab_size = 4096)
L_attention: Cross-entropy with label_smoothing=0.1
L_sentinel:  BCE with frame-level speech/non-speech mask

Key Hyperparameters

optimizer    = AdamW(betas=(0.9, 0.98), weight_decay=1e-2)
# weight decay applied to weight matrices only — biases and LayerNorm excluded
scheduler    = cosine_decay_with_warmup
precision    = bfloat16
grad_clip    = 1.0
dropout      = 0.1  # all encoder and decoder modules
label_smooth = 0.1

Text Normalization

Applied to both training targets and inference output:

  1. NFKC Unicode normalization
  2. Arabic → Persian substitution (ك→ک, ي→ی, ة→ه, أ/إ/آ→ا, ؤ→و, ئ→ی)
  3. Diacritic removal (harakat, shadda, tanwin)
  4. ASCII digits → Persian digits (0→۰ … 9→۹)
  5. Lowercase ASCII (Latin loanwords)
  6. Whitespace normalization

Limitations

Known limitations
  • Noise robustness: Trained primarily on clean speech (SNR ≥ 30 dB). Performance degrades below ~20 dB SNR. Noise-robust variants under active development.
  • Informal/dialectal speech: Handles colloquial Persian but was primarily trained on formal/broadcast speech.
  • Proper nouns: Rare names and places may be substituted with phonetically similar common words.
  • Long-form audio: Optimized for utterances up to ~15 seconds. Use VAD segmentation for longer audio.
  • Causal encoder: Left-context-only attention enables streaming but is slightly less accurate than bidirectional models on offline tasks.

Hardware & Training Time

Trained on a single NVIDIA GeForce RTX 4090 Laptop GPU (16 GB VRAM).

Phase Duration
Phase 1 (150k steps) ~33 hours
Phase 2 (50k steps) ~11 hours
Phase 3a (20k steps) ~3 hours
Noise training (50k steps) ~28 hours
Total ~75+ hours

About GinkgoQ

GinkgoQ

GinkgoQ is an AI research initiative focused on Persian language technology. The QoQnus (ققنوس) brand — named after the Persian phoenix — represents a family of speech and language models built from the ground up for the Persian-speaking community.

Resource Link
🤗 Organization huggingface.co/GinkgoQ
📦 Training Dataset GinkgoQ/Qoqnus
🔊 This Model GinkgoQ/QoQnus-DARA

Citation

@misc{QoQnus-DARA-2026,
  title        = {QoQnus-DARA: Dual-head ASR with Robust Acoustics for Persian},
  author       = {GinkgoQ},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/GinkgoQ/QoQnus-DARA}},
  note         = {Trained on GinkgoQ/Qoqnus Persian speech corpus}
}

QoQnus SpeechLine · Built by GinkgoQ

QoQnus (ققنوس) — the Persian phoenix · rising from silence into speech

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Dataset used to train GinkgoQ/QoQnus-DARA

Space using GinkgoQ/QoQnus-DARA 1

Evaluation results