Overview
QoQnus-DARA is a Persian ASR model built from scratch by GinkgoQ, part of the QoQnus SpeechLine — a family of Persian speech models trained on the GinkgoQ/Qoqnus corpus (~2,000 hours, 1.7M+ utterances).
The architecture introduces an Acoustic Sentinel — a lightweight speech/non-speech classifier that gates the attention decoder, preventing hallucination on silent or noisy inputs. Combined with a causal Conformer encoder and a dual CTC+attention training objective, DARA achieves strong accuracy with streaming-capable inference.
| Property | Value |
|---|---|
| Language | Persian · فارسی |
| Architecture | Causal Conformer + Gated Decoder + Acoustic Sentinel |
| Parameters | 195.8M |
| Vocabulary | 4,096 SentencePiece Unigram tokens |
| Input | 16 kHz mono, RMS-normalized to −20 dB |
| Mel features | 80-band log-mel · N_FFT=1024 · HOP=256 · Slaney norm |
| Decoding | CTC greedy + Attention beam search (beam=5) |
| Training data | GinkgoQ/Qoqnus |
| License | CC BY-SA 4.0 |
Benchmark Results
Evaluated on 5,867 samples from gpt_informal_train (informal conversational Persian) and 1,000 samples from hezarai_cv13_test (Common Voice Persian test set). All models use identical text normalization.
Informal Persian (gpt_informal_train · 5,867 samples)
| Model | WER ↓ | CER ↓ | Median WER | Perfect (WER=0) | Speed |
|---|---|---|---|---|---|
| QoQnus-DARA (ours) | 34.41% | 17.88% | 27.27% | 949/5867 (16.2%) | 6.5 ms |
| QoQnus-Moonshine (ours) | 34.93% | 10.28% | 25.00% | 1790/5867 (30.5%) | 12.3 ms |
| vhdm/whisper-large-fa-v1 | 31.81% | 15.42% | 25.00% | 879/5867 (15.0%) | 58.7 ms |
| jonatasgrosman/wav2vec2-large-xlsr-53-persian | 37.81% | 12.49% | 30.00% | 664/5867 (11.3%) | 17.4 ms |
| m3hrdadfi/wav2vec2-large-xlsr-persian-v3 | 40.51% | 14.65% | 28.57% | 666/5867 (11.4%) | 17.4 ms |
| nvidia/stt_fa_fastconformer_hybrid_large | 46.62% | 23.39% | 41.67% | 297/5867 (5.1%) | 9.6 ms |
QoQnus-DARA is 9× faster than Whisper-large-fa while achieving competitive WER. On formal speech (hezarai_cv13_test), QoQnus-DARA achieves 17.30% WER / 7.51% CER.
Common Voice Persian Test (hezarai_cv13_test · 1,000 samples)
| Metric | Value |
|---|---|
| Average WER | 17.30% |
| Average CER | 7.51% |
| Median WER | 0.00% |
| Median CER | 0.00% |
| Perfect predictions (WER=0) | 595/1000 (59.5%) |
| Throughput | ~154 samples/sec |
Median WER of 0% — the majority of utterances are transcribed perfectly.
Quick Start
from transformers import AutoProcessor, AutoModelForCTC
import torch, torchaudio
# ── load model ────────────────────────────────────────────────────────────────
# For the full DARA pipeline, use the inference script below.
# A HuggingFace-native wrapper is coming soon.
Using the inference script
pip install torch torchaudio sentencepiece transformers
git clone https://huggingface.co/GinkgoQ/QoQnus-DARA
cd QoQnus-DARA
Load model
from huggingface_hub import hf_hub_download
import torch, sys
from pathlib import Path
# download what you need
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
model_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "dara_model.py")
tok_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer.model")
# add the downloaded dir to path so dara_model.py is importable
sys.path.insert(0, str(Path(model_path).parent))
from dara_model import build_dara_small
# load weights
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = build_dara_small()
ckpt = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)
print("model loaded")
Use model
import torch, torchaudio, unicodedata, re
import sentencepiece as spm
from huggingface_hub import hf_hub_download
from dara_model import build_dara_small
# ── download model weights ────────────────────────────────────────────────────
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
tok_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer/fa_unigram_4096/tokenizer.model")
# ── load ──────────────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = build_dara_small()
ckpt = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)
sp = spm.SentencePieceProcessor()
sp.load(tok_path)
# ── mel preprocessing (must match training) ───────────────────────────────────
import torchaudio.transforms as T
import numpy as np
SAMPLE_RATE, N_FFT, HOP, N_MEL = 16000, 1024, 256, 80
_RMS = 10 ** (-20.0 / 20.0)
mel_fn = T.MelSpectrogram(
sample_rate=SAMPLE_RATE, n_fft=N_FFT, win_length=N_FFT,
hop_length=HOP, f_min=0, f_max=8000, n_mels=N_MEL,
power=1.0, norm="slaney", mel_scale="slaney",
)
def audio_to_mel(path):
wav, sr = torchaudio.load(path)
if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
if sr != SAMPLE_RATE:
wav = torchaudio.functional.resample(wav, sr, SAMPLE_RATE)
wav = wav * (_RMS / wav.pow(2).mean().sqrt().clamp(min=1e-9))
return torch.log(mel_fn(wav.squeeze(0)).clamp(min=1e-5)).T # (T, 80)
def enc_len(t): return ((t + 1) // 2 + 1) // 2
# ── inference ─────────────────────────────────────────────────────────────────
mel = audio_to_mel("audio.wav").unsqueeze(0).to(device, dtype=torch.bfloat16)
with torch.no_grad():
with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
h = model.encoder(model.subsampler(mel))
sentinel_logits = model.sentinel(h)
mask = model.masker(sentinel_logits)
ctc_logits = model.ctc(h)
# CTC greedy (fast)
ids = ctc_logits[0, :enc_len(mel.shape[1])].argmax(-1).tolist()
col = [ids[0]] if ids else []
for t in ids[1:]:
if t != col[-1]: col.append(t)
ctc_text = sp.decode([t for t in col if t != 4096])
print(f"CTC: {ctc_text}")
# Attention beam search (more accurate, beam=5)
BOS, EOS, PAD = 2, 3, 1
tokens = torch.tensor([[BOS]], dtype=torch.long, device=device)
finished = False
for _ in range(200):
with torch.no_grad():
with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
logits = model.decoder(tokens, h, mask)
next_id = logits[0, -1].float().argmax(-1).item()
if next_id == EOS:
break
tokens = torch.cat([tokens, torch.tensor([[next_id]], device=device)], dim=1)
attn_ids = [t for t in tokens[0, 1:].tolist() if t not in (BOS, EOS, PAD)]
attn_text = sp.decode(attn_ids)
print(f"ATTN: {attn_text}")
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ QoQnus-DARA │
│ │
│ Audio (16kHz) │
│ │ │
│ ▼ │
│ RMS Normalize (−20 dB) │
│ │ │
│ ▼ │
│ Log-Mel Spectrogram (80 bands, N_FFT=1024, Slaney) │
│ │ │
│ ▼ │
│ ConvSubsampler ── 2× causal conv (stride=2) → 4× reduction │
│ │ │
│ ▼ │
│ Conformer Encoder ── 12 layers, d=512, 8 heads, causal │
│ │ │
│ ├──────────────────────┬──────────────────────┐ │
│ ▼ ▼ ▼ │
│ CTC Head Acoustic Sentinel Gated Decoder │
│ (fast decode) (speech/noise gate) (4 layers, d=512) │
│ │ │ │
│ └──── mask ────────────┘ │
│ │
│ Transcription │
└─────────────────────────────────────────────────────────────────┘
Component Details
ConvSubsampler
Two causal strided convolutions (kernel=3, stride=2 each) reduce time resolution by 4×.
Input: (T, 80) → Output: (T/4, 512). GELU activation + LayerNorm.
Conformer Encoder — 12 layers
Each block:
- Feed-forward (half-step, 512→2048→512, SiLU, dropout=0.1)
- Multi-head self-attention with RoPE (8 heads, causal, dropout=0.1)
- Depthwise conv module (kernel=31, causal, dropout=0.1)
- Feed-forward (half-step)
- Post-block LayerNorm
Causal masking enables streaming inference.
Acoustic Sentinel — the novel component
A 2-layer Transformer classifier (d=256, 4 heads) that predicts per-frame speech probability. Trained with BCE loss to distinguish speech from non-speech (MUSAN music noise).
The sentinel output gates the decoder's cross-attention — non-speech frames are masked to zero, preventing the decoder from hallucinating on silence or background noise.
Gated Decoder — 4 layers
Autoregressive decoder:
- Embedding dropout (p=0.1)
- Causal self-attention with RoPE (8 heads, dropout=0.1)
- Gated cross-attention over sentinel-masked encoder output (8 heads)
- Feed-forward (512→2048→512, GELU, dropout=0.1)
- Pre-norm (LayerNorm before each sub-layer)
Parameter breakdown
| Component | Parameters |
|---|---|
| ConvSubsampler | ~0.8M |
| Conformer Encoder (12 layers) | ~85M |
| CTC Head | ~2.1M |
| Acoustic Sentinel | ~2.5M |
| Gated Decoder (4 layers) | ~105M |
| Total | ~195.8M |
QoQnus SpeechLine
|
QoQnus-DARA Dual-head ASR · Robust Acoustics 195.8M · Conformer + Sentinel 🤗 This model |
QoQnus-Moonshine Streaming ASR · Informal Persian 108M · Moonshine Streaming Coming soon |
QoQnus-DARA-Base Dual-head ASR · Larger capacity ~350M · Conformer + Sentinel In development |
All models in the SpeechLine are trained on GinkgoQ/Qoqnus — a 3,000+ hour Persian speech corpus curated and released by GinkgoQ.
Dataset — GinkgoQ/Qoqnus
| Bucket | Utterances | Hours | SNR |
|---|---|---|---|
| train_high | ~1,083,351 | ~1,260h | ≥ 40 dB |
| train_medium | ~631,573 | ~740h | 30–40 dB |
| val | ~15,262 | ~18h | — |
| test | ~33,000 | ~39h | — |
Sources: Common Voice 13/17, VHDM, Pourmand YouTube, Srezas (YouTube/Fleurs/Yazdi), KiarashQ, Mana-TTS, GPTInformal, Mshojaei, Thomcles, PERTTS.
Training Phases
| Phase | Data | Steps | Peak LR | Augmentation |
|---|---|---|---|---|
| 1 | train_high | 150,000 | 1e-3 | None |
| 2 | train_high + 5% MUSAN | 50,000 | 3e-4 | None |
| 3a | train_high + train_medium | 20,000 | 5e-5 | None |
| Noise v2 | train_high + train_medium | 50,000 | 3e-5 | Curriculum noise |
Loss Function
L_total = 0.3 × L_CTC + 0.7 × L_attention + 0.3 × L_sentinel
L_CTC: CTC loss (blank = vocab_size = 4096)
L_attention: Cross-entropy with label_smoothing=0.1
L_sentinel: BCE with frame-level speech/non-speech mask
Key Hyperparameters
optimizer = AdamW(betas=(0.9, 0.98), weight_decay=1e-2)
# weight decay applied to weight matrices only — biases and LayerNorm excluded
scheduler = cosine_decay_with_warmup
precision = bfloat16
grad_clip = 1.0
dropout = 0.1 # all encoder and decoder modules
label_smooth = 0.1
Text Normalization
Applied to both training targets and inference output:
- NFKC Unicode normalization
- Arabic → Persian substitution (ك→ک, ي→ی, ة→ه, أ/إ/آ→ا, ؤ→و, ئ→ی)
- Diacritic removal (harakat, shadda, tanwin)
- ASCII digits → Persian digits (0→۰ … 9→۹)
- Lowercase ASCII (Latin loanwords)
- Whitespace normalization
Limitations
Known limitations
- Noise robustness: Trained primarily on clean speech (SNR ≥ 30 dB). Performance degrades below ~20 dB SNR. Noise-robust variants under active development.
- Informal/dialectal speech: Handles colloquial Persian but was primarily trained on formal/broadcast speech.
- Proper nouns: Rare names and places may be substituted with phonetically similar common words.
- Long-form audio: Optimized for utterances up to ~15 seconds. Use VAD segmentation for longer audio.
- Causal encoder: Left-context-only attention enables streaming but is slightly less accurate than bidirectional models on offline tasks.
Hardware & Training Time
Trained on a single NVIDIA GeForce RTX 4090 Laptop GPU (16 GB VRAM).
| Phase | Duration |
|---|---|
| Phase 1 (150k steps) | ~33 hours |
| Phase 2 (50k steps) | ~11 hours |
| Phase 3a (20k steps) | ~3 hours |
| Noise training (50k steps) | ~28 hours |
| Total | ~75+ hours |
About GinkgoQ

GinkgoQ is an AI research initiative focused on Persian language technology. The QoQnus (ققنوس) brand — named after the Persian phoenix — represents a family of speech and language models built from the ground up for the Persian-speaking community.
| Resource | Link |
|---|---|
| 🤗 Organization | huggingface.co/GinkgoQ |
| 📦 Training Dataset | GinkgoQ/Qoqnus |
| 🔊 This Model | GinkgoQ/QoQnus-DARA |
Citation
@misc{QoQnus-DARA-2026,
title = {QoQnus-DARA: Dual-head ASR with Robust Acoustics for Persian},
author = {GinkgoQ},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/GinkgoQ/QoQnus-DARA}},
note = {Trained on GinkgoQ/Qoqnus Persian speech corpus}
}
QoQnus SpeechLine · Built by GinkgoQ
QoQnus (ققنوس) — the Persian phoenix · rising from silence into speech
Dataset used to train GinkgoQ/QoQnus-DARA
Space using GinkgoQ/QoQnus-DARA 1
Evaluation results
- WER on GinkgoQ/Qoqnus (hezarai_cv13_test)self-reported17.300
- CER on GinkgoQ/Qoqnus (hezarai_cv13_test)self-reported7.510
