You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

GinkgoQ

QoQnus

QoQnus-DARA

Dual-head ASR with Robust Acoustics · Persian (فارسی) · 195.8M params · by GinkgoQ

Overview

QoQnus-DARA is a Persian ASR model built from scratch by GinkgoQ, part of the QoQnus SpeechLine — a family of Persian speech models trained on the GinkgoQ/Qoqnus corpus (~2,000 hours, 1.7M+ utterances).

The architecture introduces an Acoustic Sentinel — a lightweight speech/non-speech classifier that gates the attention decoder, preventing hallucination on silent or noisy inputs. Combined with a causal Conformer encoder and a dual CTC+attention training objective, DARA achieves strong accuracy with streaming-capable inference.

Property	Value
Language	Persian · فارسی
Architecture	Causal Conformer + Gated Decoder + Acoustic Sentinel
Parameters	195.8M
Vocabulary	4,096 SentencePiece Unigram tokens
Input	16 kHz mono, RMS-normalized to −20 dB
Mel features	80-band log-mel · N_FFT=1024 · HOP=256 · Slaney norm
Decoding	CTC greedy + Attention beam search (beam=5)
Training data	GinkgoQ/Qoqnus
License	CC BY-SA 4.0

Benchmark Results

Evaluated on 5,867 samples from gpt_informal_train (informal conversational Persian) and 1,000 samples from hezarai_cv13_test (Common Voice Persian test set). All models use identical text normalization.

Informal Persian (gpt_informal_train · 5,867 samples)

Model	WER ↓	CER ↓	Median WER	Perfect (WER=0)	Speed
QoQnus-DARA (ours)	34.41%	17.88%	27.27%	949/5867 (16.2%)	6.5 ms
QoQnus-Moonshine (ours)	34.93%	10.28%	25.00%	1790/5867 (30.5%)	12.3 ms
vhdm/whisper-large-fa-v1	31.81%	15.42%	25.00%	879/5867 (15.0%)	58.7 ms
jonatasgrosman/wav2vec2-large-xlsr-53-persian	37.81%	12.49%	30.00%	664/5867 (11.3%)	17.4 ms
m3hrdadfi/wav2vec2-large-xlsr-persian-v3	40.51%	14.65%	28.57%	666/5867 (11.4%)	17.4 ms
nvidia/stt_fa_fastconformer_hybrid_large	46.62%	23.39%	41.67%	297/5867 (5.1%)	9.6 ms

QoQnus-DARA is 9× faster than Whisper-large-fa while achieving competitive WER. On formal speech (hezarai_cv13_test), QoQnus-DARA achieves 17.30% WER / 7.51% CER.

Common Voice Persian Test (hezarai_cv13_test · 1,000 samples)

Metric	Value
Average WER	17.30%
Average CER	7.51%
Median WER	0.00%
Median CER	0.00%
Perfect predictions (WER=0)	595/1000 (59.5%)
Throughput	~154 samples/sec

Median WER of 0% — the majority of utterances are transcribed perfectly.

Quick Start

from transformers import AutoProcessor, AutoModelForCTC
import torch, torchaudio

# ── load model ────────────────────────────────────────────────────────────────
# For the full DARA pipeline, use the inference script below.
# A HuggingFace-native wrapper is coming soon.

Using the inference script

pip install torch torchaudio sentencepiece transformers
git clone https://huggingface.co/GinkgoQ/QoQnus-DARA
cd QoQnus-DARA

Load model

from huggingface_hub import hf_hub_download
import torch, sys
from pathlib import Path

# download what you need
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
model_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "dara_model.py")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer.model")

# add the downloaded dir to path so dara_model.py is importable
sys.path.insert(0, str(Path(model_path).parent))
from dara_model import build_dara_small

# load weights
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)
print("model loaded")

Use model

import torch, torchaudio, unicodedata, re
import sentencepiece as spm
from huggingface_hub import hf_hub_download
from dara_model import build_dara_small

# ── download model weights ────────────────────────────────────────────────────
ckpt_path = hf_hub_download("GinkgoQ/QoQnus-DARA", "best.pt")
tok_path  = hf_hub_download("GinkgoQ/QoQnus-DARA", "tokenizer/fa_unigram_4096/tokenizer.model")

# ── load ──────────────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = build_dara_small()
ckpt   = torch.load(ckpt_path, map_location="cpu")
model.load_state_dict(ckpt["model_state"])
model.eval().to(device).to(torch.bfloat16)

sp = spm.SentencePieceProcessor()
sp.load(tok_path)

# ── mel preprocessing (must match training) ───────────────────────────────────
import torchaudio.transforms as T
import numpy as np

SAMPLE_RATE, N_FFT, HOP, N_MEL = 16000, 1024, 256, 80
_RMS = 10 ** (-20.0 / 20.0)

mel_fn = T.MelSpectrogram(
    sample_rate=SAMPLE_RATE, n_fft=N_FFT, win_length=N_FFT,
    hop_length=HOP, f_min=0, f_max=8000, n_mels=N_MEL,
    power=1.0, norm="slaney", mel_scale="slaney",
)

def audio_to_mel(path):
    wav, sr = torchaudio.load(path)
    if wav.shape[0] > 1: wav = wav.mean(0, keepdim=True)
    if sr != SAMPLE_RATE:
        wav = torchaudio.functional.resample(wav, sr, SAMPLE_RATE)
    wav = wav * (_RMS / wav.pow(2).mean().sqrt().clamp(min=1e-9))
    return torch.log(mel_fn(wav.squeeze(0)).clamp(min=1e-5)).T  # (T, 80)

def enc_len(t): return ((t + 1) // 2 + 1) // 2

# ── inference ─────────────────────────────────────────────────────────────────
mel = audio_to_mel("audio.wav").unsqueeze(0).to(device, dtype=torch.bfloat16)

with torch.no_grad():
    with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
        h               = model.encoder(model.subsampler(mel))
        sentinel_logits = model.sentinel(h)
        mask            = model.masker(sentinel_logits)
        ctc_logits      = model.ctc(h)

# CTC greedy (fast)
ids = ctc_logits[0, :enc_len(mel.shape[1])].argmax(-1).tolist()
col = [ids[0]] if ids else []
for t in ids[1:]:
    if t != col[-1]: col.append(t)
ctc_text = sp.decode([t for t in col if t != 4096])
print(f"CTC:  {ctc_text}")

# Attention beam search (more accurate, beam=5)
BOS, EOS, PAD = 2, 3, 1
tokens = torch.tensor([[BOS]], dtype=torch.long, device=device)
finished = False
for _ in range(200):
    with torch.no_grad():
        with torch.autocast(device_type=device.type, dtype=torch.bfloat16):
            logits = model.decoder(tokens, h, mask)
    next_id = logits[0, -1].float().argmax(-1).item()
    if next_id == EOS:
        break
    tokens = torch.cat([tokens, torch.tensor([[next_id]], device=device)], dim=1)
attn_ids = [t for t in tokens[0, 1:].tolist() if t not in (BOS, EOS, PAD)]
attn_text = sp.decode(attn_ids)
print(f"ATTN: {attn_text}")

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        QoQnus-DARA                              │
│                                                                 │
│  Audio (16kHz)                                                  │
│      │                                                          │
│      ▼                                                          │
│  RMS Normalize (−20 dB)                                         │
│      │                                                          │
│      ▼                                                          │
│  Log-Mel Spectrogram (80 bands, N_FFT=1024, Slaney)             │
│      │                                                          │
│      ▼                                                          │
│  ConvSubsampler  ──  2× causal conv (stride=2) → 4× reduction  │
│      │                                                          │
│      ▼                                                          │
│  Conformer Encoder  ──  12 layers, d=512, 8 heads, causal       │
│      │                                                          │
│      ├──────────────────────┬──────────────────────┐           │
│      ▼                      ▼                      ▼           │
│  CTC Head              Acoustic Sentinel      Gated Decoder     │
│  (fast decode)         (speech/noise gate)    (4 layers, d=512) │
│                              │                      │           │
│                              └──── mask ────────────┘           │
│                                                                 │
│                         Transcription                           │
└─────────────────────────────────────────────────────────────────┘

Component Details

ConvSubsampler

Two causal strided convolutions (kernel=3, stride=2 each) reduce time resolution by 4×. Input: (T, 80) → Output: (T/4, 512). GELU activation + LayerNorm.

Conformer Encoder — 12 layers

Each block:

Feed-forward (half-step, 512→2048→512, SiLU, dropout=0.1)
Multi-head self-attention with RoPE (8 heads, causal, dropout=0.1)
Depthwise conv module (kernel=31, causal, dropout=0.1)
Feed-forward (half-step)
Post-block LayerNorm

Causal masking enables streaming inference.

Acoustic Sentinel — the novel component

A 2-layer Transformer classifier (d=256, 4 heads) that predicts per-frame speech probability. Trained with BCE loss to distinguish speech from non-speech (MUSAN music noise).

The sentinel output gates the decoder's cross-attention — non-speech frames are masked to zero, preventing the decoder from hallucinating on silence or background noise.

Gated Decoder — 4 layers

Autoregressive decoder:

Embedding dropout (p=0.1)
Causal self-attention with RoPE (8 heads, dropout=0.1)
Gated cross-attention over sentinel-masked encoder output (8 heads)
Feed-forward (512→2048→512, GELU, dropout=0.1)
Pre-norm (LayerNorm before each sub-layer)

Parameter breakdown

Component	Parameters
ConvSubsampler	~0.8M
Conformer Encoder (12 layers)	~85M
CTC Head	~2.1M
Acoustic Sentinel	~2.5M
Gated Decoder (4 layers)	~105M
Total	~195.8M

QoQnus SpeechLine

QoQnus-DARA
_{Dual-head ASR · Robust Acoustics}
_{195.8M · Conformer + Sentinel}
🤗 This model

QoQnus-Moonshine
_{Streaming ASR · Informal Persian}
_{108M · Moonshine Streaming}
_{Coming soon}

QoQnus-DARA-Base
_{Dual-head ASR · Larger capacity}
_{~350M · Conformer + Sentinel}
_{In development}

All models in the SpeechLine are trained on GinkgoQ/Qoqnus — a 3,000+ hour Persian speech corpus curated and released by GinkgoQ.

Dataset — GinkgoQ/Qoqnus

Bucket	Utterances	Hours	SNR
train_high	~1,083,351	~1,260h	≥ 40 dB
train_medium	~631,573	~740h	30–40 dB
val	~15,262	~18h	—
test	~33,000	~39h	—

Sources: Common Voice 13/17, VHDM, Pourmand YouTube, Srezas (YouTube/Fleurs/Yazdi), KiarashQ, Mana-TTS, GPTInformal, Mshojaei, Thomcles, PERTTS.

Training Phases

Phase	Data	Steps	Peak LR	Augmentation
1	train_high	150,000	1e-3	None
2	train_high + 5% MUSAN	50,000	3e-4	None
3a	train_high + train_medium	20,000	5e-5	None
Noise v2	train_high + train_medium	50,000	3e-5	Curriculum noise

Loss Function

L_total = 0.3 × L_CTC  +  0.7 × L_attention  +  0.3 × L_sentinel

L_CTC:       CTC loss (blank = vocab_size = 4096)
L_attention: Cross-entropy with label_smoothing=0.1
L_sentinel:  BCE with frame-level speech/non-speech mask

Key Hyperparameters

optimizer    = AdamW(betas=(0.9, 0.98), weight_decay=1e-2)
# weight decay applied to weight matrices only — biases and LayerNorm excluded
scheduler    = cosine_decay_with_warmup
precision    = bfloat16
grad_clip    = 1.0
dropout      = 0.1  # all encoder and decoder modules
label_smooth = 0.1

Text Normalization

Applied to both training targets and inference output:

NFKC Unicode normalization
Arabic → Persian substitution (ك→ک, ي→ی, ة→ه, أ/إ/آ→ا, ؤ→و, ئ→ی)
Diacritic removal (harakat, shadda, tanwin)
ASCII digits → Persian digits (0→۰ … 9→۹)
Lowercase ASCII (Latin loanwords)
Whitespace normalization

Limitations

Known limitations

Noise robustness: Trained primarily on clean speech (SNR ≥ 30 dB). Performance degrades below ~20 dB SNR. Noise-robust variants under active development.
Informal/dialectal speech: Handles colloquial Persian but was primarily trained on formal/broadcast speech.
Proper nouns: Rare names and places may be substituted with phonetically similar common words.
Long-form audio: Optimized for utterances up to ~15 seconds. Use VAD segmentation for longer audio.
Causal encoder: Left-context-only attention enables streaming but is slightly less accurate than bidirectional models on offline tasks.

Hardware & Training Time

Trained on a single NVIDIA GeForce RTX 4090 Laptop GPU (16 GB VRAM).

Phase	Duration
Phase 1 (150k steps)	~33 hours
Phase 2 (50k steps)	~11 hours
Phase 3a (20k steps)	~3 hours
Noise training (50k steps)	~28 hours
Total	~75+ hours

About GinkgoQ

GinkgoQ

GinkgoQ is an AI research initiative focused on Persian language technology. The QoQnus (ققنوس) brand — named after the Persian phoenix — represents a family of speech and language models built from the ground up for the Persian-speaking community.

Resource	Link
🤗 Organization	huggingface.co/GinkgoQ
📦 Training Dataset	GinkgoQ/Qoqnus
🔊 This Model	GinkgoQ/QoQnus-DARA

Citation

@misc{QoQnus-DARA-2026,
  title        = {QoQnus-DARA: Dual-head ASR with Robust Acoustics for Persian},
  author       = {GinkgoQ},
  year         = {2026},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/GinkgoQ/QoQnus-DARA}},
  note         = {Trained on GinkgoQ/Qoqnus Persian speech corpus}
}

QoQnus SpeechLine · Built by GinkgoQ

_{QoQnus (ققنوس) — the Persian phoenix · rising from silence into speech}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train GinkgoQ/QoQnus-DARA

Space using GinkgoQ/QoQnus-DARA 1

Evaluation results

WER on GinkgoQ/Qoqnus (hezarai_cv13_test)
self-reported

17.300
CER on GinkgoQ/Qoqnus (hezarai_cv13_test)
self-reported

7.510