π§ͺ Benchmarks
Define fixed test sets, metrics, and leaderboard generation scripts.
π¦ Result Storage
- Result format guide: results/README.md
- JSON schema: schema/benchmark_result.schema.json
β Verified Benchmark Sources
πΈ FLEURS (Pashto speech benchmark)
- Dataset: huggingface.co/datasets/google/fleurs
- Pashto validation: fleurs.py includes
ps_af. - Primary use: multilingual ASR benchmark with fixed split conventions.
π Belebele (Pashto reading benchmark)
- Dataset: huggingface.co/datasets/facebook/belebele
- Pashto validation: subset includes
pbt_Arab. - Primary use: comprehension benchmark for multilingual NLP models.
π FLORES-200 (Pashto translation benchmark)
- Dataset/language list: facebookresearch/flores/tree/main/flores200
- Pashto validation: language list includes
pbt_Arab. - Primary use: fixed-reference MT evaluation for Pashto translation experiments.
π£οΈ Common Voice Pashto v24
- Dataset: Mozilla Data Collective - Common Voice Pashto 24.0
- Primary use: ASR train/dev/test experiments and project baseline tracking.
π Recommended Metrics
- ASR:
WER,CER - TTS:
MCD/objective proxies + human MOS-style scoring - NLP: task-specific accuracy/F1 with fixed test set
- MT:
BLEU,chrF,COMET
π§Ύ Reporting Template
- Benchmark dataset + version
- Model + checkpoint version
- Normalization policy version
- Metrics and error analysis summary
- Reproducible command/config reference