musaw commited on
Commit Β·
fb472d7
1
Parent(s): 5bb5a63
Add validated Pashto resources across datasets models and benchmarks
Browse files- benchmarks/README.md +6 -0
- data/README.md +18 -0
- docs/resource_catalog.md +8 -1
- resources/benchmarks/README.md +1 -0
- resources/datasets/README.md +3 -0
- resources/models/README.md +5 -3
benchmarks/README.md
CHANGED
|
@@ -18,6 +18,11 @@ Define fixed test sets, metrics, and leaderboard generation scripts.
|
|
| 18 |
- Pashto validation: subset includes `pbt_Arab`.
|
| 19 |
- Primary use: comprehension benchmark for multilingual NLP models.
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
### π£οΈ Common Voice Pashto v24
|
| 22 |
- Dataset: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
|
| 23 |
- Primary use: ASR train/dev/test experiments and project baseline tracking.
|
|
@@ -26,6 +31,7 @@ Define fixed test sets, metrics, and leaderboard generation scripts.
|
|
| 26 |
- ASR: `WER`, `CER`
|
| 27 |
- TTS: `MCD`/objective proxies + human MOS-style scoring
|
| 28 |
- NLP: task-specific accuracy/F1 with fixed test set
|
|
|
|
| 29 |
|
| 30 |
## π§Ύ Reporting Template
|
| 31 |
- Benchmark dataset + version
|
|
|
|
| 18 |
- Pashto validation: subset includes `pbt_Arab`.
|
| 19 |
- Primary use: comprehension benchmark for multilingual NLP models.
|
| 20 |
|
| 21 |
+
### π FLORES-200 (Pashto translation benchmark)
|
| 22 |
+
- Dataset/language list: [facebookresearch/flores/tree/main/flores200](https://github.com/facebookresearch/flores/tree/main/flores200)
|
| 23 |
+
- Pashto validation: language list includes `pbt_Arab`.
|
| 24 |
+
- Primary use: fixed-reference MT evaluation for Pashto translation experiments.
|
| 25 |
+
|
| 26 |
### π£οΈ Common Voice Pashto v24
|
| 27 |
- Dataset: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
|
| 28 |
- Primary use: ASR train/dev/test experiments and project baseline tracking.
|
|
|
|
| 31 |
- ASR: `WER`, `CER`
|
| 32 |
- TTS: `MCD`/objective proxies + human MOS-style scoring
|
| 33 |
- NLP: task-specific accuracy/F1 with fixed test set
|
| 34 |
+
- MT: `BLEU`, `chrF`, `COMET`
|
| 35 |
|
| 36 |
## π§Ύ Reporting Template
|
| 37 |
- Benchmark dataset + version
|
data/README.md
CHANGED
|
@@ -35,6 +35,24 @@
|
|
| 35 |
- Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
|
| 36 |
- How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md).
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
## First Contribution (Normalization Starter)
|
| 39 |
- [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples
|
| 40 |
- [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy
|
|
|
|
| 35 |
- Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
|
| 36 |
- How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md).
|
| 37 |
|
| 38 |
+
### π OPUS-100 (parallel text, en-ps)
|
| 39 |
+
- Link: [huggingface.co/datasets/Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
|
| 40 |
+
- Pashto validation: dataset viewer includes `en-ps` subset.
|
| 41 |
+
- Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks.
|
| 42 |
+
- How to use here: keep in external eval/training split plans and log subset/version in run cards.
|
| 43 |
+
|
| 44 |
+
### π€ Pashto Isolated Words Speech Dataset (Kaggle)
|
| 45 |
+
- Link: [kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset)
|
| 46 |
+
- Pashto validation: dataset title is explicitly Pashto isolated-word speech.
|
| 47 |
+
- Why useful: useful for small-footprint ASR or keyword-spotting experiments.
|
| 48 |
+
- How to use here: treat as task-specific speech data and document licensing/collection assumptions before use.
|
| 49 |
+
|
| 50 |
+
### π§ Pashto Word Embeddings (Kaggle)
|
| 51 |
+
- Link: [kaggle.com/datasets/drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings)
|
| 52 |
+
- Pashto validation: dataset description states pretrained Pashto embeddings.
|
| 53 |
+
- Why useful: quick-start lexical semantics baseline for NLP experiments.
|
| 54 |
+
- How to use here: benchmark against transformer encoders in downstream Pashto tasks.
|
| 55 |
+
|
| 56 |
## First Contribution (Normalization Starter)
|
| 57 |
- [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples
|
| 58 |
- [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy
|
docs/resource_catalog.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# π Verified Pashto Resource Catalog
|
| 2 |
|
| 3 |
-
Last updated: `2026-02-
|
| 4 |
|
| 5 |
This index points to validated Pashto-related resources tracked in structured files.
|
| 6 |
|
|
@@ -27,3 +27,10 @@ Before each release:
|
|
| 27 |
- Confirm links still resolve.
|
| 28 |
- Confirm Pashto support markers remain valid.
|
| 29 |
- Confirm license/usage terms are still compatible.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# π Verified Pashto Resource Catalog
|
| 2 |
|
| 3 |
+
Last updated: `2026-02-15`
|
| 4 |
|
| 5 |
This index points to validated Pashto-related resources tracked in structured files.
|
| 6 |
|
|
|
|
| 27 |
- Confirm links still resolve.
|
| 28 |
- Confirm Pashto support markers remain valid.
|
| 29 |
- Confirm license/usage terms are still compatible.
|
| 30 |
+
|
| 31 |
+
## New Additions (2026-02-15)
|
| 32 |
+
- `OPUS-100` dataset with `en-ps` subset support.
|
| 33 |
+
- `FLORES-200` benchmark reference with `pbt_Arab` language code coverage.
|
| 34 |
+
- `facebook/mms-1b-all` ASR model reference for multilingual Pashto transfer.
|
| 35 |
+
- `mdarhri/pashto-bert` model for Pashto NLP baseline work.
|
| 36 |
+
- Two Kaggle resources: Pashto isolated-word speech and Pashto word embeddings.
|
resources/benchmarks/README.md
CHANGED
|
@@ -7,6 +7,7 @@
|
|
| 7 |
| FLEURS (Pashto subset) | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | WER, CER |
|
| 8 |
| Common Voice Pashto v24 | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | WER, CER |
|
| 9 |
| Belebele (`pbt_Arab`) | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Accuracy, F1 |
|
|
|
|
| 10 |
|
| 11 |
## Integration Paths
|
| 12 |
- Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)
|
|
|
|
| 7 |
| FLEURS (Pashto subset) | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | WER, CER |
|
| 8 |
| Common Voice Pashto v24 | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | WER, CER |
|
| 9 |
| Belebele (`pbt_Arab`) | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Accuracy, F1 |
|
| 10 |
+
| FLORES-200 (`pbt_Arab`) | [FLORES language list](https://github.com/facebookresearch/flores/tree/main/flores200) | BLEU, chrF, COMET |
|
| 11 |
|
| 12 |
## Integration Paths
|
| 13 |
- Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)
|
resources/datasets/README.md
CHANGED
|
@@ -9,6 +9,9 @@
|
|
| 9 |
| OSCAR Corpus | [Hugging Face - oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | Includes `unshuffled_deduplicated_ps` | NLP language modeling |
|
| 10 |
| Wikimedia Wikipedia | [Hugging Face - wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | Includes `20231101.ps` | Clean text corpus |
|
| 11 |
| Belebele | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Includes `pbt_Arab` | Reading comprehension benchmark |
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## Integration Paths
|
| 14 |
- Data workspace: [../../data/README.md](../../data/README.md)
|
|
|
|
| 9 |
| OSCAR Corpus | [Hugging Face - oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | Includes `unshuffled_deduplicated_ps` | NLP language modeling |
|
| 10 |
| Wikimedia Wikipedia | [Hugging Face - wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | Includes `20231101.ps` | Clean text corpus |
|
| 11 |
| Belebele | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Includes `pbt_Arab` | Reading comprehension benchmark |
|
| 12 |
+
| OPUS-100 | [Hugging Face - Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes `en-ps` subset](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Parallel corpus for Pashto-English translation |
|
| 13 |
+
| Pashto Isolated Words Speech Dataset | [Kaggle - engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Dataset card title explicitly marks Pashto speech data | Keyword spotting and limited-vocabulary ASR |
|
| 14 |
+
| Pashto Word Embeddings | [Kaggle - drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Dataset description states pretrained Pashto embeddings | NLP baselines and lexical experiments |
|
| 15 |
|
| 16 |
## Integration Paths
|
| 17 |
- Data workspace: [../../data/README.md](../../data/README.md)
|
resources/models/README.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
## Pashto-Relevant Models
|
| 4 |
|
|
@@ -6,10 +6,12 @@
|
|
| 6 |
|---|---|---|---|
|
| 7 |
| Whisper Large v3 | [Hugging Face - openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Tokenizer map includes `ps`](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline |
|
| 8 |
| MMS Coverage Table | [Meta MMS language coverage](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Includes `pus` with ASR/TTS support | Multilingual transfer |
|
|
|
|
| 9 |
| MMS TTS | [Hugging Face - facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | Aligned with MMS coverage table | TTS baseline |
|
| 10 |
| NLLB-200 Distilled 600M | [Hugging Face - facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | [`special_tokens_map.json` includes `pbt_Arab`](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Translation baseline |
|
| 11 |
-
| OPUS MT en
|
| 12 |
-
| OPUS MT mul
|
|
|
|
| 13 |
|
| 14 |
## Integration Paths
|
| 15 |
- ASR workspace: [../../asr/README.md](../../asr/README.md)
|
|
|
|
| 1 |
+
# Models
|
| 2 |
|
| 3 |
## Pashto-Relevant Models
|
| 4 |
|
|
|
|
| 6 |
|---|---|---|---|
|
| 7 |
| Whisper Large v3 | [Hugging Face - openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Tokenizer map includes `ps`](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline |
|
| 8 |
| MMS Coverage Table | [Meta MMS language coverage](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Includes `pus` with ASR/TTS support | Multilingual transfer |
|
| 9 |
+
| MMS 1B All (ASR) | [Hugging Face - facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) | [Coverage table includes `pus` with ASR support](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Multilingual ASR transfer baseline |
|
| 10 |
| MMS TTS | [Hugging Face - facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | Aligned with MMS coverage table | TTS baseline |
|
| 11 |
| NLLB-200 Distilled 600M | [Hugging Face - facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | [`special_tokens_map.json` includes `pbt_Arab`](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Translation baseline |
|
| 12 |
+
| OPUS MT en->mul | [Hugging Face - opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | Model language list includes `pus` | English->Pashto path |
|
| 13 |
+
| OPUS MT mul->en | [Hugging Face - opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | Model language list includes `pus` | Pashto->English path |
|
| 14 |
+
| PashtoBERT | [Hugging Face - mdarhri/pashto-bert](https://huggingface.co/mdarhri/pashto-bert) | Model card states it is trained on Pashto corpus data | Pashto NLP encoder baseline |
|
| 15 |
|
| 16 |
## Integration Paths
|
| 17 |
- ASR workspace: [../../asr/README.md](../../asr/README.md)
|