musaw commited on
Commit
fb472d7
Β·
1 Parent(s): 5bb5a63

Add validated Pashto resources across datasets models and benchmarks

Browse files
benchmarks/README.md CHANGED
@@ -18,6 +18,11 @@ Define fixed test sets, metrics, and leaderboard generation scripts.
18
  - Pashto validation: subset includes `pbt_Arab`.
19
  - Primary use: comprehension benchmark for multilingual NLP models.
20
 
 
 
 
 
 
21
  ### πŸ—£οΈ Common Voice Pashto v24
22
  - Dataset: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
23
  - Primary use: ASR train/dev/test experiments and project baseline tracking.
@@ -26,6 +31,7 @@ Define fixed test sets, metrics, and leaderboard generation scripts.
26
  - ASR: `WER`, `CER`
27
  - TTS: `MCD`/objective proxies + human MOS-style scoring
28
  - NLP: task-specific accuracy/F1 with fixed test set
 
29
 
30
  ## 🧾 Reporting Template
31
  - Benchmark dataset + version
 
18
  - Pashto validation: subset includes `pbt_Arab`.
19
  - Primary use: comprehension benchmark for multilingual NLP models.
20
 
21
+ ### 🌍 FLORES-200 (Pashto translation benchmark)
22
+ - Dataset/language list: [facebookresearch/flores/tree/main/flores200](https://github.com/facebookresearch/flores/tree/main/flores200)
23
+ - Pashto validation: language list includes `pbt_Arab`.
24
+ - Primary use: fixed-reference MT evaluation for Pashto translation experiments.
25
+
26
  ### πŸ—£οΈ Common Voice Pashto v24
27
  - Dataset: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
28
  - Primary use: ASR train/dev/test experiments and project baseline tracking.
 
31
  - ASR: `WER`, `CER`
32
  - TTS: `MCD`/objective proxies + human MOS-style scoring
33
  - NLP: task-specific accuracy/F1 with fixed test set
34
+ - MT: `BLEU`, `chrF`, `COMET`
35
 
36
  ## 🧾 Reporting Template
37
  - Benchmark dataset + version
data/README.md CHANGED
@@ -35,6 +35,24 @@
35
  - Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
36
  - How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md).
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## First Contribution (Normalization Starter)
39
  - [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples
40
  - [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy
 
35
  - Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
36
  - How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md).
37
 
38
+ ### 🌐 OPUS-100 (parallel text, en-ps)
39
+ - Link: [huggingface.co/datasets/Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
40
+ - Pashto validation: dataset viewer includes `en-ps` subset.
41
+ - Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks.
42
+ - How to use here: keep in external eval/training split plans and log subset/version in run cards.
43
+
44
+ ### 🎀 Pashto Isolated Words Speech Dataset (Kaggle)
45
+ - Link: [kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset)
46
+ - Pashto validation: dataset title is explicitly Pashto isolated-word speech.
47
+ - Why useful: useful for small-footprint ASR or keyword-spotting experiments.
48
+ - How to use here: treat as task-specific speech data and document licensing/collection assumptions before use.
49
+
50
+ ### 🧠 Pashto Word Embeddings (Kaggle)
51
+ - Link: [kaggle.com/datasets/drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings)
52
+ - Pashto validation: dataset description states pretrained Pashto embeddings.
53
+ - Why useful: quick-start lexical semantics baseline for NLP experiments.
54
+ - How to use here: benchmark against transformer encoders in downstream Pashto tasks.
55
+
56
  ## First Contribution (Normalization Starter)
57
  - [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples
58
  - [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy
docs/resource_catalog.md CHANGED
@@ -1,6 +1,6 @@
1
  # πŸ“š Verified Pashto Resource Catalog
2
 
3
- Last updated: `2026-02-14`
4
 
5
  This index points to validated Pashto-related resources tracked in structured files.
6
 
@@ -27,3 +27,10 @@ Before each release:
27
  - Confirm links still resolve.
28
  - Confirm Pashto support markers remain valid.
29
  - Confirm license/usage terms are still compatible.
 
 
 
 
 
 
 
 
1
  # πŸ“š Verified Pashto Resource Catalog
2
 
3
+ Last updated: `2026-02-15`
4
 
5
  This index points to validated Pashto-related resources tracked in structured files.
6
 
 
27
  - Confirm links still resolve.
28
  - Confirm Pashto support markers remain valid.
29
  - Confirm license/usage terms are still compatible.
30
+
31
+ ## New Additions (2026-02-15)
32
+ - `OPUS-100` dataset with `en-ps` subset support.
33
+ - `FLORES-200` benchmark reference with `pbt_Arab` language code coverage.
34
+ - `facebook/mms-1b-all` ASR model reference for multilingual Pashto transfer.
35
+ - `mdarhri/pashto-bert` model for Pashto NLP baseline work.
36
+ - Two Kaggle resources: Pashto isolated-word speech and Pashto word embeddings.
resources/benchmarks/README.md CHANGED
@@ -7,6 +7,7 @@
7
  | FLEURS (Pashto subset) | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | WER, CER |
8
  | Common Voice Pashto v24 | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | WER, CER |
9
  | Belebele (`pbt_Arab`) | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Accuracy, F1 |
 
10
 
11
  ## Integration Paths
12
  - Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)
 
7
  | FLEURS (Pashto subset) | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | WER, CER |
8
  | Common Voice Pashto v24 | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | WER, CER |
9
  | Belebele (`pbt_Arab`) | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Accuracy, F1 |
10
+ | FLORES-200 (`pbt_Arab`) | [FLORES language list](https://github.com/facebookresearch/flores/tree/main/flores200) | BLEU, chrF, COMET |
11
 
12
  ## Integration Paths
13
  - Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)
resources/datasets/README.md CHANGED
@@ -9,6 +9,9 @@
9
  | OSCAR Corpus | [Hugging Face - oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | Includes `unshuffled_deduplicated_ps` | NLP language modeling |
10
  | Wikimedia Wikipedia | [Hugging Face - wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | Includes `20231101.ps` | Clean text corpus |
11
  | Belebele | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Includes `pbt_Arab` | Reading comprehension benchmark |
 
 
 
12
 
13
  ## Integration Paths
14
  - Data workspace: [../../data/README.md](../../data/README.md)
 
9
  | OSCAR Corpus | [Hugging Face - oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | Includes `unshuffled_deduplicated_ps` | NLP language modeling |
10
  | Wikimedia Wikipedia | [Hugging Face - wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | Includes `20231101.ps` | Clean text corpus |
11
  | Belebele | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Includes `pbt_Arab` | Reading comprehension benchmark |
12
+ | OPUS-100 | [Hugging Face - Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes `en-ps` subset](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Parallel corpus for Pashto-English translation |
13
+ | Pashto Isolated Words Speech Dataset | [Kaggle - engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Dataset card title explicitly marks Pashto speech data | Keyword spotting and limited-vocabulary ASR |
14
+ | Pashto Word Embeddings | [Kaggle - drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Dataset description states pretrained Pashto embeddings | NLP baselines and lexical experiments |
15
 
16
  ## Integration Paths
17
  - Data workspace: [../../data/README.md](../../data/README.md)
resources/models/README.md CHANGED
@@ -1,4 +1,4 @@
1
- # πŸ€– Models
2
 
3
  ## Pashto-Relevant Models
4
 
@@ -6,10 +6,12 @@
6
  |---|---|---|---|
7
  | Whisper Large v3 | [Hugging Face - openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Tokenizer map includes `ps`](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline |
8
  | MMS Coverage Table | [Meta MMS language coverage](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Includes `pus` with ASR/TTS support | Multilingual transfer |
 
9
  | MMS TTS | [Hugging Face - facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | Aligned with MMS coverage table | TTS baseline |
10
  | NLLB-200 Distilled 600M | [Hugging Face - facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | [`special_tokens_map.json` includes `pbt_Arab`](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Translation baseline |
11
- | OPUS MT en→mul | [Hugging Face - opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | Model language list includes `pus` | English→Pashto path |
12
- | OPUS MT mul→en | [Hugging Face - opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | Model language list includes `pus` | Pashto→English path |
 
13
 
14
  ## Integration Paths
15
  - ASR workspace: [../../asr/README.md](../../asr/README.md)
 
1
+ # Models
2
 
3
  ## Pashto-Relevant Models
4
 
 
6
  |---|---|---|---|
7
  | Whisper Large v3 | [Hugging Face - openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Tokenizer map includes `ps`](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline |
8
  | MMS Coverage Table | [Meta MMS language coverage](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Includes `pus` with ASR/TTS support | Multilingual transfer |
9
+ | MMS 1B All (ASR) | [Hugging Face - facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) | [Coverage table includes `pus` with ASR support](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Multilingual ASR transfer baseline |
10
  | MMS TTS | [Hugging Face - facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | Aligned with MMS coverage table | TTS baseline |
11
  | NLLB-200 Distilled 600M | [Hugging Face - facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | [`special_tokens_map.json` includes `pbt_Arab`](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Translation baseline |
12
+ | OPUS MT en->mul | [Hugging Face - opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | Model language list includes `pus` | English->Pashto path |
13
+ | OPUS MT mul->en | [Hugging Face - opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | Model language list includes `pus` | Pashto->English path |
14
+ | PashtoBERT | [Hugging Face - mdarhri/pashto-bert](https://huggingface.co/mdarhri/pashto-bert) | Model card states it is trained on Pashto corpus data | Pashto NLP encoder baseline |
15
 
16
  ## Integration Paths
17
  - ASR workspace: [../../asr/README.md](../../asr/README.md)