Add validated Pashto resources across datasets models and benchmarks

Files changed (6) hide show

benchmarks/README.md +6 -0
data/README.md +18 -0
docs/resource_catalog.md +8 -1
resources/benchmarks/README.md +1 -0
resources/datasets/README.md +3 -0
resources/models/README.md +5 -3

benchmarks/README.md CHANGED Viewed

@@ -18,6 +18,11 @@ Define fixed test sets, metrics, and leaderboard generation scripts.
 - Pashto validation: subset includes `pbt_Arab`.
 - Primary use: comprehension benchmark for multilingual NLP models.
 ### 🗣️ Common Voice Pashto v24
 - Dataset: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
 - Primary use: ASR train/dev/test experiments and project baseline tracking.
@@ -26,6 +31,7 @@ Define fixed test sets, metrics, and leaderboard generation scripts.
 - ASR: `WER`, `CER`
 - TTS: `MCD`/objective proxies + human MOS-style scoring
 - NLP: task-specific accuracy/F1 with fixed test set
 ## 🧾 Reporting Template
 - Benchmark dataset + version

 - Pashto validation: subset includes `pbt_Arab`.
 - Primary use: comprehension benchmark for multilingual NLP models.
+### 🌍 FLORES-200 (Pashto translation benchmark)
+- Dataset/language list: [facebookresearch/flores/tree/main/flores200](https://github.com/facebookresearch/flores/tree/main/flores200)
+- Pashto validation: language list includes `pbt_Arab`.
+- Primary use: fixed-reference MT evaluation for Pashto translation experiments.
 ### 🗣️ Common Voice Pashto v24
 - Dataset: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
 - Primary use: ASR train/dev/test experiments and project baseline tracking.
 - ASR: `WER`, `CER`
 - TTS: `MCD`/objective proxies + human MOS-style scoring
 - NLP: task-specific accuracy/F1 with fixed test set
+- MT: `BLEU`, `chrF`, `COMET`
 ## 🧾 Reporting Template
 - Benchmark dataset + version

data/README.md CHANGED Viewed

@@ -35,6 +35,24 @@
 - Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
 - How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md).
 ## First Contribution (Normalization Starter)
 - [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples
 - [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy

 - Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
 - How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md).
+### 🌐 OPUS-100 (parallel text, en-ps)
+- Link: [huggingface.co/datasets/Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
+- Pashto validation: dataset viewer includes `en-ps` subset.
+- Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks.
+- How to use here: keep in external eval/training split plans and log subset/version in run cards.
+### 🎤 Pashto Isolated Words Speech Dataset (Kaggle)
+- Link: [kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset)
+- Pashto validation: dataset title is explicitly Pashto isolated-word speech.
+- Why useful: useful for small-footprint ASR or keyword-spotting experiments.
+- How to use here: treat as task-specific speech data and document licensing/collection assumptions before use.
+### 🧠 Pashto Word Embeddings (Kaggle)
+- Link: [kaggle.com/datasets/drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings)
+- Pashto validation: dataset description states pretrained Pashto embeddings.
+- Why useful: quick-start lexical semantics baseline for NLP experiments.
+- How to use here: benchmark against transformer encoders in downstream Pashto tasks.
 ## First Contribution (Normalization Starter)
 - [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples
 - [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy

docs/resource_catalog.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # 📚 Verified Pashto Resource Catalog
-Last updated: `2026-02-14`
 This index points to validated Pashto-related resources tracked in structured files.
@@ -27,3 +27,10 @@ Before each release:
 - Confirm links still resolve.
 - Confirm Pashto support markers remain valid.
 - Confirm license/usage terms are still compatible.

 # 📚 Verified Pashto Resource Catalog
+Last updated: `2026-02-15`
 This index points to validated Pashto-related resources tracked in structured files.
 - Confirm links still resolve.
 - Confirm Pashto support markers remain valid.
 - Confirm license/usage terms are still compatible.
+## New Additions (2026-02-15)
+- `OPUS-100` dataset with `en-ps` subset support.
+- `FLORES-200` benchmark reference with `pbt_Arab` language code coverage.
+- `facebook/mms-1b-all` ASR model reference for multilingual Pashto transfer.
+- `mdarhri/pashto-bert` model for Pashto NLP baseline work.
+- Two Kaggle resources: Pashto isolated-word speech and Pashto word embeddings.

resources/benchmarks/README.md CHANGED Viewed

@@ -7,6 +7,7 @@
 | FLEURS (Pashto subset) | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | WER, CER |
 | Common Voice Pashto v24 | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | WER, CER |
 | Belebele (`pbt_Arab`) | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Accuracy, F1 |
 ## Integration Paths
 - Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)

 | FLEURS (Pashto subset) | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | WER, CER |
 | Common Voice Pashto v24 | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | WER, CER |
 | Belebele (`pbt_Arab`) | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Accuracy, F1 |
+| FLORES-200 (`pbt_Arab`) | [FLORES language list](https://github.com/facebookresearch/flores/tree/main/flores200) | BLEU, chrF, COMET |
 ## Integration Paths
 - Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)

resources/datasets/README.md CHANGED Viewed

@@ -9,6 +9,9 @@
 | OSCAR Corpus | [Hugging Face - oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | Includes `unshuffled_deduplicated_ps` | NLP language modeling |
 | Wikimedia Wikipedia | [Hugging Face - wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | Includes `20231101.ps` | Clean text corpus |
 | Belebele | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Includes `pbt_Arab` | Reading comprehension benchmark |
 ## Integration Paths
 - Data workspace: [../../data/README.md](../../data/README.md)

 | OSCAR Corpus | [Hugging Face - oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | Includes `unshuffled_deduplicated_ps` | NLP language modeling |
 | Wikimedia Wikipedia | [Hugging Face - wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | Includes `20231101.ps` | Clean text corpus |
 | Belebele | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Includes `pbt_Arab` | Reading comprehension benchmark |
+| OPUS-100 | [Hugging Face - Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes `en-ps` subset](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Parallel corpus for Pashto-English translation |
+| Pashto Isolated Words Speech Dataset | [Kaggle - engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Dataset card title explicitly marks Pashto speech data | Keyword spotting and limited-vocabulary ASR |
+| Pashto Word Embeddings | [Kaggle - drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Dataset description states pretrained Pashto embeddings | NLP baselines and lexical experiments |
 ## Integration Paths
 - Data workspace: [../../data/README.md](../../data/README.md)

resources/models/README.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# 🤖 Models
 ## Pashto-Relevant Models
@@ -6,10 +6,12 @@
 |---|---|---|---|
 | Whisper Large v3 | [Hugging Face - openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Tokenizer map includes `ps`](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline |
 | MMS Coverage Table | [Meta MMS language coverage](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Includes `pus` with ASR/TTS support | Multilingual transfer |
 | MMS TTS | [Hugging Face - facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | Aligned with MMS coverage table | TTS baseline |
 | NLLB-200 Distilled 600M | [Hugging Face - facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | [`special_tokens_map.json` includes `pbt_Arab`](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Translation baseline |
-| OPUS MT en→mul | [Hugging Face - opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | Model language list includes `pus` | English→Pashto path |
-| OPUS MT mul→en | [Hugging Face - opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | Model language list includes `pus` | Pashto→English path |
 ## Integration Paths
 - ASR workspace: [../../asr/README.md](../../asr/README.md)

+# Models
 ## Pashto-Relevant Models
 |---|---|---|---|
 | Whisper Large v3 | [Hugging Face - openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Tokenizer map includes `ps`](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline |
 | MMS Coverage Table | [Meta MMS language coverage](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Includes `pus` with ASR/TTS support | Multilingual transfer |
+| MMS 1B All (ASR) | [Hugging Face - facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) | [Coverage table includes `pus` with ASR support](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Multilingual ASR transfer baseline |
 | MMS TTS | [Hugging Face - facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | Aligned with MMS coverage table | TTS baseline |
 | NLLB-200 Distilled 600M | [Hugging Face - facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | [`special_tokens_map.json` includes `pbt_Arab`](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Translation baseline |
+| OPUS MT en->mul | [Hugging Face - opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | Model language list includes `pus` | English->Pashto path |
+| OPUS MT mul->en | [Hugging Face - opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | Model language list includes `pus` | Pashto->English path |
+| PashtoBERT | [Hugging Face - mdarhri/pashto-bert](https://huggingface.co/mdarhri/pashto-bert) | Model card states it is trained on Pashto corpus data | Pashto NLP encoder baseline |
 ## Integration Paths
 - ASR workspace: [../../asr/README.md](../../asr/README.md)