Musawer14
/

pashto-language-resources

machine-translation

language-resources

low-resource-languages

speech-recognition

Model card Files Files and versions

pashto-language-resources / data /README.md

musaw

Add validated Pashto resources across datasets models and benchmarks

fb472d7 19 days ago

|

history blame contribute delete

4.31 kB

🗂️ Data Workspace

raw/ incoming source files
processed/ cleaned/aligned artifacts
metadata/ manifests, speaker/dialect info, QA reports

✅ Verified External Datasets

🎙️ Common Voice Scripted Speech 24.0 - Pashto

Link: Mozilla Data Collective - Common Voice Pashto 24.0
Why useful: largest open community Pashto speech source for ASR training and evaluation.
How to use here: download to data/raw/common_voice_scripted_ps_v24/ and follow docs/common_voice_pashto_24.md.

🌸 Google FLEURS (Pashto config)

Link: huggingface.co/datasets/google/fleurs
Pashto validation: fleurs.py includes "ps_af".
Why useful: standardized multilingual speech benchmark split for comparable ASR scores.
How to use here: treat as external eval set for benchmarks/ and avoid training/eval leakage.

📖 OSCAR Corpus (Pashto web text)

Link: huggingface.co/datasets/oscar-corpus/oscar
Pashto validation: dataset includes unshuffled_deduplicated_ps.
Why useful: large-scale Pashto text for LM pretraining and lexicon expansion.
How to use here: normalize and sample into processed/ for NLP/ASR language model support.

📰 Wikimedia Wikipedia (Pashto dump)

Link: huggingface.co/datasets/wikimedia/wikipedia
Pashto validation: subset includes 20231101.ps.
Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance.
How to use here: include as a high-quality text source in normalization and glossary workflows.

📘 Belebele (reading-comprehension benchmark)

Link: huggingface.co/datasets/facebook/belebele
Pashto validation: subset includes pbt_Arab.
Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
How to use here: benchmark multilingual encoders and track improvements in benchmarks/.

🌐 OPUS-100 (parallel text, en-ps)

Link: huggingface.co/datasets/Helsinki-NLP/opus-100
Pashto validation: dataset viewer includes en-ps subset.
Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks.
How to use here: keep in external eval/training split plans and log subset/version in run cards.

🎤 Pashto Isolated Words Speech Dataset (Kaggle)

Link: kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset
Pashto validation: dataset title is explicitly Pashto isolated-word speech.
Why useful: useful for small-footprint ASR or keyword-spotting experiments.
How to use here: treat as task-specific speech data and document licensing/collection assumptions before use.

🧠 Pashto Word Embeddings (Kaggle)

Link: kaggle.com/datasets/drijaz/pashto-word-embeddings
Pashto validation: dataset description states pretrained Pashto embeddings.
Why useful: quick-start lexical semantics baseline for NLP experiments.
How to use here: benchmark against transformer encoders in downstream Pashto tasks.

First Contribution (Normalization Starter)

processed/normalization_seed_v0.1.tsv starter normalization examples
docs/pashto_normalization_v0.1.md baseline normalization policy
scripts/validate_normalization.py basic file validator

🧪 Validate Seed File

python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv

📝 Notes

Keep raw downloaded dataset files out of git.
Track source URL + version in experiment notes for reproducibility.
Re-check external links before every milestone release.