musaw
Add validated Pashto resources across datasets models and benchmarks
fb472d7

πŸ—‚οΈ Data Workspace

  • raw/ incoming source files
  • processed/ cleaned/aligned artifacts
  • metadata/ manifests, speaker/dialect info, QA reports

βœ… Verified External Datasets

πŸŽ™οΈ Common Voice Scripted Speech 24.0 - Pashto

🌸 Google FLEURS (Pashto config)

πŸ“– OSCAR Corpus (Pashto web text)

  • Link: huggingface.co/datasets/oscar-corpus/oscar
  • Pashto validation: dataset includes unshuffled_deduplicated_ps.
  • Why useful: large-scale Pashto text for LM pretraining and lexicon expansion.
  • How to use here: normalize and sample into processed/ for NLP/ASR language model support.

πŸ“° Wikimedia Wikipedia (Pashto dump)

  • Link: huggingface.co/datasets/wikimedia/wikipedia
  • Pashto validation: subset includes 20231101.ps.
  • Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance.
  • How to use here: include as a high-quality text source in normalization and glossary workflows.

πŸ“˜ Belebele (reading-comprehension benchmark)

  • Link: huggingface.co/datasets/facebook/belebele
  • Pashto validation: subset includes pbt_Arab.
  • Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
  • How to use here: benchmark multilingual encoders and track improvements in benchmarks/.

🌐 OPUS-100 (parallel text, en-ps)

  • Link: huggingface.co/datasets/Helsinki-NLP/opus-100
  • Pashto validation: dataset viewer includes en-ps subset.
  • Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks.
  • How to use here: keep in external eval/training split plans and log subset/version in run cards.

🎀 Pashto Isolated Words Speech Dataset (Kaggle)

  • Link: kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset
  • Pashto validation: dataset title is explicitly Pashto isolated-word speech.
  • Why useful: useful for small-footprint ASR or keyword-spotting experiments.
  • How to use here: treat as task-specific speech data and document licensing/collection assumptions before use.

🧠 Pashto Word Embeddings (Kaggle)

  • Link: kaggle.com/datasets/drijaz/pashto-word-embeddings
  • Pashto validation: dataset description states pretrained Pashto embeddings.
  • Why useful: quick-start lexical semantics baseline for NLP experiments.
  • How to use here: benchmark against transformer encoders in downstream Pashto tasks.

First Contribution (Normalization Starter)

πŸ§ͺ Validate Seed File

python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv

πŸ“ Notes

  • Keep raw downloaded dataset files out of git.
  • Track source URL + version in experiment notes for reproducibility.
  • Re-check external links before every milestone release.