ποΈ Data Workspace
- raw/ incoming source files
- processed/ cleaned/aligned artifacts
- metadata/ manifests, speaker/dialect info, QA reports
β Verified External Datasets
ποΈ Common Voice Scripted Speech 24.0 - Pashto
- Link: Mozilla Data Collective - Common Voice Pashto 24.0
- Why useful: largest open community Pashto speech source for ASR training and evaluation.
- How to use here: download to
data/raw/common_voice_scripted_ps_v24/and follow docs/common_voice_pashto_24.md.
πΈ Google FLEURS (Pashto config)
- Link: huggingface.co/datasets/google/fleurs
- Pashto validation:
fleurs.pyincludes"ps_af". - Why useful: standardized multilingual speech benchmark split for comparable ASR scores.
- How to use here: treat as external eval set for benchmarks/ and avoid training/eval leakage.
π OSCAR Corpus (Pashto web text)
- Link: huggingface.co/datasets/oscar-corpus/oscar
- Pashto validation: dataset includes
unshuffled_deduplicated_ps. - Why useful: large-scale Pashto text for LM pretraining and lexicon expansion.
- How to use here: normalize and sample into processed/ for NLP/ASR language model support.
π° Wikimedia Wikipedia (Pashto dump)
- Link: huggingface.co/datasets/wikimedia/wikipedia
- Pashto validation: subset includes
20231101.ps. - Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance.
- How to use here: include as a high-quality text source in normalization and glossary workflows.
π Belebele (reading-comprehension benchmark)
- Link: huggingface.co/datasets/facebook/belebele
- Pashto validation: subset includes
pbt_Arab. - Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
- How to use here: benchmark multilingual encoders and track improvements in benchmarks/.
π OPUS-100 (parallel text, en-ps)
- Link: huggingface.co/datasets/Helsinki-NLP/opus-100
- Pashto validation: dataset viewer includes
en-pssubset. - Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks.
- How to use here: keep in external eval/training split plans and log subset/version in run cards.
π€ Pashto Isolated Words Speech Dataset (Kaggle)
- Link: kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset
- Pashto validation: dataset title is explicitly Pashto isolated-word speech.
- Why useful: useful for small-footprint ASR or keyword-spotting experiments.
- How to use here: treat as task-specific speech data and document licensing/collection assumptions before use.
π§ Pashto Word Embeddings (Kaggle)
- Link: kaggle.com/datasets/drijaz/pashto-word-embeddings
- Pashto validation: dataset description states pretrained Pashto embeddings.
- Why useful: quick-start lexical semantics baseline for NLP experiments.
- How to use here: benchmark against transformer encoders in downstream Pashto tasks.
First Contribution (Normalization Starter)
- processed/normalization_seed_v0.1.tsv starter normalization examples
- docs/pashto_normalization_v0.1.md baseline normalization policy
- scripts/validate_normalization.py basic file validator
π§ͺ Validate Seed File
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
π Notes
- Keep raw downloaded dataset files out of git.
- Track source URL + version in experiment notes for reproducibility.
- Re-check external links before every milestone release.