File size: 4,312 Bytes
1ad58b4
f725a8a
d2f0b77
 
 
379266c
0052610
 
 
d2f0b77
0052610
d2f0b77
0052610
 
d2f0b77
 
0052610
d2f0b77
0052610
 
d2f0b77
0052610
 
d2f0b77
0052610
 
d2f0b77
0052610
 
 
 
 
d2f0b77
0052610
 
d2f0b77
0052610
fb472d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
379266c
d2f0b77
 
 
379266c
0052610
379266c
 
 
0052610
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# πŸ—‚οΈ Data Workspace

- [raw/](raw/) incoming source files
- [processed/](processed/) cleaned/aligned artifacts
- [metadata/](metadata/) manifests, speaker/dialect info, QA reports

## βœ… Verified External Datasets

### πŸŽ™οΈ Common Voice Scripted Speech 24.0 - Pashto
- Link: [Mozilla Data Collective - Common Voice Pashto 24.0](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14)
- Why useful: largest open community Pashto speech source for ASR training and evaluation.
- How to use here: download to `data/raw/common_voice_scripted_ps_v24/` and follow [docs/common_voice_pashto_24.md](../docs/common_voice_pashto_24.md).

### 🌸 Google FLEURS (Pashto config)
- Link: [huggingface.co/datasets/google/fleurs](https://huggingface.co/datasets/google/fleurs)
- Pashto validation: [`fleurs.py` includes `"ps_af"`](https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py).
- Why useful: standardized multilingual speech benchmark split for comparable ASR scores.
- How to use here: treat as external eval set for [benchmarks/](../benchmarks/README.md) and avoid training/eval leakage.

### πŸ“– OSCAR Corpus (Pashto web text)
- Link: [huggingface.co/datasets/oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar)
- Pashto validation: dataset includes `unshuffled_deduplicated_ps`.
- Why useful: large-scale Pashto text for LM pretraining and lexicon expansion.
- How to use here: normalize and sample into [processed/](processed/) for NLP/ASR language model support.

### πŸ“° Wikimedia Wikipedia (Pashto dump)
- Link: [huggingface.co/datasets/wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia)
- Pashto validation: subset includes `20231101.ps`.
- Why useful: cleaner encyclopedia-style Pashto text for terminology and style balance.
- How to use here: include as a high-quality text source in normalization and glossary workflows.

### πŸ“˜ Belebele (reading-comprehension benchmark)
- Link: [huggingface.co/datasets/facebook/belebele](https://huggingface.co/datasets/facebook/belebele)
- Pashto validation: subset includes `pbt_Arab`.
- Why useful: useful downstream benchmark for comprehension-oriented NLP progress in Pashto.
- How to use here: benchmark multilingual encoders and track improvements in [benchmarks/](../benchmarks/README.md).

### 🌐 OPUS-100 (parallel text, en-ps)
- Link: [huggingface.co/datasets/Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100)
- Pashto validation: dataset viewer includes `en-ps` subset.
- Why useful: parallel Pashto-English bitext for translation baselines and text normalization cross-checks.
- How to use here: keep in external eval/training split plans and log subset/version in run cards.

### 🎀 Pashto Isolated Words Speech Dataset (Kaggle)
- Link: [kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset)
- Pashto validation: dataset title is explicitly Pashto isolated-word speech.
- Why useful: useful for small-footprint ASR or keyword-spotting experiments.
- How to use here: treat as task-specific speech data and document licensing/collection assumptions before use.

### 🧠 Pashto Word Embeddings (Kaggle)
- Link: [kaggle.com/datasets/drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings)
- Pashto validation: dataset description states pretrained Pashto embeddings.
- Why useful: quick-start lexical semantics baseline for NLP experiments.
- How to use here: benchmark against transformer encoders in downstream Pashto tasks.

## First Contribution (Normalization Starter)
- [processed/normalization_seed_v0.1.tsv](processed/normalization_seed_v0.1.tsv) starter normalization examples
- [docs/pashto_normalization_v0.1.md](../docs/pashto_normalization_v0.1.md) baseline normalization policy
- [scripts/validate_normalization.py](../scripts/validate_normalization.py) basic file validator

## πŸ§ͺ Validate Seed File
```bash
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
```

## πŸ“ Notes
- Keep raw downloaded dataset files out of git.
- Track source URL + version in experiment notes for reproducibility.
- Re-check external links before every milestone release.