musaw commited on
Commit
6f1c8bd
·
1 Parent(s): 2c6aada

sync(hf): snapshot main content without binary history

Browse files
.github/workflows/resource_sync.yml CHANGED
@@ -29,9 +29,15 @@ jobs:
29
  - name: Sync candidate resources
30
  run: python scripts/sync_resources.py --limit 20
31
 
 
 
 
32
  - name: Validate catalog
33
  run: python scripts/validate_resource_catalog.py
34
 
 
 
 
35
  - name: Ensure labels exist
36
  uses: actions/github-script@v7
37
  with:
@@ -67,16 +73,27 @@ jobs:
67
  with:
68
  branch: bot/resource-sync
69
  delete-branch: true
70
- commit-message: "chore(resources): sync candidate feed"
71
- title: "chore(resources): sync Pashto resource candidates"
72
  body: |
73
- Automated daily candidate sync.
74
 
75
  Scope:
76
  - Updates `resources/catalog/pending_candidates.json`
77
- - Leaves verified catalog unchanged for maintainer review
 
78
  labels: |
79
  resource-update
80
  needs-review
81
  add-paths: |
82
  resources/catalog/pending_candidates.json
 
 
 
 
 
 
 
 
 
 
 
29
  - name: Sync candidate resources
30
  run: python scripts/sync_resources.py --limit 20
31
 
32
+ - name: Auto-promote valid candidates
33
+ run: python scripts/promote_candidates.py
34
+
35
  - name: Validate catalog
36
  run: python scripts/validate_resource_catalog.py
37
 
38
+ - name: Generate resource views
39
+ run: python scripts/generate_resource_views.py
40
+
41
  - name: Ensure labels exist
42
  uses: actions/github-script@v7
43
  with:
 
73
  with:
74
  branch: bot/resource-sync
75
  delete-branch: true
76
+ commit-message: "chore(resources): sync candidate feed and auto-promote valid entries"
77
+ title: "chore(resources): sync and auto-promote Pashto resources"
78
  body: |
79
+ Automated daily resource sync.
80
 
81
  Scope:
82
  - Updates `resources/catalog/pending_candidates.json`
83
+ - Auto-promotes valid non-duplicate candidates into `resources/catalog/resources.json`
84
+ - Regenerates resource indexes and search payload
85
  labels: |
86
  resource-update
87
  needs-review
88
  add-paths: |
89
  resources/catalog/pending_candidates.json
90
+ resources/catalog/resources.json
91
+ resources/README.md
92
+ resources/datasets/README.md
93
+ resources/models/README.md
94
+ resources/benchmarks/README.md
95
+ resources/tools/README.md
96
+ resources/papers/README.md
97
+ resources/projects/README.md
98
+ resources/codes/README.md
99
+ docs/search/resources.json
CHANGELOG.md CHANGED
@@ -12,10 +12,11 @@ and this project uses semantic version tags with a fixed role per figure:
12
 
13
  ## [Unreleased]
14
  ### Added
15
- - None yet.
16
 
17
  ### Changed
18
- - None yet.
 
19
 
20
  ### Fixed
21
  - None yet.
 
12
 
13
  ## [Unreleased]
14
  ### Added
15
+ - Added `scripts/promote_candidates.py` to auto-promote valid non-duplicate candidates into the verified catalog.
16
 
17
  ### Changed
18
+ - Updated `.github/workflows/resource_sync.yml` to auto-promote valid candidates, regenerate resource views, and include verified catalog changes in bot PRs.
19
+ - Updated resource workflow docs and runbook to reflect automated promotion behavior.
20
 
21
  ### Fixed
22
  - None yet.
README.md CHANGED
@@ -49,8 +49,8 @@ This repository curates verified Pashto resources and keeps validation and publi
49
  ## Resource Workflow
50
 
51
  1. Discovery job (`.github/workflows/resource_sync.yml`) updates candidate feed.
52
- 2. Maintainer review promotes high-quality entries to `resources/catalog/resources.json`.
53
- 3. Regeneration and validation updates derived views and search index.
54
 
55
  Core commands:
56
 
@@ -82,4 +82,3 @@ python -m pytest -q
82
  - Community communication: [community/COMMUNICATION.md](community/COMMUNICATION.md)
83
  - Resource guidelines: [docs/dataset_guidelines.md](docs/dataset_guidelines.md)
84
 
85
-
 
49
  ## Resource Workflow
50
 
51
  1. Discovery job (`.github/workflows/resource_sync.yml`) updates candidate feed.
52
+ 2. Automation promotes valid non-duplicate candidates into `resources/catalog/resources.json`.
53
+ 3. Regeneration and validation update derived views and search index.
54
 
55
  Core commands:
56
 
 
82
  - Community communication: [community/COMMUNICATION.md](community/COMMUNICATION.md)
83
  - Resource guidelines: [docs/dataset_guidelines.md](docs/dataset_guidelines.md)
84
 
 
docs/resource_automation.md CHANGED
@@ -1,11 +1,11 @@
1
  # Resource Automation
2
 
3
- This repository uses a semi-automated process to keep Pashto resources current while preserving human review.
4
 
5
  ## Goals
6
  - Discover new Pashto-relevant resources from trusted public endpoints.
7
  - Keep a machine-readable canonical catalog.
8
- - Prevent unreviewed low-confidence resources from directly entering verified lists.
9
 
10
  ## Covered source types
11
  - Kaggle datasets
@@ -29,6 +29,7 @@ This repository uses a semi-automated process to keep Pashto resources current w
29
  - Validate catalog: `python scripts/validate_resource_catalog.py`
30
  - Generate markdown and search index: `python scripts/generate_resource_views.py`
31
  - Sync new candidates: `python scripts/sync_resources.py --limit 20`
 
32
  - Full run wrapper: `python scripts/run_resource_cycle.py --limit 25`
33
 
34
  ## GitHub Actions
@@ -37,16 +38,15 @@ This repository uses a semi-automated process to keep Pashto resources current w
37
  - generated file consistency
38
  - markdown link checks
39
  - tests
40
- - Resource Sync (`.github/workflows/resource_sync.yml`) runs daily and opens a PR with candidate updates.
41
 
42
- ## Review flow
43
- 1. Inspect candidate entries in `resources/catalog/pending_candidates.json`.
44
- 2. Select useful items and move them into `resources/catalog/resources.json`.
45
- 3. Set `status` to `verified` only after checking evidence and license.
46
- 4. Run:
47
  - `python scripts/validate_resource_catalog.py`
48
  - `python scripts/generate_resource_views.py`
49
- 5. Commit and open PR.
50
 
51
  ## Runbook
52
  - Reusable process guide: [resource_cycle_runbook.md](resource_cycle_runbook.md)
 
1
  # Resource Automation
2
 
3
+ This repository uses automated discovery and promotion to keep Pashto resources current while preserving validation guardrails.
4
 
5
  ## Goals
6
  - Discover new Pashto-relevant resources from trusted public endpoints.
7
  - Keep a machine-readable canonical catalog.
8
+ - Auto-promote only candidates that pass strict validation and deduplication checks.
9
 
10
  ## Covered source types
11
  - Kaggle datasets
 
29
  - Validate catalog: `python scripts/validate_resource_catalog.py`
30
  - Generate markdown and search index: `python scripts/generate_resource_views.py`
31
  - Sync new candidates: `python scripts/sync_resources.py --limit 20`
32
+ - Auto-promote valid candidates: `python scripts/promote_candidates.py`
33
  - Full run wrapper: `python scripts/run_resource_cycle.py --limit 25`
34
 
35
  ## GitHub Actions
 
38
  - generated file consistency
39
  - markdown link checks
40
  - tests
41
+ - Resource Sync (`.github/workflows/resource_sync.yml`) runs daily, syncs candidates, auto-promotes valid non-duplicate entries, regenerates views, and opens a PR.
42
 
43
+ ## Promotion flow
44
+ 1. Sync candidates into `resources/catalog/pending_candidates.json`.
45
+ 2. Auto-promote valid, non-duplicate entries into `resources/catalog/resources.json`.
46
+ 3. Run:
 
47
  - `python scripts/validate_resource_catalog.py`
48
  - `python scripts/generate_resource_views.py`
49
+ 4. Review PR and merge.
50
 
51
  ## Runbook
52
  - Reusable process guide: [resource_cycle_runbook.md](resource_cycle_runbook.md)
docs/resource_cycle_runbook.md CHANGED
@@ -5,7 +5,7 @@ Use this runbook whenever you want to repeat the resource update process without
5
  ## Daily automation (already enabled)
6
  - Workflow: [../.github/workflows/resource_sync.yml](../.github/workflows/resource_sync.yml)
7
  - Schedule: every day at 04:00 UTC via GitHub Actions cron.
8
- - Output: updates [../resources/catalog/pending_candidates.json](../resources/catalog/pending_candidates.json) and opens a review PR.
9
 
10
  ## Manual run (single command)
11
  Run from repository root:
@@ -16,32 +16,25 @@ python scripts/run_resource_cycle.py --limit 25
16
 
17
  What it executes:
18
  1. `python scripts/sync_resources.py --limit 25`
19
- 2. `python scripts/validate_resource_catalog.py`
20
- 3. `python scripts/generate_resource_views.py`
21
- 4. `python scripts/check_links.py`
22
- 5. `python -m pytest -q`
 
23
 
24
  Candidate sources in the sync step include Kaggle datasets, Hugging Face datasets/models/spaces, GitHub repositories, GitLab repositories, Zenodo records, Dataverse datasets, DataCite DOI records, and paper endpoints (arXiv, Semantic Scholar, OpenAlex, Crossref).
25
 
26
- ## Discovery-only mode
27
- If you only want fresh candidates:
28
-
29
- ```bash
30
- python scripts/run_resource_cycle.py --discover-only --limit 25
31
- ```
32
-
33
- ## Promotion step (manual review)
34
- After discovery, promote only approved resources:
35
- 1. Open [../resources/catalog/pending_candidates.json](../resources/catalog/pending_candidates.json).
36
- 2. Copy selected entries into [../resources/catalog/resources.json](../resources/catalog/resources.json).
37
- 3. Ensure unique `id` and valid evidence fields.
38
- 4. Re-run:
39
- - `python scripts/run_resource_cycle.py --skip-pytest`
40
  5. Commit and push.
41
 
42
  ## Guardrails
43
- - Do not auto-promote candidates without evidence and license review.
44
- - Keep `status: verified` only for reviewed entries.
45
  - Do not promote "reference-only" resources where Pashto is incidental; only Pashto-centric resources are eligible.
46
  - Treat spelling variants as valid Pashto markers during review (`pashto`, `pukhto`, `pushto`, `pakhto`, `pashto-script`).
47
  - Generated files must be committed after catalog updates.
 
5
  ## Daily automation (already enabled)
6
  - Workflow: [../.github/workflows/resource_sync.yml](../.github/workflows/resource_sync.yml)
7
  - Schedule: every day at 04:00 UTC via GitHub Actions cron.
8
+ - Output: updates [../resources/catalog/pending_candidates.json](../resources/catalog/pending_candidates.json), auto-promotes valid non-duplicate entries into [../resources/catalog/resources.json](../resources/catalog/resources.json), regenerates views, and opens a review PR.
9
 
10
  ## Manual run (single command)
11
  Run from repository root:
 
16
 
17
  What it executes:
18
  1. `python scripts/sync_resources.py --limit 25`
19
+ 2. `python scripts/promote_candidates.py`
20
+ 3. `python scripts/validate_resource_catalog.py`
21
+ 4. `python scripts/generate_resource_views.py`
22
+ 5. `python scripts/check_links.py`
23
+ 6. `python -m pytest -q`
24
 
25
  Candidate sources in the sync step include Kaggle datasets, Hugging Face datasets/models/spaces, GitHub repositories, GitLab repositories, Zenodo records, Dataverse datasets, DataCite DOI records, and paper endpoints (arXiv, Semantic Scholar, OpenAlex, Crossref).
26
 
27
+ ## Discovery-only mode + manual promotion
28
+ If you want fresh candidates without auto-promotion:
29
+ 1. Run `python scripts/run_resource_cycle.py --discover-only --limit 25`.
30
+ 2. Review [../resources/catalog/pending_candidates.json](../resources/catalog/pending_candidates.json).
31
+ 3. Manually move selected entries into [../resources/catalog/resources.json](../resources/catalog/resources.json).
32
+ 4. Re-run `python scripts/run_resource_cycle.py --skip-pytest`.
 
 
 
 
 
 
 
 
33
  5. Commit and push.
34
 
35
  ## Guardrails
36
+ - Auto-promotion accepts only entries that pass dedupe and catalog validation checks.
37
+ - Keep `status: verified` for entries that pass automation checks and repository review.
38
  - Do not promote "reference-only" resources where Pashto is incidental; only Pashto-centric resources are eligible.
39
  - Treat spelling variants as valid Pashto markers during review (`pashto`, `pukhto`, `pushto`, `pakhto`, `pashto-script`).
40
  - Generated files must be committed after catalog updates.
docs/search/resources.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
- "generated_on": "2026-02-18T00:00:00Z",
3
- "count": 101,
4
  "resources": [
5
  {
6
  "id": "dataset-common-voice-ps-v24",
@@ -2522,6 +2522,243 @@
2522
  "Pashto",
2523
  "parts-of-speech"
2524
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2525
  }
2526
  ]
2527
  }
 
1
  {
2
+ "generated_on": "2026-02-20T00:00:00Z",
3
+ "count": 112,
4
  "resources": [
5
  {
6
  "id": "dataset-common-voice-ps-v24",
 
2522
  "Pashto",
2523
  "parts-of-speech"
2524
  ]
2525
+ },
2526
+ {
2527
+ "id": "candidate-hf-dataset-aamirhs-pashto",
2528
+ "title": "aamirhs/pashto",
2529
+ "url": "https://huggingface.co/datasets/aamirhs/pashto",
2530
+ "category": "dataset",
2531
+ "source": "huggingface",
2532
+ "status": "verified",
2533
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2534
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2535
+ "tasks": [],
2536
+ "tags": [
2537
+ "pashto",
2538
+ "candidate",
2539
+ "dataset"
2540
+ ],
2541
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2542
+ "evidence_url": "https://huggingface.co/datasets/aamirhs/pashto",
2543
+ "markers": [
2544
+ "pashto"
2545
+ ]
2546
+ },
2547
+ {
2548
+ "id": "candidate-hf-dataset-arsalagrey-pashto",
2549
+ "title": "arsalagrey/pashto",
2550
+ "url": "https://huggingface.co/datasets/arsalagrey/pashto",
2551
+ "category": "dataset",
2552
+ "source": "huggingface",
2553
+ "status": "verified",
2554
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2555
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2556
+ "tasks": [],
2557
+ "tags": [
2558
+ "pashto",
2559
+ "candidate",
2560
+ "dataset"
2561
+ ],
2562
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2563
+ "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto",
2564
+ "markers": [
2565
+ "pashto"
2566
+ ]
2567
+ },
2568
+ {
2569
+ "id": "candidate-hf-dataset-arsalagrey-pashto-books",
2570
+ "title": "arsalagrey/pashto-books",
2571
+ "url": "https://huggingface.co/datasets/arsalagrey/pashto-books",
2572
+ "category": "dataset",
2573
+ "source": "huggingface",
2574
+ "status": "verified",
2575
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2576
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2577
+ "tasks": [],
2578
+ "tags": [
2579
+ "pashto",
2580
+ "candidate",
2581
+ "dataset"
2582
+ ],
2583
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2584
+ "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto-books",
2585
+ "markers": [
2586
+ "pashto"
2587
+ ]
2588
+ },
2589
+ {
2590
+ "id": "candidate-hf-dataset-arsalagrey-pashto-books-json",
2591
+ "title": "arsalagrey/pashto-books-json",
2592
+ "url": "https://huggingface.co/datasets/arsalagrey/pashto-books-json",
2593
+ "category": "dataset",
2594
+ "source": "huggingface",
2595
+ "status": "verified",
2596
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2597
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2598
+ "tasks": [],
2599
+ "tags": [
2600
+ "pashto",
2601
+ "candidate",
2602
+ "dataset"
2603
+ ],
2604
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2605
+ "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto-books-json",
2606
+ "markers": [
2607
+ "pashto"
2608
+ ]
2609
+ },
2610
+ {
2611
+ "id": "candidate-hf-model-jawaria-wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2612
+ "title": "Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2613
+ "url": "https://huggingface.co/Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2614
+ "category": "model",
2615
+ "source": "huggingface",
2616
+ "status": "verified",
2617
+ "summary": "Candidate model returned from Hugging Face search for Pashto.",
2618
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2619
+ "tasks": [],
2620
+ "tags": [
2621
+ "pashto",
2622
+ "candidate",
2623
+ "model"
2624
+ ],
2625
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2626
+ "evidence_url": "https://huggingface.co/Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2627
+ "markers": [
2628
+ "pashto"
2629
+ ]
2630
+ },
2631
+ {
2632
+ "id": "candidate-zenodo-dataset-oped-open-pashto-english-dictionary-preliminary-version-30-october-2025",
2633
+ "title": "OPED (Open Pashto-English Dictionary): Preliminary version, 30 October 2025",
2634
+ "url": "https://zenodo.org/records/17487678",
2635
+ "category": "dataset",
2636
+ "source": "zenodo",
2637
+ "status": "verified",
2638
+ "summary": "Candidate resource returned from Zenodo search for Pashto.",
2639
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2640
+ "tasks": [],
2641
+ "tags": [
2642
+ "pashto",
2643
+ "candidate",
2644
+ "dataset",
2645
+ "zenodo"
2646
+ ],
2647
+ "evidence_text": "Zenodo metadata includes Pashto markers in title or description.",
2648
+ "evidence_url": "https://zenodo.org/records/17487678",
2649
+ "markers": [
2650
+ "pashto"
2651
+ ]
2652
+ },
2653
+ {
2654
+ "id": "candidate-kaggle-dataset-abdulbasitkh-pashto-isolated-alphabets-and-numerals",
2655
+ "title": "Pashto Isolated Alphabets and Numerals",
2656
+ "url": "https://www.kaggle.com/datasets/abdulbasitkh/pashto-isolated-alphabetss-and-numerals",
2657
+ "category": "dataset",
2658
+ "source": "kaggle",
2659
+ "status": "verified",
2660
+ "summary": "Pashto Islated Alphabets and Numerals Handwritten and Printed",
2661
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2662
+ "tasks": [],
2663
+ "tags": [
2664
+ "pashto",
2665
+ "candidate",
2666
+ "dataset",
2667
+ "kaggle"
2668
+ ],
2669
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2670
+ "evidence_url": "https://www.kaggle.com/datasets/abdulbasitkh/pashto-isolated-alphabetss-and-numerals",
2671
+ "markers": [
2672
+ "Pashto"
2673
+ ]
2674
+ },
2675
+ {
2676
+ "id": "candidate-kaggle-dataset-alimuhammadasad-pashto-poetry",
2677
+ "title": "Pashto Poetry",
2678
+ "url": "https://www.kaggle.com/datasets/alimuhammadasad/pashto-poetry",
2679
+ "category": "dataset",
2680
+ "source": "kaggle",
2681
+ "status": "verified",
2682
+ "summary": "Candidate Kaggle dataset returned from Pashto search.",
2683
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2684
+ "tasks": [],
2685
+ "tags": [
2686
+ "pashto",
2687
+ "candidate",
2688
+ "dataset",
2689
+ "kaggle"
2690
+ ],
2691
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2692
+ "evidence_url": "https://www.kaggle.com/datasets/alimuhammadasad/pashto-poetry",
2693
+ "markers": [
2694
+ "Pashto"
2695
+ ]
2696
+ },
2697
+ {
2698
+ "id": "candidate-kaggle-dataset-mahibullahmudaser-pashto-text-characters-sample",
2699
+ "title": "Pashto text characters sample",
2700
+ "url": "https://www.kaggle.com/datasets/mahibullahmudaser/pashtochracterssample",
2701
+ "category": "dataset",
2702
+ "source": "kaggle",
2703
+ "status": "verified",
2704
+ "summary": "Pashto text characters sample",
2705
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2706
+ "tasks": [],
2707
+ "tags": [
2708
+ "pashto",
2709
+ "candidate",
2710
+ "dataset",
2711
+ "kaggle"
2712
+ ],
2713
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2714
+ "evidence_url": "https://www.kaggle.com/datasets/mahibullahmudaser/pashtochracterssample",
2715
+ "markers": [
2716
+ "Pashto"
2717
+ ]
2718
+ },
2719
+ {
2720
+ "id": "candidate-kaggle-dataset-ahmadferozafshar-pashto-language-alphabets",
2721
+ "title": "pashto_language_alphabets",
2722
+ "url": "https://www.kaggle.com/datasets/ahmadferozafshar/pashto-language-alphabets",
2723
+ "category": "dataset",
2724
+ "source": "kaggle",
2725
+ "status": "verified",
2726
+ "summary": "Candidate Kaggle dataset returned from Pashto search.",
2727
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2728
+ "tasks": [],
2729
+ "tags": [
2730
+ "pashto",
2731
+ "candidate",
2732
+ "dataset",
2733
+ "kaggle"
2734
+ ],
2735
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2736
+ "evidence_url": "https://www.kaggle.com/datasets/ahmadferozafshar/pashto-language-alphabets",
2737
+ "markers": [
2738
+ "Pashto"
2739
+ ]
2740
+ },
2741
+ {
2742
+ "id": "candidate-kaggle-dataset-aimalrezvan-pashto-language-characters",
2743
+ "title": "Pashto_language_characters",
2744
+ "url": "https://www.kaggle.com/datasets/aimalrezvan/pashto-language-characters",
2745
+ "category": "dataset",
2746
+ "source": "kaggle",
2747
+ "status": "verified",
2748
+ "summary": "Pashto_language_characters are Pashto lanugage full and semi characters.",
2749
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2750
+ "tasks": [],
2751
+ "tags": [
2752
+ "pashto",
2753
+ "candidate",
2754
+ "dataset",
2755
+ "kaggle"
2756
+ ],
2757
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2758
+ "evidence_url": "https://www.kaggle.com/datasets/aimalrezvan/pashto-language-characters",
2759
+ "markers": [
2760
+ "Pashto"
2761
+ ]
2762
  }
2763
  ]
2764
  }
resources/README.md CHANGED
@@ -3,8 +3,8 @@
3
  Structured, Pashto-focused resource tracking lives in this folder.
4
 
5
  ## Sections
6
- - Datasets (38): [datasets/README.md](datasets/README.md)
7
- - Models (17): [models/README.md](models/README.md)
8
  - Benchmarks (4): [benchmarks/README.md](benchmarks/README.md)
9
  - Tools (0): [tools/README.md](tools/README.md)
10
  - Papers (24): [papers/README.md](papers/README.md)
@@ -22,4 +22,4 @@ Structured, Pashto-focused resource tracking lives in this folder.
22
  - Run `python scripts/validate_resource_catalog.py` before opening a PR.
23
  - Run `python scripts/generate_resource_views.py` after catalog changes.
24
 
25
- Verified resource count: `101`
 
3
  Structured, Pashto-focused resource tracking lives in this folder.
4
 
5
  ## Sections
6
+ - Datasets (48): [datasets/README.md](datasets/README.md)
7
+ - Models (18): [models/README.md](models/README.md)
8
  - Benchmarks (4): [benchmarks/README.md](benchmarks/README.md)
9
  - Tools (0): [tools/README.md](tools/README.md)
10
  - Papers (24): [papers/README.md](papers/README.md)
 
22
  - Run `python scripts/validate_resource_catalog.py` before opening a PR.
23
  - Run `python scripts/generate_resource_views.py` after catalog changes.
24
 
25
+ Verified resource count: `112`
resources/catalog/README.md CHANGED
@@ -1,18 +1,20 @@
1
- # Resource Catalog
2
 
3
  This folder holds machine-readable resource data used by docs and GitHub Pages search.
4
 
5
  ## Files
6
  - `resources.json`: canonical Pashto resource catalog (source of truth).
7
- - `pending_candidates.json`: automation output for candidate resources requiring review.
8
  - `resource.template.json`: starter template for adding a new resource entry.
9
 
10
  ## Required workflow
11
- 1. Update `resources.json`.
12
- 2. Run `python scripts/validate_resource_catalog.py`.
13
- 3. Run `python scripts/generate_resource_views.py`.
14
- 4. Commit both catalog and generated markdown/search files.
 
15
 
16
  ## Promotion guardrail
17
- - Promote only Pashto-centric resources. Exclude entries where Pashto appears only as a side reference.
18
- - Accept Pashto naming variants during review (`pashto`, `pukhto`, `pushto`, `pakhto`, `پښتو`).
 
 
1
+ # Resource Catalog
2
 
3
  This folder holds machine-readable resource data used by docs and GitHub Pages search.
4
 
5
  ## Files
6
  - `resources.json`: canonical Pashto resource catalog (source of truth).
7
+ - `pending_candidates.json`: automation output for discovered candidate resources.
8
  - `resource.template.json`: starter template for adding a new resource entry.
9
 
10
  ## Required workflow
11
+ 1. Sync candidates: `python scripts/sync_resources.py --limit 20`.
12
+ 2. Auto-promote valid entries: `python scripts/promote_candidates.py`.
13
+ 3. Run `python scripts/validate_resource_catalog.py`.
14
+ 4. Run `python scripts/generate_resource_views.py`.
15
+ 5. Commit catalog and generated markdown/search files.
16
 
17
  ## Promotion guardrail
18
+ - Auto-promotion accepts only valid non-duplicate entries that pass catalog validation.
19
+ - Keep only Pashto-centric resources. Exclude entries where Pashto appears only as a side reference.
20
+ - Accept Pashto naming variants (`pashto`, `pukhto`, `pushto`, `pakhto`, `pashto-script`).
resources/catalog/pending_candidates.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "generated_on": "2026-02-18T11:04:12.305454+00:00",
3
  "sources": [
4
  "kaggle-datasets",
5
  "huggingface-datasets",
@@ -15,7 +15,7 @@
15
  "arxiv",
16
  "semantic-scholar"
17
  ],
18
- "candidate_count": 137,
19
  "candidates": [
20
  {
21
  "id": "candidate-s2-a-comparison-of-pashto-and-turkmen-languages-vowel",
@@ -40,6 +40,29 @@
40
  "paper"
41
  ]
42
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  {
44
  "id": "candidate-openalex-a-new-etymological-vocabulary-of-pashto",
45
  "title": "A New Etymological Vocabulary of Pashto",
@@ -159,30 +182,6 @@
159
  "space"
160
  ]
161
  },
162
- {
163
- "id": "candidate-hf-project-afaaaak-urdu-pashto-translator",
164
- "title": "afaaaak/urdu_pashto_translator",
165
- "url": "https://huggingface.co/spaces/afaaaak/urdu_pashto_translator",
166
- "category": "project",
167
- "source": "huggingface",
168
- "status": "candidate",
169
- "summary": "Candidate project app returned from Hugging Face Spaces Pashto search.",
170
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
171
- "tasks": [],
172
- "pashto_evidence": {
173
- "evidence_text": "Matched by Pashto keyword in Hugging Face Spaces search.",
174
- "evidence_url": "https://huggingface.co/spaces/afaaaak/urdu_pashto_translator",
175
- "markers": [
176
- "pashto"
177
- ]
178
- },
179
- "tags": [
180
- "pashto",
181
- "candidate",
182
- "project",
183
- "space"
184
- ]
185
- },
186
  {
187
  "id": "candidate-gh-project-amirajorloo-jira-auto-direction-chrome-extension",
188
  "title": "amirajorloo/jira-auto-direction-chrome-extension",
@@ -210,29 +209,6 @@
210
  "farsi"
211
  ]
212
  },
213
- {
214
- "id": "candidate-s2-an-acoustic-analysis-of-consonants-of-khattak-dialect-of-pashto",
215
- "title": "An Acoustic Analysis of consonants of Khattak Dialect of Pashto",
216
- "url": "https://www.semanticscholar.org/paper/ed06d206e60a62c2bebdd487b4f8dea253a9a0a8",
217
- "category": "paper",
218
- "source": "other",
219
- "status": "candidate",
220
- "summary": "Pashto, an ancient language written in Perso-Arabic script, is predominantly spoken in Pakistan's Khyber Pakhtunkhwa Province and Afghanistan. Despite its wide usage, more research is needed on the consonantal sounds of the Khattak dialect.",
221
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
222
- "tasks": [],
223
- "pashto_evidence": {
224
- "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
225
- "evidence_url": "https://www.semanticscholar.org/paper/ed06d206e60a62c2bebdd487b4f8dea253a9a0a8",
226
- "markers": [
227
- "pashto"
228
- ]
229
- },
230
- "tags": [
231
- "pashto",
232
- "candidate",
233
- "paper"
234
- ]
235
- },
236
  {
237
  "id": "candidate-zenodo-paper-an-analysis-of-freudian-concept-of-mourning-in-pashto-tappas-on-the-theme-of-mig",
238
  "title": "AN ANALYSIS OF FREUDIAN CONCEPT OF MOURNING IN PASHTO TAPPAS ON THE THEME OF MIGRATION",
@@ -467,6 +443,29 @@
467
  "zenodo"
468
  ]
469
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
470
  {
471
  "id": "candidate-zenodo-paper-critical-study-of-the-travelogues-of-dr-altaf-yousafzai-in-the-context-of-thaila",
472
  "title": "Critical study of the travelogues of Dr Altaf Yousafzai (In The Context of \"Thailand kay Rang\", \"Nile kay Sang\" and \"Bakhal-e-Hinduwush Bakhsham\")",
@@ -515,18 +514,18 @@
515
  ]
516
  },
517
  {
518
- "id": "candidate-s2-deictic-field-time-of-action-in-the-semantics-of-the-pashto-language-the-time-fi",
519
- "title": "DEICTIC FIELD “TIME OF ACTION” IN THE SEMANTICS OF THE PASHTO LANGUAGE, THE “TIME” FIELD: BACKGROUND OF THE PROBLEM",
520
- "url": "https://www.semanticscholar.org/paper/3358d828c2ff07a45d614fd1d81cf44d5c55cad8",
521
  "category": "paper",
522
  "source": "other",
523
  "status": "candidate",
524
- "summary": "The article examines the semantic modeling of the category of time in language through the lens of deictic field theory, with a focus on Pashto adverbs. It outlines four major approaches to modeling semantic fields - phenomenological, lexic",
525
  "primary_use": "Needs maintainer review before promotion to verified catalog.",
526
  "tasks": [],
527
  "pashto_evidence": {
528
  "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
529
- "evidence_url": "https://www.semanticscholar.org/paper/3358d828c2ff07a45d614fd1d81cf44d5c55cad8",
530
  "markers": [
531
  "pashto"
532
  ]
@@ -633,34 +632,10 @@
633
  "crossref"
634
  ]
635
  },
636
- {
637
- "id": "candidate-hf-project-drsaqlainhassan-pashtotokenixer",
638
- "title": "DrSaqlainHassan/PashtoTokenixer",
639
- "url": "https://huggingface.co/spaces/DrSaqlainHassan/PashtoTokenixer",
640
- "category": "project",
641
- "source": "huggingface",
642
- "status": "candidate",
643
- "summary": "Candidate project app returned from Hugging Face Spaces Pashto search.",
644
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
645
- "tasks": [],
646
- "pashto_evidence": {
647
- "evidence_text": "Matched by Pashto keyword in Hugging Face Spaces search.",
648
- "evidence_url": "https://huggingface.co/spaces/DrSaqlainHassan/PashtoTokenixer",
649
- "markers": [
650
- "pashto"
651
- ]
652
- },
653
- "tags": [
654
- "pashto",
655
- "candidate",
656
- "project",
657
- "space"
658
- ]
659
- },
660
  {
661
  "id": "candidate-datacite-project-early-pregnancy-loss-pashto",
662
  "title": "Early Pregnancy Loss [Pashto]",
663
- "url": "https://zenodo.org/doi/10.5281/zenodo.18325729",
664
  "category": "project",
665
  "source": "datacite",
666
  "status": "candidate",
@@ -669,7 +644,7 @@
669
  "tasks": [],
670
  "pashto_evidence": {
671
  "evidence_text": "DataCite metadata includes Pashto markers in title or description.",
672
- "evidence_url": "https://zenodo.org/doi/10.5281/zenodo.18325729",
673
  "markers": [
674
  "pashto"
675
  ]
@@ -896,29 +871,6 @@
896
  "zenodo"
897
  ]
898
  },
899
- {
900
- "id": "candidate-s2-exploring-the-impacts-of-emotion-through-language-learning-on-pashto-speakers-yo",
901
- "title": "Exploring the Impacts of Emotion through Language Learning on Pashto Speakers Young Adulthood in District Peshawar",
902
- "url": "https://www.semanticscholar.org/paper/4549649112553aabccfac8b918c7e98cdbdd0f09",
903
- "category": "paper",
904
- "source": "other",
905
- "status": "candidate",
906
- "summary": "The current study explores the emotional experiences of Pashto speakers learning a second language, with a focus on how emotions are expressed, understood, and influenced by cultural and linguistic factors. While language learning is often",
907
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
908
- "tasks": [],
909
- "pashto_evidence": {
910
- "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
911
- "evidence_url": "https://www.semanticscholar.org/paper/4549649112553aabccfac8b918c7e98cdbdd0f09",
912
- "markers": [
913
- "pashto"
914
- ]
915
- },
916
- "tags": [
917
- "pashto",
918
- "candidate",
919
- "paper"
920
- ]
921
- },
922
  {
923
  "id": "candidate-datacite-paper-fairness-evaluation-and-inference-level-mitigation-in-llms",
924
  "title": "Fairness Evaluation and Inference Level Mitigation in LLMs",
@@ -973,7 +925,7 @@
973
  {
974
  "id": "candidate-datacite-project-female-birth-control-part-i-pashto",
975
  "title": "Female Birth Control Part I [Pashto]",
976
- "url": "https://zenodo.org/doi/10.5281/zenodo.18325040",
977
  "category": "project",
978
  "source": "datacite",
979
  "status": "candidate",
@@ -982,7 +934,7 @@
982
  "tasks": [],
983
  "pashto_evidence": {
984
  "evidence_text": "DataCite metadata includes Pashto markers in title or description.",
985
- "evidence_url": "https://zenodo.org/doi/10.5281/zenodo.18325040",
986
  "markers": [
987
  "pashto"
988
  ]
@@ -1114,18 +1066,18 @@
1114
  ]
1115
  },
1116
  {
1117
- "id": "candidate-s2-gemination-in-pashto",
1118
- "title": "Gemination in Pashto",
1119
- "url": "https://www.semanticscholar.org/paper/ccf72dc1bcd0a0cd3a4b97cc7fe1830c37922c64",
1120
  "category": "paper",
1121
  "source": "other",
1122
  "status": "candidate",
1123
- "summary": "The purpose of the present study was to analyze gemination in Pashto. For this purpose, first, data was collected generally from elder native speakers who speak the Yousafzai dialect. The collected data then was verified and discussed sever",
1124
  "primary_use": "Needs maintainer review before promotion to verified catalog.",
1125
  "tasks": [],
1126
  "pashto_evidence": {
1127
  "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
1128
- "evidence_url": "https://www.semanticscholar.org/paper/ccf72dc1bcd0a0cd3a4b97cc7fe1830c37922c64",
1129
  "markers": [
1130
  "pashto"
1131
  ]
@@ -1160,6 +1112,30 @@
1160
  "github"
1161
  ]
1162
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1163
  {
1164
  "id": "candidate-gh-project-haseebjanhamraz-pashtofonts",
1165
  "title": "haseebjanhamraz/PashtoFonts",
@@ -1234,52 +1210,6 @@
1234
  "paper"
1235
  ]
1236
  },
1237
- {
1238
- "id": "candidate-hf-dataset-ihanif-pashto-speech-2k",
1239
- "title": "ihanif/pashto_speech_2k",
1240
- "url": "https://huggingface.co/datasets/ihanif/pashto_speech_2k",
1241
- "category": "dataset",
1242
- "source": "huggingface",
1243
- "status": "candidate",
1244
- "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
1245
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
1246
- "tasks": [],
1247
- "pashto_evidence": {
1248
- "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
1249
- "evidence_url": "https://huggingface.co/datasets/ihanif/pashto_speech_2k",
1250
- "markers": [
1251
- "pashto"
1252
- ]
1253
- },
1254
- "tags": [
1255
- "pashto",
1256
- "candidate",
1257
- "dataset"
1258
- ]
1259
- },
1260
- {
1261
- "id": "candidate-hf-dataset-ihanif-pashto-speech-3k",
1262
- "title": "ihanif/pashto_speech_3k",
1263
- "url": "https://huggingface.co/datasets/ihanif/pashto_speech_3k",
1264
- "category": "dataset",
1265
- "source": "huggingface",
1266
- "status": "candidate",
1267
- "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
1268
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
1269
- "tasks": [],
1270
- "pashto_evidence": {
1271
- "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
1272
- "evidence_url": "https://huggingface.co/datasets/ihanif/pashto_speech_3k",
1273
- "markers": [
1274
- "pashto"
1275
- ]
1276
- },
1277
- "tags": [
1278
- "pashto",
1279
- "candidate",
1280
- "dataset"
1281
- ]
1282
- },
1283
  {
1284
  "id": "candidate-hf-project-ihanif-whisper-medium-pashto",
1285
  "title": "ihanif/whisper-medium-pashto",
@@ -1473,52 +1403,6 @@
1473
  "openalex"
1474
  ]
1475
  },
1476
- {
1477
- "id": "candidate-hf-dataset-koochikoo25-pashto-concatenated",
1478
- "title": "koochikoo25/Pashto-Concatenated",
1479
- "url": "https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated",
1480
- "category": "dataset",
1481
- "source": "huggingface",
1482
- "status": "candidate",
1483
- "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
1484
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
1485
- "tasks": [],
1486
- "pashto_evidence": {
1487
- "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
1488
- "evidence_url": "https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated",
1489
- "markers": [
1490
- "pashto"
1491
- ]
1492
- },
1493
- "tags": [
1494
- "pashto",
1495
- "candidate",
1496
- "dataset"
1497
- ]
1498
- },
1499
- {
1500
- "id": "candidate-hf-model-koochikoo25-whisper-medium-pashto",
1501
- "title": "koochikoo25/Whisper-medium-pashto",
1502
- "url": "https://huggingface.co/koochikoo25/Whisper-medium-pashto",
1503
- "category": "model",
1504
- "source": "huggingface",
1505
- "status": "candidate",
1506
- "summary": "Candidate model returned from Hugging Face search for Pashto.",
1507
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
1508
- "tasks": [],
1509
- "pashto_evidence": {
1510
- "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
1511
- "evidence_url": "https://huggingface.co/koochikoo25/Whisper-medium-pashto",
1512
- "markers": [
1513
- "pashto"
1514
- ]
1515
- },
1516
- "tags": [
1517
- "pashto",
1518
- "candidate",
1519
- "model"
1520
- ]
1521
- },
1522
  {
1523
  "id": "candidate-zenodo-paper-language-barrier-and-its-effect-on-learning-at-the-public-primary-school-level-i",
1524
  "title": "Language Barrier and its Effect on Learning at the Public Primary School Level in Lahore",
@@ -1543,29 +1427,6 @@
1543
  "zenodo"
1544
  ]
1545
  },
1546
- {
1547
- "id": "candidate-s2-language-of-resistance-in-pashto-poetry-during-the-war-on-terror",
1548
- "title": "Language of Resistance in Pashto Poetry during the War on Terror",
1549
- "url": "https://www.semanticscholar.org/paper/23dbf301cdadbb3e1e309ed232baf5cfb2b6414b",
1550
- "category": "paper",
1551
- "source": "other",
1552
- "status": "candidate",
1553
- "summary": "The paper explores the compelling nature of Pashto poetry as a weapon of resistance in the War on Terror, how it has been used to reveal Pashtun identity, political protest, and cultural strength. With military activities dismantling the Pa",
1554
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
1555
- "tasks": [],
1556
- "pashto_evidence": {
1557
- "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
1558
- "evidence_url": "https://www.semanticscholar.org/paper/23dbf301cdadbb3e1e309ed232baf5cfb2b6414b",
1559
- "markers": [
1560
- "pashto"
1561
- ]
1562
- },
1563
- "tags": [
1564
- "pashto",
1565
- "candidate",
1566
- "paper"
1567
- ]
1568
- },
1569
  {
1570
  "id": "candidate-crossref-le-verbe-pashto",
1571
  "title": "Le verbe pashto",
@@ -1796,29 +1657,6 @@
1796
  "dataverse"
1797
  ]
1798
  },
1799
- {
1800
- "id": "candidate-s2-multilingual-interplay-and-the-influence-of-the-official-languages-on-the-use-an",
1801
- "title": "Multilingual interplay and the influence of the official languages on the use and transmission of the regional language Pashto: a case study of a Pashtun family in Pakistan",
1802
- "url": "https://www.semanticscholar.org/paper/2b42be99fa7ad002efd3cf1d1c75834b69108a07",
1803
- "category": "paper",
1804
- "source": "other",
1805
- "status": "candidate",
1806
- "summary": "ABSTRACT The impact of English and Urdu in Pakistan on the intergenerational transmission and use of the regional language, Pashto, in the family domain is not well known. This paper, therefore, examines language use patterns in a middle-cl",
1807
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
1808
- "tasks": [],
1809
- "pashto_evidence": {
1810
- "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
1811
- "evidence_url": "https://www.semanticscholar.org/paper/2b42be99fa7ad002efd3cf1d1c75834b69108a07",
1812
- "markers": [
1813
- "pashto"
1814
- ]
1815
- },
1816
- "tags": [
1817
- "pashto",
1818
- "candidate",
1819
- "paper"
1820
- ]
1821
- },
1822
  {
1823
  "id": "candidate-gh-project-nabeelest-pakhtoodle",
1824
  "title": "nabeelest/pakhtoodle",
@@ -2235,6 +2073,52 @@
2235
  "kaggle"
2236
  ]
2237
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2238
  {
2239
  "id": "candidate-crossref-pashto-tappa",
2240
  "title": "Pashto Tappa",
@@ -2646,29 +2530,6 @@
2646
  "dataverse"
2647
  ]
2648
  },
2649
- {
2650
- "id": "candidate-s2-resolution-of-ellipses-in-wh-constructions-in-pashto-language",
2651
- "title": "Resolution of Ellipses in WH-constructions in Pashto Language",
2652
- "url": "https://www.semanticscholar.org/paper/b9d84d79be0e90e026bbd596276697eeca5d9474",
2653
- "category": "paper",
2654
- "source": "other",
2655
- "status": "candidate",
2656
- "summary": "The Pashto language has a question structure consisting of a WH-word and an answer to the question, this is called WH-structure. The resolution of ellipsis occurs in most cases in both written and spoken language in its WH construction. In",
2657
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
2658
- "tasks": [],
2659
- "pashto_evidence": {
2660
- "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
2661
- "evidence_url": "https://www.semanticscholar.org/paper/b9d84d79be0e90e026bbd596276697eeca5d9474",
2662
- "markers": [
2663
- "pashto"
2664
- ]
2665
- },
2666
- "tags": [
2667
- "pashto",
2668
- "candidate",
2669
- "paper"
2670
- ]
2671
- },
2672
  {
2673
  "id": "candidate-openalex-scale-and-rotation-invariant-recognition-of-cursive-pashto-script-using-sift-fea",
2674
  "title": "Scale and rotation invariant recognition of cursive Pashto script using SIFT features",
@@ -2985,29 +2846,6 @@
2985
  "openalex"
2986
  ]
2987
  },
2988
- {
2989
- "id": "candidate-s2-the-development-and-evaluation-of-an-automatic-clitic-generator-for-pashto-langu",
2990
- "title": "The development and evaluation of an automatic clitic generator for Pashto language",
2991
- "url": "https://www.semanticscholar.org/paper/3d95449d67799fcac83f855984cb0c29cc500d7b",
2992
- "category": "paper",
2993
- "source": "other",
2994
- "status": "candidate",
2995
- "summary": "Candidate paper returned from Semantic Scholar search for Pashto.",
2996
- "primary_use": "Needs maintainer review before promotion to verified catalog.",
2997
- "tasks": [],
2998
- "pashto_evidence": {
2999
- "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
3000
- "evidence_url": "https://www.semanticscholar.org/paper/3d95449d67799fcac83f855984cb0c29cc500d7b",
3001
- "markers": [
3002
- "pashto"
3003
- ]
3004
- },
3005
- "tags": [
3006
- "pashto",
3007
- "candidate",
3008
- "paper"
3009
- ]
3010
- },
3011
  {
3012
  "id": "candidate-openalex-the-grammar-of-clitics-evidence-from-pashto-and-other-languages",
3013
  "title": "The grammar of clitics : evidence from Pashto and other languages",
@@ -3104,6 +2942,29 @@
3104
  "datacite"
3105
  ]
3106
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3107
  {
3108
  "id": "candidate-crossref-topicalization-in-pashto",
3109
  "title": "Topicalization in Pashto",
@@ -3248,6 +3109,29 @@
3248
  "dataverse"
3249
  ]
3250
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3251
  {
3252
  "id": "candidate-s2-validation-of-the-pashto-version-of-the-premature-ejaculation-diagnostic-tool-pe",
3253
  "title": "Validation of the Pashto Version of the Premature Ejaculation Diagnostic Tool (PEDT)",
 
1
  {
2
+ "generated_on": "2026-02-19T16:58:23.791370+00:00",
3
  "sources": [
4
  "kaggle-datasets",
5
  "huggingface-datasets",
 
15
  "arxiv",
16
  "semantic-scholar"
17
  ],
18
+ "candidate_count": 132,
19
  "candidates": [
20
  {
21
  "id": "candidate-s2-a-comparison-of-pashto-and-turkmen-languages-vowel",
 
40
  "paper"
41
  ]
42
  },
43
+ {
44
+ "id": "candidate-s2-a-lexical-analysis-of-pashto-language",
45
+ "title": "A Lexical Analysis of Pashto Language",
46
+ "url": "https://www.semanticscholar.org/paper/6a1422eaca906a6657aa667b30dcb5575d25f8f8",
47
+ "category": "paper",
48
+ "source": "other",
49
+ "status": "candidate",
50
+ "summary": "Language changes over time. Apart from many other reasons, some words become dormant and remain no more in use. In this research, an attempt has been made to show language change in Pashto language. For this purpose, images of different cul",
51
+ "primary_use": "Needs maintainer review before promotion to verified catalog.",
52
+ "tasks": [],
53
+ "pashto_evidence": {
54
+ "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
55
+ "evidence_url": "https://www.semanticscholar.org/paper/6a1422eaca906a6657aa667b30dcb5575d25f8f8",
56
+ "markers": [
57
+ "pashto"
58
+ ]
59
+ },
60
+ "tags": [
61
+ "pashto",
62
+ "candidate",
63
+ "paper"
64
+ ]
65
+ },
66
  {
67
  "id": "candidate-openalex-a-new-etymological-vocabulary-of-pashto",
68
  "title": "A New Etymological Vocabulary of Pashto",
 
182
  "space"
183
  ]
184
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  {
186
  "id": "candidate-gh-project-amirajorloo-jira-auto-direction-chrome-extension",
187
  "title": "amirajorloo/jira-auto-direction-chrome-extension",
 
209
  "farsi"
210
  ]
211
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
  {
213
  "id": "candidate-zenodo-paper-an-analysis-of-freudian-concept-of-mourning-in-pashto-tappas-on-the-theme-of-mig",
214
  "title": "AN ANALYSIS OF FREUDIAN CONCEPT OF MOURNING IN PASHTO TAPPAS ON THE THEME OF MIGRATION",
 
443
  "zenodo"
444
  ]
445
  },
446
+ {
447
+ "id": "candidate-s2-comprehensive-socio-phonetic-study-of-the-plosive-p-and-fricative-f-merger-among",
448
+ "title": "Comprehensive Socio-phonetic Study of the Plosive /p/ and Fricative /f/ Merger among Pashto Speakers in Khyber Pakhtunkhwa",
449
+ "url": "https://www.semanticscholar.org/paper/4f01f2250c897dc53099f76a2455471b480f22cf",
450
+ "category": "paper",
451
+ "source": "other",
452
+ "status": "candidate",
453
+ "summary": "Introduction: The phonological systems of a first language (L1) can fundamentally constrain the acquisition of a second language (L2), particularly in speech sound perception and production. In Pashto-English bilinguals, the absence of the",
454
+ "primary_use": "Needs maintainer review before promotion to verified catalog.",
455
+ "tasks": [],
456
+ "pashto_evidence": {
457
+ "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
458
+ "evidence_url": "https://www.semanticscholar.org/paper/4f01f2250c897dc53099f76a2455471b480f22cf",
459
+ "markers": [
460
+ "pashto"
461
+ ]
462
+ },
463
+ "tags": [
464
+ "pashto",
465
+ "candidate",
466
+ "paper"
467
+ ]
468
+ },
469
  {
470
  "id": "candidate-zenodo-paper-critical-study-of-the-travelogues-of-dr-altaf-yousafzai-in-the-context-of-thaila",
471
  "title": "Critical study of the travelogues of Dr Altaf Yousafzai (In The Context of \"Thailand kay Rang\", \"Nile kay Sang\" and \"Bakhal-e-Hinduwush Bakhsham\")",
 
514
  ]
515
  },
516
  {
517
+ "id": "candidate-s2-cultural-identity-and-pragmatic-competence-a-cross-cultural-analysis-of-punjabi-",
518
+ "title": "Cultural Identity and Pragmatic Competence: A Cross-Cultural Analysis of Punjabi and Pashto Learners of English in Pakistan",
519
+ "url": "https://www.semanticscholar.org/paper/85c80a8f97b12a1e3238126cbee321a219ff87e4",
520
  "category": "paper",
521
  "source": "other",
522
  "status": "candidate",
523
+ "summary": "This study examines the influence of cultural identity on pragmatic competence among Punjabi and Pashto speakers learning English in Pakistan. It explores how learners perform speech acts such as requests, refusals, and apologies, focusing",
524
  "primary_use": "Needs maintainer review before promotion to verified catalog.",
525
  "tasks": [],
526
  "pashto_evidence": {
527
  "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
528
+ "evidence_url": "https://www.semanticscholar.org/paper/85c80a8f97b12a1e3238126cbee321a219ff87e4",
529
  "markers": [
530
  "pashto"
531
  ]
 
632
  "crossref"
633
  ]
634
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
635
  {
636
  "id": "candidate-datacite-project-early-pregnancy-loss-pashto",
637
  "title": "Early Pregnancy Loss [Pashto]",
638
+ "url": "https://zenodo.org/doi/10.5281/zenodo.18325728",
639
  "category": "project",
640
  "source": "datacite",
641
  "status": "candidate",
 
644
  "tasks": [],
645
  "pashto_evidence": {
646
  "evidence_text": "DataCite metadata includes Pashto markers in title or description.",
647
+ "evidence_url": "https://zenodo.org/doi/10.5281/zenodo.18325728",
648
  "markers": [
649
  "pashto"
650
  ]
 
871
  "zenodo"
872
  ]
873
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
874
  {
875
  "id": "candidate-datacite-paper-fairness-evaluation-and-inference-level-mitigation-in-llms",
876
  "title": "Fairness Evaluation and Inference Level Mitigation in LLMs",
 
925
  {
926
  "id": "candidate-datacite-project-female-birth-control-part-i-pashto",
927
  "title": "Female Birth Control Part I [Pashto]",
928
+ "url": "https://zenodo.org/doi/10.5281/zenodo.18325041",
929
  "category": "project",
930
  "source": "datacite",
931
  "status": "candidate",
 
934
  "tasks": [],
935
  "pashto_evidence": {
936
  "evidence_text": "DataCite metadata includes Pashto markers in title or description.",
937
+ "evidence_url": "https://zenodo.org/doi/10.5281/zenodo.18325041",
938
  "markers": [
939
  "pashto"
940
  ]
 
1066
  ]
1067
  },
1068
  {
1069
+ "id": "candidate-s2-gender-classification-from-pashto-handwritten-text-images",
1070
+ "title": "Gender Classification From Pashto Handwritten Text Images",
1071
+ "url": "https://www.semanticscholar.org/paper/2d70fffa9224d71f67ad3c1943b8a71b18164eeb",
1072
  "category": "paper",
1073
  "source": "other",
1074
  "status": "candidate",
1075
+ "summary": "Computer vision (CV) is a subfield of computer science that enables machines to perceive, interpret, and understand visual data. It combines image processing, analysis, and machine learning to extract meaningful insights from images and vid",
1076
  "primary_use": "Needs maintainer review before promotion to verified catalog.",
1077
  "tasks": [],
1078
  "pashto_evidence": {
1079
  "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
1080
+ "evidence_url": "https://www.semanticscholar.org/paper/2d70fffa9224d71f67ad3c1943b8a71b18164eeb",
1081
  "markers": [
1082
  "pashto"
1083
  ]
 
1112
  "github"
1113
  ]
1114
  },
1115
+ {
1116
+ "id": "candidate-hf-project-haseeb-007-pashto-sekho",
1117
+ "title": "Haseeb-007/Pashto-sekho",
1118
+ "url": "https://huggingface.co/spaces/Haseeb-007/Pashto-sekho",
1119
+ "category": "project",
1120
+ "source": "huggingface",
1121
+ "status": "candidate",
1122
+ "summary": "Candidate project app returned from Hugging Face Spaces Pashto search.",
1123
+ "primary_use": "Needs maintainer review before promotion to verified catalog.",
1124
+ "tasks": [],
1125
+ "pashto_evidence": {
1126
+ "evidence_text": "Matched by Pashto keyword in Hugging Face Spaces search.",
1127
+ "evidence_url": "https://huggingface.co/spaces/Haseeb-007/Pashto-sekho",
1128
+ "markers": [
1129
+ "pashto"
1130
+ ]
1131
+ },
1132
+ "tags": [
1133
+ "pashto",
1134
+ "candidate",
1135
+ "project",
1136
+ "space"
1137
+ ]
1138
+ },
1139
  {
1140
  "id": "candidate-gh-project-haseebjanhamraz-pashtofonts",
1141
  "title": "haseebjanhamraz/PashtoFonts",
 
1210
  "paper"
1211
  ]
1212
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1213
  {
1214
  "id": "candidate-hf-project-ihanif-whisper-medium-pashto",
1215
  "title": "ihanif/whisper-medium-pashto",
 
1403
  "openalex"
1404
  ]
1405
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1406
  {
1407
  "id": "candidate-zenodo-paper-language-barrier-and-its-effect-on-learning-at-the-public-primary-school-level-i",
1408
  "title": "Language Barrier and its Effect on Learning at the Public Primary School Level in Lahore",
 
1427
  "zenodo"
1428
  ]
1429
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1430
  {
1431
  "id": "candidate-crossref-le-verbe-pashto",
1432
  "title": "Le verbe pashto",
 
1657
  "dataverse"
1658
  ]
1659
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1660
  {
1661
  "id": "candidate-gh-project-nabeelest-pakhtoodle",
1662
  "title": "nabeelest/pakhtoodle",
 
2073
  "kaggle"
2074
  ]
2075
  },
2076
+ {
2077
+ "id": "candidate-s2-pashto-poetry-attribution-using-deep-learning-techniques",
2078
+ "title": "Pashto Poetry Attribution using Deep Learning Techniques",
2079
+ "url": "https://www.semanticscholar.org/paper/e08cbd095d80dea85b91e31f4d6d81e96bc556a1",
2080
+ "category": "paper",
2081
+ "source": "other",
2082
+ "status": "candidate",
2083
+ "summary": "Pashto poetry, a rich tradition dating back to the 8th century, has been a cornerstone of cultural and literary heritage in the region. However, Pashto remains a low-resource language in computational linguistics, with limited annotated dat",
2084
+ "primary_use": "Needs maintainer review before promotion to verified catalog.",
2085
+ "tasks": [],
2086
+ "pashto_evidence": {
2087
+ "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
2088
+ "evidence_url": "https://www.semanticscholar.org/paper/e08cbd095d80dea85b91e31f4d6d81e96bc556a1",
2089
+ "markers": [
2090
+ "pashto"
2091
+ ]
2092
+ },
2093
+ "tags": [
2094
+ "pashto",
2095
+ "candidate",
2096
+ "paper"
2097
+ ]
2098
+ },
2099
+ {
2100
+ "id": "candidate-s2-pashto-preverbs-v",
2101
+ "title": "Pashto preverbs V",
2102
+ "url": "https://www.semanticscholar.org/paper/1f59f22ae99379106b417186f3053c00b5fe391f",
2103
+ "category": "paper",
2104
+ "source": "other",
2105
+ "status": "candidate",
2106
+ "summary": "Abstract This article deals with the perfective preverb wə́-. Pashto wə́- cannot be studied separately from aspectual oppositions: in fact, wə́- characterizes the “perfective” of simple verbs. Therefore, a quick review of aspect in Pashto w",
2107
+ "primary_use": "Needs maintainer review before promotion to verified catalog.",
2108
+ "tasks": [],
2109
+ "pashto_evidence": {
2110
+ "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
2111
+ "evidence_url": "https://www.semanticscholar.org/paper/1f59f22ae99379106b417186f3053c00b5fe391f",
2112
+ "markers": [
2113
+ "pashto"
2114
+ ]
2115
+ },
2116
+ "tags": [
2117
+ "pashto",
2118
+ "candidate",
2119
+ "paper"
2120
+ ]
2121
+ },
2122
  {
2123
  "id": "candidate-crossref-pashto-tappa",
2124
  "title": "Pashto Tappa",
 
2530
  "dataverse"
2531
  ]
2532
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2533
  {
2534
  "id": "candidate-openalex-scale-and-rotation-invariant-recognition-of-cursive-pashto-script-using-sift-fea",
2535
  "title": "Scale and rotation invariant recognition of cursive Pashto script using SIFT features",
 
2846
  "openalex"
2847
  ]
2848
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2849
  {
2850
  "id": "candidate-openalex-the-grammar-of-clitics-evidence-from-pashto-and-other-languages",
2851
  "title": "The grammar of clitics : evidence from Pashto and other languages",
 
2942
  "datacite"
2943
  ]
2944
  },
2945
+ {
2946
+ "id": "candidate-s2-the-roshani-movement-literary-services-and-the-contribution-of-this-movement-in-",
2947
+ "title": "The Roshani Movement literary services and the contribution of this Movement in the development of Pashto Literature",
2948
+ "url": "https://www.semanticscholar.org/paper/88a3cd1ec497844c5997ae1795f8e72bbb314112",
2949
+ "category": "paper",
2950
+ "source": "other",
2951
+ "status": "candidate",
2952
+ "summary": "Literature is the mirror of society. The purpose of this article was to review the achievements and literary services of the Roshani Movement, in order to use their positive points in the development of Pashto language and literature. The r",
2953
+ "primary_use": "Needs maintainer review before promotion to verified catalog.",
2954
+ "tasks": [],
2955
+ "pashto_evidence": {
2956
+ "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
2957
+ "evidence_url": "https://www.semanticscholar.org/paper/88a3cd1ec497844c5997ae1795f8e72bbb314112",
2958
+ "markers": [
2959
+ "pashto"
2960
+ ]
2961
+ },
2962
+ "tags": [
2963
+ "pashto",
2964
+ "candidate",
2965
+ "paper"
2966
+ ]
2967
+ },
2968
  {
2969
  "id": "candidate-crossref-topicalization-in-pashto",
2970
  "title": "Topicalization in Pashto",
 
3109
  "dataverse"
3110
  ]
3111
  },
3112
+ {
3113
+ "id": "candidate-s2-transformer-based-title-generation-for-pashto-texts-using-topic-modeling-techniq",
3114
+ "title": "Transformer-Based Title Generation for Pashto Texts Using Topic Modeling Techniques",
3115
+ "url": "https://www.semanticscholar.org/paper/1b675f8b6e3683677cfa3bf3fec40ae9649d8d8c",
3116
+ "category": "paper",
3117
+ "source": "other",
3118
+ "status": "candidate",
3119
+ "summary": "This study proposes a transformer-assisted framework for automatic title generation in Pashto, a low-resource language with limited NLP resources. The approach combines topic modeling techniques with a generative transformer to produce cohe",
3120
+ "primary_use": "Needs maintainer review before promotion to verified catalog.",
3121
+ "tasks": [],
3122
+ "pashto_evidence": {
3123
+ "evidence_text": "Matched by explicit Pashto marker in paper title from Semantic Scholar search.",
3124
+ "evidence_url": "https://www.semanticscholar.org/paper/1b675f8b6e3683677cfa3bf3fec40ae9649d8d8c",
3125
+ "markers": [
3126
+ "pashto"
3127
+ ]
3128
+ },
3129
+ "tags": [
3130
+ "pashto",
3131
+ "candidate",
3132
+ "paper"
3133
+ ]
3134
+ },
3135
  {
3136
  "id": "candidate-s2-validation-of-the-pashto-version-of-the-premature-ejaculation-diagnostic-tool-pe",
3137
  "title": "Validation of the Pashto Version of the Premature Ejaculation Diagnostic Tool (PEDT)",
resources/catalog/resources.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "version": "1.0.1",
3
- "updated_on": "2026-02-18",
4
  "resources": [
5
  {
6
  "id": "dataset-common-voice-ps-v24",
@@ -2742,6 +2742,265 @@
2742
  "nlp",
2743
  "demo"
2744
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2745
  }
2746
  ]
2747
  }
 
1
  {
2
  "version": "1.0.1",
3
+ "updated_on": "2026-02-20",
4
  "resources": [
5
  {
6
  "id": "dataset-common-voice-ps-v24",
 
2742
  "nlp",
2743
  "demo"
2744
  ]
2745
+ },
2746
+ {
2747
+ "id": "candidate-hf-dataset-aamirhs-pashto",
2748
+ "title": "aamirhs/pashto",
2749
+ "url": "https://huggingface.co/datasets/aamirhs/pashto",
2750
+ "category": "dataset",
2751
+ "source": "huggingface",
2752
+ "status": "verified",
2753
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2754
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2755
+ "tasks": [],
2756
+ "pashto_evidence": {
2757
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2758
+ "evidence_url": "https://huggingface.co/datasets/aamirhs/pashto",
2759
+ "markers": [
2760
+ "pashto"
2761
+ ]
2762
+ },
2763
+ "tags": [
2764
+ "pashto",
2765
+ "candidate",
2766
+ "dataset"
2767
+ ]
2768
+ },
2769
+ {
2770
+ "id": "candidate-hf-dataset-arsalagrey-pashto",
2771
+ "title": "arsalagrey/pashto",
2772
+ "url": "https://huggingface.co/datasets/arsalagrey/pashto",
2773
+ "category": "dataset",
2774
+ "source": "huggingface",
2775
+ "status": "verified",
2776
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2777
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2778
+ "tasks": [],
2779
+ "pashto_evidence": {
2780
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2781
+ "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto",
2782
+ "markers": [
2783
+ "pashto"
2784
+ ]
2785
+ },
2786
+ "tags": [
2787
+ "pashto",
2788
+ "candidate",
2789
+ "dataset"
2790
+ ]
2791
+ },
2792
+ {
2793
+ "id": "candidate-hf-dataset-arsalagrey-pashto-books",
2794
+ "title": "arsalagrey/pashto-books",
2795
+ "url": "https://huggingface.co/datasets/arsalagrey/pashto-books",
2796
+ "category": "dataset",
2797
+ "source": "huggingface",
2798
+ "status": "verified",
2799
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2800
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2801
+ "tasks": [],
2802
+ "pashto_evidence": {
2803
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2804
+ "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto-books",
2805
+ "markers": [
2806
+ "pashto"
2807
+ ]
2808
+ },
2809
+ "tags": [
2810
+ "pashto",
2811
+ "candidate",
2812
+ "dataset"
2813
+ ]
2814
+ },
2815
+ {
2816
+ "id": "candidate-hf-dataset-arsalagrey-pashto-books-json",
2817
+ "title": "arsalagrey/pashto-books-json",
2818
+ "url": "https://huggingface.co/datasets/arsalagrey/pashto-books-json",
2819
+ "category": "dataset",
2820
+ "source": "huggingface",
2821
+ "status": "verified",
2822
+ "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
2823
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2824
+ "tasks": [],
2825
+ "pashto_evidence": {
2826
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2827
+ "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto-books-json",
2828
+ "markers": [
2829
+ "pashto"
2830
+ ]
2831
+ },
2832
+ "tags": [
2833
+ "pashto",
2834
+ "candidate",
2835
+ "dataset"
2836
+ ]
2837
+ },
2838
+ {
2839
+ "id": "candidate-hf-model-jawaria-wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2840
+ "title": "Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2841
+ "url": "https://huggingface.co/Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2842
+ "category": "model",
2843
+ "source": "huggingface",
2844
+ "status": "verified",
2845
+ "summary": "Candidate model returned from Hugging Face search for Pashto.",
2846
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2847
+ "tasks": [],
2848
+ "pashto_evidence": {
2849
+ "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
2850
+ "evidence_url": "https://huggingface.co/Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1",
2851
+ "markers": [
2852
+ "pashto"
2853
+ ]
2854
+ },
2855
+ "tags": [
2856
+ "pashto",
2857
+ "candidate",
2858
+ "model"
2859
+ ]
2860
+ },
2861
+ {
2862
+ "id": "candidate-zenodo-dataset-oped-open-pashto-english-dictionary-preliminary-version-30-october-2025",
2863
+ "title": "OPED (Open Pashto-English Dictionary): Preliminary version, 30 October 2025",
2864
+ "url": "https://zenodo.org/records/17487678",
2865
+ "category": "dataset",
2866
+ "source": "zenodo",
2867
+ "status": "verified",
2868
+ "summary": "Candidate resource returned from Zenodo search for Pashto.",
2869
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2870
+ "tasks": [],
2871
+ "pashto_evidence": {
2872
+ "evidence_text": "Zenodo metadata includes Pashto markers in title or description.",
2873
+ "evidence_url": "https://zenodo.org/records/17487678",
2874
+ "markers": [
2875
+ "pashto"
2876
+ ]
2877
+ },
2878
+ "tags": [
2879
+ "pashto",
2880
+ "candidate",
2881
+ "dataset",
2882
+ "zenodo"
2883
+ ]
2884
+ },
2885
+ {
2886
+ "id": "candidate-kaggle-dataset-abdulbasitkh-pashto-isolated-alphabets-and-numerals",
2887
+ "title": "Pashto Isolated Alphabets and Numerals",
2888
+ "url": "https://www.kaggle.com/datasets/abdulbasitkh/pashto-isolated-alphabetss-and-numerals",
2889
+ "category": "dataset",
2890
+ "source": "kaggle",
2891
+ "status": "verified",
2892
+ "summary": "Pashto Islated Alphabets and Numerals Handwritten and Printed",
2893
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2894
+ "tasks": [],
2895
+ "pashto_evidence": {
2896
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2897
+ "evidence_url": "https://www.kaggle.com/datasets/abdulbasitkh/pashto-isolated-alphabetss-and-numerals",
2898
+ "markers": [
2899
+ "Pashto"
2900
+ ]
2901
+ },
2902
+ "tags": [
2903
+ "pashto",
2904
+ "candidate",
2905
+ "dataset",
2906
+ "kaggle"
2907
+ ]
2908
+ },
2909
+ {
2910
+ "id": "candidate-kaggle-dataset-alimuhammadasad-pashto-poetry",
2911
+ "title": "Pashto Poetry",
2912
+ "url": "https://www.kaggle.com/datasets/alimuhammadasad/pashto-poetry",
2913
+ "category": "dataset",
2914
+ "source": "kaggle",
2915
+ "status": "verified",
2916
+ "summary": "Candidate Kaggle dataset returned from Pashto search.",
2917
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2918
+ "tasks": [],
2919
+ "pashto_evidence": {
2920
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2921
+ "evidence_url": "https://www.kaggle.com/datasets/alimuhammadasad/pashto-poetry",
2922
+ "markers": [
2923
+ "Pashto"
2924
+ ]
2925
+ },
2926
+ "tags": [
2927
+ "pashto",
2928
+ "candidate",
2929
+ "dataset",
2930
+ "kaggle"
2931
+ ]
2932
+ },
2933
+ {
2934
+ "id": "candidate-kaggle-dataset-mahibullahmudaser-pashto-text-characters-sample",
2935
+ "title": "Pashto text characters sample",
2936
+ "url": "https://www.kaggle.com/datasets/mahibullahmudaser/pashtochracterssample",
2937
+ "category": "dataset",
2938
+ "source": "kaggle",
2939
+ "status": "verified",
2940
+ "summary": "Pashto text characters sample",
2941
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2942
+ "tasks": [],
2943
+ "pashto_evidence": {
2944
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2945
+ "evidence_url": "https://www.kaggle.com/datasets/mahibullahmudaser/pashtochracterssample",
2946
+ "markers": [
2947
+ "Pashto"
2948
+ ]
2949
+ },
2950
+ "tags": [
2951
+ "pashto",
2952
+ "candidate",
2953
+ "dataset",
2954
+ "kaggle"
2955
+ ]
2956
+ },
2957
+ {
2958
+ "id": "candidate-kaggle-dataset-ahmadferozafshar-pashto-language-alphabets",
2959
+ "title": "pashto_language_alphabets",
2960
+ "url": "https://www.kaggle.com/datasets/ahmadferozafshar/pashto-language-alphabets",
2961
+ "category": "dataset",
2962
+ "source": "kaggle",
2963
+ "status": "verified",
2964
+ "summary": "Candidate Kaggle dataset returned from Pashto search.",
2965
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2966
+ "tasks": [],
2967
+ "pashto_evidence": {
2968
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2969
+ "evidence_url": "https://www.kaggle.com/datasets/ahmadferozafshar/pashto-language-alphabets",
2970
+ "markers": [
2971
+ "Pashto"
2972
+ ]
2973
+ },
2974
+ "tags": [
2975
+ "pashto",
2976
+ "candidate",
2977
+ "dataset",
2978
+ "kaggle"
2979
+ ]
2980
+ },
2981
+ {
2982
+ "id": "candidate-kaggle-dataset-aimalrezvan-pashto-language-characters",
2983
+ "title": "Pashto_language_characters",
2984
+ "url": "https://www.kaggle.com/datasets/aimalrezvan/pashto-language-characters",
2985
+ "category": "dataset",
2986
+ "source": "kaggle",
2987
+ "status": "verified",
2988
+ "summary": "Pashto_language_characters are Pashto lanugage full and semi characters.",
2989
+ "primary_use": "Automated discovery entry for Pashto resource tracking.",
2990
+ "tasks": [],
2991
+ "pashto_evidence": {
2992
+ "evidence_text": "Kaggle dataset title/subtitle includes Pashto keyword.",
2993
+ "evidence_url": "https://www.kaggle.com/datasets/aimalrezvan/pashto-language-characters",
2994
+ "markers": [
2995
+ "Pashto"
2996
+ ]
2997
+ },
2998
+ "tags": [
2999
+ "pashto",
3000
+ "candidate",
3001
+ "dataset",
3002
+ "kaggle"
3003
+ ]
3004
  }
3005
  ]
3006
  }
resources/datasets/README.md CHANGED
@@ -5,10 +5,14 @@
5
  | Resource | Link | Pashto Evidence | Primary Use |
6
  |---|---|---|---|
7
  | 99 Hours Pashto Spontaneous Dialogue Smartphone Speech Dataset | [huggingface](https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset) | [Dataset title explicitly includes Pashto and API metadata marks audio and text modalities. (`Pashto`)](https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset) | Spontaneous speech ASR training and robustness evaluation |
 
8
  | aamirhs/pashto-audio-wav2vec | [huggingface](https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec) | Pashto ASR data exploration and baseline training |
9
  | adnankhan769/proper_dataset_english_2_pashto | [huggingface](https://huggingface.co/datasets/adnankhan769/proper_dataset_english_2_pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/adnankhan769/proper_dataset_english_2_pashto) | Machine translation and bilingual corpus development |
10
  | AliMuhammad73/Pashto-Poetry | [huggingface](https://huggingface.co/datasets/AliMuhammad73/Pashto-Poetry) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/AliMuhammad73/Pashto-Poetry) | Pashto poetry corpus for language modeling and text analysis |
11
  | alpaca-pashto-cleaned | [huggingface](https://huggingface.co/datasets/saillab/alpaca-pashto-cleaned) | [Dataset metadata includes language:ps and dataset name includes Pashto. (`ps`, `Pashto`)](https://huggingface.co/api/datasets/saillab/alpaca-pashto-cleaned) | Pashto instruction tuning and conversational NLP experiments |
 
 
 
12
  | Belebele | [huggingface](https://huggingface.co/datasets/facebook/belebele) | [Dataset includes pbt_Arab subset. (`pbt_Arab`)](https://huggingface.co/datasets/facebook/belebele) | Comprehension and multilingual NLP benchmark |
13
  | Common Voice 24.0: Pashto Speech Dataset | [kaggle](https://www.kaggle.com/datasets/ataullahaali/common-voice-scripted-speech-24-0-pashto) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/ataullahaali/common-voice-scripted-speech-24-0-pashto) | ASR training and evaluation data source |
14
  | Common Voice Scripted Speech 24.0 - Pashto | [mozilla](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | [Official dataset page is for Pashto. (`Pashto`)](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | ASR training and evaluation |
@@ -26,13 +30,19 @@
26
  | Katib's Pashto Text Imagebase (KPTI) | [kaggle](https://www.kaggle.com/datasets/hassanamin/katibs-pashto-text-imagebase-kpti) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/hassanamin/katibs-pashto-text-imagebase-kpti) | OCR training and evaluation data source |
27
  | koochikoo25/Pashto-Concatenated | [huggingface](https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated) | [Dataset title explicitly states Pashto and card metadata exposes audio-text features and splits. (`Pashto`, `audio`, `transcription`)](https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated) | ASR dataset preparation and split-based benchmark experiments |
28
  | oowais/pushto-text-to-speech-dataset | [huggingface](https://huggingface.co/datasets/oowais/pushto-text-to-speech-dataset) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/oowais/pushto-text-to-speech-dataset) | ASR training and evaluation data source |
 
29
  | OPUS-100 | [huggingface](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes en-ps split. (`en-ps`)](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Machine translation training and evaluation |
30
  | OSCAR Corpus | [huggingface](https://huggingface.co/datasets/oscar-corpus/oscar) | [Dataset includes unshuffled_deduplicated_ps split. (`unshuffled_deduplicated_ps`)](https://huggingface.co/datasets/oscar-corpus/oscar) | Language modeling and lexicon expansion |
31
  | Pashto English Bilingual Sentiment Corpus | [kaggle](https://www.kaggle.com/datasets/farhadkhan66/pashto-translated-corpus) | [Kaggle dataset title and description identify the corpus as Pashto-English sentiment data. (`Pashto`)](https://www.kaggle.com/api/v1/datasets/view/farhadkhan66/pashto-translated-corpus) | Sentiment analysis and bilingual NLP experiments |
 
32
  | Pashto Isolated Words Speech Dataset | [kaggle](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | [Dataset title explicitly states Pashto speech dataset. (`Pashto`)](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Keyword spotting and constrained ASR experiments |
33
  | Pashto OCR | [kaggle](https://www.kaggle.com/datasets/hassanamin/pashto-ocr) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/hassanamin/pashto-ocr) | OCR training and evaluation data source |
 
 
34
  | Pashto Wikipedia Corpus | [huggingface](https://huggingface.co/datasets/ihanif/pashto-wikipedia-corpus) | [Dataset metadata includes language:ps and the title specifies Pashto corpus. (`ps`, `Pashto`)](https://huggingface.co/datasets/ihanif/pashto-wikipedia-corpus) | Pashto text corpus for NLP baselines |
35
  | Pashto Word Embeddings | [kaggle](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | [Dataset description states pretrained Pashto embeddings. (`Pashto`)](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Lexical semantics and lightweight NLP baselines |
 
 
36
  | PashtoOCR (Kaggle) | [kaggle](https://www.kaggle.com/datasets/drijaz/pashtoocr) | [Kaggle dataset title and subtitle explicitly identify a Pashto OCR dataset. (`Pashto`, `OCR`)](https://www.kaggle.com/api/v1/datasets/view/drijaz/pashtoocr) | Pashto OCR dataset benchmarking and training |
37
  | POLD - Pashto Offensive Language Dataset | [kaggle](https://www.kaggle.com/datasets/drijaz/pold-pashto-offensive-language-dataset) | [Kaggle title and description explicitly state Pashto offensive language benchmark dataset. (`Pashto`)](https://www.kaggle.com/api/v1/datasets/view/drijaz/pold-pashto-offensive-language-dataset) | Pashto toxicity and moderation NLP benchmarks |
38
  | saillab/alpaca_pashto_taco | [huggingface](https://huggingface.co/datasets/saillab/alpaca_pashto_taco) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/saillab/alpaca_pashto_taco) | Instruction tuning and LLM adaptation data source |
 
5
  | Resource | Link | Pashto Evidence | Primary Use |
6
  |---|---|---|---|
7
  | 99 Hours Pashto Spontaneous Dialogue Smartphone Speech Dataset | [huggingface](https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset) | [Dataset title explicitly includes Pashto and API metadata marks audio and text modalities. (`Pashto`)](https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset) | Spontaneous speech ASR training and robustness evaluation |
8
+ | aamirhs/pashto | [huggingface](https://huggingface.co/datasets/aamirhs/pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/aamirhs/pashto) | Automated discovery entry for Pashto resource tracking. |
9
  | aamirhs/pashto-audio-wav2vec | [huggingface](https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec) | Pashto ASR data exploration and baseline training |
10
  | adnankhan769/proper_dataset_english_2_pashto | [huggingface](https://huggingface.co/datasets/adnankhan769/proper_dataset_english_2_pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/adnankhan769/proper_dataset_english_2_pashto) | Machine translation and bilingual corpus development |
11
  | AliMuhammad73/Pashto-Poetry | [huggingface](https://huggingface.co/datasets/AliMuhammad73/Pashto-Poetry) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/AliMuhammad73/Pashto-Poetry) | Pashto poetry corpus for language modeling and text analysis |
12
  | alpaca-pashto-cleaned | [huggingface](https://huggingface.co/datasets/saillab/alpaca-pashto-cleaned) | [Dataset metadata includes language:ps and dataset name includes Pashto. (`ps`, `Pashto`)](https://huggingface.co/api/datasets/saillab/alpaca-pashto-cleaned) | Pashto instruction tuning and conversational NLP experiments |
13
+ | arsalagrey/pashto | [huggingface](https://huggingface.co/datasets/arsalagrey/pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/arsalagrey/pashto) | Automated discovery entry for Pashto resource tracking. |
14
+ | arsalagrey/pashto-books | [huggingface](https://huggingface.co/datasets/arsalagrey/pashto-books) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/arsalagrey/pashto-books) | Automated discovery entry for Pashto resource tracking. |
15
+ | arsalagrey/pashto-books-json | [huggingface](https://huggingface.co/datasets/arsalagrey/pashto-books-json) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/arsalagrey/pashto-books-json) | Automated discovery entry for Pashto resource tracking. |
16
  | Belebele | [huggingface](https://huggingface.co/datasets/facebook/belebele) | [Dataset includes pbt_Arab subset. (`pbt_Arab`)](https://huggingface.co/datasets/facebook/belebele) | Comprehension and multilingual NLP benchmark |
17
  | Common Voice 24.0: Pashto Speech Dataset | [kaggle](https://www.kaggle.com/datasets/ataullahaali/common-voice-scripted-speech-24-0-pashto) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/ataullahaali/common-voice-scripted-speech-24-0-pashto) | ASR training and evaluation data source |
18
  | Common Voice Scripted Speech 24.0 - Pashto | [mozilla](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | [Official dataset page is for Pashto. (`Pashto`)](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | ASR training and evaluation |
 
30
  | Katib's Pashto Text Imagebase (KPTI) | [kaggle](https://www.kaggle.com/datasets/hassanamin/katibs-pashto-text-imagebase-kpti) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/hassanamin/katibs-pashto-text-imagebase-kpti) | OCR training and evaluation data source |
31
  | koochikoo25/Pashto-Concatenated | [huggingface](https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated) | [Dataset title explicitly states Pashto and card metadata exposes audio-text features and splits. (`Pashto`, `audio`, `transcription`)](https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated) | ASR dataset preparation and split-based benchmark experiments |
32
  | oowais/pushto-text-to-speech-dataset | [huggingface](https://huggingface.co/datasets/oowais/pushto-text-to-speech-dataset) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/oowais/pushto-text-to-speech-dataset) | ASR training and evaluation data source |
33
+ | OPED (Open Pashto-English Dictionary): Preliminary version, 30 October 2025 | [zenodo](https://zenodo.org/records/17487678) | [Zenodo metadata includes Pashto markers in title or description. (`pashto`)](https://zenodo.org/records/17487678) | Automated discovery entry for Pashto resource tracking. |
34
  | OPUS-100 | [huggingface](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes en-ps split. (`en-ps`)](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Machine translation training and evaluation |
35
  | OSCAR Corpus | [huggingface](https://huggingface.co/datasets/oscar-corpus/oscar) | [Dataset includes unshuffled_deduplicated_ps split. (`unshuffled_deduplicated_ps`)](https://huggingface.co/datasets/oscar-corpus/oscar) | Language modeling and lexicon expansion |
36
  | Pashto English Bilingual Sentiment Corpus | [kaggle](https://www.kaggle.com/datasets/farhadkhan66/pashto-translated-corpus) | [Kaggle dataset title and description identify the corpus as Pashto-English sentiment data. (`Pashto`)](https://www.kaggle.com/api/v1/datasets/view/farhadkhan66/pashto-translated-corpus) | Sentiment analysis and bilingual NLP experiments |
37
+ | Pashto Isolated Alphabets and Numerals | [kaggle](https://www.kaggle.com/datasets/abdulbasitkh/pashto-isolated-alphabetss-and-numerals) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/abdulbasitkh/pashto-isolated-alphabetss-and-numerals) | Automated discovery entry for Pashto resource tracking. |
38
  | Pashto Isolated Words Speech Dataset | [kaggle](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | [Dataset title explicitly states Pashto speech dataset. (`Pashto`)](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Keyword spotting and constrained ASR experiments |
39
  | Pashto OCR | [kaggle](https://www.kaggle.com/datasets/hassanamin/pashto-ocr) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/hassanamin/pashto-ocr) | OCR training and evaluation data source |
40
+ | Pashto Poetry | [kaggle](https://www.kaggle.com/datasets/alimuhammadasad/pashto-poetry) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/alimuhammadasad/pashto-poetry) | Automated discovery entry for Pashto resource tracking. |
41
+ | Pashto text characters sample | [kaggle](https://www.kaggle.com/datasets/mahibullahmudaser/pashtochracterssample) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/mahibullahmudaser/pashtochracterssample) | Automated discovery entry for Pashto resource tracking. |
42
  | Pashto Wikipedia Corpus | [huggingface](https://huggingface.co/datasets/ihanif/pashto-wikipedia-corpus) | [Dataset metadata includes language:ps and the title specifies Pashto corpus. (`ps`, `Pashto`)](https://huggingface.co/datasets/ihanif/pashto-wikipedia-corpus) | Pashto text corpus for NLP baselines |
43
  | Pashto Word Embeddings | [kaggle](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | [Dataset description states pretrained Pashto embeddings. (`Pashto`)](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Lexical semantics and lightweight NLP baselines |
44
+ | pashto_language_alphabets | [kaggle](https://www.kaggle.com/datasets/ahmadferozafshar/pashto-language-alphabets) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/ahmadferozafshar/pashto-language-alphabets) | Automated discovery entry for Pashto resource tracking. |
45
+ | Pashto_language_characters | [kaggle](https://www.kaggle.com/datasets/aimalrezvan/pashto-language-characters) | [Kaggle dataset title/subtitle includes Pashto keyword. (`Pashto`)](https://www.kaggle.com/datasets/aimalrezvan/pashto-language-characters) | Automated discovery entry for Pashto resource tracking. |
46
  | PashtoOCR (Kaggle) | [kaggle](https://www.kaggle.com/datasets/drijaz/pashtoocr) | [Kaggle dataset title and subtitle explicitly identify a Pashto OCR dataset. (`Pashto`, `OCR`)](https://www.kaggle.com/api/v1/datasets/view/drijaz/pashtoocr) | Pashto OCR dataset benchmarking and training |
47
  | POLD - Pashto Offensive Language Dataset | [kaggle](https://www.kaggle.com/datasets/drijaz/pold-pashto-offensive-language-dataset) | [Kaggle title and description explicitly state Pashto offensive language benchmark dataset. (`Pashto`)](https://www.kaggle.com/api/v1/datasets/view/drijaz/pold-pashto-offensive-language-dataset) | Pashto toxicity and moderation NLP benchmarks |
48
  | saillab/alpaca_pashto_taco | [huggingface](https://huggingface.co/datasets/saillab/alpaca_pashto_taco) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/datasets/saillab/alpaca_pashto_taco) | Instruction tuning and LLM adaptation data source |
resources/models/README.md CHANGED
@@ -14,6 +14,7 @@
14
  | ihanif/xls-r-1b-pashto | [huggingface](https://huggingface.co/ihanif/xls-r-1b-pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/ihanif/xls-r-1b-pashto) | Pashto ASR baseline and model comparison |
15
  | ijazulhaq/bert-base-pashto | [huggingface](https://huggingface.co/ijazulhaq/bert-base-pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/ijazulhaq/bert-base-pashto) | Pashto model baseline for downstream NLP tasks |
16
  | ijazulhaq/bert-base-pashto-v1 | [huggingface](https://huggingface.co/ijazulhaq/bert-base-pashto-v1) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/ijazulhaq/bert-base-pashto-v1) | Pashto model baseline for downstream NLP tasks |
 
17
  | koochikoo25/pashto-whisper-large | [huggingface](https://huggingface.co/koochikoo25/pashto-whisper-large) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/koochikoo25/pashto-whisper-large) | Pashto ASR baseline and model comparison |
18
  | koochikoo25/Whisper-medium-pashto | [huggingface](https://huggingface.co/koochikoo25/Whisper-medium-pashto) | [Model tags include ps and automatic-speech-recognition with a Pashto model name. (`ps`, `automatic-speech-recognition`, `pashto`)](https://huggingface.co/koochikoo25/Whisper-medium-pashto) | Pashto ASR baseline modeling and transcription comparison |
19
  | PashtoBERT | [huggingface](https://huggingface.co/mdarhri/pashto-bert) | [Model card states training on Pashto corpus data. (`Pashto`)](https://huggingface.co/mdarhri/pashto-bert) | Pashto NLP baseline encoder |
 
14
  | ihanif/xls-r-1b-pashto | [huggingface](https://huggingface.co/ihanif/xls-r-1b-pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/ihanif/xls-r-1b-pashto) | Pashto ASR baseline and model comparison |
15
  | ijazulhaq/bert-base-pashto | [huggingface](https://huggingface.co/ijazulhaq/bert-base-pashto) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/ijazulhaq/bert-base-pashto) | Pashto model baseline for downstream NLP tasks |
16
  | ijazulhaq/bert-base-pashto-v1 | [huggingface](https://huggingface.co/ijazulhaq/bert-base-pashto-v1) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/ijazulhaq/bert-base-pashto-v1) | Pashto model baseline for downstream NLP tasks |
17
+ | Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1 | [huggingface](https://huggingface.co/Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/Jawaria/wav2vec2-large-xls-r-300m-pashto-colab-final-1) | Automated discovery entry for Pashto resource tracking. |
18
  | koochikoo25/pashto-whisper-large | [huggingface](https://huggingface.co/koochikoo25/pashto-whisper-large) | [Matched by Pashto keyword in Hugging Face search results. (`pashto`)](https://huggingface.co/koochikoo25/pashto-whisper-large) | Pashto ASR baseline and model comparison |
19
  | koochikoo25/Whisper-medium-pashto | [huggingface](https://huggingface.co/koochikoo25/Whisper-medium-pashto) | [Model tags include ps and automatic-speech-recognition with a Pashto model name. (`ps`, `automatic-speech-recognition`, `pashto`)](https://huggingface.co/koochikoo25/Whisper-medium-pashto) | Pashto ASR baseline modeling and transcription comparison |
20
  | PashtoBERT | [huggingface](https://huggingface.co/mdarhri/pashto-bert) | [Model card states training on Pashto corpus data. (`Pashto`)](https://huggingface.co/mdarhri/pashto-bert) | Pashto NLP baseline encoder |
scripts/README.md CHANGED
@@ -8,6 +8,7 @@ Automation scripts for quality checks, resource catalog validation, and search i
8
  - `validate_resource_catalog.py`: validate `resources/catalog/resources.json`.
9
  - `generate_resource_views.py`: generate `resources/*/README.md`, `resources/README.md`, and `docs/search/resources.json` from the catalog.
10
  - `sync_resources.py`: collect new candidate Pashto resources from Kaggle, Hugging Face (datasets/models/spaces), GitHub, GitLab, OpenAlex, Crossref, Zenodo, Dataverse, DataCite, arXiv, and Semantic Scholar into `resources/catalog/pending_candidates.json`.
 
11
  - `run_resource_cycle.py`: run the full repeatable resource cycle with one command.
12
 
13
  ## Usage
@@ -32,6 +33,11 @@ Sync candidate resources for maintainer review:
32
  python scripts/sync_resources.py --limit 20
33
  ```
34
 
 
 
 
 
 
35
  Run full repeatable cycle:
36
  ```bash
37
  python scripts/run_resource_cycle.py --limit 25
 
8
  - `validate_resource_catalog.py`: validate `resources/catalog/resources.json`.
9
  - `generate_resource_views.py`: generate `resources/*/README.md`, `resources/README.md`, and `docs/search/resources.json` from the catalog.
10
  - `sync_resources.py`: collect new candidate Pashto resources from Kaggle, Hugging Face (datasets/models/spaces), GitHub, GitLab, OpenAlex, Crossref, Zenodo, Dataverse, DataCite, arXiv, and Semantic Scholar into `resources/catalog/pending_candidates.json`.
11
+ - `promote_candidates.py`: auto-promote valid non-duplicate entries from `pending_candidates.json` into `resources/catalog/resources.json`.
12
  - `run_resource_cycle.py`: run the full repeatable resource cycle with one command.
13
 
14
  ## Usage
 
33
  python scripts/sync_resources.py --limit 20
34
  ```
35
 
36
+ Auto-promote valid candidates into verified catalog:
37
+ ```bash
38
+ python scripts/promote_candidates.py
39
+ ```
40
+
41
  Run full repeatable cycle:
42
  ```bash
43
  python scripts/run_resource_cycle.py --limit 25
scripts/promote_candidates.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Promote valid candidate resources into the verified catalog.
2
+
3
+ Usage:
4
+ python scripts/promote_candidates.py
5
+ python scripts/promote_candidates.py --max-promotions 10
6
+ """
7
+
8
+ from __future__ import annotations
9
+
10
+ import argparse
11
+ import json
12
+ from datetime import date
13
+ from pathlib import Path
14
+ from typing import Any
15
+
16
+ try:
17
+ from scripts.validate_resource_catalog import validate_resource
18
+ except ModuleNotFoundError:
19
+ from validate_resource_catalog import validate_resource
20
+
21
+
22
+ PLACEHOLDER_PRIMARY_USE = "Needs maintainer review before promotion to verified catalog."
23
+
24
+
25
+ def _canonical_url(value: str) -> str:
26
+ return value.rstrip("/")
27
+
28
+
29
+ def _normalized_tasks(value: Any) -> list[str]:
30
+ if isinstance(value, list):
31
+ return [item for item in value if isinstance(item, str) and item.strip()]
32
+ return []
33
+
34
+
35
+ def _prepare_candidate(candidate: dict[str, Any]) -> dict[str, Any]:
36
+ promoted = dict(candidate)
37
+ promoted["status"] = "verified"
38
+ promoted["tasks"] = _normalized_tasks(promoted.get("tasks"))
39
+
40
+ primary_use = str(promoted.get("primary_use", "")).strip()
41
+ if primary_use == PLACEHOLDER_PRIMARY_USE:
42
+ promoted["primary_use"] = "Automated discovery entry for Pashto resource tracking."
43
+ return promoted
44
+
45
+
46
+ def promote_candidates(
47
+ catalog: dict[str, Any],
48
+ pending_payload: dict[str, Any],
49
+ *,
50
+ max_promotions: int | None = None,
51
+ ) -> tuple[list[dict[str, Any]], dict[str, int]]:
52
+ resources = catalog.get("resources")
53
+ if not isinstance(resources, list):
54
+ raise ValueError("catalog.resources must be a list")
55
+
56
+ candidates = pending_payload.get("candidates", [])
57
+ if not isinstance(candidates, list):
58
+ raise ValueError("pending candidates payload must include a 'candidates' list")
59
+
60
+ seen_ids = {
61
+ resource.get("id")
62
+ for resource in resources
63
+ if isinstance(resource, dict) and isinstance(resource.get("id"), str)
64
+ }
65
+ seen_urls = {
66
+ _canonical_url(resource.get("url", ""))
67
+ for resource in resources
68
+ if isinstance(resource, dict) and isinstance(resource.get("url"), str)
69
+ }
70
+
71
+ promoted: list[dict[str, Any]] = []
72
+ stats = {"total": len(candidates), "promoted": 0, "duplicate": 0, "invalid": 0}
73
+
74
+ for candidate in candidates:
75
+ if max_promotions is not None and len(promoted) >= max_promotions:
76
+ break
77
+ if not isinstance(candidate, dict):
78
+ stats["invalid"] += 1
79
+ continue
80
+
81
+ resource = _prepare_candidate(candidate)
82
+ rid = resource.get("id")
83
+ url = resource.get("url")
84
+ if not isinstance(rid, str) or not isinstance(url, str):
85
+ stats["invalid"] += 1
86
+ continue
87
+
88
+ canonical_url = _canonical_url(url)
89
+ if rid in seen_ids or canonical_url in seen_urls:
90
+ stats["duplicate"] += 1
91
+ continue
92
+
93
+ errors = validate_resource(resource, len(resources) + len(promoted))
94
+ if errors:
95
+ stats["invalid"] += 1
96
+ continue
97
+
98
+ seen_ids.add(rid)
99
+ seen_urls.add(canonical_url)
100
+ promoted.append(resource)
101
+
102
+ if promoted:
103
+ resources.extend(promoted)
104
+ catalog["resources"] = resources
105
+ catalog["updated_on"] = date.today().isoformat()
106
+ stats["promoted"] = len(promoted)
107
+ return promoted, stats
108
+
109
+
110
+ def main() -> int:
111
+ parser = argparse.ArgumentParser()
112
+ parser.add_argument("--catalog", default="resources/catalog/resources.json")
113
+ parser.add_argument("--candidates", default="resources/catalog/pending_candidates.json")
114
+ parser.add_argument("--max-promotions", type=int, default=None)
115
+ args = parser.parse_args()
116
+
117
+ catalog_path = Path(args.catalog)
118
+ candidates_path = Path(args.candidates)
119
+
120
+ if not catalog_path.exists():
121
+ print(f"Missing catalog file: {catalog_path}")
122
+ return 1
123
+ if not candidates_path.exists():
124
+ print(f"Missing candidates file: {candidates_path}")
125
+ return 1
126
+
127
+ try:
128
+ catalog = json.loads(catalog_path.read_text(encoding="utf-8"))
129
+ pending_payload = json.loads(candidates_path.read_text(encoding="utf-8"))
130
+ except json.JSONDecodeError as exc:
131
+ print(f"Invalid JSON input: {exc}")
132
+ return 1
133
+
134
+ promoted, stats = promote_candidates(
135
+ catalog,
136
+ pending_payload,
137
+ max_promotions=args.max_promotions,
138
+ )
139
+ if not promoted:
140
+ print(
141
+ "Promotion complete: no new verified resources "
142
+ f"(duplicates={stats['duplicate']}, invalid={stats['invalid']})"
143
+ )
144
+ return 0
145
+
146
+ catalog_path.write_text(json.dumps(catalog, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
147
+ print(
148
+ "Promotion complete: "
149
+ f"promoted={stats['promoted']} duplicate={stats['duplicate']} invalid={stats['invalid']}"
150
+ )
151
+ return 0
152
+
153
+
154
+ if __name__ == "__main__":
155
+ raise SystemExit(main())
scripts/run_resource_cycle.py CHANGED
@@ -8,6 +8,7 @@ Usage:
8
  python scripts/run_resource_cycle.py --limit 30
9
  python scripts/run_resource_cycle.py --skip-pytest
10
  python scripts/run_resource_cycle.py --discover-only
 
11
  """
12
 
13
  from __future__ import annotations
@@ -30,6 +31,12 @@ def main() -> int:
30
  parser.add_argument("--limit", type=int, default=25, help="Candidate fetch limit per source")
31
  parser.add_argument("--skip-pytest", action="store_true", help="Skip pytest step")
32
  parser.add_argument("--discover-only", action="store_true", help="Only sync candidates and stop")
 
 
 
 
 
 
33
  args = parser.parse_args()
34
 
35
  repo_root = Path(__file__).resolve().parents[1]
@@ -38,8 +45,12 @@ def main() -> int:
38
  ]
39
 
40
  if not args.discover_only:
 
 
 
41
  steps.extend(
42
  [
 
43
  ["python", "scripts/validate_resource_catalog.py"],
44
  ["python", "scripts/generate_resource_views.py"],
45
  ["python", "scripts/check_links.py"],
 
8
  python scripts/run_resource_cycle.py --limit 30
9
  python scripts/run_resource_cycle.py --skip-pytest
10
  python scripts/run_resource_cycle.py --discover-only
11
+ python scripts/run_resource_cycle.py --max-promotions 10
12
  """
13
 
14
  from __future__ import annotations
 
31
  parser.add_argument("--limit", type=int, default=25, help="Candidate fetch limit per source")
32
  parser.add_argument("--skip-pytest", action="store_true", help="Skip pytest step")
33
  parser.add_argument("--discover-only", action="store_true", help="Only sync candidates and stop")
34
+ parser.add_argument(
35
+ "--max-promotions",
36
+ type=int,
37
+ default=None,
38
+ help="Optional cap for auto-promotion count from pending candidates",
39
+ )
40
  args = parser.parse_args()
41
 
42
  repo_root = Path(__file__).resolve().parents[1]
 
45
  ]
46
 
47
  if not args.discover_only:
48
+ promote_step = ["python", "scripts/promote_candidates.py"]
49
+ if args.max_promotions is not None:
50
+ promote_step.extend(["--max-promotions", str(args.max_promotions)])
51
  steps.extend(
52
  [
53
+ promote_step,
54
  ["python", "scripts/validate_resource_catalog.py"],
55
  ["python", "scripts/generate_resource_views.py"],
56
  ["python", "scripts/check_links.py"],
tests/test_promote_candidates.py ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from datetime import date
2
+
3
+ from scripts.promote_candidates import PLACEHOLDER_PRIMARY_USE, promote_candidates
4
+
5
+
6
+ def _catalog() -> dict:
7
+ return {
8
+ "version": "1.0.1",
9
+ "updated_on": "2026-02-18",
10
+ "resources": [
11
+ {
12
+ "id": "dataset-existing",
13
+ "title": "Pashto Existing Dataset",
14
+ "url": "https://example.org/pashto-existing",
15
+ "category": "dataset",
16
+ "source": "other",
17
+ "status": "verified",
18
+ "summary": "Existing Pashto dataset used as baseline for dedupe checks.",
19
+ "primary_use": "Testing",
20
+ "tasks": ["asr"],
21
+ "pashto_evidence": {
22
+ "evidence_text": "Includes Pashto split.",
23
+ "evidence_url": "https://example.org/pashto-existing",
24
+ "markers": ["pashto"],
25
+ },
26
+ "tags": ["pashto", "dataset"],
27
+ }
28
+ ],
29
+ }
30
+
31
+
32
+ def _candidate(*, rid: str, title: str, url: str, category: str = "dataset") -> dict:
33
+ return {
34
+ "id": rid,
35
+ "title": title,
36
+ "url": url,
37
+ "category": category,
38
+ "source": "other",
39
+ "status": "candidate",
40
+ "summary": "Candidate entry for automated promotion tests.",
41
+ "primary_use": PLACEHOLDER_PRIMARY_USE,
42
+ "tasks": [],
43
+ "pashto_evidence": {
44
+ "evidence_text": "Contains explicit Pashto marker in evidence text.",
45
+ "evidence_url": url,
46
+ "markers": ["pashto"],
47
+ },
48
+ "tags": ["pashto", "candidate"],
49
+ }
50
+
51
+
52
+ def test_promote_candidates_promotes_valid_non_duplicate_entries() -> None:
53
+ catalog = _catalog()
54
+ pending = {
55
+ "candidate_count": 1,
56
+ "candidates": [
57
+ _candidate(
58
+ rid="dataset-new",
59
+ title="Pashto New Dataset",
60
+ url="https://example.org/pashto-new",
61
+ )
62
+ ],
63
+ }
64
+
65
+ promoted, stats = promote_candidates(catalog, pending)
66
+
67
+ assert len(promoted) == 1
68
+ assert stats["promoted"] == 1
69
+ assert catalog["updated_on"] == date.today().isoformat()
70
+ assert catalog["resources"][-1]["status"] == "verified"
71
+ assert catalog["resources"][-1]["primary_use"] == "Automated discovery entry for Pashto resource tracking."
72
+
73
+
74
+ def test_promote_candidates_skips_duplicates_and_invalid_entries() -> None:
75
+ catalog = _catalog()
76
+ invalid = _candidate(
77
+ rid="model-invalid",
78
+ title="Generic Multilingual Model",
79
+ url="https://example.org/model-invalid",
80
+ category="model",
81
+ )
82
+ invalid["pashto_evidence"]["markers"] = ["multilingual"]
83
+ invalid["pashto_evidence"]["evidence_text"] = "Language support listed in docs."
84
+
85
+ pending = {
86
+ "candidate_count": 3,
87
+ "candidates": [
88
+ _candidate(
89
+ rid="dataset-existing",
90
+ title="Pashto Duplicate ID",
91
+ url="https://example.org/new-url",
92
+ ),
93
+ _candidate(
94
+ rid="dataset-url-duplicate",
95
+ title="Pashto Duplicate URL",
96
+ url="https://example.org/pashto-existing",
97
+ ),
98
+ invalid,
99
+ ],
100
+ }
101
+
102
+ promoted, stats = promote_candidates(catalog, pending)
103
+
104
+ assert promoted == []
105
+ assert stats["promoted"] == 0
106
+ assert stats["duplicate"] == 2
107
+ assert stats["invalid"] == 1
108
+ assert catalog["updated_on"] == "2026-02-18"
109
+ assert len(catalog["resources"]) == 1
110
+
111
+
112
+ def test_promote_candidates_respects_max_promotions() -> None:
113
+ catalog = _catalog()
114
+ pending = {
115
+ "candidate_count": 2,
116
+ "candidates": [
117
+ _candidate(
118
+ rid="dataset-new-a",
119
+ title="Pashto New Dataset A",
120
+ url="https://example.org/pashto-new-a",
121
+ ),
122
+ _candidate(
123
+ rid="dataset-new-b",
124
+ title="Pashto New Dataset B",
125
+ url="https://example.org/pashto-new-b",
126
+ ),
127
+ ],
128
+ }
129
+
130
+ promoted, stats = promote_candidates(catalog, pending, max_promotions=1)
131
+
132
+ assert len(promoted) == 1
133
+ assert stats["promoted"] == 1
134
+ assert len(catalog["resources"]) == 2