musaw commited on Feb 15

Commit

f13fd7c

1 Parent(s): fb472d7

Add automated Pashto resource catalog, search UI, and sync workflow

Files changed (27) hide show

.github/ISSUE_TEMPLATE/resource_addition.md +34 -0
.github/workflows/ci.yml +9 -0
.github/workflows/resource_sync.yml +52 -0
CONTRIBUTING.md +4 -6
README.md +4 -0
docs/README.md +11 -6
docs/index.md +7 -1
docs/resource_automation.md +36 -0
docs/resource_catalog.md +23 -16
docs/search/index.html +418 -0
docs/search/resources.json +595 -0
resources/README.md +15 -6
resources/benchmarks/README.md +12 -11
resources/catalog/README.md +14 -0
resources/catalog/pending_candidates.json +474 -0
resources/catalog/resource.template.json +25 -0
resources/catalog/resources.json +645 -0
resources/datasets/README.md +14 -13
resources/models/README.md +12 -13
resources/papers/README.md +15 -0
resources/schema/resource.schema.json +142 -0
resources/tools/README.md +10 -14
scripts/README.md +24 -6
scripts/generate_resource_views.py +174 -0
scripts/sync_resources.py +283 -0
scripts/validate_resource_catalog.py +207 -0
tests/test_validate_resource_catalog.py +45 -0

.github/ISSUE_TEMPLATE/resource_addition.md ADDED Viewed

	@@ -0,0 +1,34 @@

+---
+name: Resource addition
+about: Propose a new Pashto-related dataset/model/tool/paper for the catalog
+title: "[resource] "
+labels: ["docs", "help wanted"]
+assignees: []
+---
+## Resource type
+- [ ] Dataset
+- [ ] Model
+- [ ] Benchmark
+- [ ] Tool
+- [ ] Paper
+## Resource URL
+<!-- Add one canonical URL -->
+## Why this is Pashto-relevant
+<!-- Include explicit markers such as Pashto, ps, pus, pbt_Arab, ps_af -->
+## Pashto evidence link
+<!-- Link to the exact line/page/model card proving Pashto support -->
+## Suggested primary use in this repository
+<!-- Example: ASR baseline, MT benchmark, NLP pretraining -->
+## License and usage notes
+<!-- Include known license terms and restrictions -->
+## Checklist
+- [ ] Link is clickable markdown format
+- [ ] Evidence is explicit and verifiable
+- [ ] Not already present in `resources/catalog/resources.json`

.github/workflows/ci.yml CHANGED Viewed

@@ -23,6 +23,15 @@ jobs:
           python -m pip install --upgrade pip
           python -m pip install -e ".[dev]"
       - name: Check markdown links format
         run: python scripts/check_links.py

           python -m pip install --upgrade pip
           python -m pip install -e ".[dev]"
+      - name: Validate resource catalog
+        run: python scripts/validate_resource_catalog.py
+      - name: Generate resource views
+        run: python scripts/generate_resource_views.py
+      - name: Ensure generated files are committed
+        run: git diff --exit-code
       - name: Check markdown links format
         run: python scripts/check_links.py

.github/workflows/resource_sync.yml ADDED Viewed

	@@ -0,0 +1,52 @@

+name: Resource Sync
+on:
+  schedule:
+    - cron: "0 4 * * 1"
+  workflow_dispatch:
+permissions:
+  contents: write
+  pull-requests: write
+jobs:
+  sync:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e ".[dev]"
+      - name: Sync candidate resources
+        run: python scripts/sync_resources.py --limit 20
+      - name: Validate catalog
+        run: python scripts/validate_resource_catalog.py
+      - name: Create review PR
+        uses: peter-evans/create-pull-request@v6
+        with:
+          branch: bot/resource-sync
+          delete-branch: true
+          commit-message: "chore(resources): sync candidate feed"
+          title: "chore(resources): sync Pashto resource candidates"
+          body: |
+            Automated weekly candidate sync.
+            Scope:
+            - Updates `resources/catalog/pending_candidates.json`
+            - Leaves verified catalog unchanged for maintainer review
+          labels: |
+            resource-update
+            needs-review
+          add-paths: |
+            resources/catalog/pending_candidates.json

CONTRIBUTING.md CHANGED Viewed

@@ -24,12 +24,10 @@ Then contribute here by opening an issue/PR with:
 - what concrete follow-up is needed in this repository.
 ## 🔍 External Resource Contribution Rules
-- Add links in the correct workspace README (`data`, `asr`, `tts`, `benchmarks`, `apps`).
-- Update [docs/resource_catalog.md](docs/resource_catalog.md) with:
-  - what the resource is,
-  - explicit Pashto support evidence,
-  - how it can be used in this repository,
-  - practical applications.
 - Prefer official pages and model/dataset cards over third-party reposts.
 ## 🔄 Contribution Flow

 - what concrete follow-up is needed in this repository.
 ## 🔍 External Resource Contribution Rules
+- Add or update entries in [resources/catalog/resources.json](resources/catalog/resources.json) using [resources/catalog/resource.template.json](resources/catalog/resource.template.json).
+- Validate catalog changes with `python scripts/validate_resource_catalog.py`.
+- Regenerate resource docs and search data with `python scripts/generate_resource_views.py`.
+- Use [docs/resource_catalog.md](docs/resource_catalog.md) and [docs/resource_automation.md](docs/resource_automation.md) for full rules.
 - Prefer official pages and model/dataset cards over third-party reposts.
 ## 🔄 Contribution Flow

README.md CHANGED Viewed

@@ -17,6 +17,7 @@ Community-led open-source project to make Pashto a first-class language in AI sp
 - GitHub: [Pukhto_Pashto](https://github.com/Musawer1214/Pukhto_Pashto)
 - Hugging Face: [Musawer14/Pukhto_Pashto](https://huggingface.co/Musawer14/Pukhto_Pashto)
 - GitHub Pages (About): [Pukhto_Pashto Site](https://musawer1214.github.io/Pukhto_Pashto/)
 ## 🎯 Core Goal
 - Build open datasets, benchmarks, and models for Pashto ASR, TTS, and NLP.
@@ -41,10 +42,13 @@ Community-led open-source project to make Pashto a first-class language in AI sp
 ## 📚 Verified Resource Catalog
 The project tracks validated external resources in:
 - [docs/resource_catalog.md](docs/resource_catalog.md) (master index)
 - [resources/datasets/README.md](resources/datasets/README.md)
 - [resources/models/README.md](resources/models/README.md)
 - [resources/benchmarks/README.md](resources/benchmarks/README.md)
 - [resources/tools/README.md](resources/tools/README.md)
 ## 🎙️ Featured Dataset: Common Voice Pashto
 - Dataset: Common Voice Scripted Speech 24.0 - Pashto

 - GitHub: [Pukhto_Pashto](https://github.com/Musawer1214/Pukhto_Pashto)
 - Hugging Face: [Musawer14/Pukhto_Pashto](https://huggingface.co/Musawer14/Pukhto_Pashto)
 - GitHub Pages (About): [Pukhto_Pashto Site](https://musawer1214.github.io/Pukhto_Pashto/)
+- GitHub Pages (Resource Search): [Pashto Resource Search](https://musawer1214.github.io/Pukhto_Pashto/search/)
 ## 🎯 Core Goal
 - Build open datasets, benchmarks, and models for Pashto ASR, TTS, and NLP.
 ## 📚 Verified Resource Catalog
 The project tracks validated external resources in:
 - [docs/resource_catalog.md](docs/resource_catalog.md) (master index)
+- [resources/catalog/resources.json](resources/catalog/resources.json) (canonical machine-readable catalog)
+- [resources/schema/resource.schema.json](resources/schema/resource.schema.json) (catalog schema)
 - [resources/datasets/README.md](resources/datasets/README.md)
 - [resources/models/README.md](resources/models/README.md)
 - [resources/benchmarks/README.md](resources/benchmarks/README.md)
 - [resources/tools/README.md](resources/tools/README.md)
+- [resources/papers/README.md](resources/papers/README.md)
 ## 🎙️ Featured Dataset: Common Voice Pashto
 - Dataset: Common Voice Scripted Speech 24.0 - Pashto

docs/README.md CHANGED Viewed

@@ -1,8 +1,8 @@
-# 📘 Documentation Hub
 This folder is the main documentation entry point for contributors.
-## 🚀 Start Here
 - Project purpose: [../PROJECT_PURPOSE.md](../PROJECT_PURPOSE.md)
 - Contributing guide: [../CONTRIBUTING.md](../CONTRIBUTING.md)
 - Governance: [../GOVERNANCE.md](../GOVERNANCE.md)
@@ -10,7 +10,7 @@ This folder is the main documentation entry point for contributors.
 - Roadmap: [../ROADMAP.md](../ROADMAP.md)
 - Changelog: [../CHANGELOG.md](../CHANGELOG.md)
-## 🧭 Core Documentation
 - Workstreams: [workstreams.md](workstreams.md)
 - Dataset guidelines: [dataset_guidelines.md](dataset_guidelines.md)
 - Pashto normalization policy: [pashto_normalization_v0.1.md](pashto_normalization_v0.1.md)
@@ -19,17 +19,22 @@ This folder is the main documentation entry point for contributors.
 - Release checklist: [release_checklist.md](release_checklist.md)
 - Platforms and publish flow: [platforms.md](platforms.md)
 - GitHub operations: [github_operations.md](github_operations.md)
-## 📚 Resource Tracking
 - Master resource index: [resource_catalog.md](resource_catalog.md)
 - Structured resources folder: [../resources/README.md](../resources/README.md)
-## 🛠️ Tooling
 - Scripts overview: [../scripts/README.md](../scripts/README.md)
 - Link checker: [../scripts/check_links.py](../scripts/check_links.py)
 - Normalization validator: [../scripts/validate_normalization.py](../scripts/validate_normalization.py)
-## 📈 Evaluation and Experiments
 - Benchmark result format: [../benchmarks/results/README.md](../benchmarks/results/README.md)
 - Benchmark schema: [../benchmarks/schema/benchmark_result.schema.json](../benchmarks/schema/benchmark_result.schema.json)
 - Experiment run cards: [../experiments/README.md](../experiments/README.md)

+# Documentation Hub
 This folder is the main documentation entry point for contributors.
+## Start here
 - Project purpose: [../PROJECT_PURPOSE.md](../PROJECT_PURPOSE.md)
 - Contributing guide: [../CONTRIBUTING.md](../CONTRIBUTING.md)
 - Governance: [../GOVERNANCE.md](../GOVERNANCE.md)
 - Roadmap: [../ROADMAP.md](../ROADMAP.md)
 - Changelog: [../CHANGELOG.md](../CHANGELOG.md)
+## Core documentation
 - Workstreams: [workstreams.md](workstreams.md)
 - Dataset guidelines: [dataset_guidelines.md](dataset_guidelines.md)
 - Pashto normalization policy: [pashto_normalization_v0.1.md](pashto_normalization_v0.1.md)
 - Release checklist: [release_checklist.md](release_checklist.md)
 - Platforms and publish flow: [platforms.md](platforms.md)
 - GitHub operations: [github_operations.md](github_operations.md)
+- Resource automation: [resource_automation.md](resource_automation.md)
+## Resource tracking
 - Master resource index: [resource_catalog.md](resource_catalog.md)
+- GitHub Pages search: [search/index.html](search/index.html)
 - Structured resources folder: [../resources/README.md](../resources/README.md)
+## Tooling
 - Scripts overview: [../scripts/README.md](../scripts/README.md)
 - Link checker: [../scripts/check_links.py](../scripts/check_links.py)
+- Resource catalog validator: [../scripts/validate_resource_catalog.py](../scripts/validate_resource_catalog.py)
+- Resource view generator: [../scripts/generate_resource_views.py](../scripts/generate_resource_views.py)
+- Candidate sync script: [../scripts/sync_resources.py](../scripts/sync_resources.py)
 - Normalization validator: [../scripts/validate_normalization.py](../scripts/validate_normalization.py)
+## Evaluation and experiments
 - Benchmark result format: [../benchmarks/results/README.md](../benchmarks/results/README.md)
 - Benchmark schema: [../benchmarks/schema/benchmark_result.schema.json](../benchmarks/schema/benchmark_result.schema.json)
 - Experiment run cards: [../experiments/README.md](../experiments/README.md)

docs/index.md CHANGED Viewed

@@ -21,7 +21,13 @@ title: About Pukhto Pashto
 - `benchmarks/`: benchmark schema, result format, and metric guidance.
 - `experiments/`: reproducible run cards and experiment records.
 - `docs/`: policies, roadmap, release process, and operating guides.
-- `resources/`: verified external Pashto datasets, models, tools, and benchmarks.
 ## Project References

 - `benchmarks/`: benchmark schema, result format, and metric guidance.
 - `experiments/`: reproducible run cards and experiment records.
 - `docs/`: policies, roadmap, release process, and operating guides.
+- `resources/`: verified external Pashto datasets, models, tools, benchmarks, and papers.
+## Search Resources
+- Search UI: [Pashto Resource Search](search/)
+- Resource index docs: [resource_catalog.md](resource_catalog.md)
+- Machine-readable catalog: [../resources/catalog/resources.json](../resources/catalog/resources.json)
 ## Project References

docs/resource_automation.md ADDED Viewed

	@@ -0,0 +1,36 @@

+# Resource Automation
+This repository uses a semi-automated process to keep Pashto resources current while preserving human review.
+## Goals
+- Discover new Pashto-relevant resources from trusted public endpoints.
+- Keep a machine-readable canonical catalog.
+- Prevent unreviewed low-confidence resources from directly entering verified lists.
+## Files involved
+- Canonical verified catalog: [../resources/catalog/resources.json](../resources/catalog/resources.json)
+- Candidate feed: [../resources/catalog/pending_candidates.json](../resources/catalog/pending_candidates.json)
+- Catalog schema: [../resources/schema/resource.schema.json](../resources/schema/resource.schema.json)
+- Search export: [search/resources.json](search/resources.json)
+## Scripts
+- Validate catalog: `python scripts/validate_resource_catalog.py`
+- Generate markdown and search index: `python scripts/generate_resource_views.py`
+- Sync new candidates: `python scripts/sync_resources.py --limit 20`
+## GitHub Actions
+- CI (`.github/workflows/ci.yml`) enforces:
+  - catalog validation
+  - generated file consistency
+  - markdown link checks
+  - tests
+- Resource Sync (`.github/workflows/resource_sync.yml`) runs weekly and opens a PR with candidate updates.
+## Review flow
+1. Inspect candidate entries in `resources/catalog/pending_candidates.json`.
+2. Select useful items and move them into `resources/catalog/resources.json`.
+3. Set `status` to `verified` only after checking evidence and license.
+4. Run:
+   - `python scripts/validate_resource_catalog.py`
+   - `python scripts/generate_resource_views.py`
+5. Commit and open PR.

docs/resource_catalog.md CHANGED Viewed

@@ -1,36 +1,43 @@
-# 📚 Verified Pashto Resource Catalog
 Last updated: `2026-02-15`
 This index points to validated Pashto-related resources tracked in structured files.
-## ✅ Validation Method
-- Verify that each source URL resolves to an official page.
-- Verify explicit Pashto support markers (`ps`, `ps_af`, `pbt_Arab`, `pus`) where available.
 - Include only resources with practical use for this repository.
-## 🧭 Structured Catalog
 - Datasets: [../resources/datasets/README.md](../resources/datasets/README.md)
 - Models: [../resources/models/README.md](../resources/models/README.md)
 - Benchmarks: [../resources/benchmarks/README.md](../resources/benchmarks/README.md)
-- Tools and applications: [../resources/tools/README.md](../resources/tools/README.md)
-## 🧩 Workspace Mapping
 - Data workspace: [../data/README.md](../data/README.md)
 - ASR workspace: [../asr/README.md](../asr/README.md)
 - TTS workspace: [../tts/README.md](../tts/README.md)
 - Benchmarks workspace: [../benchmarks/README.md](../benchmarks/README.md)
 - Applications workspace: [../apps/desktop/README.md](../apps/desktop/README.md)
-## 🔄 Maintenance Rule
 Before each release:
 - Confirm links still resolve.
 - Confirm Pashto support markers remain valid.
-- Confirm license/usage terms are still compatible.
-## New Additions (2026-02-15)
-- `OPUS-100` dataset with `en-ps` subset support.
-- `FLORES-200` benchmark reference with `pbt_Arab` language code coverage.
-- `facebook/mms-1b-all` ASR model reference for multilingual Pashto transfer.
-- `mdarhri/pashto-bert` model for Pashto NLP baseline work.
-- Two Kaggle resources: Pashto isolated-word speech and Pashto word embeddings.

+# Verified Pashto Resource Catalog
 Last updated: `2026-02-15`
 This index points to validated Pashto-related resources tracked in structured files.
+## Validation method
+- Verify source URL resolves to official page or canonical repository.
+- Verify explicit Pashto support markers (`Pashto`, `ps`, `ps_af`, `pus`, `pbt_Arab`) where possible.
 - Include only resources with practical use for this repository.
+## Structured catalog
+- Canonical JSON: [../resources/catalog/resources.json](../resources/catalog/resources.json)
+- Candidate feed: [../resources/catalog/pending_candidates.json](../resources/catalog/pending_candidates.json)
+- JSON schema: [../resources/schema/resource.schema.json](../resources/schema/resource.schema.json)
+## Generated markdown views
 - Datasets: [../resources/datasets/README.md](../resources/datasets/README.md)
 - Models: [../resources/models/README.md](../resources/models/README.md)
 - Benchmarks: [../resources/benchmarks/README.md](../resources/benchmarks/README.md)
+- Tools: [../resources/tools/README.md](../resources/tools/README.md)
+- Papers: [../resources/papers/README.md](../resources/papers/README.md)
+## Search page
+- GitHub Pages search UI: [search/index.html](search/index.html)
+- Search data export: [search/resources.json](search/resources.json)
+- Automation guide: [resource_automation.md](resource_automation.md)
+## Workspace mapping
 - Data workspace: [../data/README.md](../data/README.md)
 - ASR workspace: [../asr/README.md](../asr/README.md)
 - TTS workspace: [../tts/README.md](../tts/README.md)
 - Benchmarks workspace: [../benchmarks/README.md](../benchmarks/README.md)
 - Applications workspace: [../apps/desktop/README.md](../apps/desktop/README.md)
+## Maintenance rule
 Before each release:
 - Confirm links still resolve.
 - Confirm Pashto support markers remain valid.
+- Confirm license or usage terms are still compatible.
+- Run:
+  - `python scripts/validate_resource_catalog.py`
+  - `python scripts/generate_resource_views.py`

docs/search/index.html ADDED Viewed

	@@ -0,0 +1,418 @@

+<!doctype html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>Pashto Resource Search</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+  <link href="https://fonts.googleapis.com/css2?family=IBM+Plex+Sans+Arabic:wght@400;500;700&family=Space+Grotesk:wght@500;700&display=swap" rel="stylesheet">
+  <style>
+    :root {
+      --bg: #f6f4ec;
+      --panel: #fffef9;
+      --ink: #1d2a24;
+      --muted: #4c6158;
+      --line: #d6ddd7;
+      --brand: #106b53;
+      --brand-soft: #e0f0ea;
+      --accent: #c76a1a;
+      --accent-soft: #f7e9d8;
+      --shadow: 0 12px 30px rgba(29, 42, 36, 0.08);
+    }
+    * { box-sizing: border-box; }
+    body {
+      margin: 0;
+      font-family: "IBM Plex Sans Arabic", "Segoe UI", sans-serif;
+      color: var(--ink);
+      background:
+        radial-gradient(circle at 8% 12%, #fce4c4 0, rgba(252, 228, 196, 0) 35%),
+        radial-gradient(circle at 92% 6%, #d8ece5 0, rgba(216, 236, 229, 0) 38%),
+        var(--bg);
+      min-height: 100vh;
+    }
+    .wrap {
+      max-width: 1100px;
+      margin: 0 auto;
+      padding: 24px 18px 44px;
+    }
+    .hero {
+      background: linear-gradient(120deg, #fff, #f7fbf9);
+      border: 1px solid var(--line);
+      box-shadow: var(--shadow);
+      border-radius: 18px;
+      padding: 20px 18px;
+      margin-bottom: 16px;
+      transform: translateY(8px);
+      opacity: 0;
+      animation: rise 500ms ease forwards;
+    }
+    .eyebrow {
+      letter-spacing: 0.07em;
+      text-transform: uppercase;
+      color: var(--accent);
+      font-family: "Space Grotesk", sans-serif;
+      font-weight: 700;
+      font-size: 12px;
+      margin-bottom: 6px;
+    }
+    h1 {
+      margin: 0 0 8px;
+      font-family: "Space Grotesk", sans-serif;
+      font-size: 33px;
+      line-height: 1.1;
+    }
+    .hero p {
+      margin: 0;
+      color: var(--muted);
+      line-height: 1.5;
+    }
+    .controls {
+      margin-top: 14px;
+      display: grid;
+      grid-template-columns: 2.2fr 1fr 1fr 1fr 1fr;
+      gap: 10px;
+    }
+    .field {
+      display: flex;
+      flex-direction: column;
+      gap: 5px;
+    }
+    .field label {
+      font-size: 12px;
+      font-weight: 700;
+      color: var(--muted);
+      letter-spacing: 0.04em;
+      text-transform: uppercase;
+      font-family: "Space Grotesk", sans-serif;
+    }
+    input, select {
+      width: 100%;
+      border: 1px solid var(--line);
+      background: var(--panel);
+      color: var(--ink);
+      border-radius: 10px;
+      padding: 10px 11px;
+      font: inherit;
+    }
+    input:focus, select:focus {
+      outline: 2px solid #8cc9b5;
+      border-color: #8cc9b5;
+    }
+    .summary {
+      margin: 14px 2px 6px;
+      display: flex;
+      justify-content: space-between;
+      align-items: center;
+      gap: 12px;
+      color: var(--muted);
+      font-size: 14px;
+    }
+    .badge {
+      background: var(--brand-soft);
+      color: var(--brand);
+      border: 1px solid #b7dccc;
+      padding: 3px 9px;
+      border-radius: 999px;
+      font-weight: 600;
+    }
+    .grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fill, minmax(260px, 1fr));
+      gap: 12px;
+      list-style: none;
+      padding: 0;
+      margin: 0;
+    }
+    .card {
+      border: 1px solid var(--line);
+      border-radius: 14px;
+      background: var(--panel);
+      box-shadow: var(--shadow);
+      padding: 13px 12px;
+      display: flex;
+      flex-direction: column;
+      gap: 9px;
+      opacity: 0;
+      animation: rise 420ms ease forwards;
+    }
+    .chips {
+      display: flex;
+      gap: 6px;
+      flex-wrap: wrap;
+    }
+    .chip {
+      border-radius: 999px;
+      padding: 2px 8px;
+      font-size: 11px;
+      font-weight: 700;
+      font-family: "Space Grotesk", sans-serif;
+      letter-spacing: 0.03em;
+      text-transform: uppercase;
+      border: 1px solid transparent;
+    }
+    .chip.category { background: var(--brand-soft); color: var(--brand); border-color: #b7dccc; }
+    .chip.source { background: var(--accent-soft); color: #955016; border-color: #efc89f; }
+    .chip.status { background: #eef3ff; color: #3e4f86; border-color: #d2dbf6; }
+    .title {
+      margin: 0;
+      font-size: 17px;
+      line-height: 1.3;
+    }
+    .title a {
+      color: var(--ink);
+      text-decoration: none;
+      border-bottom: 1px solid transparent;
+    }
+    .title a:hover {
+      border-bottom-color: currentColor;
+    }
+    .meta {
+      color: var(--muted);
+      font-size: 13px;
+      line-height: 1.45;
+      margin: 0;
+    }
+    .card footer {
+      margin-top: auto;
+      font-size: 12px;
+      color: var(--muted);
+    }
+    .empty {
+      border: 1px dashed #b9c4bd;
+      border-radius: 12px;
+      padding: 24px;
+      background: #fcfcfa;
+      color: var(--muted);
+      text-align: center;
+    }
+    @keyframes rise {
+      from { opacity: 0; transform: translateY(8px); }
+      to { opacity: 1; transform: translateY(0); }
+    }
+    @media (max-width: 900px) {
+      .controls {
+        grid-template-columns: 1fr 1fr;
+      }
+      .controls .field:first-child {
+        grid-column: span 2;
+      }
+    }
+    @media (max-width: 560px) {
+      h1 { font-size: 28px; }
+      .controls { grid-template-columns: 1fr; }
+      .controls .field:first-child { grid-column: span 1; }
+      .summary { flex-direction: column; align-items: flex-start; }
+    }
+  </style>
+</head>
+<body>
+  <main class="wrap">
+    <section class="hero">
+      <div class="eyebrow">Pukhto Pashto</div>
+      <h1>Pashto Technology Resource Search</h1>
+      <p>
+        Search and filter verified and candidate resources that support Pashto in ASR, TTS, NLP, translation, tooling, and research.
+      </p>
+      <div class="controls">
+        <div class="field">
+          <label for="q">Search</label>
+          <input id="q" type="search" placeholder="Try: ASR, pbt_Arab, translation, speech" />
+        </div>
+        <div class="field">
+          <label for="category">Category</label>
+          <select id="category"></select>
+        </div>
+        <div class="field">
+          <label for="source">Source</label>
+          <select id="source"></select>
+        </div>
+        <div class="field">
+          <label for="task">Task</label>
+          <select id="task"></select>
+        </div>
+        <div class="field">
+          <label for="status">Status</label>
+          <select id="status"></select>
+        </div>
+      </div>
+    </section>
+    <div class="summary">
+      <span id="countText">Loading resources...</span>
+      <span class="badge" id="generatedAt">Catalog timestamp: -</span>
+    </div>
+    <ul id="results" class="grid"></ul>
+  </main>
+  <script>
+    const state = {
+      all: [],
+      filtered: []
+    };
+    const els = {
+      q: document.getElementById("q"),
+      category: document.getElementById("category"),
+      source: document.getElementById("source"),
+      task: document.getElementById("task"),
+      status: document.getElementById("status"),
+      results: document.getElementById("results"),
+      countText: document.getElementById("countText"),
+      generatedAt: document.getElementById("generatedAt")
+    };
+    function uniqSorted(values) {
+      return [...new Set(values.filter(Boolean))].sort((a, b) => a.localeCompare(b));
+    }
+    function fillSelect(select, options, allLabel) {
+      select.innerHTML = "";
+      const allOption = document.createElement("option");
+      allOption.value = "";
+      allOption.textContent = allLabel;
+      select.appendChild(allOption);
+      for (const opt of options) {
+        const el = document.createElement("option");
+        el.value = opt;
+        el.textContent = opt;
+        select.appendChild(el);
+      }
+    }
+    function matchesQuery(resource, query) {
+      if (!query) return true;
+      const hay = [
+        resource.title,
+        resource.summary,
+        resource.primary_use,
+        resource.category,
+        resource.source,
+        resource.status,
+        ...(resource.tags || []),
+        ...(resource.tasks || []),
+        ...(resource.markers || []),
+        resource.evidence_text
+      ].join(" ").toLowerCase();
+      return hay.includes(query);
+    }
+    function applyFilters() {
+      const q = els.q.value.trim().toLowerCase();
+      const category = els.category.value;
+      const source = els.source.value;
+      const task = els.task.value;
+      const status = els.status.value;
+      state.filtered = state.all.filter((resource) => {
+        if (!matchesQuery(resource, q)) return false;
+        if (category && resource.category !== category) return false;
+        if (source && resource.source !== source) return false;
+        if (status && resource.status !== status) return false;
+        if (task && !(resource.tasks || []).includes(task)) return false;
+        return true;
+      });
+      renderResults();
+    }
+    function chip(label, cls) {
+      return `<span class="chip ${cls}">${label}</span>`;
+    }
+    function renderResults() {
+      const items = state.filtered;
+      els.countText.textContent = `${items.length} result${items.length === 1 ? "" : "s"} of ${state.all.length}`;
+      if (!items.length) {
+        els.results.innerHTML = `<li class="empty">No matches. Try broadening filters or changing keywords.</li>`;
+        return;
+      }
+      els.results.innerHTML = items.map((resource, idx) => `
+        <li class="card" style="animation-delay:${Math.min(idx * 28, 320)}ms">
+          <div class="chips">
+            ${chip(resource.category, "category")}
+            ${chip(resource.source, "source")}
+            ${chip(resource.status, "status")}
+          </div>
+          <h2 class="title"><a href="${resource.url}" target="_blank" rel="noreferrer">${resource.title}</a></h2>
+          <p class="meta">${resource.summary}</p>
+          <p class="meta"><strong>Primary use:</strong> ${resource.primary_use}</p>
+          <p class="meta"><strong>Pashto evidence:</strong> <a href="${resource.evidence_url}" target="_blank" rel="noreferrer">${resource.evidence_text}</a></p>
+          <footer>
+            ${(resource.tasks || []).length ? `Tasks: ${resource.tasks.join(", ")}` : "Tasks: n/a"}
+          </footer>
+        </li>
+      `).join("");
+    }
+    async function load() {
+      try {
+        const res = await fetch("./resources.json", { cache: "no-store" });
+        if (!res.ok) throw new Error(`HTTP ${res.status}`);
+        const payload = await res.json();
+        state.all = payload.resources || [];
+        state.filtered = [...state.all];
+        fillSelect(els.category, uniqSorted(state.all.map((r) => r.category)), "All categories");
+        fillSelect(els.source, uniqSorted(state.all.map((r) => r.source)), "All sources");
+        fillSelect(els.status, uniqSorted(state.all.map((r) => r.status)), "All statuses");
+        fillSelect(
+          els.task,
+          uniqSorted(state.all.flatMap((r) => r.tasks || [])),
+          "All tasks"
+        );
+        const generated = payload.generated_on ? new Date(payload.generated_on) : null;
+        els.generatedAt.textContent = generated && !Number.isNaN(generated.getTime())
+          ? `Catalog timestamp: ${generated.toISOString()}`
+          : "Catalog timestamp: unknown";
+        applyFilters();
+      } catch (err) {
+        els.countText.textContent = "Failed to load resources";
+        els.results.innerHTML = `<li class="empty">Could not load search data. ${String(err)}</li>`;
+      }
+    }
+    [els.q, els.category, els.source, els.task, els.status].forEach((el) => {
+      el.addEventListener("input", applyFilters);
+      el.addEventListener("change", applyFilters);
+    });
+    load();
+  </script>
+</body>
+</html>

docs/search/resources.json ADDED Viewed

	@@ -0,0 +1,595 @@

+{
+  "generated_on": "2026-02-15T00:00:00Z",
+  "count": 25,
+  "resources": [
+    {
+      "id": "dataset-common-voice-ps-v24",
+      "title": "Common Voice Scripted Speech 24.0 - Pashto",
+      "url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+      "category": "dataset",
+      "source": "mozilla",
+      "status": "verified",
+      "summary": "Large open Pashto speech dataset for ASR training and evaluation.",
+      "primary_use": "ASR training and evaluation",
+      "tasks": [
+        "asr"
+      ],
+      "tags": [
+        "pashto",
+        "speech",
+        "asr"
+      ],
+      "evidence_text": "Official dataset page is for Pashto.",
+      "evidence_url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+      "markers": [
+        "Pashto"
+      ]
+    },
+    {
+      "id": "dataset-google-fleurs",
+      "title": "Google FLEURS",
+      "url": "https://huggingface.co/datasets/google/fleurs",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Standard multilingual speech benchmark dataset with Pashto subset.",
+      "primary_use": "Speech benchmark and external evaluation",
+      "tasks": [
+        "asr",
+        "benchmarking"
+      ],
+      "tags": [
+        "pashto",
+        "speech",
+        "benchmark"
+      ],
+      "evidence_text": "Dataset config includes ps_af.",
+      "evidence_url": "https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py",
+      "markers": [
+        "ps_af"
+      ]
+    },
+    {
+      "id": "dataset-oscar-ps",
+      "title": "OSCAR Corpus",
+      "url": "https://huggingface.co/datasets/oscar-corpus/oscar",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Large web text corpus that includes Pashto text split.",
+      "primary_use": "Language modeling and lexicon expansion",
+      "tasks": [
+        "nlp"
+      ],
+      "tags": [
+        "pashto",
+        "text",
+        "nlp"
+      ],
+      "evidence_text": "Dataset includes unshuffled_deduplicated_ps split.",
+      "evidence_url": "https://huggingface.co/datasets/oscar-corpus/oscar",
+      "markers": [
+        "unshuffled_deduplicated_ps"
+      ]
+    },
+    {
+      "id": "dataset-wikipedia-ps",
+      "title": "Wikimedia Wikipedia",
+      "url": "https://huggingface.co/datasets/wikimedia/wikipedia",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Wikipedia corpus with Pashto edition for cleaner text resources.",
+      "primary_use": "Terminology and balanced text corpus",
+      "tasks": [
+        "nlp"
+      ],
+      "tags": [
+        "pashto",
+        "text",
+        "nlp"
+      ],
+      "evidence_text": "Dataset includes 20231101.ps subset.",
+      "evidence_url": "https://huggingface.co/datasets/wikimedia/wikipedia",
+      "markers": [
+        "20231101.ps"
+      ]
+    },
+    {
+      "id": "dataset-belebele-pbt-arab",
+      "title": "Belebele",
+      "url": "https://huggingface.co/datasets/facebook/belebele",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Reading comprehension dataset with Pashto script subset.",
+      "primary_use": "Comprehension and multilingual NLP benchmark",
+      "tasks": [
+        "nlp",
+        "benchmarking"
+      ],
+      "tags": [
+        "pashto",
+        "nlp",
+        "benchmark"
+      ],
+      "evidence_text": "Dataset includes pbt_Arab subset.",
+      "evidence_url": "https://huggingface.co/datasets/facebook/belebele",
+      "markers": [
+        "pbt_Arab"
+      ]
+    },
+    {
+      "id": "dataset-opus100-en-ps",
+      "title": "OPUS-100",
+      "url": "https://huggingface.co/datasets/Helsinki-NLP/opus-100",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Parallel corpus with English to Pashto split for MT tasks.",
+      "primary_use": "Machine translation training and evaluation",
+      "tasks": [
+        "mt",
+        "nlp"
+      ],
+      "tags": [
+        "pashto",
+        "mt",
+        "parallel-corpus"
+      ],
+      "evidence_text": "Dataset viewer includes en-ps split.",
+      "evidence_url": "https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps",
+      "markers": [
+        "en-ps"
+      ]
+    },
+    {
+      "id": "dataset-kaggle-pashto-isolated-words",
+      "title": "Pashto Isolated Words Speech Dataset",
+      "url": "https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset",
+      "category": "dataset",
+      "source": "kaggle",
+      "status": "verified",
+      "summary": "Speech dataset focused on isolated Pashto words.",
+      "primary_use": "Keyword spotting and constrained ASR experiments",
+      "tasks": [
+        "asr"
+      ],
+      "tags": [
+        "pashto",
+        "speech",
+        "kaggle"
+      ],
+      "evidence_text": "Dataset title explicitly states Pashto speech dataset.",
+      "evidence_url": "https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset",
+      "markers": [
+        "Pashto"
+      ]
+    },
+    {
+      "id": "dataset-kaggle-pashto-word-embeddings",
+      "title": "Pashto Word Embeddings",
+      "url": "https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings",
+      "category": "dataset",
+      "source": "kaggle",
+      "status": "verified",
+      "summary": "Pretrained Pashto word vectors for classic NLP baselines.",
+      "primary_use": "Lexical semantics and lightweight NLP baselines",
+      "tasks": [
+        "nlp"
+      ],
+      "tags": [
+        "pashto",
+        "nlp",
+        "embeddings",
+        "kaggle"
+      ],
+      "evidence_text": "Dataset description states pretrained Pashto embeddings.",
+      "evidence_url": "https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings",
+      "markers": [
+        "Pashto"
+      ]
+    },
+    {
+      "id": "model-whisper-large-v3",
+      "title": "Whisper Large v3",
+      "url": "https://huggingface.co/openai/whisper-large-v3",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Strong multilingual ASR baseline suitable for Pashto bootstrapping.",
+      "primary_use": "ASR baseline and pseudo-labeling",
+      "tasks": [
+        "asr"
+      ],
+      "tags": [
+        "pashto",
+        "asr",
+        "whisper"
+      ],
+      "evidence_text": "Whisper tokenizer map includes ps language key.",
+      "evidence_url": "https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py",
+      "markers": [
+        "ps"
+      ]
+    },
+    {
+      "id": "model-mms-1b-all",
+      "title": "MMS 1B All",
+      "url": "https://huggingface.co/facebook/mms-1b-all",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Multilingual ASR model from MMS for low-resource transfer.",
+      "primary_use": "ASR transfer baseline",
+      "tasks": [
+        "asr"
+      ],
+      "tags": [
+        "pashto",
+        "asr",
+        "mms"
+      ],
+      "evidence_text": "MMS coverage table includes pus with ASR support.",
+      "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+      "markers": [
+        "pus"
+      ]
+    },
+    {
+      "id": "model-mms-tts",
+      "title": "MMS TTS",
+      "url": "https://huggingface.co/facebook/mms-tts",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Multilingual TTS checkpoints useful for Pashto voice synthesis.",
+      "primary_use": "TTS baseline and transfer",
+      "tasks": [
+        "tts"
+      ],
+      "tags": [
+        "pashto",
+        "tts",
+        "mms"
+      ],
+      "evidence_text": "MMS coverage table includes pus with TTS support.",
+      "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+      "markers": [
+        "pus"
+      ]
+    },
+    {
+      "id": "model-nllb-200-distilled-600m",
+      "title": "NLLB-200 Distilled 600M",
+      "url": "https://huggingface.co/facebook/nllb-200-distilled-600M",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "General multilingual translation model with Pashto script token support.",
+      "primary_use": "Pashto translation baseline",
+      "tasks": [
+        "mt"
+      ],
+      "tags": [
+        "pashto",
+        "mt",
+        "nllb"
+      ],
+      "evidence_text": "Model special token map includes pbt_Arab.",
+      "evidence_url": "https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json",
+      "markers": [
+        "pbt_Arab"
+      ]
+    },
+    {
+      "id": "model-opus-mt-en-mul",
+      "title": "OPUS MT en-mul",
+      "url": "https://huggingface.co/Helsinki-NLP/opus-mt-en-mul",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Translation model that can route English into Pashto via multilingual set.",
+      "primary_use": "English to Pashto translation path",
+      "tasks": [
+        "mt"
+      ],
+      "tags": [
+        "pashto",
+        "mt",
+        "opus"
+      ],
+      "evidence_text": "Language list includes pus code.",
+      "evidence_url": "https://huggingface.co/Helsinki-NLP/opus-mt-en-mul",
+      "markers": [
+        "pus"
+      ]
+    },
+    {
+      "id": "model-opus-mt-mul-en",
+      "title": "OPUS MT mul-en",
+      "url": "https://huggingface.co/Helsinki-NLP/opus-mt-mul-en",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Translation model for Pashto to English via multilingual encoder.",
+      "primary_use": "Pashto to English translation path",
+      "tasks": [
+        "mt"
+      ],
+      "tags": [
+        "pashto",
+        "mt",
+        "opus"
+      ],
+      "evidence_text": "Language list includes pus code.",
+      "evidence_url": "https://huggingface.co/Helsinki-NLP/opus-mt-mul-en",
+      "markers": [
+        "pus"
+      ]
+    },
+    {
+      "id": "model-pashto-bert",
+      "title": "PashtoBERT",
+      "url": "https://huggingface.co/mdarhri/pashto-bert",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Pashto-specific encoder model for NLP transfer tasks.",
+      "primary_use": "Pashto NLP baseline encoder",
+      "tasks": [
+        "nlp"
+      ],
+      "tags": [
+        "pashto",
+        "nlp",
+        "bert"
+      ],
+      "evidence_text": "Model card states training on Pashto corpus data.",
+      "evidence_url": "https://huggingface.co/mdarhri/pashto-bert",
+      "markers": [
+        "Pashto"
+      ]
+    },
+    {
+      "id": "benchmark-fleurs-ps-af",
+      "title": "FLEURS Pashto Benchmark",
+      "url": "https://huggingface.co/datasets/google/fleurs",
+      "category": "benchmark",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Fixed multilingual speech benchmark with Pashto subset for WER and CER.",
+      "primary_use": "ASR benchmark reporting",
+      "tasks": [
+        "asr",
+        "benchmarking"
+      ],
+      "tags": [
+        "pashto",
+        "benchmark",
+        "asr"
+      ],
+      "evidence_text": "Dataset includes ps_af split.",
+      "evidence_url": "https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py",
+      "markers": [
+        "ps_af"
+      ]
+    },
+    {
+      "id": "benchmark-common-voice-ps-v24",
+      "title": "Common Voice Pashto v24 Benchmark",
+      "url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+      "category": "benchmark",
+      "source": "mozilla",
+      "status": "verified",
+      "summary": "Core benchmark reference for project-level Pashto ASR tracking.",
+      "primary_use": "ASR baseline tracking",
+      "tasks": [
+        "asr",
+        "benchmarking"
+      ],
+      "tags": [
+        "pashto",
+        "benchmark",
+        "asr"
+      ],
+      "evidence_text": "Official Pashto split and versioned release.",
+      "evidence_url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+      "markers": [
+        "Pashto"
+      ]
+    },
+    {
+      "id": "benchmark-belebele-pbt-arab",
+      "title": "Belebele Pashto Benchmark",
+      "url": "https://huggingface.co/datasets/facebook/belebele",
+      "category": "benchmark",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Comprehension benchmark for multilingual NLP with Pashto variant.",
+      "primary_use": "NLP benchmark reporting",
+      "tasks": [
+        "nlp",
+        "benchmarking"
+      ],
+      "tags": [
+        "pashto",
+        "benchmark",
+        "nlp"
+      ],
+      "evidence_text": "Includes pbt_Arab language variant.",
+      "evidence_url": "https://huggingface.co/datasets/facebook/belebele",
+      "markers": [
+        "pbt_Arab"
+      ]
+    },
+    {
+      "id": "benchmark-flores-200-pbt-arab",
+      "title": "FLORES-200 Pashto Benchmark",
+      "url": "https://github.com/facebookresearch/flores/tree/main/flores200",
+      "category": "benchmark",
+      "source": "github",
+      "status": "verified",
+      "summary": "Translation benchmark language inventory including Pashto script variant.",
+      "primary_use": "MT benchmark with BLEU and chrF",
+      "tasks": [
+        "mt",
+        "benchmarking"
+      ],
+      "tags": [
+        "pashto",
+        "benchmark",
+        "mt"
+      ],
+      "evidence_text": "Language list includes pbt_Arab.",
+      "evidence_url": "https://raw.githubusercontent.com/facebookresearch/flores/main/flores200/README.md",
+      "markers": [
+        "pbt_Arab"
+      ]
+    },
+    {
+      "id": "tool-faster-whisper",
+      "title": "Faster-Whisper",
+      "url": "https://github.com/SYSTRAN/faster-whisper",
+      "category": "tool",
+      "source": "github",
+      "status": "verified",
+      "summary": "Optimized Whisper inference runtime for faster Pashto ASR experiments.",
+      "primary_use": "ASR inference acceleration",
+      "tasks": [
+        "asr"
+      ],
+      "tags": [
+        "pashto",
+        "tooling",
+        "asr"
+      ],
+      "evidence_text": "Whisper tokenizer includes ps and tool runs Whisper models.",
+      "evidence_url": "https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py",
+      "markers": [
+        "ps"
+      ]
+    },
+    {
+      "id": "tool-coqui-tts",
+      "title": "Coqui TTS",
+      "url": "https://github.com/coqui-ai/TTS",
+      "category": "tool",
+      "source": "github",
+      "status": "verified",
+      "summary": "Open toolkit for TTS training and inference used for Pashto experiments.",
+      "primary_use": "TTS training and inference",
+      "tasks": [
+        "tts"
+      ],
+      "tags": [
+        "pashto",
+        "tooling",
+        "tts"
+      ],
+      "evidence_text": "Can be paired with Pashto-supporting MMS checkpoints.",
+      "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+      "markers": [
+        "pus"
+      ]
+    },
+    {
+      "id": "paper-whisper-2212-04356",
+      "title": "Robust Speech Recognition via Large-Scale Weak Supervision",
+      "url": "https://arxiv.org/abs/2212.04356",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "Whisper paper used as a foundational ASR reference for Pashto baselines.",
+      "primary_use": "ASR methodology reference",
+      "tasks": [
+        "asr",
+        "research"
+      ],
+      "tags": [
+        "pashto",
+        "paper",
+        "asr"
+      ],
+      "evidence_text": "Paired with tokenizer language map containing ps.",
+      "evidence_url": "https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py",
+      "markers": [
+        "ps"
+      ]
+    },
+    {
+      "id": "paper-mms-2305-13516",
+      "title": "Scaling Speech Technology to 1,000+ Languages",
+      "url": "https://arxiv.org/abs/2305.13516",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "MMS paper covering multilingual speech scaling and low-resource transfer.",
+      "primary_use": "ASR and TTS transfer reference",
+      "tasks": [
+        "asr",
+        "tts",
+        "research"
+      ],
+      "tags": [
+        "pashto",
+        "paper",
+        "speech"
+      ],
+      "evidence_text": "Coverage table marks pus support in MMS release.",
+      "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+      "markers": [
+        "pus"
+      ]
+    },
+    {
+      "id": "paper-nllb-2207-04672",
+      "title": "No Language Left Behind",
+      "url": "https://arxiv.org/abs/2207.04672",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "NLLB paper supporting multilingual MT strategy for Pashto integration.",
+      "primary_use": "MT research reference",
+      "tasks": [
+        "mt",
+        "research"
+      ],
+      "tags": [
+        "pashto",
+        "paper",
+        "mt"
+      ],
+      "evidence_text": "Model usage in repo references pbt_Arab token support.",
+      "evidence_url": "https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json",
+      "markers": [
+        "pbt_Arab"
+      ]
+    },
+    {
+      "id": "paper-fleurs-2205-12446",
+      "title": "FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech",
+      "url": "https://arxiv.org/abs/2205.12446",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "FLEURS benchmark paper supporting multilingual speech evaluation including Pashto.",
+      "primary_use": "Speech benchmark methodology reference",
+      "tasks": [
+        "asr",
+        "benchmarking",
+        "research"
+      ],
+      "tags": [
+        "pashto",
+        "paper",
+        "benchmark"
+      ],
+      "evidence_text": "Dataset implementation includes ps_af language code.",
+      "evidence_url": "https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py",
+      "markers": [
+        "ps_af"
+      ]
+    }
+  ]
+}

resources/README.md CHANGED Viewed

@@ -1,14 +1,23 @@
-# 📚 Resources
 Structured, Pashto-focused resource tracking lives in this folder.
 ## Sections
-- Datasets: [datasets/README.md](datasets/README.md)
-- Models: [models/README.md](models/README.md)
-- Benchmarks: [benchmarks/README.md](benchmarks/README.md)
-- Tools and applications: [tools/README.md](tools/README.md)
 ## Update Rule
 - Add only validated resources with explicit Pashto relevance.
 - Keep every external reference clickable using markdown links.
-- Mirror high-level updates in [../docs/resource_catalog.md](../docs/resource_catalog.md).

+# Resources
 Structured, Pashto-focused resource tracking lives in this folder.
 ## Sections
+- Datasets (8): [datasets/README.md](datasets/README.md)
+- Models (7): [models/README.md](models/README.md)
+- Benchmarks (4): [benchmarks/README.md](benchmarks/README.md)
+- Tools (2): [tools/README.md](tools/README.md)
+- Papers (4): [papers/README.md](papers/README.md)
+## Machine-Readable Catalog
+- Canonical catalog: [catalog/resources.json](catalog/resources.json)
+- Candidate feed: [catalog/pending_candidates.json](catalog/pending_candidates.json)
+- Schema: [schema/resource.schema.json](schema/resource.schema.json)
 ## Update Rule
 - Add only validated resources with explicit Pashto relevance.
 - Keep every external reference clickable using markdown links.
+- Run `python scripts/validate_resource_catalog.py` before opening a PR.
+- Run `python scripts/generate_resource_views.py` after catalog changes.
+Verified resource count: `25`

resources/benchmarks/README.md CHANGED Viewed

@@ -1,14 +1,15 @@
-# 🧪 Benchmarks
-## Recommended Benchmarks
-| Benchmark | Link | Metrics |
-|---|---|---|
-| FLEURS (Pashto subset) | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | WER, CER |
-| Common Voice Pashto v24 | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | WER, CER |
-| Belebele (`pbt_Arab`) | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Accuracy, F1 |
-| FLORES-200 (`pbt_Arab`) | [FLORES language list](https://github.com/facebookresearch/flores/tree/main/flores200) | BLEU, chrF, COMET |
-## Integration Paths
-- Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)
-- Resource index: [../../docs/resource_catalog.md](../../docs/resource_catalog.md)

+# Benchmarks
+## Verified Pashto Resources
+| Resource | Link | Pashto Evidence | Primary Use |
+|---|---|---|---|
+| Belebele Pashto Benchmark | [huggingface](https://huggingface.co/datasets/facebook/belebele) | [Includes pbt_Arab language variant. (`pbt_Arab`)](https://huggingface.co/datasets/facebook/belebele) | NLP benchmark reporting |
+| Common Voice Pashto v24 Benchmark | [mozilla](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | [Official Pashto split and versioned release. (`Pashto`)](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | ASR baseline tracking |
+| FLEURS Pashto Benchmark | [huggingface](https://huggingface.co/datasets/google/fleurs) | [Dataset includes ps_af split. (`ps_af`)](https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py) | ASR benchmark reporting |
+| FLORES-200 Pashto Benchmark | [github](https://github.com/facebookresearch/flores/tree/main/flores200) | [Language list includes pbt_Arab. (`pbt_Arab`)](https://raw.githubusercontent.com/facebookresearch/flores/main/flores200/README.md) | MT benchmark with BLEU and chrF |
+## Maintenance
+- Source of truth: [../catalog/resources.json](../catalog/resources.json)
+- Validation: [../../scripts/validate_resource_catalog.py](../../scripts/validate_resource_catalog.py)
+- Generated by: [../../scripts/generate_resource_views.py](../../scripts/generate_resource_views.py)

resources/catalog/README.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# Resource Catalog
+This folder holds machine-readable resource data used by docs and GitHub Pages search.
+## Files
+- `resources.json`: canonical Pashto resource catalog (source of truth).
+- `pending_candidates.json`: automation output for candidate resources requiring review.
+- `resource.template.json`: starter template for adding a new resource entry.
+## Required workflow
+1. Update `resources.json`.
+2. Run `python scripts/validate_resource_catalog.py`.
+3. Run `python scripts/generate_resource_views.py`.
+4. Commit both catalog and generated markdown/search files.

resources/catalog/pending_candidates.json ADDED Viewed

	@@ -0,0 +1,474 @@

+{
+  "generated_on": "2026-02-15T09:45:32.641403+00:00",
+  "sources": [
+    "huggingface-datasets",
+    "huggingface-models"
+  ],
+  "candidate_count": 20,
+  "candidates": [
+    {
+      "id": "candidate-hf-dataset-aamirhs-pashto",
+      "title": "aamirhs/pashto",
+      "url": "https://huggingface.co/datasets/aamirhs/pashto",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/aamirhs/pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-aamirhs-pashto-audio-wav2vec",
+      "title": "aamirhs/pashto-audio-wav2vec",
+      "url": "https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/aamirhs/pashto-audio-wav2vec",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-aamirhs-pashto-test-1",
+      "title": "aamirhs/pashto_test_1",
+      "url": "https://huggingface.co/datasets/aamirhs/pashto_test_1",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/aamirhs/pashto_test_1",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-arsalagrey-pashto",
+      "title": "arsalagrey/pashto",
+      "url": "https://huggingface.co/datasets/arsalagrey/pashto",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-arsalagrey-pashto-books",
+      "title": "arsalagrey/pashto-books",
+      "url": "https://huggingface.co/datasets/arsalagrey/pashto-books",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto-books",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-arsalagrey-pashto-books-json",
+      "title": "arsalagrey/pashto-books-json",
+      "url": "https://huggingface.co/datasets/arsalagrey/pashto-books-json",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/arsalagrey/pashto-books-json",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-wav2vec2-xls-r-300m-pashto",
+      "title": "ihanif/wav2vec2-xls-r-300m-pashto",
+      "url": "https://huggingface.co/ihanif/wav2vec2-xls-r-300m-pashto",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/wav2vec2-xls-r-300m-pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-wav2vec2-xls-r-300m-pashto-lm",
+      "title": "ihanif/wav2vec2-xls-r-300m-pashto-lm",
+      "url": "https://huggingface.co/ihanif/wav2vec2-xls-r-300m-pashto-lm",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/wav2vec2-xls-r-300m-pashto-lm",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-whisper-base-pashto",
+      "title": "ihanif/whisper-base-pashto",
+      "url": "https://huggingface.co/ihanif/whisper-base-pashto",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/whisper-base-pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-whisper-large-pashto",
+      "title": "ihanif/whisper-large-pashto",
+      "url": "https://huggingface.co/ihanif/whisper-large-pashto",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/whisper-large-pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-whisper-medium-pashto",
+      "title": "ihanif/whisper-medium-pashto",
+      "url": "https://huggingface.co/ihanif/whisper-medium-pashto",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/whisper-medium-pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-whisper-medium-pashto-3e-7",
+      "title": "ihanif/whisper-medium-pashto-3e-7",
+      "url": "https://huggingface.co/ihanif/whisper-medium-pashto-3e-7",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/whisper-medium-pashto-3e-7",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-whisper-small-pashto",
+      "title": "ihanif/whisper-small-pashto",
+      "url": "https://huggingface.co/ihanif/whisper-small-pashto",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/whisper-small-pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-whisper-small-pashto-dropout",
+      "title": "ihanif/whisper-small-pashto-dropout",
+      "url": "https://huggingface.co/ihanif/whisper-small-pashto-dropout",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/whisper-small-pashto-dropout",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-ihanif-xls-r-1b-pashto",
+      "title": "ihanif/xls-r-1b-pashto",
+      "url": "https://huggingface.co/ihanif/xls-r-1b-pashto",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/ihanif/xls-r-1b-pashto",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-koochikoo25-pashto-concatenated",
+      "title": "koochikoo25/Pashto-Concatenated",
+      "url": "https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/koochikoo25/Pashto-Concatenated",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-nexdata-99-hours-pashto-spontaneous-dialogue-smartphone-speech-dataset",
+      "title": "Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset",
+      "url": "https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/Nexdata/99_Hours_Pashto_Spontaneous_Dialogue_Smartphone_speech_dataset",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-saillab-alpaca-pashto-taco",
+      "title": "saillab/alpaca_pashto_taco",
+      "url": "https://huggingface.co/datasets/saillab/alpaca_pashto_taco",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/saillab/alpaca_pashto_taco",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    },
+    {
+      "id": "candidate-hf-model-zirak-ai-pashto-bert-v1",
+      "title": "zirak-ai/pashto-bert-v1",
+      "url": "https://huggingface.co/zirak-ai/pashto-bert-v1",
+      "category": "model",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate model returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/zirak-ai/pashto-bert-v1",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "model"
+      ]
+    },
+    {
+      "id": "candidate-hf-dataset-zirak-ai-pashtoocr",
+      "title": "zirak-ai/PashtoOCR",
+      "url": "https://huggingface.co/datasets/zirak-ai/PashtoOCR",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "candidate",
+      "summary": "Candidate dataset returned from Hugging Face search for Pashto.",
+      "primary_use": "Needs maintainer review before promotion to verified catalog.",
+      "tasks": [],
+      "pashto_evidence": {
+        "evidence_text": "Matched by Pashto keyword in Hugging Face search results.",
+        "evidence_url": "https://huggingface.co/datasets/zirak-ai/PashtoOCR",
+        "markers": [
+          "pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "candidate",
+        "dataset"
+      ]
+    }
+  ],
+  "errors": [
+    "arxiv: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1007)>",
+    "semantic-scholar: HTTP Error 429: "
+  ]
+}

resources/catalog/resource.template.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "id": "example-resource-id",
+  "title": "Example Resource Title",
+  "url": "https://example.org/resource",
+  "category": "dataset",
+  "source": "other",
+  "status": "verified",
+  "summary": "One-line summary explaining why this resource matters for Pashto in technology.",
+  "primary_use": "ASR baseline",
+  "license": "Unknown",
+  "tasks": [
+    "asr"
+  ],
+  "pashto_evidence": {
+    "evidence_text": "Resource page explicitly lists Pashto support.",
+    "evidence_url": "https://example.org/resource",
+    "markers": [
+      "Pashto"
+    ]
+  },
+  "tags": [
+    "pashto",
+    "speech"
+  ]
+}

resources/catalog/resources.json ADDED Viewed

	@@ -0,0 +1,645 @@

+{
+  "version": "1.0.0",
+  "updated_on": "2026-02-15",
+  "resources": [
+    {
+      "id": "dataset-common-voice-ps-v24",
+      "title": "Common Voice Scripted Speech 24.0 - Pashto",
+      "url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+      "category": "dataset",
+      "source": "mozilla",
+      "status": "verified",
+      "summary": "Large open Pashto speech dataset for ASR training and evaluation.",
+      "primary_use": "ASR training and evaluation",
+      "tasks": [
+        "asr"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Official dataset page is for Pashto.",
+        "evidence_url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+        "markers": [
+          "Pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "speech",
+        "asr"
+      ]
+    },
+    {
+      "id": "dataset-google-fleurs",
+      "title": "Google FLEURS",
+      "url": "https://huggingface.co/datasets/google/fleurs",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Standard multilingual speech benchmark dataset with Pashto subset.",
+      "primary_use": "Speech benchmark and external evaluation",
+      "tasks": [
+        "asr",
+        "benchmarking"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset config includes ps_af.",
+        "evidence_url": "https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py",
+        "markers": [
+          "ps_af"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "speech",
+        "benchmark"
+      ]
+    },
+    {
+      "id": "dataset-oscar-ps",
+      "title": "OSCAR Corpus",
+      "url": "https://huggingface.co/datasets/oscar-corpus/oscar",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Large web text corpus that includes Pashto text split.",
+      "primary_use": "Language modeling and lexicon expansion",
+      "tasks": [
+        "nlp"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset includes unshuffled_deduplicated_ps split.",
+        "evidence_url": "https://huggingface.co/datasets/oscar-corpus/oscar",
+        "markers": [
+          "unshuffled_deduplicated_ps"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "text",
+        "nlp"
+      ]
+    },
+    {
+      "id": "dataset-wikipedia-ps",
+      "title": "Wikimedia Wikipedia",
+      "url": "https://huggingface.co/datasets/wikimedia/wikipedia",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Wikipedia corpus with Pashto edition for cleaner text resources.",
+      "primary_use": "Terminology and balanced text corpus",
+      "tasks": [
+        "nlp"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset includes 20231101.ps subset.",
+        "evidence_url": "https://huggingface.co/datasets/wikimedia/wikipedia",
+        "markers": [
+          "20231101.ps"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "text",
+        "nlp"
+      ]
+    },
+    {
+      "id": "dataset-belebele-pbt-arab",
+      "title": "Belebele",
+      "url": "https://huggingface.co/datasets/facebook/belebele",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Reading comprehension dataset with Pashto script subset.",
+      "primary_use": "Comprehension and multilingual NLP benchmark",
+      "tasks": [
+        "nlp",
+        "benchmarking"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset includes pbt_Arab subset.",
+        "evidence_url": "https://huggingface.co/datasets/facebook/belebele",
+        "markers": [
+          "pbt_Arab"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "nlp",
+        "benchmark"
+      ]
+    },
+    {
+      "id": "dataset-opus100-en-ps",
+      "title": "OPUS-100",
+      "url": "https://huggingface.co/datasets/Helsinki-NLP/opus-100",
+      "category": "dataset",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Parallel corpus with English to Pashto split for MT tasks.",
+      "primary_use": "Machine translation training and evaluation",
+      "tasks": [
+        "mt",
+        "nlp"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset viewer includes en-ps split.",
+        "evidence_url": "https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps",
+        "markers": [
+          "en-ps"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "mt",
+        "parallel-corpus"
+      ]
+    },
+    {
+      "id": "dataset-kaggle-pashto-isolated-words",
+      "title": "Pashto Isolated Words Speech Dataset",
+      "url": "https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset",
+      "category": "dataset",
+      "source": "kaggle",
+      "status": "verified",
+      "summary": "Speech dataset focused on isolated Pashto words.",
+      "primary_use": "Keyword spotting and constrained ASR experiments",
+      "tasks": [
+        "asr"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset title explicitly states Pashto speech dataset.",
+        "evidence_url": "https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset",
+        "markers": [
+          "Pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "speech",
+        "kaggle"
+      ]
+    },
+    {
+      "id": "dataset-kaggle-pashto-word-embeddings",
+      "title": "Pashto Word Embeddings",
+      "url": "https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings",
+      "category": "dataset",
+      "source": "kaggle",
+      "status": "verified",
+      "summary": "Pretrained Pashto word vectors for classic NLP baselines.",
+      "primary_use": "Lexical semantics and lightweight NLP baselines",
+      "tasks": [
+        "nlp"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset description states pretrained Pashto embeddings.",
+        "evidence_url": "https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings",
+        "markers": [
+          "Pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "nlp",
+        "embeddings",
+        "kaggle"
+      ]
+    },
+    {
+      "id": "model-whisper-large-v3",
+      "title": "Whisper Large v3",
+      "url": "https://huggingface.co/openai/whisper-large-v3",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Strong multilingual ASR baseline suitable for Pashto bootstrapping.",
+      "primary_use": "ASR baseline and pseudo-labeling",
+      "tasks": [
+        "asr"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Whisper tokenizer map includes ps language key.",
+        "evidence_url": "https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py",
+        "markers": [
+          "ps"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "asr",
+        "whisper"
+      ]
+    },
+    {
+      "id": "model-mms-1b-all",
+      "title": "MMS 1B All",
+      "url": "https://huggingface.co/facebook/mms-1b-all",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Multilingual ASR model from MMS for low-resource transfer.",
+      "primary_use": "ASR transfer baseline",
+      "tasks": [
+        "asr"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "MMS coverage table includes pus with ASR support.",
+        "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+        "markers": [
+          "pus"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "asr",
+        "mms"
+      ]
+    },
+    {
+      "id": "model-mms-tts",
+      "title": "MMS TTS",
+      "url": "https://huggingface.co/facebook/mms-tts",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Multilingual TTS checkpoints useful for Pashto voice synthesis.",
+      "primary_use": "TTS baseline and transfer",
+      "tasks": [
+        "tts"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "MMS coverage table includes pus with TTS support.",
+        "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+        "markers": [
+          "pus"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "tts",
+        "mms"
+      ]
+    },
+    {
+      "id": "model-nllb-200-distilled-600m",
+      "title": "NLLB-200 Distilled 600M",
+      "url": "https://huggingface.co/facebook/nllb-200-distilled-600M",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "General multilingual translation model with Pashto script token support.",
+      "primary_use": "Pashto translation baseline",
+      "tasks": [
+        "mt"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Model special token map includes pbt_Arab.",
+        "evidence_url": "https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json",
+        "markers": [
+          "pbt_Arab"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "mt",
+        "nllb"
+      ]
+    },
+    {
+      "id": "model-opus-mt-en-mul",
+      "title": "OPUS MT en-mul",
+      "url": "https://huggingface.co/Helsinki-NLP/opus-mt-en-mul",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Translation model that can route English into Pashto via multilingual set.",
+      "primary_use": "English to Pashto translation path",
+      "tasks": [
+        "mt"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Language list includes pus code.",
+        "evidence_url": "https://huggingface.co/Helsinki-NLP/opus-mt-en-mul",
+        "markers": [
+          "pus"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "mt",
+        "opus"
+      ]
+    },
+    {
+      "id": "model-opus-mt-mul-en",
+      "title": "OPUS MT mul-en",
+      "url": "https://huggingface.co/Helsinki-NLP/opus-mt-mul-en",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Translation model for Pashto to English via multilingual encoder.",
+      "primary_use": "Pashto to English translation path",
+      "tasks": [
+        "mt"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Language list includes pus code.",
+        "evidence_url": "https://huggingface.co/Helsinki-NLP/opus-mt-mul-en",
+        "markers": [
+          "pus"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "mt",
+        "opus"
+      ]
+    },
+    {
+      "id": "model-pashto-bert",
+      "title": "PashtoBERT",
+      "url": "https://huggingface.co/mdarhri/pashto-bert",
+      "category": "model",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Pashto-specific encoder model for NLP transfer tasks.",
+      "primary_use": "Pashto NLP baseline encoder",
+      "tasks": [
+        "nlp"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Model card states training on Pashto corpus data.",
+        "evidence_url": "https://huggingface.co/mdarhri/pashto-bert",
+        "markers": [
+          "Pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "nlp",
+        "bert"
+      ]
+    },
+    {
+      "id": "benchmark-fleurs-ps-af",
+      "title": "FLEURS Pashto Benchmark",
+      "url": "https://huggingface.co/datasets/google/fleurs",
+      "category": "benchmark",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Fixed multilingual speech benchmark with Pashto subset for WER and CER.",
+      "primary_use": "ASR benchmark reporting",
+      "tasks": [
+        "asr",
+        "benchmarking"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset includes ps_af split.",
+        "evidence_url": "https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py",
+        "markers": [
+          "ps_af"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "benchmark",
+        "asr"
+      ]
+    },
+    {
+      "id": "benchmark-common-voice-ps-v24",
+      "title": "Common Voice Pashto v24 Benchmark",
+      "url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+      "category": "benchmark",
+      "source": "mozilla",
+      "status": "verified",
+      "summary": "Core benchmark reference for project-level Pashto ASR tracking.",
+      "primary_use": "ASR baseline tracking",
+      "tasks": [
+        "asr",
+        "benchmarking"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Official Pashto split and versioned release.",
+        "evidence_url": "https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14",
+        "markers": [
+          "Pashto"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "benchmark",
+        "asr"
+      ]
+    },
+    {
+      "id": "benchmark-belebele-pbt-arab",
+      "title": "Belebele Pashto Benchmark",
+      "url": "https://huggingface.co/datasets/facebook/belebele",
+      "category": "benchmark",
+      "source": "huggingface",
+      "status": "verified",
+      "summary": "Comprehension benchmark for multilingual NLP with Pashto variant.",
+      "primary_use": "NLP benchmark reporting",
+      "tasks": [
+        "nlp",
+        "benchmarking"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Includes pbt_Arab language variant.",
+        "evidence_url": "https://huggingface.co/datasets/facebook/belebele",
+        "markers": [
+          "pbt_Arab"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "benchmark",
+        "nlp"
+      ]
+    },
+    {
+      "id": "benchmark-flores-200-pbt-arab",
+      "title": "FLORES-200 Pashto Benchmark",
+      "url": "https://github.com/facebookresearch/flores/tree/main/flores200",
+      "category": "benchmark",
+      "source": "github",
+      "status": "verified",
+      "summary": "Translation benchmark language inventory including Pashto script variant.",
+      "primary_use": "MT benchmark with BLEU and chrF",
+      "tasks": [
+        "mt",
+        "benchmarking"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Language list includes pbt_Arab.",
+        "evidence_url": "https://raw.githubusercontent.com/facebookresearch/flores/main/flores200/README.md",
+        "markers": [
+          "pbt_Arab"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "benchmark",
+        "mt"
+      ]
+    },
+    {
+      "id": "tool-faster-whisper",
+      "title": "Faster-Whisper",
+      "url": "https://github.com/SYSTRAN/faster-whisper",
+      "category": "tool",
+      "source": "github",
+      "status": "verified",
+      "summary": "Optimized Whisper inference runtime for faster Pashto ASR experiments.",
+      "primary_use": "ASR inference acceleration",
+      "tasks": [
+        "asr"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Whisper tokenizer includes ps and tool runs Whisper models.",
+        "evidence_url": "https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py",
+        "markers": [
+          "ps"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "tooling",
+        "asr"
+      ]
+    },
+    {
+      "id": "tool-coqui-tts",
+      "title": "Coqui TTS",
+      "url": "https://github.com/coqui-ai/TTS",
+      "category": "tool",
+      "source": "github",
+      "status": "verified",
+      "summary": "Open toolkit for TTS training and inference used for Pashto experiments.",
+      "primary_use": "TTS training and inference",
+      "tasks": [
+        "tts"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Can be paired with Pashto-supporting MMS checkpoints.",
+        "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+        "markers": [
+          "pus"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "tooling",
+        "tts"
+      ]
+    },
+    {
+      "id": "paper-whisper-2212-04356",
+      "title": "Robust Speech Recognition via Large-Scale Weak Supervision",
+      "url": "https://arxiv.org/abs/2212.04356",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "Whisper paper used as a foundational ASR reference for Pashto baselines.",
+      "primary_use": "ASR methodology reference",
+      "tasks": [
+        "asr",
+        "research"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Paired with tokenizer language map containing ps.",
+        "evidence_url": "https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py",
+        "markers": [
+          "ps"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "paper",
+        "asr"
+      ]
+    },
+    {
+      "id": "paper-mms-2305-13516",
+      "title": "Scaling Speech Technology to 1,000+ Languages",
+      "url": "https://arxiv.org/abs/2305.13516",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "MMS paper covering multilingual speech scaling and low-resource transfer.",
+      "primary_use": "ASR and TTS transfer reference",
+      "tasks": [
+        "asr",
+        "tts",
+        "research"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Coverage table marks pus support in MMS release.",
+        "evidence_url": "https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html",
+        "markers": [
+          "pus"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "paper",
+        "speech"
+      ]
+    },
+    {
+      "id": "paper-nllb-2207-04672",
+      "title": "No Language Left Behind",
+      "url": "https://arxiv.org/abs/2207.04672",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "NLLB paper supporting multilingual MT strategy for Pashto integration.",
+      "primary_use": "MT research reference",
+      "tasks": [
+        "mt",
+        "research"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Model usage in repo references pbt_Arab token support.",
+        "evidence_url": "https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json",
+        "markers": [
+          "pbt_Arab"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "paper",
+        "mt"
+      ]
+    },
+    {
+      "id": "paper-fleurs-2205-12446",
+      "title": "FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech",
+      "url": "https://arxiv.org/abs/2205.12446",
+      "category": "paper",
+      "source": "arxiv",
+      "status": "verified",
+      "summary": "FLEURS benchmark paper supporting multilingual speech evaluation including Pashto.",
+      "primary_use": "Speech benchmark methodology reference",
+      "tasks": [
+        "asr",
+        "benchmarking",
+        "research"
+      ],
+      "pashto_evidence": {
+        "evidence_text": "Dataset implementation includes ps_af language code.",
+        "evidence_url": "https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py",
+        "markers": [
+          "ps_af"
+        ]
+      },
+      "tags": [
+        "pashto",
+        "paper",
+        "benchmark"
+      ]
+    }
+  ]
+}

resources/datasets/README.md CHANGED Viewed

@@ -1,18 +1,19 @@
-# 🗂️ Datasets
-## Pashto-Related Datasets
 | Resource | Link | Pashto Evidence | Primary Use |
 |---|---|---|---|
-| Common Voice Scripted Speech 24.0 - Pashto | [Mozilla Data Collective](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | Official Pashto dataset page | ASR training/evaluation |
-| Google FLEURS | [Hugging Face - google/fleurs](https://huggingface.co/datasets/google/fleurs) | [`fleurs.py` includes `ps_af`](https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py) | Speech benchmark |
-| OSCAR Corpus | [Hugging Face - oscar-corpus/oscar](https://huggingface.co/datasets/oscar-corpus/oscar) | Includes `unshuffled_deduplicated_ps` | NLP language modeling |
-| Wikimedia Wikipedia | [Hugging Face - wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) | Includes `20231101.ps` | Clean text corpus |
-| Belebele | [Hugging Face - facebook/belebele](https://huggingface.co/datasets/facebook/belebele) | Includes `pbt_Arab` | Reading comprehension benchmark |
-| OPUS-100 | [Hugging Face - Helsinki-NLP/opus-100](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes `en-ps` subset](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Parallel corpus for Pashto-English translation |
-| Pashto Isolated Words Speech Dataset | [Kaggle - engrirf/pashto-isolated-words-speech-dataset](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Dataset card title explicitly marks Pashto speech data | Keyword spotting and limited-vocabulary ASR |
-| Pashto Word Embeddings | [Kaggle - drijaz/pashto-word-embeddings](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Dataset description states pretrained Pashto embeddings | NLP baselines and lexical experiments |
-## Integration Paths
-- Data workspace: [../../data/README.md](../../data/README.md)
-- Benchmark workspace: [../../benchmarks/README.md](../../benchmarks/README.md)

+# Datasets
+## Verified Pashto Resources
 | Resource | Link | Pashto Evidence | Primary Use |
 |---|---|---|---|
+| Belebele | [huggingface](https://huggingface.co/datasets/facebook/belebele) | [Dataset includes pbt_Arab subset. (`pbt_Arab`)](https://huggingface.co/datasets/facebook/belebele) | Comprehension and multilingual NLP benchmark |
+| Common Voice Scripted Speech 24.0 - Pashto | [mozilla](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | [Official dataset page is for Pashto. (`Pashto`)](https://datacollective.mozillafoundation.org/datasets/cmj8u3pnb00llnxxbfvxo3b14) | ASR training and evaluation |
+| Google FLEURS | [huggingface](https://huggingface.co/datasets/google/fleurs) | [Dataset config includes ps_af. (`ps_af`)](https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py) | Speech benchmark and external evaluation |
+| OPUS-100 | [huggingface](https://huggingface.co/datasets/Helsinki-NLP/opus-100) | [Dataset viewer includes en-ps split. (`en-ps`)](https://huggingface.co/datasets/Helsinki-NLP/opus-100/viewer/en-ps) | Machine translation training and evaluation |
+| OSCAR Corpus | [huggingface](https://huggingface.co/datasets/oscar-corpus/oscar) | [Dataset includes unshuffled_deduplicated_ps split. (`unshuffled_deduplicated_ps`)](https://huggingface.co/datasets/oscar-corpus/oscar) | Language modeling and lexicon expansion |
+| Pashto Isolated Words Speech Dataset | [kaggle](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | [Dataset title explicitly states Pashto speech dataset. (`Pashto`)](https://www.kaggle.com/datasets/engrirf/pashto-isolated-words-speech-dataset) | Keyword spotting and constrained ASR experiments |
+| Pashto Word Embeddings | [kaggle](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | [Dataset description states pretrained Pashto embeddings. (`Pashto`)](https://www.kaggle.com/datasets/drijaz/pashto-word-embeddings) | Lexical semantics and lightweight NLP baselines |
+| Wikimedia Wikipedia | [huggingface](https://huggingface.co/datasets/wikimedia/wikipedia) | [Dataset includes 20231101.ps subset. (`20231101.ps`)](https://huggingface.co/datasets/wikimedia/wikipedia) | Terminology and balanced text corpus |
+## Maintenance
+- Source of truth: [../catalog/resources.json](../catalog/resources.json)
+- Validation: [../../scripts/validate_resource_catalog.py](../../scripts/validate_resource_catalog.py)
+- Generated by: [../../scripts/generate_resource_views.py](../../scripts/generate_resource_views.py)

resources/models/README.md CHANGED Viewed

@@ -1,19 +1,18 @@
 # Models
-## Pashto-Relevant Models
 | Resource | Link | Pashto Evidence | Primary Use |
 |---|---|---|---|
-| Whisper Large v3 | [Hugging Face - openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | [Tokenizer map includes `ps`](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline |
-| MMS Coverage Table | [Meta MMS language coverage](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Includes `pus` with ASR/TTS support | Multilingual transfer |
-| MMS 1B All (ASR) | [Hugging Face - facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) | [Coverage table includes `pus` with ASR support](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | Multilingual ASR transfer baseline |
-| MMS TTS | [Hugging Face - facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | Aligned with MMS coverage table | TTS baseline |
-| NLLB-200 Distilled 600M | [Hugging Face - facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | [`special_tokens_map.json` includes `pbt_Arab`](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Translation baseline |
-| OPUS MT en->mul | [Hugging Face - opus-mt-en-mul](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | Model language list includes `pus` | English->Pashto path |
-| OPUS MT mul->en | [Hugging Face - opus-mt-mul-en](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | Model language list includes `pus` | Pashto->English path |
-| PashtoBERT | [Hugging Face - mdarhri/pashto-bert](https://huggingface.co/mdarhri/pashto-bert) | Model card states it is trained on Pashto corpus data | Pashto NLP encoder baseline |
-## Integration Paths
-- ASR workspace: [../../asr/README.md](../../asr/README.md)
-- TTS workspace: [../../tts/README.md](../../tts/README.md)
-- Apps workspace: [../../apps/desktop/README.md](../../apps/desktop/README.md)

 # Models
+## Verified Pashto Resources
 | Resource | Link | Pashto Evidence | Primary Use |
 |---|---|---|---|
+| MMS 1B All | [huggingface](https://huggingface.co/facebook/mms-1b-all) | [MMS coverage table includes pus with ASR support. (`pus`)](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | ASR transfer baseline |
+| MMS TTS | [huggingface](https://huggingface.co/facebook/mms-tts) | [MMS coverage table includes pus with TTS support. (`pus`)](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | TTS baseline and transfer |
+| NLLB-200 Distilled 600M | [huggingface](https://huggingface.co/facebook/nllb-200-distilled-600M) | [Model special token map includes pbt_Arab. (`pbt_Arab`)](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | Pashto translation baseline |
+| OPUS MT en-mul | [huggingface](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | [Language list includes pus code. (`pus`)](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) | English to Pashto translation path |
+| OPUS MT mul-en | [huggingface](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | [Language list includes pus code. (`pus`)](https://huggingface.co/Helsinki-NLP/opus-mt-mul-en) | Pashto to English translation path |
+| PashtoBERT | [huggingface](https://huggingface.co/mdarhri/pashto-bert) | [Model card states training on Pashto corpus data. (`Pashto`)](https://huggingface.co/mdarhri/pashto-bert) | Pashto NLP baseline encoder |
+| Whisper Large v3 | [huggingface](https://huggingface.co/openai/whisper-large-v3) | [Whisper tokenizer map includes ps language key. (`ps`)](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR baseline and pseudo-labeling |
+## Maintenance
+- Source of truth: [../catalog/resources.json](../catalog/resources.json)
+- Validation: [../../scripts/validate_resource_catalog.py](../../scripts/validate_resource_catalog.py)
+- Generated by: [../../scripts/generate_resource_views.py](../../scripts/generate_resource_views.py)

resources/papers/README.md ADDED Viewed

	@@ -0,0 +1,15 @@

+# Papers
+## Verified Pashto Resources
+| Resource | Link | Pashto Evidence | Primary Use |
+|---|---|---|---|
+| FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech | [arxiv](https://arxiv.org/abs/2205.12446) | [Dataset implementation includes ps_af language code. (`ps_af`)](https://huggingface.co/datasets/google/fleurs/blob/main/fleurs.py) | Speech benchmark methodology reference |
+| No Language Left Behind | [arxiv](https://arxiv.org/abs/2207.04672) | [Model usage in repo references pbt_Arab token support. (`pbt_Arab`)](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/special_tokens_map.json) | MT research reference |
+| Robust Speech Recognition via Large-Scale Weak Supervision | [arxiv](https://arxiv.org/abs/2212.04356) | [Paired with tokenizer language map containing ps. (`ps`)](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR methodology reference |
+| Scaling Speech Technology to 1,000+ Languages | [arxiv](https://arxiv.org/abs/2305.13516) | [Coverage table marks pus support in MMS release. (`pus`)](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | ASR and TTS transfer reference |
+## Maintenance
+- Source of truth: [../catalog/resources.json](../catalog/resources.json)
+- Validation: [../../scripts/validate_resource_catalog.py](../../scripts/validate_resource_catalog.py)
+- Generated by: [../../scripts/generate_resource_views.py](../../scripts/generate_resource_views.py)

resources/schema/resource.schema.json ADDED Viewed

	@@ -0,0 +1,142 @@

+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "$id": "https://musawer1214.github.io/Pukhto_Pashto/resources/schema/resource.schema.json",
+  "title": "Pashto Resource Catalog",
+  "type": "object",
+  "additionalProperties": false,
+  "required": [
+    "version",
+    "updated_on",
+    "resources"
+  ],
+  "properties": {
+    "version": {
+      "type": "string",
+      "pattern": "^\\d+\\.\\d+\\.\\d+$"
+    },
+    "updated_on": {
+      "type": "string",
+      "format": "date"
+    },
+    "resources": {
+      "type": "array",
+      "items": {
+        "$ref": "#/$defs/resource"
+      }
+    }
+  },
+  "$defs": {
+    "resource": {
+      "type": "object",
+      "additionalProperties": false,
+      "required": [
+        "id",
+        "title",
+        "url",
+        "category",
+        "source",
+        "status",
+        "summary",
+        "primary_use",
+        "pashto_evidence",
+        "tags"
+      ],
+      "properties": {
+        "id": {
+          "type": "string",
+          "pattern": "^[a-z0-9][a-z0-9._-]*$"
+        },
+        "title": {
+          "type": "string",
+          "minLength": 3
+        },
+        "url": {
+          "type": "string",
+          "format": "uri",
+          "pattern": "^https?://"
+        },
+        "category": {
+          "type": "string",
+          "enum": [
+            "dataset",
+            "model",
+            "benchmark",
+            "tool",
+            "paper"
+          ]
+        },
+        "source": {
+          "type": "string",
+          "enum": [
+            "huggingface",
+            "mozilla",
+            "kaggle",
+            "github",
+            "arxiv",
+            "meta",
+            "other"
+          ]
+        },
+        "status": {
+          "type": "string",
+          "enum": [
+            "verified",
+            "candidate"
+          ]
+        },
+        "summary": {
+          "type": "string",
+          "minLength": 10
+        },
+        "primary_use": {
+          "type": "string",
+          "minLength": 3
+        },
+        "license": {
+          "type": "string"
+        },
+        "tasks": {
+          "type": "array",
+          "items": {
+            "type": "string"
+          }
+        },
+        "pashto_evidence": {
+          "type": "object",
+          "additionalProperties": false,
+          "required": [
+            "evidence_text",
+            "evidence_url",
+            "markers"
+          ],
+          "properties": {
+            "evidence_text": {
+              "type": "string",
+              "minLength": 3
+            },
+            "evidence_url": {
+              "type": "string",
+              "format": "uri",
+              "pattern": "^https?://"
+            },
+            "markers": {
+              "type": "array",
+              "minItems": 1,
+              "items": {
+                "type": "string",
+                "minLength": 1
+              }
+            }
+          }
+        },
+        "tags": {
+          "type": "array",
+          "minItems": 1,
+          "items": {
+            "type": "string"
+          }
+        }
+      }
+    }
+  }
+}

resources/tools/README.md CHANGED Viewed

@@ -1,17 +1,13 @@
-# 🛠️ Tools and Applications
-## Practical Tools
-| Tool | Link | Use |
-|---|---|---|
-| Faster-Whisper | [github.com/SYSTRAN/faster-whisper](https://github.com/SYSTRAN/faster-whisper) | Faster ASR inference |
-| Coqui TTS | [github.com/coqui-ai/TTS](https://github.com/coqui-ai/TTS) | TTS training/inference |
-## Research Anchors
-- Whisper paper: [arXiv:2212.04356](https://arxiv.org/abs/2212.04356)
-- MMS paper: [arXiv:2305.13516](https://arxiv.org/abs/2305.13516)
-- NLLB paper: [arXiv:2207.04672](https://arxiv.org/abs/2207.04672)
-- FLEURS paper: [arXiv:2205.12446](https://arxiv.org/abs/2205.12446)
-## Integration Path
-- Desktop integration: [../../apps/desktop/README.md](../../apps/desktop/README.md)

+# Tools
+## Verified Pashto Resources
+| Resource | Link | Pashto Evidence | Primary Use |
+|---|---|---|---|
+| Coqui TTS | [github](https://github.com/coqui-ai/TTS) | [Can be paired with Pashto-supporting MMS checkpoints. (`pus`)](https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html) | TTS training and inference |
+| Faster-Whisper | [github](https://github.com/SYSTRAN/faster-whisper) | [Whisper tokenizer includes ps and tool runs Whisper models. (`ps`)](https://raw.githubusercontent.com/openai/whisper/main/whisper/tokenizer.py) | ASR inference acceleration |
+## Maintenance
+- Source of truth: [../catalog/resources.json](../catalog/resources.json)
+- Validation: [../../scripts/validate_resource_catalog.py](../../scripts/validate_resource_catalog.py)
+- Generated by: [../../scripts/generate_resource_views.py](../../scripts/generate_resource_views.py)

scripts/README.md CHANGED Viewed

@@ -1,10 +1,13 @@
-# ⚙️ Scripts
-Automation scripts for data checks and documentation hygiene.
-## Available Scripts
-- Normalization validator: [validate_normalization.py](validate_normalization.py)
-- Markdown link checker: [check_links.py](check_links.py)
 ## Usage
@@ -13,7 +16,22 @@ Validate normalization seed file:
 python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
 ```
-Check markdown links are clickable-format links:
 ```bash
 python scripts/check_links.py
 ```

+# Scripts
+Automation scripts for quality checks, resource catalog validation, and search index generation.
+## Available scripts
+- `validate_normalization.py`: validate normalization seed TSV format and rules.
+- `check_links.py`: ensure markdown links are clickable (optional online reachability check).
+- `validate_resource_catalog.py`: validate `resources/catalog/resources.json`.
+- `generate_resource_views.py`: generate `resources/*/README.md`, `resources/README.md`, and `docs/search/resources.json` from the catalog.
+- `sync_resources.py`: collect new candidate Pashto resources from public endpoints into `resources/catalog/pending_candidates.json`.
 ## Usage
 python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
 ```
+Validate resource catalog:
+```bash
+python scripts/validate_resource_catalog.py
+```
+Generate markdown and search index from catalog:
+```bash
+python scripts/generate_resource_views.py
+```
+Sync candidate resources for maintainer review:
+```bash
+python scripts/sync_resources.py --limit 20
+```
+Check markdown links format:
 ```bash
 python scripts/check_links.py
 ```

scripts/generate_resource_views.py ADDED Viewed

	@@ -0,0 +1,174 @@

+"""Generate markdown resource views and search index from catalog JSON.
+Usage:
+    python scripts/generate_resource_views.py
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typing import Any
+CATEGORY_CONFIG = {
+    "dataset": ("resources/datasets/README.md", "Datasets"),
+    "model": ("resources/models/README.md", "Models"),
+    "benchmark": ("resources/benchmarks/README.md", "Benchmarks"),
+    "tool": ("resources/tools/README.md", "Tools"),
+    "paper": ("resources/papers/README.md", "Papers"),
+}
+def _load_catalog(path: Path) -> dict[str, Any]:
+    return json.loads(path.read_text(encoding="utf-8"))
+def _escape_cell(value: str) -> str:
+    return value.replace("|", "\\|").strip()
+def _marker_text(markers: list[str]) -> str:
+    return ", ".join(f"`{marker}`" for marker in markers)
+def _resource_row(resource: dict[str, Any]) -> str:
+    evidence = resource["pashto_evidence"]
+    evidence_text = _escape_cell(evidence["evidence_text"])
+    markers = _marker_text(evidence["markers"])
+    if markers:
+        evidence_text = f"{evidence_text} ({markers})"
+    return (
+        f"| {_escape_cell(resource['title'])} | "
+        f"[{resource['source']}]({resource['url']}) | "
+        f"[{evidence_text}]({evidence['evidence_url']}) | "
+        f"{_escape_cell(resource['primary_use'])} |"
+    )
+def _write_markdown_table(path: Path, title: str, resources: list[dict[str, Any]]) -> None:
+    lines = [
+        f"# {title}",
+        "",
+        "## Verified Pashto Resources",
+        "",
+        "| Resource | Link | Pashto Evidence | Primary Use |",
+        "|---|---|---|---|",
+    ]
+    if resources:
+        lines.extend(_resource_row(resource) for resource in resources)
+    else:
+        lines.append("| _None yet_ | - | - | - |")
+    lines.extend(
+        [
+            "",
+            "## Maintenance",
+            "- Source of truth: [../catalog/resources.json](../catalog/resources.json)",
+            "- Validation: [../../scripts/validate_resource_catalog.py](../../scripts/validate_resource_catalog.py)",
+            "- Generated by: [../../scripts/generate_resource_views.py](../../scripts/generate_resource_views.py)",
+            "",
+        ]
+    )
+    path.write_text("\n".join(lines), encoding="utf-8")
+def _write_resources_home(path: Path, counts: dict[str, int], total_verified: int) -> None:
+    lines = [
+        "# Resources",
+        "",
+        "Structured, Pashto-focused resource tracking lives in this folder.",
+        "",
+        "## Sections",
+        f"- Datasets ({counts.get('dataset', 0)}): [datasets/README.md](datasets/README.md)",
+        f"- Models ({counts.get('model', 0)}): [models/README.md](models/README.md)",
+        f"- Benchmarks ({counts.get('benchmark', 0)}): [benchmarks/README.md](benchmarks/README.md)",
+        f"- Tools ({counts.get('tool', 0)}): [tools/README.md](tools/README.md)",
+        f"- Papers ({counts.get('paper', 0)}): [papers/README.md](papers/README.md)",
+        "",
+        "## Machine-Readable Catalog",
+        "- Canonical catalog: [catalog/resources.json](catalog/resources.json)",
+        "- Candidate feed: [catalog/pending_candidates.json](catalog/pending_candidates.json)",
+        "- Schema: [schema/resource.schema.json](schema/resource.schema.json)",
+        "",
+        "## Update Rule",
+        "- Add only validated resources with explicit Pashto relevance.",
+        "- Keep every external reference clickable using markdown links.",
+        "- Run `python scripts/validate_resource_catalog.py` before opening a PR.",
+        "- Run `python scripts/generate_resource_views.py` after catalog changes.",
+        "",
+        f"Verified resource count: `{total_verified}`",
+        "",
+    ]
+    path.write_text("\n".join(lines), encoding="utf-8")
+def _build_search_payload(resources: list[dict[str, Any]], updated_on: str) -> dict[str, Any]:
+    search_items: list[dict[str, Any]] = []
+    for resource in resources:
+        evidence = resource["pashto_evidence"]
+        search_items.append(
+            {
+                "id": resource["id"],
+                "title": resource["title"],
+                "url": resource["url"],
+                "category": resource["category"],
+                "source": resource["source"],
+                "status": resource["status"],
+                "summary": resource["summary"],
+                "primary_use": resource["primary_use"],
+                "tasks": resource.get("tasks", []),
+                "tags": resource["tags"],
+                "evidence_text": evidence["evidence_text"],
+                "evidence_url": evidence["evidence_url"],
+                "markers": evidence["markers"],
+            }
+        )
+    return {
+        "generated_on": f"{updated_on}T00:00:00Z",
+        "count": len(search_items),
+        "resources": search_items,
+    }
+def main() -> int:
+    catalog_path = Path("resources/catalog/resources.json")
+    catalog = _load_catalog(catalog_path)
+    resources: list[dict[str, Any]] = catalog.get("resources", [])
+    updated_on = catalog.get("updated_on", "1970-01-01")
+    verified = [resource for resource in resources if resource.get("status") == "verified"]
+    grouped: dict[str, list[dict[str, Any]]] = {category: [] for category in CATEGORY_CONFIG}
+    for resource in verified:
+        category = resource.get("category")
+        if category in grouped:
+            grouped[category].append(resource)
+    for category, (file_path, title) in CATEGORY_CONFIG.items():
+        output_path = Path(file_path)
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        rows = sorted(grouped[category], key=lambda item: item["title"].lower())
+        _write_markdown_table(output_path, title, rows)
+    counts = {category: len(items) for category, items in grouped.items()}
+    _write_resources_home(Path("resources/README.md"), counts, len(verified))
+    search_payload = _build_search_payload(resources, updated_on)
+    search_json_path = Path("docs/search/resources.json")
+    search_json_path.parent.mkdir(parents=True, exist_ok=True)
+    search_json_path.write_text(
+        json.dumps(search_payload, ensure_ascii=False, indent=2) + "\n",
+        encoding="utf-8",
+    )
+    print(
+        "Generated resources markdown and search index: "
+        f"{len(verified)} verified resources, {len(resources)} total resources"
+    )
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/sync_resources.py ADDED Viewed

	@@ -0,0 +1,283 @@

+"""Discover new Pashto-related resource candidates from public endpoints.
+This script does not auto-merge into the main catalog. It writes candidates to
+`resources/catalog/pending_candidates.json` for maintainer review.
+Usage:
+    python scripts/sync_resources.py
+    python scripts/sync_resources.py --limit 20 --output resources/catalog/pending_candidates.json
+"""
+from __future__ import annotations
+import argparse
+import json
+import re
+import urllib.parse
+import urllib.request
+import xml.etree.ElementTree as ET
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+USER_AGENT = "pashto-resource-sync/1.0"
+def _slug(value: str) -> str:
+    value = value.lower()
+    value = re.sub(r"[^a-z0-9]+", "-", value)
+    value = re.sub(r"-+", "-", value).strip("-")
+    return value[:80] if value else "resource"
+def _fetch_json(url: str, timeout: float = 20.0) -> Any:
+    req = urllib.request.Request(url, headers={"User-Agent": USER_AGENT})
+    with urllib.request.urlopen(req, timeout=timeout) as response:
+        return json.loads(response.read().decode("utf-8"))
+def _fetch_text(url: str, timeout: float = 20.0) -> str:
+    req = urllib.request.Request(url, headers={"User-Agent": USER_AGENT})
+    with urllib.request.urlopen(req, timeout=timeout) as response:
+        return response.read().decode("utf-8", errors="replace")
+def _candidate(
+    *,
+    rid: str,
+    title: str,
+    url: str,
+    category: str,
+    source: str,
+    summary: str,
+    evidence_text: str,
+    evidence_url: str,
+    markers: list[str],
+    tags: list[str],
+) -> dict[str, Any]:
+    return {
+        "id": rid,
+        "title": title.strip(),
+        "url": url.strip(),
+        "category": category,
+        "source": source,
+        "status": "candidate",
+        "summary": summary.strip(),
+        "primary_use": "Needs maintainer review before promotion to verified catalog.",
+        "tasks": [],
+        "pashto_evidence": {
+            "evidence_text": evidence_text.strip(),
+            "evidence_url": evidence_url.strip(),
+            "markers": markers,
+        },
+        "tags": tags,
+    }
+def fetch_huggingface(kind: str, limit: int) -> list[dict[str, Any]]:
+    if kind not in {"datasets", "models"}:
+        return []
+    query = urllib.parse.urlencode({"search": "pashto", "limit": str(limit)})
+    url = f"https://huggingface.co/api/{kind}?{query}"
+    payload = _fetch_json(url)
+    category = "dataset" if kind == "datasets" else "model"
+    out: list[dict[str, Any]] = []
+    for item in payload:
+        repo_id = item.get("id") or item.get("modelId")
+        if not repo_id:
+            continue
+        repo_url = f"https://huggingface.co/{'datasets/' if kind == 'datasets' else ''}{repo_id}"
+        rid = f"candidate-hf-{kind[:-1]}-{_slug(repo_id)}"
+        out.append(
+            _candidate(
+                rid=rid,
+                title=repo_id,
+                url=repo_url,
+                category=category,
+                source="huggingface",
+                summary=f"Candidate {category} returned from Hugging Face search for Pashto.",
+                evidence_text="Matched by Pashto keyword in Hugging Face search results.",
+                evidence_url=repo_url,
+                markers=["pashto"],
+                tags=["pashto", "candidate", category],
+            )
+        )
+    return out
+def fetch_arxiv(limit: int) -> list[dict[str, Any]]:
+    query = urllib.parse.urlencode(
+        {"search_query": "all:pashto", "start": "0", "max_results": str(limit)}
+    )
+    url = f"http://export.arxiv.org/api/query?{query}"
+    xml_text = _fetch_text(url)
+    root = ET.fromstring(xml_text)
+    ns = {"atom": "http://www.w3.org/2005/Atom"}
+    out: list[dict[str, Any]] = []
+    for entry in root.findall("atom:entry", ns):
+        title = (entry.findtext("atom:title", default="", namespaces=ns) or "").strip()
+        link = (entry.findtext("atom:id", default="", namespaces=ns) or "").strip()
+        summary = (entry.findtext("atom:summary", default="", namespaces=ns) or "").strip()
+        if not title or not link:
+            continue
+        rid = f"candidate-arxiv-{_slug(title)}"
+        out.append(
+            _candidate(
+                rid=rid,
+                title=title,
+                url=link,
+                category="paper",
+                source="arxiv",
+                summary=summary[:240] if summary else "Candidate paper returned from arXiv query for Pashto.",
+                evidence_text="Matched by arXiv query: all:pashto.",
+                evidence_url=link,
+                markers=["pashto"],
+                tags=["pashto", "candidate", "paper"],
+            )
+        )
+    return out
+def fetch_semantic_scholar(limit: int) -> list[dict[str, Any]]:
+    fields = "title,url,abstract,year,externalIds"
+    query = urllib.parse.urlencode(
+        {"query": "pashto", "limit": str(limit), "fields": fields}
+    )
+    url = f"https://api.semanticscholar.org/graph/v1/paper/search?{query}"
+    payload = _fetch_json(url)
+    out: list[dict[str, Any]] = []
+    for item in payload.get("data", []):
+        title = (item.get("title") or "").strip()
+        if not title:
+            continue
+        paper_url = (item.get("url") or "").strip()
+        if not paper_url:
+            ext = item.get("externalIds") or {}
+            arxiv_id = ext.get("ArXiv")
+            if arxiv_id:
+                paper_url = f"https://arxiv.org/abs/{arxiv_id}"
+        if not paper_url:
+            continue
+        summary = (item.get("abstract") or "").strip()
+        rid = f"candidate-s2-{_slug(title)}"
+        out.append(
+            _candidate(
+                rid=rid,
+                title=title,
+                url=paper_url,
+                category="paper",
+                source="other",
+                summary=summary[:240] if summary else "Candidate paper returned from Semantic Scholar search for Pashto.",
+                evidence_text="Matched by Semantic Scholar query: pashto.",
+                evidence_url=paper_url,
+                markers=["pashto"],
+                tags=["pashto", "candidate", "paper"],
+            )
+        )
+    return out
+def _dedupe_candidates(
+    candidates: list[dict[str, Any]],
+    existing_ids: set[str],
+    existing_urls: set[str],
+) -> list[dict[str, Any]]:
+    unique: list[dict[str, Any]] = []
+    seen_ids = set(existing_ids)
+    seen_urls = set(existing_urls)
+    for item in candidates:
+        rid = item["id"]
+        url = item["url"].rstrip("/")
+        if rid in seen_ids or url in seen_urls:
+            continue
+        seen_ids.add(rid)
+        seen_urls.add(url)
+        unique.append(item)
+    return unique
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--catalog", default="resources/catalog/resources.json")
+    parser.add_argument("--output", default="resources/catalog/pending_candidates.json")
+    parser.add_argument("--limit", type=int, default=15)
+    args = parser.parse_args()
+    catalog_path = Path(args.catalog)
+    output_path = Path(args.output)
+    catalog = json.loads(catalog_path.read_text(encoding="utf-8"))
+    resources = catalog.get("resources", [])
+    existing_ids = {resource.get("id", "") for resource in resources if isinstance(resource, dict)}
+    existing_urls = {
+        resource.get("url", "").rstrip("/")
+        for resource in resources
+        if isinstance(resource, dict) and isinstance(resource.get("url"), str)
+    }
+    all_candidates: list[dict[str, Any]] = []
+    source_errors: list[str] = []
+    sources_used: list[str] = []
+    fetch_steps = [
+        ("huggingface-datasets", lambda: fetch_huggingface("datasets", args.limit)),
+        ("huggingface-models", lambda: fetch_huggingface("models", args.limit)),
+        ("arxiv", lambda: fetch_arxiv(args.limit)),
+        ("semantic-scholar", lambda: fetch_semantic_scholar(args.limit)),
+    ]
+    for source_name, step in fetch_steps:
+        try:
+            results = step()
+            all_candidates.extend(results)
+            sources_used.append(source_name)
+        except Exception as exc:  # noqa: BLE001
+            source_errors.append(f"{source_name}: {exc}")
+    unique_candidates = _dedupe_candidates(all_candidates, existing_ids, existing_urls)
+    unique_candidates = sorted(unique_candidates, key=lambda item: item["title"].lower())
+    payload: dict[str, Any] = {
+        "generated_on": datetime.now(timezone.utc).isoformat(),
+        "sources": sources_used,
+        "candidate_count": len(unique_candidates),
+        "candidates": unique_candidates,
+    }
+    if source_errors:
+        payload["errors"] = source_errors
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    if output_path.exists():
+        try:
+            old_payload = json.loads(output_path.read_text(encoding="utf-8"))
+        except json.JSONDecodeError:
+            old_payload = None
+        if isinstance(old_payload, dict):
+            old_compare = {key: value for key, value in old_payload.items() if key != "generated_on"}
+            new_compare = {key: value for key, value in payload.items() if key != "generated_on"}
+            if old_compare == new_compare:
+                print(
+                    f"Candidate sync complete: {len(unique_candidates)} new candidates, "
+                    f"{len(source_errors)} source errors, no file changes"
+                )
+                return 0
+    output_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2) + "\n", encoding="utf-8")
+    print(
+        f"Candidate sync complete: {len(unique_candidates)} new candidates, "
+        f"{len(source_errors)} source errors"
+    )
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/validate_resource_catalog.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""Validate the machine-readable Pashto resource catalog.
+Usage:
+    python scripts/validate_resource_catalog.py
+    python scripts/validate_resource_catalog.py --catalog resources/catalog/resources.json
+"""
+from __future__ import annotations
+import argparse
+import json
+import re
+from datetime import date
+from pathlib import Path
+from typing import Any
+from urllib.parse import urlparse
+ALLOWED_CATEGORIES = {"dataset", "model", "benchmark", "tool", "paper"}
+ALLOWED_SOURCES = {"huggingface", "mozilla", "kaggle", "github", "arxiv", "meta", "other"}
+ALLOWED_STATUS = {"verified", "candidate"}
+RESOURCE_ID_RE = re.compile(r"^[a-z0-9][a-z0-9._-]*$")
+def _load_json(path: Path) -> dict[str, Any]:
+    return json.loads(path.read_text(encoding="utf-8"))
+def _is_valid_http_url(value: str) -> bool:
+    parsed = urlparse(value)
+    return parsed.scheme in {"http", "https"} and bool(parsed.netloc)
+def _validate_iso_date(value: str) -> bool:
+    try:
+        date.fromisoformat(value)
+    except ValueError:
+        return False
+    return True
+def validate_resource(resource: dict[str, Any], index: int) -> list[str]:
+    errors: list[str] = []
+    prefix = f"resource[{index}]"
+    required_fields = {
+        "id",
+        "title",
+        "url",
+        "category",
+        "source",
+        "status",
+        "summary",
+        "primary_use",
+        "pashto_evidence",
+        "tags",
+    }
+    missing = sorted(required_fields - resource.keys())
+    if missing:
+        errors.append(f"{prefix} missing required fields: {', '.join(missing)}")
+        return errors
+    rid = resource["id"]
+    if not isinstance(rid, str) or not RESOURCE_ID_RE.fullmatch(rid):
+        errors.append(f"{prefix}.id must match {RESOURCE_ID_RE.pattern}")
+    title = resource["title"]
+    if not isinstance(title, str) or len(title.strip()) < 3:
+        errors.append(f"{prefix}.title must be a non-empty string")
+    url = resource["url"]
+    if not isinstance(url, str) or not _is_valid_http_url(url):
+        errors.append(f"{prefix}.url must be a valid http/https URL")
+    category = resource["category"]
+    if category not in ALLOWED_CATEGORIES:
+        errors.append(f"{prefix}.category must be one of {sorted(ALLOWED_CATEGORIES)}")
+    source = resource["source"]
+    if source not in ALLOWED_SOURCES:
+        errors.append(f"{prefix}.source must be one of {sorted(ALLOWED_SOURCES)}")
+    status = resource["status"]
+    if status not in ALLOWED_STATUS:
+        errors.append(f"{prefix}.status must be one of {sorted(ALLOWED_STATUS)}")
+    summary = resource["summary"]
+    if not isinstance(summary, str) or len(summary.strip()) < 10:
+        errors.append(f"{prefix}.summary must be at least 10 characters")
+    primary_use = resource["primary_use"]
+    if not isinstance(primary_use, str) or len(primary_use.strip()) < 3:
+        errors.append(f"{prefix}.primary_use must be a non-empty string")
+    if "tasks" in resource and not (
+        isinstance(resource["tasks"], list)
+        and all(isinstance(item, str) and item.strip() for item in resource["tasks"])
+    ):
+        errors.append(f"{prefix}.tasks must be a list of strings")
+    tags = resource["tags"]
+    if not (isinstance(tags, list) and tags and all(isinstance(tag, str) and tag.strip() for tag in tags)):
+        errors.append(f"{prefix}.tags must be a non-empty list of strings")
+    evidence = resource["pashto_evidence"]
+    if not isinstance(evidence, dict):
+        errors.append(f"{prefix}.pashto_evidence must be an object")
+        return errors
+    for key in ("evidence_text", "evidence_url", "markers"):
+        if key not in evidence:
+            errors.append(f"{prefix}.pashto_evidence missing '{key}'")
+    evidence_text = evidence.get("evidence_text")
+    if not isinstance(evidence_text, str) or len(evidence_text.strip()) < 3:
+        errors.append(f"{prefix}.pashto_evidence.evidence_text must be a string")
+    evidence_url = evidence.get("evidence_url")
+    if not isinstance(evidence_url, str) or not _is_valid_http_url(evidence_url):
+        errors.append(f"{prefix}.pashto_evidence.evidence_url must be a valid http/https URL")
+    markers = evidence.get("markers")
+    if not (isinstance(markers, list) and markers and all(isinstance(marker, str) and marker.strip() for marker in markers)):
+        errors.append(f"{prefix}.pashto_evidence.markers must be a non-empty list of strings")
+    return errors
+def validate_catalog(catalog: dict[str, Any]) -> list[str]:
+    errors: list[str] = []
+    for key in ("version", "updated_on", "resources"):
+        if key not in catalog:
+            errors.append(f"catalog missing required top-level key: {key}")
+    if errors:
+        return errors
+    version = catalog["version"]
+    if not isinstance(version, str) or not re.fullmatch(r"^\d+\.\d+\.\d+$", version):
+        errors.append("catalog.version must look like '1.0.0'")
+    updated_on = catalog["updated_on"]
+    if not isinstance(updated_on, str) or not _validate_iso_date(updated_on):
+        errors.append("catalog.updated_on must be a valid ISO date (YYYY-MM-DD)")
+    resources = catalog["resources"]
+    if not isinstance(resources, list):
+        errors.append("catalog.resources must be a list")
+        return errors
+    seen_ids: set[str] = set()
+    for index, resource in enumerate(resources):
+        if not isinstance(resource, dict):
+            errors.append(f"resource[{index}] must be an object")
+            continue
+        errors.extend(validate_resource(resource, index))
+        resource_id = resource.get("id")
+        if isinstance(resource_id, str):
+            if resource_id in seen_ids:
+                errors.append(f"duplicate resource id: {resource_id}")
+            seen_ids.add(resource_id)
+    return errors
+def main() -> int:
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--catalog", default="resources/catalog/resources.json")
+    parser.add_argument("--schema", default="resources/schema/resource.schema.json")
+    args = parser.parse_args()
+    catalog_path = Path(args.catalog)
+    schema_path = Path(args.schema)
+    if not catalog_path.exists():
+        print(f"Missing catalog file: {catalog_path}")
+        return 1
+    if not schema_path.exists():
+        print(f"Missing schema file: {schema_path}")
+        return 1
+    try:
+        schema = _load_json(schema_path)
+        catalog = _load_json(catalog_path)
+    except json.JSONDecodeError as exc:
+        print(f"Invalid JSON: {exc}")
+        return 1
+    # Basic schema sanity check (this script enforces the validation rules directly).
+    if not isinstance(schema, dict) or "$schema" not in schema:
+        print("Schema file must be a JSON object with a '$schema' key")
+        return 1
+    errors = validate_catalog(catalog)
+    if errors:
+        print("Resource catalog validation failed:")
+        for error in errors:
+            print(f"- {error}")
+        return 1
+    print(f"Resource catalog valid: {len(catalog['resources'])} resources")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

tests/test_validate_resource_catalog.py ADDED Viewed

	@@ -0,0 +1,45 @@

+from scripts.validate_resource_catalog import validate_catalog
+def _minimal_catalog() -> dict:
+    return {
+        "version": "1.0.0",
+        "updated_on": "2026-02-15",
+        "resources": [
+            {
+                "id": "dataset-example",
+                "title": "Example Dataset",
+                "url": "https://example.org/dataset",
+                "category": "dataset",
+                "source": "other",
+                "status": "verified",
+                "summary": "Useful Pashto example dataset for testing the validator.",
+                "primary_use": "Testing",
+                "pashto_evidence": {
+                    "evidence_text": "Mentions Pashto in title.",
+                    "evidence_url": "https://example.org/dataset",
+                    "markers": ["Pashto"],
+                },
+                "tags": ["pashto", "test"],
+            }
+        ],
+    }
+def test_validate_catalog_passes_for_minimal_valid_catalog() -> None:
+    errors = validate_catalog(_minimal_catalog())
+    assert errors == []
+def test_validate_catalog_fails_for_duplicate_ids() -> None:
+    catalog = _minimal_catalog()
+    catalog["resources"].append(dict(catalog["resources"][0]))
+    errors = validate_catalog(catalog)
+    assert any("duplicate resource id" in error for error in errors)
+def test_validate_catalog_fails_for_invalid_evidence_url() -> None:
+    catalog = _minimal_catalog()
+    catalog["resources"][0]["pashto_evidence"]["evidence_url"] = "not-a-url"
+    errors = validate_catalog(catalog)
+    assert any("evidence_url" in error for error in errors)