musaw
sync(hf): snapshot origin main after resource audit cycle
194828a

Scripts

Automation scripts for quality checks, resource catalog validation, and search index generation.

Available scripts

  • validate_normalization.py: validate normalization seed TSV format and rules.
  • check_links.py: ensure markdown links are clickable (optional online reachability check).
  • validate_resource_catalog.py: validate resources/catalog/resources.json.
  • generate_resource_views.py: generate resources/*/README.md, resources/README.md, and docs/search/resources.json from the catalog.
  • sync_resources.py: collect new candidate Pashto resources from Kaggle, Hugging Face (datasets/models/spaces), GitHub, GitLab, OpenAlex, Crossref, Zenodo, Dataverse, DataCite, arXiv, and Semantic Scholar into resources/catalog/pending_candidates.json.
  • promote_candidates.py: auto-promote valid non-duplicate entries from pending_candidates.json into resources/catalog/resources.json.
  • review_existing_resources.py: review current catalog resources, remove stale/removed entries only with strong reasons, and log removals in resources/catalog/removal_log.json.
  • run_resource_cycle.py: run the full repeatable resource cycle with one command.

Usage

Validate normalization seed file:

python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv

Validate resource catalog:

python scripts/validate_resource_catalog.py

Generate markdown and search index from catalog:

python scripts/generate_resource_views.py

Sync candidate resources for maintainer review:

python scripts/sync_resources.py --limit 20

Review existing resources and remove stale entries before discovery:

python scripts/review_existing_resources.py

Run stricter relevance cleanup mode:

python scripts/review_existing_resources.py --enforce-pashto-relevance

Auto-promote valid candidates into verified catalog:

python scripts/promote_candidates.py

Auto-promote while skipping online URL availability checks:

python scripts/promote_candidates.py --skip-url-check

Run full repeatable cycle:

python scripts/run_resource_cycle.py --limit 25

Run discovery only:

python scripts/run_resource_cycle.py --discover-only --limit 25

Check markdown links format:

python scripts/check_links.py

Check markdown links and verify URLs online:

python scripts/check_links.py --online