File size: 2,451 Bytes
f13fd7c
f725a8a
f13fd7c
d2f0b77
f13fd7c
 
 
 
 
2f53244
6f1c8bd
194828a
574cd8c
d2f0b77
 
 
 
 
 
 
 
f13fd7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194828a
 
 
 
 
 
 
 
 
 
6f1c8bd
 
 
 
 
194828a
 
 
 
 
574cd8c
 
 
 
 
 
 
 
 
 
f13fd7c
d2f0b77
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# Scripts

Automation scripts for quality checks, resource catalog validation, and search index generation.

## Available scripts
- `validate_normalization.py`: validate normalization seed TSV format and rules.
- `check_links.py`: ensure markdown links are clickable (optional online reachability check).
- `validate_resource_catalog.py`: validate `resources/catalog/resources.json`.
- `generate_resource_views.py`: generate `resources/*/README.md`, `resources/README.md`, and `docs/search/resources.json` from the catalog.
- `sync_resources.py`: collect new candidate Pashto resources from Kaggle, Hugging Face (datasets/models/spaces), GitHub, GitLab, OpenAlex, Crossref, Zenodo, Dataverse, DataCite, arXiv, and Semantic Scholar into `resources/catalog/pending_candidates.json`.
- `promote_candidates.py`: auto-promote valid non-duplicate entries from `pending_candidates.json` into `resources/catalog/resources.json`.
- `review_existing_resources.py`: review current catalog resources, remove stale/removed entries only with strong reasons, and log removals in `resources/catalog/removal_log.json`.
- `run_resource_cycle.py`: run the full repeatable resource cycle with one command.

## Usage

Validate normalization seed file:
```bash
python scripts/validate_normalization.py data/processed/normalization_seed_v0.1.tsv
```

Validate resource catalog:
```bash
python scripts/validate_resource_catalog.py
```

Generate markdown and search index from catalog:
```bash
python scripts/generate_resource_views.py
```

Sync candidate resources for maintainer review:
```bash
python scripts/sync_resources.py --limit 20
```

Review existing resources and remove stale entries before discovery:
```bash
python scripts/review_existing_resources.py
```

Run stricter relevance cleanup mode:
```bash
python scripts/review_existing_resources.py --enforce-pashto-relevance
```

Auto-promote valid candidates into verified catalog:
```bash
python scripts/promote_candidates.py
```

Auto-promote while skipping online URL availability checks:
```bash
python scripts/promote_candidates.py --skip-url-check
```

Run full repeatable cycle:
```bash
python scripts/run_resource_cycle.py --limit 25
```

Run discovery only:
```bash
python scripts/run_resource_cycle.py --discover-only --limit 25
```

Check markdown links format:
```bash
python scripts/check_links.py
```

Check markdown links and verify URLs online:
```bash
python scripts/check_links.py --online
```