NorthernTribe-Research commited on
Commit
6853460
·
verified ·
1 Parent(s): 6ebeddc

Update model card with project overview and parameter visualization.

Browse files
Files changed (1) hide show
  1. README.md +463 -3
README.md CHANGED
@@ -7,6 +7,7 @@ tags:
7
  - reasoning
8
  - peft
9
  - lora
 
10
  base_model: Qwen/Qwen2.5-0.5B-Instruct
11
  base_model_relation: adapter
12
  ---
@@ -24,7 +25,466 @@ base_model_relation: adapter
24
 
25
  `[#---------------------------]` trainable share
26
 
27
- ## Notes
28
 
29
- - Parameter counts are generated from the latest training summary.
30
- - See `training_summary.json` for full run telemetry.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - reasoning
8
  - peft
9
  - lora
10
+ - lean
11
  base_model: Qwen/Qwen2.5-0.5B-Instruct
12
  base_model_relation: adapter
13
  ---
 
25
 
26
  `[#---------------------------]` trainable share
27
 
28
+ ## Training Reference
29
 
30
+ - Summary source: `workspace/runs/math-conjecture-sota-050b-quick/training_summary.json`
31
+ - This card is generated from the repository README plus the latest training summary.
32
+
33
+ ## Project Documentation
34
+
35
+ <details>
36
+ <summary>Expand full project README context</summary>
37
+
38
+ This repository builds a merged dataset for training math AI systems aimed at
39
+ **unsolved conjecture reasoning**. The v1 pipeline combines local conjecture
40
+ data with curated open-license Hugging Face datasets (competition, structured
41
+ reasoning, and formal proof corpora).
42
+
43
+ It now also reaches into broader open math sources such as OpenR1-Math-220k,
44
+ FineProofs-SFT, LeanStatement_CoT, NuminaMath-LEAN, and DeepSeek-Prover-V1 to
45
+ better cover proof traces, theorem formalization, and reasoning-heavy
46
+ competition data.
47
+
48
+ ## Repository Layout
49
+
50
+ ```text
51
+ configs/
52
+ source_registry.yaml
53
+ data/
54
+ raw/
55
+ unsolved_conjectures.jsonl
56
+ processed/
57
+ train.jsonl
58
+ validation.jsonl
59
+ test.jsonl
60
+ manifest.json
61
+ interim/
62
+ discovery.json
63
+ pull_report.json
64
+ normalized_rows.jsonl
65
+ merged_train.jsonl
66
+ merged_validation.jsonl
67
+ merged_test.jsonl
68
+ normalize_stats.json
69
+ merge_stats.json
70
+ validation_report.json
71
+ releases/
72
+ v1/
73
+ train.parquet
74
+ validation.parquet
75
+ test.parquet
76
+ manifest.json
77
+ excluded_sources.json
78
+ dataset_card.md
79
+ size_report.json
80
+ push_report.json
81
+ schemas/
82
+ conjecture_record.schema.json
83
+ training_example.schema.json
84
+ normalized_training_row.schema.json
85
+ scripts/
86
+ build_dataset.py
87
+ validate_dataset.py
88
+ pipeline.py
89
+ manage_hf_bucket.py
90
+ release_and_space_run.py
91
+ model_development/
92
+ configs/
93
+ math_conjecture_sota.yaml
94
+ math_conjecture_sota_state_of_art.yaml
95
+ math_conjecture_sft.yaml
96
+ math_conjecture_scratch.yaml
97
+ math_conjecture_scratch_smoke.yaml
98
+ qwen25_math_sota.yaml
99
+ scripts/
100
+ train_sft.py
101
+ train_sota.py
102
+ train_scratch.py
103
+ eval_sota.py
104
+ generate_rft_data.py
105
+ merge_and_push.py
106
+ requirements.txt
107
+ README.md
108
+ space_trainer/
109
+ app.py
110
+ configs/
111
+ math_conjecture_sota.yaml
112
+ math_conjecture_sota_state_of_art.yaml
113
+ qwen25_math_sota.yaml
114
+ scripts/
115
+ train_sota.py
116
+ eval_sota.py
117
+ requirements.txt
118
+ README.md
119
+ space_conjecture_lab/
120
+ app.py
121
+ requirements.txt
122
+ README.md
123
+ docs/
124
+ math_conjecture_lean_ai_rollout_runbook.md
125
+ model_sota_strategy_2026-03-23.md
126
+ state_of_art_math_blueprint_2026-03-25.md
127
+ ```
128
+
129
+ ## Existing Local Dataset Build
130
+
131
+ ```bash
132
+ python scripts/build_dataset.py \
133
+ --seed-path data/raw/unsolved_conjectures.jsonl \
134
+ --output-dir data/processed \
135
+ --split-seed 17
136
+
137
+ python scripts/validate_dataset.py \
138
+ --seed-path data/raw/unsolved_conjectures.jsonl \
139
+ --processed-dir data/processed
140
+ ```
141
+
142
+ ## Merged Corpus Pipeline (v1)
143
+
144
+ Use the project virtualenv:
145
+
146
+ ```bash
147
+ .venv/bin/python scripts/pipeline.py discover
148
+ .venv/bin/python scripts/pipeline.py pull
149
+ .venv/bin/python scripts/pipeline.py normalize
150
+ .venv/bin/python scripts/pipeline.py merge
151
+ .venv/bin/python scripts/pipeline.py validate
152
+ .venv/bin/python scripts/pipeline.py pack
153
+ .venv/bin/python scripts/pipeline.py push
154
+ ```
155
+
156
+ Default publish target:
157
+ - HF account: from env (`HF_USERNAME`/`HF_NAMESPACE`) or `huggingface-api-key.json`
158
+ - dataset repo: `HF_DATASET_REPO_ID` or fallback `<username>/math-conjecture-training-corpus`
159
+ - visibility: public dataset repo
160
+
161
+ The registry keeps these broader sources capped and split-aware so the corpus
162
+ can grow materially without assuming unrealistic storage or training budgets.
163
+ License policy checks now evaluate both dataset card fields (`license` and
164
+ `license_name`), handle list-valued license metadata, and still block unresolved
165
+ custom/unknown licenses.
166
+
167
+ ## Hugging Face Buckets
168
+
169
+ Current Hugging Face Hub docs expose storage buckets through both the CLI and
170
+ Python API.
171
+
172
+ CLI commands documented today:
173
+ - `hf buckets create BUCKET_ID [--private] [--exist-ok]`
174
+ - `hf buckets info BUCKET_ID`
175
+ - `hf buckets list [NAMESPACE|BUCKET_ID]`
176
+ - `hf buckets delete BUCKET_ID`
177
+ - `hf buckets move FROM_ID TO_ID`
178
+ - `hf buckets remove ...`
179
+ - `hf buckets sync ...`
180
+ - `hf buckets cp ...`
181
+
182
+ Python bucket helpers are documented as available since `huggingface_hub`
183
+ `v1.5.0`, including `create_bucket`, `list_bucket_tree`, and `sync_bucket`.
184
+
185
+ This checkout currently has `huggingface_hub 1.4.1`, so the bucket CLI/API is
186
+ not available here yet. The helper script below detects that mismatch and
187
+ prints a clear upgrade/permission error instead of failing silently.
188
+
189
+ Examples:
190
+
191
+ ```bash
192
+ # Check whether a bucket exists in the current namespace
193
+ python scripts/manage_hf_bucket.py status my-bucket --namespace NorthernTribe-Research
194
+
195
+ # Create a private bucket
196
+ python scripts/manage_hf_bucket.py create my-bucket --namespace NorthernTribe-Research --private
197
+
198
+ # Ensure a bucket exists, creating it if needed
199
+ python scripts/manage_hf_bucket.py ensure my-bucket --namespace NorthernTribe-Research --private
200
+
201
+ # Upstream CLI, when using a newer huggingface_hub install
202
+ hf buckets create NorthernTribe-Research/my-bucket --private
203
+ hf buckets list NorthernTribe-Research
204
+ hf buckets info NorthernTribe-Research/my-bucket
205
+ ```
206
+
207
+ If you want the bucket commands in the CLI on this machine, upgrade the Hub
208
+ package to a version at or above `huggingface_hub>=1.5.0`.
209
+
210
+ ## hf-mount Setup (Standard Workflow)
211
+
212
+ Use `scripts/hf_mount_setup.sh` as the project-standard entrypoint for
213
+ installing and operating `hf-mount`.
214
+
215
+ One-time bootstrap (installs binaries, persists PATH, writes runtime env):
216
+
217
+ ```bash
218
+ scripts/hf_mount_setup.sh bootstrap --persist-path
219
+ ```
220
+
221
+ This writes helper defaults to:
222
+ - `workspace/runtime/hf_mount.env`
223
+
224
+ In new shells, load project defaults:
225
+
226
+ ```bash
227
+ source workspace/runtime/hf_mount.env
228
+ ```
229
+
230
+ `hf_mount_setup.sh` now writes and maintains these Hugging Face project defaults:
231
+
232
+ - `HF_NAMESPACE`
233
+ - `HF_DATASET_REPO_ID`
234
+ - `HF_MODEL_REPO_ID`
235
+ - `HF_TRAINER_SPACE_REPO_ID`
236
+ - `HF_LAB_SPACE_REPO_ID`
237
+ - `HF_OPS_BUCKET_ID`
238
+
239
+ Common operations:
240
+
241
+ ```bash
242
+ # Mount the dataset repo (read-only)
243
+ scripts/hf_mount_setup.sh mount-repo \
244
+ --repo "datasets/${HF_DATASET_REPO_ID}" \
245
+ --target workspace/hf_mounts/dataset
246
+
247
+ # Mount the model repo (read-only)
248
+ scripts/hf_mount_setup.sh mount-repo \
249
+ --repo "${HF_MODEL_REPO_ID}" \
250
+ --target workspace/hf_mounts/model
251
+
252
+ # Mount a Space repo (read-only)
253
+ scripts/hf_mount_setup.sh mount-repo \
254
+ --repo "spaces/${HF_TRAINER_SPACE_REPO_ID}" \
255
+ --target workspace/hf_mounts/space_trainer
256
+
257
+ # Mount a bucket (read-write by default)
258
+ scripts/hf_mount_setup.sh mount-bucket \
259
+ --bucket "${HF_OPS_BUCKET_ID}" \
260
+ --target workspace/hf_mounts/math_conjecture_ops
261
+
262
+ # Inspect and stop mounts
263
+ scripts/hf_mount_setup.sh status
264
+ scripts/hf_mount_setup.sh stop --target workspace/hf_mounts/dataset
265
+ ```
266
+
267
+ If your host requires privileged NFS/FUSE mounts, add `--sudo` to mount/stop
268
+ commands.
269
+
270
+ ## Push-And-Run Orchestrator
271
+
272
+ Use `scripts/release_and_space_run.py` to execute the full promotion flow for:
273
+
274
+ - dataset repo: `HF_DATASET_REPO_ID` (default: `NorthernTribe-Research/math-conjecture-training-corpus`)
275
+ - space repo: `HF_TRAINER_SPACE_REPO_ID` (default: `NorthernTribe-Research/math_trainer`)
276
+ - model repo: `HF_MODEL_REPO_ID` (default: `NorthernTribe-Research/math-conjecture-model`)
277
+ - ops bucket: `HF_OPS_BUCKET_ID` (default: `NorthernTribe-Research/math-conjecture-ops`)
278
+
279
+ Install/upgrade tooling in the virtualenv:
280
+
281
+ ```bash
282
+ .venv/bin/python -m pip install -r model_development/requirements.txt
283
+ ```
284
+
285
+ Run the rollout sequence:
286
+
287
+ ```bash
288
+ .venv/bin/python scripts/release_and_space_run.py prepare
289
+ .venv/bin/python scripts/release_and_space_run.py bucket
290
+ .venv/bin/python scripts/release_and_space_run.py publish-dataset
291
+ .venv/bin/python scripts/release_and_space_run.py deploy-space
292
+ .venv/bin/python scripts/release_and_space_run.py run-space
293
+ .venv/bin/python scripts/release_and_space_run.py verify
294
+ ```
295
+
296
+ `run-space` now retries transient client/runtime failures by default. Tune with
297
+ `--max-retries` and `--retry-sleep-seconds`.
298
+
299
+ Non-destructive Space API probe (preflight only):
300
+
301
+ ```bash
302
+ .venv/bin/python scripts/release_and_space_run.py run-space \
303
+ --preflight-only \
304
+ --no-push-to-hub \
305
+ --no-run-eval \
306
+ --max-stages 1 \
307
+ --allow-failed-result
308
+ ```
309
+
310
+ Optional safety pins for dataset publish/verify:
311
+
312
+ ```bash
313
+ .venv/bin/python scripts/release_and_space_run.py publish-dataset \
314
+ --expected-created-at 2026-03-23T11:03:54+00:00 \
315
+ --expected-total-rows 473349
316
+ ```
317
+
318
+ `verify` is strict by default and expects a successful `run-space` report plus
319
+ model artifacts. For baseline infrastructure checks without a completed run:
320
+
321
+ ```bash
322
+ .venv/bin/python scripts/release_and_space_run.py verify \
323
+ --no-require-space-run-success \
324
+ --no-require-model-artifacts
325
+ ```
326
+
327
+ Generated reports are written to `data/releases/v1/`:
328
+
329
+ - `promotion_prepare_report.json`
330
+ - `promotion_bucket_report.json`
331
+ - `promotion_dataset_publish_report.json`
332
+ - `promotion_space_deploy_report.json`
333
+ - `promotion_space_run_report.json`
334
+ - `promotion_verify_report.json`
335
+
336
+ ## Model Development (Lean + SOTA Math Profiles)
337
+
338
+ The model fine-tuning workspace is under `model_development/`.
339
+ The SOTA curriculum now profiles responses across simple, intermediate,
340
+ advanced, and Lean-formalized bands.
341
+
342
+ ```bash
343
+ .venv/bin/python -m pip install -r model_development/requirements.txt
344
+
345
+ .venv/bin/python model_development/scripts/train_sft.py \
346
+ --config model_development/configs/math_conjecture_sft.yaml
347
+
348
+ .venv/bin/python model_development/scripts/train_sft.py \
349
+ --config model_development/configs/math_conjecture_sft.yaml \
350
+ --max-train-samples 120000
351
+
352
+ .venv/bin/python model_development/scripts/train_sota.py \
353
+ --config model_development/configs/math_conjecture_sota.yaml
354
+
355
+ .venv/bin/python model_development/scripts/train_sota.py \
356
+ --config model_development/configs/qwen25_math_sota.yaml
357
+
358
+ .venv/bin/python model_development/scripts/train_sota.py \
359
+ --config model_development/configs/math_conjecture_sota_state_of_art.yaml
360
+
361
+ .venv/bin/python model_development/scripts/train_scratch.py \
362
+ --config model_development/configs/math_conjecture_scratch.yaml \
363
+ --init-only
364
+
365
+ .venv/bin/python model_development/scripts/train_scratch.py \
366
+ --config model_development/configs/math_conjecture_scratch_smoke.yaml \
367
+ --dry-run
368
+ ```
369
+
370
+ The SOTA eval report now includes `difficulty_band_metrics`,
371
+ `response_profile_metrics`, and `simple_to_lean`.
372
+ The training summary now records model parameter stats under
373
+ `model.parameter_counts` (total/trainable/frozen + ratio).
374
+ When `push_to_hub` is enabled, `train_sota.py` now also updates model
375
+ `README.md` on Hugging Face with a parameter-count visualization table and
376
+ trainable-share bar.
377
+ The eval flow also supports consensus/verifier-aware selection metrics such as
378
+ `selected_pass_at_k`, `consensus_rate`, and `consensus_pass_at_k`.
379
+ State-of-art eval now also supports stricter grading controls:
380
+ `--allow-substring-match` (off by default) and optional SymPy symbolic checks
381
+ (`symbolic_verifier_enabled` reported in eval output).
382
+
383
+ Self-improvement (rejection-sampling) data generation:
384
+
385
+ ```bash
386
+ .venv/bin/python model_development/scripts/generate_rft_data.py \
387
+ --config model_development/configs/math_conjecture_sota_state_of_art.yaml \
388
+ --adapter-path model_development/runs/math-conjecture-sota-state-of-art/final_adapter \
389
+ --input-file data/releases/v1/train.parquet \
390
+ --output-file model_development/runs/math-conjecture-rft/rft_train.parquet \
391
+ --k 8 \
392
+ --max-samples 2000
393
+ ```
394
+
395
+ SOTA blueprint and implementation notes:
396
+ - `docs/state_of_art_math_blueprint_2026-03-25.md`
397
+
398
+ Optional adapter merge and model publish:
399
+
400
+ ```bash
401
+ .venv/bin/python model_development/scripts/merge_and_push.py \
402
+ --adapter-path model_development/runs/math-conjecture-sota/final_adapter \
403
+ --output-dir model_development/merged/math-conjecture-model \
404
+ --push-to-hub \
405
+ --repo-id NorthernTribe-Research/math-conjecture-model
406
+ ```
407
+
408
+ The publish flow now auto-generates a Hugging Face model card `README.md` with
409
+ parameter-count visualization (total/trainable/frozen + trainable-share bar),
410
+ so model size is visible directly on the Hub page.
411
+
412
+ ### llama.cpp Inference (GGUF)
413
+
414
+ `llama.cpp` inference is validated in this workspace for the trained
415
+ math-conjecture adapter flow.
416
+
417
+ Build `llama.cpp`:
418
+
419
+ ```bash
420
+ git clone --depth 1 https://github.com/ggml-org/llama.cpp workspace/llama.cpp
421
+ cmake -S workspace/llama.cpp -B workspace/llama.cpp/build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS
422
+ cmake --build workspace/llama.cpp/build -j
423
+ ```
424
+
425
+ Merge adapter to full model, then convert to GGUF:
426
+
427
+ ```bash
428
+ .venv/bin/python model_development/scripts/merge_and_push.py \
429
+ --adapter-path workspace/runs/math-conjecture-sota-050b-quick/final_adapter \
430
+ --output-dir workspace/runs/math-conjecture-sota-050b-quick/merged_model
431
+
432
+ .venv/bin/python workspace/llama.cpp/convert_hf_to_gguf.py \
433
+ workspace/runs/math-conjecture-sota-050b-quick/merged_model \
434
+ --outfile workspace/runs/math-conjecture-sota-050b-quick/math-conjecture-sota-050b-f16.gguf \
435
+ --outtype f16
436
+ ```
437
+
438
+ Optional quantization:
439
+
440
+ ```bash
441
+ ./workspace/llama.cpp/build/bin/llama-quantize \
442
+ workspace/runs/math-conjecture-sota-050b-quick/math-conjecture-sota-050b-f16.gguf \
443
+ workspace/runs/math-conjecture-sota-050b-quick/math-conjecture-sota-050b-q4_k_m.gguf \
444
+ Q4_K_M
445
+ ```
446
+
447
+ Run one-shot inference via helper script:
448
+
449
+ ```bash
450
+ scripts/llama_cpp_infer.sh \
451
+ --model workspace/runs/math-conjecture-sota-050b-quick/math-conjecture-sota-050b-q4_k_m.gguf \
452
+ --prompt "2+2=" \
453
+ --n-predict 8
454
+ ```
455
+
456
+ You can also save output:
457
+
458
+ ```bash
459
+ scripts/llama_cpp_infer.sh \
460
+ --model workspace/runs/math-conjecture-sota-050b-quick/math-conjecture-sota-050b-f16.gguf \
461
+ --prompt "Solve: a+b=10 and a-b=4. Return JSON with keys a and b only." \
462
+ --n-predict 64 \
463
+ --out workspace/runs/math-conjecture-sota-050b-quick/llama_cpp_inference.json.txt
464
+ ```
465
+
466
+ ## Hugging Face Space Trainer
467
+
468
+ `space_trainer/` contains a GPU Space app that runs staged training and pushes
469
+ artifacts to `NorthernTribe-Research/math-conjecture-model`.
470
+
471
+ ## Conjecture Showcase Space
472
+
473
+ `space_conjecture_lab/` contains a separate Lean+AI conjecture analysis Space
474
+ that:
475
+
476
+ - loads evidence from `NorthernTribe-Research/math-conjecture-training-corpus`,
477
+ - attempts conjecture analysis with `NorthernTribe-Research/math-conjecture-model`,
478
+ - falls back to a base model when needed,
479
+ - always displays explicit work traces, prompt text, and a Lean stub in UI.
480
+
481
+ ## Policy
482
+
483
+ - Include only open, known-license, non-gated datasets from the registry.
484
+ - Exclude unknown-license and gated datasets in v1.
485
+ - Prefer capped, high-signal subsets for very large sources so training stays
486
+ practical while coverage expands.
487
+ - Keep release compact with deterministic filtering, de-duplication, and
488
+ split assignment by hashed prompt.
489
+
490
+ </details>