Upload 4 files

Files changed (4) hide show

LICENSE ADDED Viewed

+MIT License
+Copyright (c) 2025
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to do so, subject to the
+following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,31 @@
----
-license: apache-2.0
----

+# IB-Physics-Mini-GPT (from-scratch tiny GPT-2)
+A small GPT-2–style causal LM trained from scratch on a compact IB Physics HL corpus,
+then lightly instruction-tuned for short Q&A. Purpose: show end-to-end skill
+(tokenizer → pretrain → SFT → eval → deploy on a HF Space).
+**Why small?** Fits student budget. **Why physics?** Narrow domain = good coverage with little data.
+## Quickstart
+```bash
+pip install -r requirements.txt
+# 1) prepare data
+python train/prepare_corpus.py
+python train/build_tokenizer.py
+# 2) pretrain (tiny)
+python train/pretrain.py
+# 3) sft
+python train/sft.py
+# 4) sample
+python train/gen_sample.py --prompt "Explain inertia in one sentence."
+# 5) push to Hugging Face
+python scripts/push_to_hf.py --repo your-username/ib-physics-mini-gpt
+```
+## Demo Space
+This repo includes a Gradio app (`space_app/app.py`). Create a Hugging Face Space,
+point it at this folder, set Space SDK=Gradio, Python backend.
+## Notes
+- Educational demo; not for safety-critical use.
+- Inspired by classic GPT papers and hands-on books/videos.

model_card.md ADDED Viewed

+# IB-Physics-Mini-GPT (from scratch)
+**Model type:** small GPT-2–style decoder-only LM
+**Params:** ~30M (n_layer=6, n_head=6, n_embed=384)
+**Context length:** 256
+**Training:** tiny pretrain on physics notes → SFT on instruction pairs
+## Intended Use
+Educational demo and concept explainer for IB Physics HL topics.
+## Limitations
+Small context, tiny dataset, not a fact oracle. Double-check results.
+## How Trained
+1) Tokenizer: BPE (vocab 16k) on `corpus_raw.txt`.
+2) Pretrain: next-token prediction.
+3) Finetune: instruction-style Q&A (short).
+## Eval
+- Perplexity on held-out notes (see `eval/` scripts)
+- Manual Q&A sanity checks.
+## License
+MIT for code. Dataset licensing is your responsibility.

requirements.txt ADDED Viewed

+torch>=2.2
+tokenizers>=0.15
+transformers>=4.43
+datasets>=2.20
+accelerate>=0.33
+peft>=0.12
+tqdm>=4.66
+numpy>=1.26
+gradio>=4.44
+huggingface_hub>=0.23
+pyyaml>=6.0