adityashisharma commited on
Commit
bbbfdda
·
verified ·
1 Parent(s): 6fccc16

Upload 4 files

Browse files
Files changed (4) hide show
  1. LICENSE +21 -0
  2. README.md +31 -3
  3. model_card.md +24 -0
  4. requirements.txt +11 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to do so, subject to the
10
+ following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,3 +1,31 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IB-Physics-Mini-GPT (from-scratch tiny GPT-2)
2
+
3
+ A small GPT-2–style causal LM trained from scratch on a compact IB Physics HL corpus,
4
+ then lightly instruction-tuned for short Q&A. Purpose: show end-to-end skill
5
+ (tokenizer → pretrain → SFT → eval → deploy on a HF Space).
6
+
7
+ **Why small?** Fits student budget. **Why physics?** Narrow domain = good coverage with little data.
8
+
9
+ ## Quickstart
10
+ ```bash
11
+ pip install -r requirements.txt
12
+ # 1) prepare data
13
+ python train/prepare_corpus.py
14
+ python train/build_tokenizer.py
15
+ # 2) pretrain (tiny)
16
+ python train/pretrain.py
17
+ # 3) sft
18
+ python train/sft.py
19
+ # 4) sample
20
+ python train/gen_sample.py --prompt "Explain inertia in one sentence."
21
+ # 5) push to Hugging Face
22
+ python scripts/push_to_hf.py --repo your-username/ib-physics-mini-gpt
23
+ ```
24
+
25
+ ## Demo Space
26
+ This repo includes a Gradio app (`space_app/app.py`). Create a Hugging Face Space,
27
+ point it at this folder, set Space SDK=Gradio, Python backend.
28
+
29
+ ## Notes
30
+ - Educational demo; not for safety-critical use.
31
+ - Inspired by classic GPT papers and hands-on books/videos.
model_card.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IB-Physics-Mini-GPT (from scratch)
2
+
3
+ **Model type:** small GPT-2–style decoder-only LM
4
+ **Params:** ~30M (n_layer=6, n_head=6, n_embed=384)
5
+ **Context length:** 256
6
+ **Training:** tiny pretrain on physics notes → SFT on instruction pairs
7
+
8
+ ## Intended Use
9
+ Educational demo and concept explainer for IB Physics HL topics.
10
+
11
+ ## Limitations
12
+ Small context, tiny dataset, not a fact oracle. Double-check results.
13
+
14
+ ## How Trained
15
+ 1) Tokenizer: BPE (vocab 16k) on `corpus_raw.txt`.
16
+ 2) Pretrain: next-token prediction.
17
+ 3) Finetune: instruction-style Q&A (short).
18
+
19
+ ## Eval
20
+ - Perplexity on held-out notes (see `eval/` scripts)
21
+ - Manual Q&A sanity checks.
22
+
23
+ ## License
24
+ MIT for code. Dataset licensing is your responsibility.
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.2
2
+ tokenizers>=0.15
3
+ transformers>=4.43
4
+ datasets>=2.20
5
+ accelerate>=0.33
6
+ peft>=0.12
7
+ tqdm>=4.66
8
+ numpy>=1.26
9
+ gradio>=4.44
10
+ huggingface_hub>=0.23
11
+ pyyaml>=6.0