barygeferson
/

John

Large_Language_Model

Model card Files Files and versions

John / README.md

barygeferson's picture

Update README.md

ed55da7 verified about 2 months ago

|

history blame contribute delete

1.89 kB

	---
	language:
	- en
	base_model:
	- CodeSM/John
	license: cc-by-nd-4.0
	tags:
	- LLM
	- Large_Language_Model
	datasets:
	- databricks/databricks-dolly-15k
	---
	# John LLM

	## Setup (15 min)
	```bash
	pip install -r requirements.txt
	```

	Place your text corpus at `data/raw/english.md`.
	- Minimum recommended size: 1MB of plain text for meaningful training
	- Good sources: Project Gutenberg books, Wikipedia dumps, personal notes

	## Execution Steps

	### STEP 0 — Data Prep:
	```bash
	python utils/clean_wiki.py
	python data/download_sft.py
	```
	> Outputs: `data/raw/english_clean.txt`, `data/sft_data.jsonl`

	### STEP 1 — Train tokenizer:
	```bash
	python tokenizer/train_tokenizer.py
	```
	> Outputs: `tokenizer/spm.model`, `tokenizer/spm.vocab`

	### STEP 2 — Prepare dataset:
	```bash
	python training/dataset.py --prepare
	```
	> Outputs: `data/processed/train.bin`, `data/processed/val.bin`
	> Prints token count and train/val split

	### STEP 3 — Pretrain:
	```bash
	python training/pretrain.py
	```
	> Expected: val loss should drop below ~3.5
	> Checkpoints saved to `checkpoints/` when val loss improves

	### STEP 4 — Fine-tune:
	```bash
	python training/sft.py
	```
	> Outputs: `checkpoints/sft_final.pt`

	### STEP 5 — Chat:
	```bash
	python inference/chat.py --checkpoint checkpoints/sft_final.pt
	```

	## Expected Behavior
	- With <1MB data: model will overfit, responses will be memorized text.
	- With 5-20MB data: model will generalize and produce novel sentences.
	- With 50MB+ data: model will feel like a real (small) language model.

	## Troubleshooting
	- OOM error: reduce `BATCH_SIZE` to 4 or `context_len` to 256 in scripts/config.
	- Loss stuck at ~9.0: tokenizer not trained, check `spm.model` exists.
	- Gibberish output: need more data or more training steps.
	- CUDA not found: install torch with `pip install torch --index-url https://download.pytorch.org/whl/cu124`