metadata
language:
- en
base_model:
- CodeSM/John
license: cc-by-nd-4.0
tags:
- LLM
- Large_Language_Model
datasets:
- databricks/databricks-dolly-15k
John LLM
Setup (15 min)
pip install -r requirements.txt
Place your text corpus at data/raw/english.md.
- Minimum recommended size: 1MB of plain text for meaningful training
- Good sources: Project Gutenberg books, Wikipedia dumps, personal notes
Execution Steps
STEP 0 — Data Prep:
python utils/clean_wiki.py
python data/download_sft.py
Outputs:
data/raw/english_clean.txt,data/sft_data.jsonl
STEP 1 — Train tokenizer:
python tokenizer/train_tokenizer.py
Outputs:
tokenizer/spm.model,tokenizer/spm.vocab
STEP 2 — Prepare dataset:
python training/dataset.py --prepare
Outputs:
data/processed/train.bin,data/processed/val.binPrints token count and train/val split
STEP 3 — Pretrain:
python training/pretrain.py
Expected: val loss should drop below ~3.5 Checkpoints saved to
checkpoints/when val loss improves
STEP 4 — Fine-tune:
python training/sft.py
Outputs:
checkpoints/sft_final.pt
STEP 5 — Chat:
python inference/chat.py --checkpoint checkpoints/sft_final.pt
Expected Behavior
- With <1MB data: model will overfit, responses will be memorized text.
- With 5-20MB data: model will generalize and produce novel sentences.
- With 50MB+ data: model will feel like a real (small) language model.
Troubleshooting
- OOM error: reduce
BATCH_SIZEto 4 orcontext_lento 256 in scripts/config. - Loss stuck at ~9.0: tokenizer not trained, check
spm.modelexists. - Gibberish output: need more data or more training steps.
- CUDA not found: install torch with
pip install torch --index-url https://download.pytorch.org/whl/cu124