John / README.md

barygeferson

Update README.md

ed55da7 verified about 2 months ago

preview code

raw

history blame contribute delete

1.89 kB

metadata

language:
  - en
base_model:
  - CodeSM/John
license: cc-by-nd-4.0
tags:
  - LLM
  - Large_Language_Model
datasets:
  - databricks/databricks-dolly-15k

John LLM

Setup (15 min)

pip install -r requirements.txt

Place your text corpus at data/raw/english.md.

Minimum recommended size: 1MB of plain text for meaningful training
Good sources: Project Gutenberg books, Wikipedia dumps, personal notes

Execution Steps

STEP 0 — Data Prep:

python utils/clean_wiki.py
python data/download_sft.py

Outputs: data/raw/english_clean.txt, data/sft_data.jsonl

STEP 1 — Train tokenizer:

python tokenizer/train_tokenizer.py

Outputs: tokenizer/spm.model, tokenizer/spm.vocab

STEP 2 — Prepare dataset:

python training/dataset.py --prepare

Outputs: data/processed/train.bin, data/processed/val.bin Prints token count and train/val split

STEP 3 — Pretrain:

python training/pretrain.py

Expected: val loss should drop below ~3.5 Checkpoints saved to checkpoints/ when val loss improves

STEP 4 — Fine-tune:

python training/sft.py

Outputs: checkpoints/sft_final.pt

STEP 5 — Chat:

python inference/chat.py --checkpoint checkpoints/sft_final.pt

Expected Behavior

With <1MB data: model will overfit, responses will be memorized text.
With 5-20MB data: model will generalize and produce novel sentences.
With 50MB+ data: model will feel like a real (small) language model.

Troubleshooting

OOM error: reduce BATCH_SIZE to 4 or context_len to 256 in scripts/config.
Loss stuck at ~9.0: tokenizer not trained, check spm.model exists.
Gibberish output: need more data or more training steps.
CUDA not found: install torch with pip install torch --index-url https://download.pytorch.org/whl/cu124