| --- |
| language: |
| - en |
| base_model: |
| - CodeSM/John |
| license: cc-by-nd-4.0 |
| tags: |
| - LLM |
| - Large_Language_Model |
| datasets: |
| - databricks/databricks-dolly-15k |
| --- |
| # John LLM |
|
|
| ## Setup (15 min) |
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| Place your text corpus at `data/raw/english.md`. |
| - Minimum recommended size: 1MB of plain text for meaningful training |
| - Good sources: Project Gutenberg books, Wikipedia dumps, personal notes |
|
|
| ## Execution Steps |
|
|
| ### STEP 0 β Data Prep: |
| ```bash |
| python utils/clean_wiki.py |
| python data/download_sft.py |
| ``` |
| > Outputs: `data/raw/english_clean.txt`, `data/sft_data.jsonl` |
|
|
| ### STEP 1 β Train tokenizer: |
| ```bash |
| python tokenizer/train_tokenizer.py |
| ``` |
| > Outputs: `tokenizer/spm.model`, `tokenizer/spm.vocab` |
|
|
| ### STEP 2 β Prepare dataset: |
| ```bash |
| python training/dataset.py --prepare |
| ``` |
| > Outputs: `data/processed/train.bin`, `data/processed/val.bin` |
| > Prints token count and train/val split |
|
|
| ### STEP 3 β Pretrain: |
| ```bash |
| python training/pretrain.py |
| ``` |
| > Expected: val loss should drop below ~3.5 |
| > Checkpoints saved to `checkpoints/` when val loss improves |
|
|
| ### STEP 4 β Fine-tune: |
| ```bash |
| python training/sft.py |
| ``` |
| > Outputs: `checkpoints/sft_final.pt` |
| |
| ### STEP 5 β Chat: |
| ```bash |
| python inference/chat.py --checkpoint checkpoints/sft_final.pt |
| ``` |
| |
| ## Expected Behavior |
| - With **<1MB data**: model will overfit, responses will be memorized text. |
| - With **5-20MB data**: model will generalize and produce novel sentences. |
| - With **50MB+ data**: model will feel like a real (small) language model. |
| |
| ## Troubleshooting |
| - **OOM error**: reduce `BATCH_SIZE` to 4 or `context_len` to 256 in scripts/config. |
| - **Loss stuck at ~9.0**: tokenizer not trained, check `spm.model` exists. |
| - **Gibberish output**: need more data or more training steps. |
| - **CUDA not found**: install torch with `pip install torch --index-url https://download.pytorch.org/whl/cu124` |