English
LLM
Large_Language_Model
John / README.md
barygeferson's picture
Update README.md
ed55da7 verified
---
language:
- en
base_model:
- CodeSM/John
license: cc-by-nd-4.0
tags:
- LLM
- Large_Language_Model
datasets:
- databricks/databricks-dolly-15k
---
# John LLM
## Setup (15 min)
```bash
pip install -r requirements.txt
```
Place your text corpus at `data/raw/english.md`.
- Minimum recommended size: 1MB of plain text for meaningful training
- Good sources: Project Gutenberg books, Wikipedia dumps, personal notes
## Execution Steps
### STEP 0 β€” Data Prep:
```bash
python utils/clean_wiki.py
python data/download_sft.py
```
> Outputs: `data/raw/english_clean.txt`, `data/sft_data.jsonl`
### STEP 1 β€” Train tokenizer:
```bash
python tokenizer/train_tokenizer.py
```
> Outputs: `tokenizer/spm.model`, `tokenizer/spm.vocab`
### STEP 2 β€” Prepare dataset:
```bash
python training/dataset.py --prepare
```
> Outputs: `data/processed/train.bin`, `data/processed/val.bin`
> Prints token count and train/val split
### STEP 3 β€” Pretrain:
```bash
python training/pretrain.py
```
> Expected: val loss should drop below ~3.5
> Checkpoints saved to `checkpoints/` when val loss improves
### STEP 4 β€” Fine-tune:
```bash
python training/sft.py
```
> Outputs: `checkpoints/sft_final.pt`
### STEP 5 β€” Chat:
```bash
python inference/chat.py --checkpoint checkpoints/sft_final.pt
```
## Expected Behavior
- With **<1MB data**: model will overfit, responses will be memorized text.
- With **5-20MB data**: model will generalize and produce novel sentences.
- With **50MB+ data**: model will feel like a real (small) language model.
## Troubleshooting
- **OOM error**: reduce `BATCH_SIZE` to 4 or `context_len` to 256 in scripts/config.
- **Loss stuck at ~9.0**: tokenizer not trained, check `spm.model` exists.
- **Gibberish output**: need more data or more training steps.
- **CUDA not found**: install torch with `pip install torch --index-url https://download.pytorch.org/whl/cu124`