File size: 3,874 Bytes
ef18673 1e799aa ef18673 1e799aa 15af856 1e799aa ef18673 1e799aa b4f432f 1e799aa ef18673 1e799aa ef18673 1e799aa ef18673 1e799aa ef18673 1e799aa ef18673 1e799aa ef18673 1e799aa ef18673 4c64fd6 ef18673 1e799aa ef18673 4c64fd6 ef18673 1e799aa ef18673 1e799aa ef18673 1e799aa 4c64fd6 1e799aa b4f432f 1e799aa ef18673 1e799aa ef18673 1e799aa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | # SAGE Commands
This is the repo's current command reference for data preparation, tokenizer training, model training, serving, browser control, and validation.
## Install
```bash
pip install -r requirements.txt
```
## Run tests
```bash
pytest -q
```
## 1. Create a starter dataset
This repo does not ship a large training corpus. The fastest way to unblock the pipeline is to generate the built-in smoke dataset first:
```bash
python -m data.bootstrap --output-dir data/raw --overwrite
```
That writes JSONL files like:
```text
data/raw/general_web.jsonl
data/raw/code.jsonl
data/raw/math_science.jsonl
data/raw/multilingual.jsonl
data/raw/synthetic.jsonl
```
If you want to use your own corpus, put JSONL records in the same folder with at least a `text` field:
```json
{ "text": "your training sample here" }
```
## 2. Train the tokenizer
The tokenizer trainer now accepts plain text files or JSONL files.
```bash
python -m tokenizer.train_tokenizer \
--input data/raw/general_web.jsonl data/raw/code.jsonl data/raw/math_science.jsonl data/raw/multilingual.jsonl data/raw/synthetic.jsonl \
--model-prefix tokenizer/tokenizer \
--vocab-size 4096 \
--training-text tokenizer/training_corpus.txt
```
## 3. Validate the tokenizer
```bash
python -m tokenizer.validate_tokenizer tokenizer/tokenizer.model
```
## 4. Build parquet shards
```bash
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 128
```
For a short smoke run:
```bash
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 32 \
--limit-per-source 4
```
The shell helper now points to the real data pipeline:
```bash
bash scripts/run_data_pipeline.sh --tokenizer-model tokenizer/tokenizer.model --output-dir data/processed
```
## 5. Start training
Smoke run:
```bash
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet \
--validation-shards data/processed/shard-00001.parquet \
--output-dir runs/smoke \
--steps 20 \
--disable-wandb
```
Longer run:
```bash
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet data/processed/shard-00001.parquet \
--validation-shards data/processed/shard-00002.parquet \
--output-dir runs/sage-1b
```
## 6. Serve the model
GPU/PyTorch server:
```bash
python -m serve.start --host 0.0.0.0 --port 8000
```
CPU control-plane server:
```bash
python -m serve.start --cpu --host 0.0.0.0 --port 8001
```
Helper scripts:
```bash
bash scripts/run_serve.sh
bash scripts/run_serve_cpu.sh
```
## 7. Browser control panel
Open the server root:
```text
http://127.0.0.1:8000/
```
The browser UI now supports:
- login with the random 12-character password printed in the terminal at server startup
- dataset bootstrap preset
- shard-building preset
- tokenizer/train/eval/server presets
- raw shell commands
- live job logs
- direct model chat through `/chat`
## 8. API commands
Health:
```bash
curl http://127.0.0.1:8000/health
```
Generate from token ids:
```bash
curl -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
-d "{\"input_ids\": [1, 42, 99], \"max_new_tokens\": 8}"
```
Chat from text:
```bash
curl -X POST http://127.0.0.1:8000/chat \
-H "Content-Type: application/json" \
-d "{\"prompt\": \"Explain the training flow in this repo.\", \"max_new_tokens\": 64}"
```
Chat status:
```bash
curl http://127.0.0.1:8000/chat/status
```
## 9. Evaluation
```bash
python -m eval.run_benchmarks
```
Or use the helper:
```bash
bash scripts/run_eval.sh
```
## 10. Hugging Face sync
```bash
python hf_push.py
```
|