sage / docs /COMMANDS.md
sage002's picture
feat: add authenticated remote control UI and ngrok launcher
4c64fd6 verified
# SAGE Commands
This is the repo's current command reference for data preparation, tokenizer training, model training, serving, browser control, and validation.
## Install
```bash
pip install -r requirements.txt
```
## Run tests
```bash
pytest -q
```
## 1. Create a starter dataset
This repo does not ship a large training corpus. The fastest way to unblock the pipeline is to generate the built-in smoke dataset first:
```bash
python -m data.bootstrap --output-dir data/raw --overwrite
```
That writes JSONL files like:
```text
data/raw/general_web.jsonl
data/raw/code.jsonl
data/raw/math_science.jsonl
data/raw/multilingual.jsonl
data/raw/synthetic.jsonl
```
If you want to use your own corpus, put JSONL records in the same folder with at least a `text` field:
```json
{ "text": "your training sample here" }
```
## 2. Train the tokenizer
The tokenizer trainer now accepts plain text files or JSONL files.
```bash
python -m tokenizer.train_tokenizer \
--input data/raw/general_web.jsonl data/raw/code.jsonl data/raw/math_science.jsonl data/raw/multilingual.jsonl data/raw/synthetic.jsonl \
--model-prefix tokenizer/tokenizer \
--vocab-size 4096 \
--training-text tokenizer/training_corpus.txt
```
## 3. Validate the tokenizer
```bash
python -m tokenizer.validate_tokenizer tokenizer/tokenizer.model
```
## 4. Build parquet shards
```bash
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 128
```
For a short smoke run:
```bash
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 32 \
--limit-per-source 4
```
The shell helper now points to the real data pipeline:
```bash
bash scripts/run_data_pipeline.sh --tokenizer-model tokenizer/tokenizer.model --output-dir data/processed
```
## 5. Start training
Smoke run:
```bash
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet \
--validation-shards data/processed/shard-00001.parquet \
--output-dir runs/smoke \
--steps 20 \
--disable-wandb
```
Longer run:
```bash
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet data/processed/shard-00001.parquet \
--validation-shards data/processed/shard-00002.parquet \
--output-dir runs/sage-1b
```
## 6. Serve the model
GPU/PyTorch server:
```bash
python -m serve.start --host 0.0.0.0 --port 8000
```
CPU control-plane server:
```bash
python -m serve.start --cpu --host 0.0.0.0 --port 8001
```
Helper scripts:
```bash
bash scripts/run_serve.sh
bash scripts/run_serve_cpu.sh
```
## 7. Browser control panel
Open the server root:
```text
http://127.0.0.1:8000/
```
The browser UI now supports:
- login with the random 12-character password printed in the terminal at server startup
- dataset bootstrap preset
- shard-building preset
- tokenizer/train/eval/server presets
- raw shell commands
- live job logs
- direct model chat through `/chat`
## 8. API commands
Health:
```bash
curl http://127.0.0.1:8000/health
```
Generate from token ids:
```bash
curl -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
-d "{\"input_ids\": [1, 42, 99], \"max_new_tokens\": 8}"
```
Chat from text:
```bash
curl -X POST http://127.0.0.1:8000/chat \
-H "Content-Type: application/json" \
-d "{\"prompt\": \"Explain the training flow in this repo.\", \"max_new_tokens\": 64}"
```
Chat status:
```bash
curl http://127.0.0.1:8000/chat/status
```
## 9. Evaluation
```bash
python -m eval.run_benchmarks
```
Or use the helper:
```bash
bash scripts/run_eval.sh
```
## 10. Hugging Face sync
```bash
python hf_push.py
```