sage / docs /COMMANDS.md

sage002

feat: add authenticated remote control UI and ngrok launcher

4c64fd6 verified about 1 month ago

preview code

raw

history blame contribute delete

3.87 kB

SAGE Commands

This is the repo's current command reference for data preparation, tokenizer training, model training, serving, browser control, and validation.

Install

pip install -r requirements.txt

Run tests

pytest -q

1. Create a starter dataset

This repo does not ship a large training corpus. The fastest way to unblock the pipeline is to generate the built-in smoke dataset first:

python -m data.bootstrap --output-dir data/raw --overwrite

That writes JSONL files like:

data/raw/general_web.jsonl
data/raw/code.jsonl
data/raw/math_science.jsonl
data/raw/multilingual.jsonl
data/raw/synthetic.jsonl

If you want to use your own corpus, put JSONL records in the same folder with at least a text field:

{ "text": "your training sample here" }

2. Train the tokenizer

The tokenizer trainer now accepts plain text files or JSONL files.

python -m tokenizer.train_tokenizer \
  --input data/raw/general_web.jsonl data/raw/code.jsonl data/raw/math_science.jsonl data/raw/multilingual.jsonl data/raw/synthetic.jsonl \
  --model-prefix tokenizer/tokenizer \
  --vocab-size 4096 \
  --training-text tokenizer/training_corpus.txt

3. Validate the tokenizer

python -m tokenizer.validate_tokenizer tokenizer/tokenizer.model

4. Build parquet shards

python -m data.pipeline \
  --tokenizer-model tokenizer/tokenizer.model \
  --output-dir data/processed \
  --shard-size 128

For a short smoke run:

python -m data.pipeline \
  --tokenizer-model tokenizer/tokenizer.model \
  --output-dir data/processed \
  --shard-size 32 \
  --limit-per-source 4

The shell helper now points to the real data pipeline:

bash scripts/run_data_pipeline.sh --tokenizer-model tokenizer/tokenizer.model --output-dir data/processed

5. Start training

Smoke run:

python -m train.trainer \
  --model-config configs/model/1b.yaml \
  --schedule-config configs/train/schedule.yaml \
  --train-shards data/processed/shard-00000.parquet \
  --validation-shards data/processed/shard-00001.parquet \
  --output-dir runs/smoke \
  --steps 20 \
  --disable-wandb

Longer run:

python -m train.trainer \
  --model-config configs/model/1b.yaml \
  --schedule-config configs/train/schedule.yaml \
  --train-shards data/processed/shard-00000.parquet data/processed/shard-00001.parquet \
  --validation-shards data/processed/shard-00002.parquet \
  --output-dir runs/sage-1b

6. Serve the model

GPU/PyTorch server:

python -m serve.start --host 0.0.0.0 --port 8000

CPU control-plane server:

python -m serve.start --cpu --host 0.0.0.0 --port 8001

Helper scripts:

bash scripts/run_serve.sh
bash scripts/run_serve_cpu.sh

7. Browser control panel

Open the server root:

http://127.0.0.1:8000/

The browser UI now supports:

login with the random 12-character password printed in the terminal at server startup
dataset bootstrap preset
shard-building preset
tokenizer/train/eval/server presets
raw shell commands
live job logs
direct model chat through /chat

8. API commands

Health:

curl http://127.0.0.1:8000/health

Generate from token ids:

curl -X POST http://127.0.0.1:8000/generate \
  -H "Content-Type: application/json" \
  -d "{\"input_ids\": [1, 42, 99], \"max_new_tokens\": 8}"

Chat from text:

curl -X POST http://127.0.0.1:8000/chat \
  -H "Content-Type: application/json" \
  -d "{\"prompt\": \"Explain the training flow in this repo.\", \"max_new_tokens\": 64}"

Chat status:

curl http://127.0.0.1:8000/chat/status

9. Evaluation

python -m eval.run_benchmarks

Or use the helper:

bash scripts/run_eval.sh

10. Hugging Face sync

python hf_push.py