SAGE Commands
This is the repo's current command reference for data preparation, tokenizer training, model training, serving, browser control, and validation.
Install
pip install -r requirements.txt
Run tests
pytest -q
1. Create a starter dataset
This repo does not ship a large training corpus. The fastest way to unblock the pipeline is to generate the built-in smoke dataset first:
python -m data.bootstrap --output-dir data/raw --overwrite
That writes JSONL files like:
data/raw/general_web.jsonl
data/raw/code.jsonl
data/raw/math_science.jsonl
data/raw/multilingual.jsonl
data/raw/synthetic.jsonl
If you want to use your own corpus, put JSONL records in the same folder with at least a text field:
{ "text": "your training sample here" }
2. Train the tokenizer
The tokenizer trainer now accepts plain text files or JSONL files.
python -m tokenizer.train_tokenizer \
--input data/raw/general_web.jsonl data/raw/code.jsonl data/raw/math_science.jsonl data/raw/multilingual.jsonl data/raw/synthetic.jsonl \
--model-prefix tokenizer/tokenizer \
--vocab-size 4096 \
--training-text tokenizer/training_corpus.txt
3. Validate the tokenizer
python -m tokenizer.validate_tokenizer tokenizer/tokenizer.model
4. Build parquet shards
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 128
For a short smoke run:
python -m data.pipeline \
--tokenizer-model tokenizer/tokenizer.model \
--output-dir data/processed \
--shard-size 32 \
--limit-per-source 4
The shell helper now points to the real data pipeline:
bash scripts/run_data_pipeline.sh --tokenizer-model tokenizer/tokenizer.model --output-dir data/processed
5. Start training
Smoke run:
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet \
--validation-shards data/processed/shard-00001.parquet \
--output-dir runs/smoke \
--steps 20 \
--disable-wandb
Longer run:
python -m train.trainer \
--model-config configs/model/1b.yaml \
--schedule-config configs/train/schedule.yaml \
--train-shards data/processed/shard-00000.parquet data/processed/shard-00001.parquet \
--validation-shards data/processed/shard-00002.parquet \
--output-dir runs/sage-1b
6. Serve the model
GPU/PyTorch server:
python -m serve.start --host 0.0.0.0 --port 8000
CPU control-plane server:
python -m serve.start --cpu --host 0.0.0.0 --port 8001
Helper scripts:
bash scripts/run_serve.sh
bash scripts/run_serve_cpu.sh
7. Browser control panel
Open the server root:
http://127.0.0.1:8000/
The browser UI now supports:
- login with the random 12-character password printed in the terminal at server startup
- dataset bootstrap preset
- shard-building preset
- tokenizer/train/eval/server presets
- raw shell commands
- live job logs
- direct model chat through
/chat
8. API commands
Health:
curl http://127.0.0.1:8000/health
Generate from token ids:
curl -X POST http://127.0.0.1:8000/generate \
-H "Content-Type: application/json" \
-d "{\"input_ids\": [1, 42, 99], \"max_new_tokens\": 8}"
Chat from text:
curl -X POST http://127.0.0.1:8000/chat \
-H "Content-Type: application/json" \
-d "{\"prompt\": \"Explain the training flow in this repo.\", \"max_new_tokens\": 64}"
Chat status:
curl http://127.0.0.1:8000/chat/status
9. Evaluation
python -m eval.run_benchmarks
Or use the helper:
bash scripts/run_eval.sh
10. Hugging Face sync
python hf_push.py