# SAGE Commands This is the repo's current command reference for data preparation, tokenizer training, model training, serving, browser control, and validation. ## Install ```bash pip install -r requirements.txt ``` ## Run tests ```bash pytest -q ``` ## 1. Create a starter dataset This repo does not ship a large training corpus. The fastest way to unblock the pipeline is to generate the built-in smoke dataset first: ```bash python -m data.bootstrap --output-dir data/raw --overwrite ``` That writes JSONL files like: ```text data/raw/general_web.jsonl data/raw/code.jsonl data/raw/math_science.jsonl data/raw/multilingual.jsonl data/raw/synthetic.jsonl ``` If you want to use your own corpus, put JSONL records in the same folder with at least a `text` field: ```json { "text": "your training sample here" } ``` ## 2. Train the tokenizer The tokenizer trainer now accepts plain text files or JSONL files. ```bash python -m tokenizer.train_tokenizer \ --input data/raw/general_web.jsonl data/raw/code.jsonl data/raw/math_science.jsonl data/raw/multilingual.jsonl data/raw/synthetic.jsonl \ --model-prefix tokenizer/tokenizer \ --vocab-size 4096 \ --training-text tokenizer/training_corpus.txt ``` ## 3. Validate the tokenizer ```bash python -m tokenizer.validate_tokenizer tokenizer/tokenizer.model ``` ## 4. Build parquet shards ```bash python -m data.pipeline \ --tokenizer-model tokenizer/tokenizer.model \ --output-dir data/processed \ --shard-size 128 ``` For a short smoke run: ```bash python -m data.pipeline \ --tokenizer-model tokenizer/tokenizer.model \ --output-dir data/processed \ --shard-size 32 \ --limit-per-source 4 ``` The shell helper now points to the real data pipeline: ```bash bash scripts/run_data_pipeline.sh --tokenizer-model tokenizer/tokenizer.model --output-dir data/processed ``` ## 5. Start training Smoke run: ```bash python -m train.trainer \ --model-config configs/model/1b.yaml \ --schedule-config configs/train/schedule.yaml \ --train-shards data/processed/shard-00000.parquet \ --validation-shards data/processed/shard-00001.parquet \ --output-dir runs/smoke \ --steps 20 \ --disable-wandb ``` Longer run: ```bash python -m train.trainer \ --model-config configs/model/1b.yaml \ --schedule-config configs/train/schedule.yaml \ --train-shards data/processed/shard-00000.parquet data/processed/shard-00001.parquet \ --validation-shards data/processed/shard-00002.parquet \ --output-dir runs/sage-1b ``` ## 6. Serve the model GPU/PyTorch server: ```bash python -m serve.start --host 0.0.0.0 --port 8000 ``` CPU control-plane server: ```bash python -m serve.start --cpu --host 0.0.0.0 --port 8001 ``` Helper scripts: ```bash bash scripts/run_serve.sh bash scripts/run_serve_cpu.sh ``` ## 7. Browser control panel Open the server root: ```text http://127.0.0.1:8000/ ``` The browser UI now supports: - login with the random 12-character password printed in the terminal at server startup - dataset bootstrap preset - shard-building preset - tokenizer/train/eval/server presets - raw shell commands - live job logs - direct model chat through `/chat` ## 8. API commands Health: ```bash curl http://127.0.0.1:8000/health ``` Generate from token ids: ```bash curl -X POST http://127.0.0.1:8000/generate \ -H "Content-Type: application/json" \ -d "{\"input_ids\": [1, 42, 99], \"max_new_tokens\": 8}" ``` Chat from text: ```bash curl -X POST http://127.0.0.1:8000/chat \ -H "Content-Type: application/json" \ -d "{\"prompt\": \"Explain the training flow in this repo.\", \"max_new_tokens\": 64}" ``` Chat status: ```bash curl http://127.0.0.1:8000/chat/status ``` ## 9. Evaluation ```bash python -m eval.run_benchmarks ``` Or use the helper: ```bash bash scripts/run_eval.sh ``` ## 10. Hugging Face sync ```bash python hf_push.py ```