File size: 3,874 Bytes
ef18673
 
1e799aa
ef18673
 
 
 
 
 
 
 
 
 
 
 
 
1e799aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15af856
1e799aa
 
 
 
 
ef18673
 
 
1e799aa
b4f432f
1e799aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef18673
 
1e799aa
ef18673
 
1e799aa
 
 
 
 
ef18673
 
1e799aa
 
 
 
 
 
 
 
 
ef18673
 
 
1e799aa
 
ef18673
 
 
 
 
 
 
1e799aa
ef18673
 
 
 
 
 
 
 
 
 
1e799aa
 
 
ef18673
 
4c64fd6
ef18673
 
1e799aa
ef18673
 
4c64fd6
ef18673
 
1e799aa
ef18673
 
1e799aa
ef18673
 
 
1e799aa
 
 
 
 
 
 
 
 
 
4c64fd6
1e799aa
 
 
 
 
 
 
 
b4f432f
1e799aa
ef18673
 
 
 
 
1e799aa
ef18673
 
 
 
 
 
1e799aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
# SAGE Commands

This is the repo's current command reference for data preparation, tokenizer training, model training, serving, browser control, and validation.

## Install

```bash
pip install -r requirements.txt
```

## Run tests

```bash
pytest -q
```

## 1. Create a starter dataset

This repo does not ship a large training corpus. The fastest way to unblock the pipeline is to generate the built-in smoke dataset first:

```bash
python -m data.bootstrap --output-dir data/raw --overwrite
```

That writes JSONL files like:

```text
data/raw/general_web.jsonl
data/raw/code.jsonl
data/raw/math_science.jsonl
data/raw/multilingual.jsonl
data/raw/synthetic.jsonl
```

If you want to use your own corpus, put JSONL records in the same folder with at least a `text` field:

```json
{ "text": "your training sample here" }
```

## 2. Train the tokenizer

The tokenizer trainer now accepts plain text files or JSONL files.

```bash
python -m tokenizer.train_tokenizer \
  --input data/raw/general_web.jsonl data/raw/code.jsonl data/raw/math_science.jsonl data/raw/multilingual.jsonl data/raw/synthetic.jsonl \
  --model-prefix tokenizer/tokenizer \
  --vocab-size 4096 \
  --training-text tokenizer/training_corpus.txt
```

## 3. Validate the tokenizer

```bash
python -m tokenizer.validate_tokenizer tokenizer/tokenizer.model
```

## 4. Build parquet shards

```bash
python -m data.pipeline \
  --tokenizer-model tokenizer/tokenizer.model \
  --output-dir data/processed \
  --shard-size 128
```

For a short smoke run:

```bash
python -m data.pipeline \
  --tokenizer-model tokenizer/tokenizer.model \
  --output-dir data/processed \
  --shard-size 32 \
  --limit-per-source 4
```

The shell helper now points to the real data pipeline:

```bash
bash scripts/run_data_pipeline.sh --tokenizer-model tokenizer/tokenizer.model --output-dir data/processed
```

## 5. Start training

Smoke run:

```bash
python -m train.trainer \
  --model-config configs/model/1b.yaml \
  --schedule-config configs/train/schedule.yaml \
  --train-shards data/processed/shard-00000.parquet \
  --validation-shards data/processed/shard-00001.parquet \
  --output-dir runs/smoke \
  --steps 20 \
  --disable-wandb
```

Longer run:

```bash
python -m train.trainer \
  --model-config configs/model/1b.yaml \
  --schedule-config configs/train/schedule.yaml \
  --train-shards data/processed/shard-00000.parquet data/processed/shard-00001.parquet \
  --validation-shards data/processed/shard-00002.parquet \
  --output-dir runs/sage-1b
```

## 6. Serve the model

GPU/PyTorch server:

```bash
python -m serve.start --host 0.0.0.0 --port 8000
```

CPU control-plane server:

```bash
python -m serve.start --cpu --host 0.0.0.0 --port 8001
```

Helper scripts:

```bash
bash scripts/run_serve.sh
bash scripts/run_serve_cpu.sh
```

## 7. Browser control panel

Open the server root:

```text
http://127.0.0.1:8000/
```

The browser UI now supports:

- login with the random 12-character password printed in the terminal at server startup
- dataset bootstrap preset
- shard-building preset
- tokenizer/train/eval/server presets
- raw shell commands
- live job logs
- direct model chat through `/chat`

## 8. API commands

Health:

```bash
curl http://127.0.0.1:8000/health
```

Generate from token ids:

```bash
curl -X POST http://127.0.0.1:8000/generate \
  -H "Content-Type: application/json" \
  -d "{\"input_ids\": [1, 42, 99], \"max_new_tokens\": 8}"
```

Chat from text:

```bash
curl -X POST http://127.0.0.1:8000/chat \
  -H "Content-Type: application/json" \
  -d "{\"prompt\": \"Explain the training flow in this repo.\", \"max_new_tokens\": 64}"
```

Chat status:

```bash
curl http://127.0.0.1:8000/chat/status
```

## 9. Evaluation

```bash
python -m eval.run_benchmarks
```

Or use the helper:

```bash
bash scripts/run_eval.sh
```

## 10. Hugging Face sync

```bash
python hf_push.py
```