Initial upload of fine‑tuned Gemma + custom tokenizer
Browse files
README.md
CHANGED
|
@@ -5,6 +5,7 @@ The following is a a model trained by [...suspense...] that is meant to:
|
|
| 5 |
- be a really good, approximately bayesian in-context learner;
|
| 6 |
- fit an data generation process
|
| 7 |
- be calibrated over distributions of possible outputs wrt a population or epistemic uncertainty
|
|
|
|
| 8 |
|
| 9 |
**Description:** From gemma‑3‑12b‑it; keeps full chat format.
|
| 10 |
**Pros:** drop‑in for chat template · works on original logs
|
|
@@ -56,7 +57,6 @@ There are three variants of the model for now:
|
|
| 56 |
| **Example w/o inputs** | ```text\nDESCRIPTION\n<start_of_turn>OUTPUT1<end_of_turn>\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
|
| 57 |
|
| 58 |
|
| 59 |
-
|
| 60 |
This model/repo is a work in progress - expect updates.
|
| 61 |
|
| 62 |
Loading model example:
|
|
@@ -281,7 +281,85 @@ Output:
|
|
| 281 |
```
|
| 282 |
|
| 283 |
A few tips and tricks:
|
| 284 |
-
- Do not expect the model to do multi-turn chats. It is designed to be stateless and to treat each data point as "exchangeable" (roughly iid).
|
| 285 |
- If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
|
| 286 |
- The model is quite good at perspective taking / steering if you provide many examples.
|
| 287 |
- The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
- be a really good, approximately bayesian in-context learner;
|
| 6 |
- fit an data generation process
|
| 7 |
- be calibrated over distributions of possible outputs wrt a population or epistemic uncertainty
|
| 8 |
+
- Also, can act as a chat model and hopefully has more diverse outputs!
|
| 9 |
|
| 10 |
**Description:** From gemma‑3‑12b‑it; keeps full chat format.
|
| 11 |
**Pros:** drop‑in for chat template · works on original logs
|
|
|
|
| 57 |
| **Example w/o inputs** | ```text\nDESCRIPTION\n<start_of_turn>OUTPUT1<end_of_turn>\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
|
| 58 |
|
| 59 |
|
|
|
|
| 60 |
This model/repo is a work in progress - expect updates.
|
| 61 |
|
| 62 |
Loading model example:
|
|
|
|
| 281 |
```
|
| 282 |
|
| 283 |
A few tips and tricks:
|
|
|
|
| 284 |
- If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
|
| 285 |
- The model is quite good at perspective taking / steering if you provide many examples.
|
| 286 |
- The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.
|
| 287 |
+
|
| 288 |
+
### Chat-specific
|
| 289 |
+
|
| 290 |
+
Additionally, the model can also be used directly as a chat model, w/ some initial evidence that it is similar to the OG chat model, but w/ slightly more diverse outputs. For example, here are two prompts, along w/ next token probabilities for chat12b vs. google/gemma-3-12b-it:
|
| 291 |
+
|
| 292 |
+
User message: `Let's play rock paper scissors! I'll play at the same time — try to beat me. Return just rock, paper, or scissors`
|
| 293 |
+
|
| 294 |
+
Top 10 probabilities for google/gemma-3-12b-it:
|
| 295 |
+
```
|
| 296 |
+
1. 'paper' -> 0.8609
|
| 297 |
+
2. 'scissors' -> 0.1098
|
| 298 |
+
3. 'Scissors' -> 0.0164
|
| 299 |
+
4. 'Paper' -> 0.0129
|
| 300 |
+
5. 'Rock' -> 0.0000
|
| 301 |
+
6. 'rock' -> 0.0000
|
| 302 |
+
7. ' scissors' -> 0.0000
|
| 303 |
+
8. ' paper' -> 0.0000
|
| 304 |
+
9. '纸' -> 0.0000
|
| 305 |
+
10. '纸' -> 0.0000
|
| 306 |
+
```
|
| 307 |
+
|
| 308 |
+
Top 10 probabilities for first tsor13/chat12b token:
|
| 309 |
+
```
|
| 310 |
+
1. 'scissors' -> 0.6375
|
| 311 |
+
2. 'rock' -> 0.2188
|
| 312 |
+
3. 'paper' -> 0.1354
|
| 313 |
+
4. 'scissor' -> 0.0017
|
| 314 |
+
5. 'Scissors' -> 0.0017
|
| 315 |
+
6. 'Rock' -> 0.0015
|
| 316 |
+
7. 'Paper' -> 0.0005
|
| 317 |
+
8. 'sc' -> 0.0003
|
| 318 |
+
9. 'stone' -> 0.0002
|
| 319 |
+
10. 'I' -> 0.0001
|
| 320 |
+
```
|
| 321 |
+
|
| 322 |
+
It's not perfect, but as you can see, the chat12b model puts at least 13% probability on each of rock, paper, and scissors, while the original model always chooses scissors or paper.
|
| 323 |
+
|
| 324 |
+
User message: `What should I name my baby? Return just the name`
|
| 325 |
+
|
| 326 |
+
Top 10 probabilities for google/gemma-3-12b-it:
|
| 327 |
+
```
|
| 328 |
+
1. 'Ele' -> 0.5388
|
| 329 |
+
2. 'Hazel' -> 0.1768
|
| 330 |
+
3. 'Aurora' -> 0.1122
|
| 331 |
+
4. 'El' -> 0.0687
|
| 332 |
+
5. 'Olivia' -> 0.0380
|
| 333 |
+
6. 'The' -> 0.0148
|
| 334 |
+
7. 'E' -> 0.0123
|
| 335 |
+
8. 'Am' -> 0.0109
|
| 336 |
+
9. 'Willow' -> 0.0082
|
| 337 |
+
10. 'Leo' -> 0.0033
|
| 338 |
+
```
|
| 339 |
+
|
| 340 |
+
Top 10 probabilities for first tsor13/chat12b token:
|
| 341 |
+
```
|
| 342 |
+
1. 'Leo' -> 0.0477
|
| 343 |
+
2. 'Olivia' -> 0.0411
|
| 344 |
+
3. 'Liam' -> 0.0347
|
| 345 |
+
4. 'Oliver' -> 0.0280
|
| 346 |
+
5. 'E' -> 0.0257
|
| 347 |
+
6. 'James' -> 0.0239
|
| 348 |
+
7. 'Alice' -> 0.0221
|
| 349 |
+
8. 'A' -> 0.0214
|
| 350 |
+
9. 'Henry' -> 0.0214
|
| 351 |
+
10. 'Luna' -> 0.0206
|
| 352 |
+
```
|
| 353 |
+
|
| 354 |
+
Again, not perfect, but the chat model spreads out probability mass over many more names (unlike the original instruct model, which puts 50% chance on a name starting with "Ele").
|
| 355 |
+
|
| 356 |
+
Finally, the chat model also has a function to from the description/input/output format to system/user/assistant format, which can be used to directly chat with the model. For example:
|
| 357 |
+
|
| 358 |
+
```
|
| 359 |
+
messages = [
|
| 360 |
+
{"role": "description", "content": "You are a helpful assistant who outputs the requested content."},
|
| 361 |
+
{"role": "input", "content": "A poem about a shark"},
|
| 362 |
+
]
|
| 363 |
+
tokenizer.messages_to_chat_messages(messages)
|
| 364 |
+
```
|
| 365 |
+
|