Initial upload of fine‑tuned Gemma + custom tokenizer
Browse files- README.md +52 -44
- config.json +4 -0
- model-00001-of-00005.safetensors +1 -1
- model-00002-of-00005.safetensors +1 -1
- model-00003-of-00005.safetensors +1 -1
- model-00004-of-00005.safetensors +1 -1
- model-00005-of-00005.safetensors +1 -1
- training_args.bin +1 -1
README.md
CHANGED
|
@@ -10,7 +10,7 @@ It is initialized from `google/gemma-3-12b-pt`.
|
|
| 10 |
**Pros:** distinguishes description/input · closer to chat · best generations(?)
|
| 11 |
**Cons:** more tokens than *special*
|
| 12 |
|
| 13 |
-
<details><summary>Example
|
| 14 |
|
| 15 |
```text
|
| 16 |
<start_of_turn>description
|
|
@@ -26,7 +26,7 @@ OUTPUT2<end_of_turn>
|
|
| 26 |
```
|
| 27 |
</details>
|
| 28 |
|
| 29 |
-
<details><summary>Example
|
| 30 |
|
| 31 |
```text
|
| 32 |
<start_of_turn>description
|
|
@@ -43,11 +43,11 @@ There are three variants of the model for now:
|
|
| 43 |
| **Field** | **special** | **extra** | **chat** |
|
| 44 |
|-----------|-------------|-----------|----------|
|
| 45 |
| **Model card** | [`tsor13/special12b`](https://huggingface.co/tsor13/special12b) | [`tsor13/extra12b`](https://huggingface.co/tsor13/extra12b) | [`tsor13/chat12b`](https://huggingface.co/tsor13/chat12b) |
|
| 46 |
-
| **Description** | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-it`, trained to preserve
|
| 47 |
-
| **Pros** | • Most token‑efficient (only tags around the output) | • Distinguishes description vs first input<br>• Closer to chat format<br>• Best generations (?) | • Drop‑in for Gemma‑chat template<br>• Works on original chat logs, even
|
| 48 |
| **Cons** | • May not tell description from first input<br>• Formatting farther from Gemma chat template | • More tokens than *special* | • Many extra tokens |
|
| 49 |
-
| **Example
|
| 50 |
-
| **Example
|
| 51 |
|
| 52 |
At the moment, I recommend:
|
| 53 |
- [special](https://huggingface.co/tsor13/special12b) for most use cases (token-efficient and gets best loss on training data)
|
|
@@ -121,17 +121,17 @@ with torch.no_grad():
|
|
| 121 |
|
| 122 |
Output:
|
| 123 |
```
|
| 124 |
-
Top 10 probabilities for first output token:
|
| 125 |
-
1. 'Tokyo' -> 0.
|
| 126 |
-
2. '
|
| 127 |
-
3. '
|
| 128 |
-
4. '
|
| 129 |
-
5. '
|
| 130 |
-
6. '
|
| 131 |
-
7. '
|
| 132 |
-
8. '
|
| 133 |
-
9. '
|
| 134 |
-
10. '
|
| 135 |
```
|
| 136 |
|
| 137 |
Great! Almost all of the probability mass is on the correct answer, Tokyo.
|
|
@@ -161,10 +161,10 @@ for i in range(n_gens):
|
|
| 161 |
|
| 162 |
Outputs:
|
| 163 |
```
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
```
|
| 169 |
Not too bad!
|
| 170 |
|
|
@@ -186,10 +186,10 @@ for i in range(n_gens):
|
|
| 186 |
```
|
| 187 |
Output:
|
| 188 |
```
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
```
|
| 194 |
|
| 195 |
By default, the model is only trained to do 1) either emulate outputs if examples are provided, or 2) generate data based on the description. Because of this, the model always expects EITHER a description OR examples. If you want it to act slightly more like an instruction following chat model, you can add a description such as the following:
|
|
@@ -215,23 +215,31 @@ for i in range(n_gens):
|
|
| 215 |
Some example generations:
|
| 216 |
```
|
| 217 |
Generation 0:
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
With fins that
|
| 223 |
Generation 1:
|
| 224 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 225 |
Generation 2:
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
|
|
|
|
|
|
| 230 |
Generation 3:
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
|
|
|
|
|
|
| 235 |
```
|
| 236 |
|
| 237 |
|
|
@@ -268,14 +276,14 @@ for i in range(n_gens):
|
|
| 268 |
```
|
| 269 |
Output:
|
| 270 |
```
|
| 271 |
-
{"situation": "
|
| 272 |
-
{"situation": "
|
| 273 |
-
{"situation": "
|
| 274 |
-
{"situation": "
|
| 275 |
```
|
| 276 |
|
| 277 |
A few tips and tricks:
|
| 278 |
- Do not expect the model to do multi-turn chats. It is designed to be stateless and to treat each data point as "exchangeable" (roughly iid).
|
| 279 |
- If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
|
| 280 |
- The model is quite good at perspective taking / steering if you provide many examples.
|
| 281 |
-
- The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.
|
|
|
|
| 10 |
**Pros:** distinguishes description/input · closer to chat · best generations(?)
|
| 11 |
**Cons:** more tokens than *special*
|
| 12 |
|
| 13 |
+
<details><summary>Example w/ inputs</summary>
|
| 14 |
|
| 15 |
```text
|
| 16 |
<start_of_turn>description
|
|
|
|
| 26 |
```
|
| 27 |
</details>
|
| 28 |
|
| 29 |
+
<details><summary>Example w/o inputs</summary>
|
| 30 |
|
| 31 |
```text
|
| 32 |
<start_of_turn>description
|
|
|
|
| 43 |
| **Field** | **special** | **extra** | **chat** |
|
| 44 |
|-----------|-------------|-----------|----------|
|
| 45 |
| **Model card** | [`tsor13/special12b`](https://huggingface.co/tsor13/special12b) | [`tsor13/extra12b`](https://huggingface.co/tsor13/extra12b) | [`tsor13/chat12b`](https://huggingface.co/tsor13/chat12b) |
|
| 46 |
+
| **Description** | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-it`, trained to preserve & assume chat format |
|
| 47 |
+
| **Pros** | • Most token‑efficient (only tags around the output) | • Distinguishes description vs first input<br>• Closer to chat format<br>• Best generations (?) | • Drop‑in for Gemma‑chat template<br>• Works on original chat logs, even OOD |
|
| 48 |
| **Cons** | • May not tell description from first input<br>• Formatting farther from Gemma chat template | • More tokens than *special* | • Many extra tokens |
|
| 49 |
+
| **Example w/ inputs** | ```text\nDESCRIPTION\nINPUT1\n<start_of_turn>OUTPUT1<end_of_turn>\nINPUT2\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>input\nINPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>input\nINPUT2<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nINPUT1<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nINPUT2<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
|
| 50 |
+
| **Example w/o inputs** | ```text\nDESCRIPTION\n<start_of_turn>OUTPUT1<end_of_turn>\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
|
| 51 |
|
| 52 |
At the moment, I recommend:
|
| 53 |
- [special](https://huggingface.co/tsor13/special12b) for most use cases (token-efficient and gets best loss on training data)
|
|
|
|
| 121 |
|
| 122 |
Output:
|
| 123 |
```
|
| 124 |
+
Top 10 probabilities for first output token:
|
| 125 |
+
1. 'Tokyo' -> 0.9904
|
| 126 |
+
2. 'Tok' -> 0.0027
|
| 127 |
+
3. 'tok' -> 0.0007
|
| 128 |
+
4. 'To' -> 0.0006
|
| 129 |
+
5. 'Toy' -> 0.0006
|
| 130 |
+
6. '東京' -> 0.0005
|
| 131 |
+
7. 'Ky' -> 0.0005
|
| 132 |
+
8. 'T' -> 0.0002
|
| 133 |
+
9. 'Washington' -> 0.0001
|
| 134 |
+
10. 'N' -> 0.0001
|
| 135 |
```
|
| 136 |
|
| 137 |
Great! Almost all of the probability mass is on the correct answer, Tokyo.
|
|
|
|
| 161 |
|
| 162 |
Outputs:
|
| 163 |
```
|
| 164 |
+
Root
|
| 165 |
+
Dune: Imperium
|
| 166 |
+
Gloomhaven
|
| 167 |
+
Spirit Island
|
| 168 |
```
|
| 169 |
Not too bad!
|
| 170 |
|
|
|
|
| 186 |
```
|
| 187 |
Output:
|
| 188 |
```
|
| 189 |
+
light steel blue
|
| 190 |
+
A descriptive list of colors
|
| 191 |
+
green
|
| 192 |
+
Blue
|
| 193 |
```
|
| 194 |
|
| 195 |
By default, the model is only trained to do 1) either emulate outputs if examples are provided, or 2) generate data based on the description. Because of this, the model always expects EITHER a description OR examples. If you want it to act slightly more like an instruction following chat model, you can add a description such as the following:
|
|
|
|
| 215 |
Some example generations:
|
| 216 |
```
|
| 217 |
Generation 0:
|
| 218 |
+
A shark is a fish that lives in the sea
|
| 219 |
+
They come in different shapes and sizes, you see
|
| 220 |
+
Some are small and cute, some are huge and mean
|
| 221 |
+
But all sharks are dangerous,
|
|
|
|
| 222 |
Generation 1:
|
| 223 |
+
A master of the ocean blue,
|
| 224 |
+
A silent hunter, strong and true.
|
| 225 |
+
With fins that glide and teeth so sharp,
|
| 226 |
+
A fearsome creature, dark and carp.
|
| 227 |
+
|
| 228 |
+
It swims through
|
| 229 |
Generation 2:
|
| 230 |
+
A sleek grey shadow, a silent stride,
|
| 231 |
+
The ocean's hunter, full of pride.
|
| 232 |
+
With fins that cut the water deep,
|
| 233 |
+
And eyes of darkness, secrets keep.
|
| 234 |
+
|
| 235 |
+
A
|
| 236 |
Generation 3:
|
| 237 |
+
With fins so sharp and eyes so keen,
|
| 238 |
+
A shark patrols the ocean scene.
|
| 239 |
+
A silent hunter, sleek and gray,
|
| 240 |
+
It glides through waters, come what may.
|
| 241 |
+
|
| 242 |
+
Its teeth
|
| 243 |
```
|
| 244 |
|
| 245 |
|
|
|
|
| 276 |
```
|
| 277 |
Output:
|
| 278 |
```
|
| 279 |
+
{"situation": "You spilled coffee on your new shoes.", "is_awkward": true}
|
| 280 |
+
{"situation": "While at a restaurant, the waiter asks if you're ready to order.", "is_awkward": false}
|
| 281 |
+
{"situation": "Going out for coffee with a good friend.", "is_awkward": false}
|
| 282 |
+
{"situation": "Having to tell your mother that you didn't go to college.", "is_awkward": true}
|
| 283 |
```
|
| 284 |
|
| 285 |
A few tips and tricks:
|
| 286 |
- Do not expect the model to do multi-turn chats. It is designed to be stateless and to treat each data point as "exchangeable" (roughly iid).
|
| 287 |
- If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
|
| 288 |
- The model is quite good at perspective taking / steering if you provide many examples.
|
| 289 |
+
- The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.
|
config.json
CHANGED
|
@@ -4,6 +4,10 @@
|
|
| 4 |
],
|
| 5 |
"boi_token_index": 255999,
|
| 6 |
"eoi_token_index": 256000,
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
"image_token_index": 262144,
|
| 8 |
"initializer_range": 0.02,
|
| 9 |
"mm_tokens_per_image": 256,
|
|
|
|
| 4 |
],
|
| 5 |
"boi_token_index": 255999,
|
| 6 |
"eoi_token_index": 256000,
|
| 7 |
+
"eos_token_id": [
|
| 8 |
+
1,
|
| 9 |
+
106
|
| 10 |
+
],
|
| 11 |
"image_token_index": 262144,
|
| 12 |
"initializer_range": 0.02,
|
| 13 |
"mm_tokens_per_image": 256,
|
model-00001-of-00005.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4979902192
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:aef1debfc6bbefdf97063e789ddd39b8b9d5707b5b5c0eb5cf86cfb0d6875ae5
|
| 3 |
size 4979902192
|
model-00002-of-00005.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4931296592
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a212cda2596393ed61ea7430b3cddcca5927fa4f6a4fcbd44647ecb36d70a25c
|
| 3 |
size 4931296592
|
model-00003-of-00005.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4931296656
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:04e4c52313e4f7f15f8b946fb3dd95b76686bd7319aa61cab159e03956d3abf8
|
| 3 |
size 4931296656
|
model-00004-of-00005.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4931296656
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b5d9e4ffcf70a992d2eb53f98905cbd4a22002c716a0f4355cfa0c2c4c572560
|
| 3 |
size 4931296656
|
model-00005-of-00005.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 4601000928
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2dd992a0fb2de5547f823fc03d09701dbab11f7dd424abbd531f1f970d06155f
|
| 3 |
size 4601000928
|
training_args.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 7377
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f1b4fda9c4cd88dc413281e68f83ed28b7c00babeb4e12badd5cafb0609fa18b
|
| 3 |
size 7377
|