tsor13 commited on
Commit
468abc9
·
verified ·
1 Parent(s): bdad2f2

Initial upload of fine‑tuned Gemma + custom tokenizer

Browse files
README.md CHANGED
@@ -10,7 +10,7 @@ It is initialized from `google/gemma-3-12b-pt`.
10
  **Pros:** distinguishes description/input · closer to chat · best generations(?)
11
  **Cons:** more tokens than *special*
12
 
13
- <details><summary>Example w/ inputs</summary>
14
 
15
  ```text
16
  <start_of_turn>description
@@ -26,7 +26,7 @@ OUTPUT2<end_of_turn>
26
  ```
27
  </details>
28
 
29
- <details><summary>Example w/o inputs</summary>
30
 
31
  ```text
32
  <start_of_turn>description
@@ -43,11 +43,11 @@ There are three variants of the model for now:
43
  | **Field** | **special** | **extra** | **chat** |
44
  |-----------|-------------|-----------|----------|
45
  | **Model card** | [`tsor13/special12b`](https://huggingface.co/tsor13/special12b) | [`tsor13/extra12b`](https://huggingface.co/tsor13/extra12b) | [`tsor13/chat12b`](https://huggingface.co/tsor13/chat12b) |
46
- | **Description** | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-it`, trained to preserve & assume chat format |
47
- | **Pros** | • Most token‑efficient (only tags around the output) | • Distinguishes description vs first input<br>• Closer to chat format<br>• Best generations (?) | • Drop‑in for Gemma‑chat template<br>• Works on original chat logs, even OOD |
48
  | **Cons** | • May not tell description from first input<br>• Formatting farther from Gemma chat template | • More tokens than *special* | • Many extra tokens |
49
- | **Example w/ inputs** | ```text\nDESCRIPTION\nINPUT1\n<start_of_turn>OUTPUT1<end_of_turn>\nINPUT2\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>input\nINPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>input\nINPUT2<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nINPUT1<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nINPUT2<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
50
- | **Example w/o inputs** | ```text\nDESCRIPTION\n<start_of_turn>OUTPUT1<end_of_turn>\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
51
 
52
  At the moment, I recommend:
53
  - [special](https://huggingface.co/tsor13/special12b) for most use cases (token-efficient and gets best loss on training data)
@@ -121,17 +121,17 @@ with torch.no_grad():
121
 
122
  Output:
123
  ```
124
- Top 10 probabilities for first output token:
125
- 1. 'Tokyo' -> 0.9846
126
- 2. '東京' -> 0.0032
127
- 3. 'Tok' -> 0.0023
128
- 4. 'Ky' -> 0.0023
129
- 5. 'tok' -> 0.0012
130
- 6. 'TO' -> 0.0011
131
- 7. 'T' -> 0.0009
132
- 8. 'To' -> 0.0005
133
- 9. 'Toy' -> 0.0004
134
- 10. '东京' -> 0.0002
135
  ```
136
 
137
  Great! Almost all of the probability mass is on the correct answer, Tokyo.
@@ -161,10 +161,10 @@ for i in range(n_gens):
161
 
162
  Outputs:
163
  ```
164
- Twilight Struggle
165
- Ark Nova
166
- Bardsung
167
- Carcassonne
168
  ```
169
  Not too bad!
170
 
@@ -186,10 +186,10 @@ for i in range(n_gens):
186
  ```
187
  Output:
188
  ```
189
- Light Green
190
- #ffff00
191
- yellow
192
- Bordeaux
193
  ```
194
 
195
  By default, the model is only trained to do 1) either emulate outputs if examples are provided, or 2) generate data based on the description. Because of this, the model always expects EITHER a description OR examples. If you want it to act slightly more like an instruction following chat model, you can add a description such as the following:
@@ -215,23 +215,31 @@ for i in range(n_gens):
215
  Some example generations:
216
  ```
217
  Generation 0:
218
- Swimming gracefully in the deep blue sea,
219
- A shark glides by so elegantly.
220
- Its sleek body moves with precision,
221
- As it searches for its next meal with attention.
222
- With fins that
223
  Generation 1:
224
- There once was a shark named Sandy, With a smile that would make you drooly. He swam in the deep blue sea, With friends who loved him oh so free.
 
 
 
 
 
225
  Generation 2:
226
- I was walking through the forest
227
- and came across a shark.
228
- I was terrified and ran away,
229
- leaving the shark alone.
 
 
230
  Generation 3:
231
- In the depths of the ocean, where shadows roam,
232
- Lurks a creature of ancient lore,
233
- With jaws that snap and teeth like razors,
234
- A shark swims through the deep,
 
 
235
  ```
236
 
237
 
@@ -268,14 +276,14 @@ for i in range(n_gens):
268
  ```
269
  Output:
270
  ```
271
- {"situation": "Your family surprised you with tickets to your favorite bands concert.", "is_awkward": false}
272
- {"situation": "Your boss is talking about how she is going on a trip and she starts to talk about how she is going to a fancy restaurant.", "is_awkward": false}
273
- {"situation": "You show up at the hospital for an appointment and realize you are one day late.", "is_awkward": true}
274
- {"situation": "Your crush tells you that they had a terrible day.", "is_awkward": true}
275
  ```
276
 
277
  A few tips and tricks:
278
  - Do not expect the model to do multi-turn chats. It is designed to be stateless and to treat each data point as "exchangeable" (roughly iid).
279
  - If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
280
  - The model is quite good at perspective taking / steering if you provide many examples.
281
- - The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.
 
10
  **Pros:** distinguishes description/input · closer to chat · best generations(?)
11
  **Cons:** more tokens than *special*
12
 
13
+ <details><summary>Example w/ inputs</summary>
14
 
15
  ```text
16
  <start_of_turn>description
 
26
  ```
27
  </details>
28
 
29
+ <details><summary>Example w/o inputs</summary>
30
 
31
  ```text
32
  <start_of_turn>description
 
43
  | **Field** | **special** | **extra** | **chat** |
44
  |-----------|-------------|-----------|----------|
45
  | **Model card** | [`tsor13/special12b`](https://huggingface.co/tsor13/special12b) | [`tsor13/extra12b`](https://huggingface.co/tsor13/extra12b) | [`tsor13/chat12b`](https://huggingface.co/tsor13/chat12b) |
46
+ | **Description** | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-it`, trained to preserve & assume chat format |
47
+ | **Pros** | • Most token‑efficient (only tags around the output) | • Distinguishes description vs first input<br>• Closer to chat format<br>• Best generations (?) | • Drop‑in for Gemma‑chat template<br>• Works on original chat logs, even OOD |
48
  | **Cons** | • May not tell description from first input<br>• Formatting farther from Gemma chat template | • More tokens than *special* | • Many extra tokens |
49
+ | **Example w/ inputs** | ```text\nDESCRIPTION\nINPUT1\n<start_of_turn>OUTPUT1<end_of_turn>\nINPUT2\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>input\nINPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>input\nINPUT2<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nINPUT1<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nINPUT2<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
50
+ | **Example w/o inputs** | ```text\nDESCRIPTION\n<start_of_turn>OUTPUT1<end_of_turn>\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
51
 
52
  At the moment, I recommend:
53
  - [special](https://huggingface.co/tsor13/special12b) for most use cases (token-efficient and gets best loss on training data)
 
121
 
122
  Output:
123
  ```
124
+ Top 10 probabilities for first output token:
125
+ 1. 'Tokyo' -> 0.9904
126
+ 2. 'Tok' -> 0.0027
127
+ 3. 'tok' -> 0.0007
128
+ 4. 'To' -> 0.0006
129
+ 5. 'Toy' -> 0.0006
130
+ 6. '東京' -> 0.0005
131
+ 7. 'Ky' -> 0.0005
132
+ 8. 'T' -> 0.0002
133
+ 9. 'Washington' -> 0.0001
134
+ 10. 'N' -> 0.0001
135
  ```
136
 
137
  Great! Almost all of the probability mass is on the correct answer, Tokyo.
 
161
 
162
  Outputs:
163
  ```
164
+ Root
165
+ Dune: Imperium
166
+ Gloomhaven
167
+ Spirit Island
168
  ```
169
  Not too bad!
170
 
 
186
  ```
187
  Output:
188
  ```
189
+ light steel blue
190
+ A descriptive list of colors
191
+ green
192
+ Blue
193
  ```
194
 
195
  By default, the model is only trained to do 1) either emulate outputs if examples are provided, or 2) generate data based on the description. Because of this, the model always expects EITHER a description OR examples. If you want it to act slightly more like an instruction following chat model, you can add a description such as the following:
 
215
  Some example generations:
216
  ```
217
  Generation 0:
218
+ A shark is a fish that lives in the sea
219
+ They come in different shapes and sizes, you see
220
+ Some are small and cute, some are huge and mean
221
+ But all sharks are dangerous,
 
222
  Generation 1:
223
+ A master of the ocean blue,
224
+ A silent hunter, strong and true.
225
+ With fins that glide and teeth so sharp,
226
+ A fearsome creature, dark and carp.
227
+
228
+ It swims through
229
  Generation 2:
230
+ A sleek grey shadow, a silent stride,
231
+ The ocean's hunter, full of pride.
232
+ With fins that cut the water deep,
233
+ And eyes of darkness, secrets keep.
234
+
235
+ A
236
  Generation 3:
237
+ With fins so sharp and eyes so keen,
238
+ A shark patrols the ocean scene.
239
+ A silent hunter, sleek and gray,
240
+ It glides through waters, come what may.
241
+
242
+ Its teeth
243
  ```
244
 
245
 
 
276
  ```
277
  Output:
278
  ```
279
+ {"situation": "You spilled coffee on your new shoes.", "is_awkward": true}
280
+ {"situation": "While at a restaurant, the waiter asks if you're ready to order.", "is_awkward": false}
281
+ {"situation": "Going out for coffee with a good friend.", "is_awkward": false}
282
+ {"situation": "Having to tell your mother that you didn't go to college.", "is_awkward": true}
283
  ```
284
 
285
  A few tips and tricks:
286
  - Do not expect the model to do multi-turn chats. It is designed to be stateless and to treat each data point as "exchangeable" (roughly iid).
287
  - If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
288
  - The model is quite good at perspective taking / steering if you provide many examples.
289
+ - The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.
config.json CHANGED
@@ -4,6 +4,10 @@
4
  ],
5
  "boi_token_index": 255999,
6
  "eoi_token_index": 256000,
 
 
 
 
7
  "image_token_index": 262144,
8
  "initializer_range": 0.02,
9
  "mm_tokens_per_image": 256,
 
4
  ],
5
  "boi_token_index": 255999,
6
  "eoi_token_index": 256000,
7
+ "eos_token_id": [
8
+ 1,
9
+ 106
10
+ ],
11
  "image_token_index": 262144,
12
  "initializer_range": 0.02,
13
  "mm_tokens_per_image": 256,
model-00001-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9009aab445c6ba6a124ad1d6c87398c489a430a4143f40b15dfc2bec4500f948
3
  size 4979902192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aef1debfc6bbefdf97063e789ddd39b8b9d5707b5b5c0eb5cf86cfb0d6875ae5
3
  size 4979902192
model-00002-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f05f6d6222735068da19c46ec89d833bf9d0784f720b3cf32c7f12c0e4493132
3
  size 4931296592
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a212cda2596393ed61ea7430b3cddcca5927fa4f6a4fcbd44647ecb36d70a25c
3
  size 4931296592
model-00003-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5c64456979a4364c4d2f6a837cb0e9ab4776d7461d20bec673681add550d9e37
3
  size 4931296656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:04e4c52313e4f7f15f8b946fb3dd95b76686bd7319aa61cab159e03956d3abf8
3
  size 4931296656
model-00004-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0ef908ea7c00422e3f55fa998bdd48f28a230713e2b8f367f8c51ad844b4044b
3
  size 4931296656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b5d9e4ffcf70a992d2eb53f98905cbd4a22002c716a0f4355cfa0c2c4c572560
3
  size 4931296656
model-00005-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9a450e609e69fd370b621763eec0bdcb65c540d6068c4ba6397a5f4d7225241f
3
  size 4601000928
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2dd992a0fb2de5547f823fc03d09701dbab11f7dd424abbd531f1f970d06155f
3
  size 4601000928
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0527ebae19ec13ceb401f198680edf9141e647089308237a1034fb95ee356b6d
3
  size 7377
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1b4fda9c4cd88dc413281e68f83ed28b7c00babeb4e12badd5cafb0609fa18b
3
  size 7377