Initial upload of fine‑tuned Gemma + custom tokenizer

Browse files

Files changed (8) hide show

README.md +52 -44
config.json +4 -0
model-00001-of-00005.safetensors +1 -1
model-00002-of-00005.safetensors +1 -1
model-00003-of-00005.safetensors +1 -1
model-00004-of-00005.safetensors +1 -1
model-00005-of-00005.safetensors +1 -1
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -10,7 +10,7 @@ It is initialized from `google/gemma-3-12b-pt`.
 **Pros:** distinguishes description/input · closer to chat · best generations(?)
 **Cons:** more tokens than *special*
-<details><summary>Example w/ inputs</summary>
 ```text
 <start_of_turn>description
@@ -26,7 +26,7 @@ OUTPUT2<end_of_turn>
 ```
 </details>
-<details><summary>Example w/o inputs</summary>
 ```text
 <start_of_turn>description
@@ -43,11 +43,11 @@ There are three variants of the model for now:
 | **Field** | **special** | **extra** | **chat** |
 |-----------|-------------|-----------|----------|
 | **Model card** | [`tsor13/special12b`](https://huggingface.co/tsor13/special12b) | [`tsor13/extra12b`](https://huggingface.co/tsor13/extra12b) | [`tsor13/chat12b`](https://huggingface.co/tsor13/chat12b) |
-| **Description** | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-it`, trained to preserve & assume chat format |
-| **Pros** | • Most token‑efficient (only tags around the output) | • Distinguishes description vs first input<br>• Closer to chat format<br>• Best generations (?) | • Drop‑in for Gemma‑chat template<br>• Works on original chat logs, even OOD |
 | **Cons** | • May not tell description from first input<br>• Formatting farther from Gemma chat template | • More tokens than *special* | • Many extra tokens |
-| **Example w/ inputs** | ```text\nDESCRIPTION\nINPUT1\n<start_of_turn>OUTPUT1<end_of_turn>\nINPUT2\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>input\nINPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>input\nINPUT2<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nINPUT1<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nINPUT2<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
-| **Example w/o inputs** | ```text\nDESCRIPTION\n<start_of_turn>OUTPUT1<end_of_turn>\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
 At the moment, I recommend:
 - [special](https://huggingface.co/tsor13/special12b) for most use cases (token-efficient and gets best loss on training data)
@@ -121,17 +121,17 @@ with torch.no_grad():
 Output:
 ```
-Top 10 probabilities for first output token:
- 1. 'Tokyo' -> 0.9846
- 2. '東京' -> 0.0032
- 3. 'Tok' -> 0.0023
- 4. 'Ky' -> 0.0023
- 5. 'tok' -> 0.0012
- 6. 'TO' -> 0.0011
- 7. 'T' -> 0.0009
- 8. 'To' -> 0.0005
- 9. 'Toy' -> 0.0004
-10. '东京' -> 0.0002
 ```
 Great! Almost all of the probability mass is on the correct answer, Tokyo.
@@ -161,10 +161,10 @@ for i in range(n_gens):
 Outputs:
 ```
-Twilight Struggle
-Ark Nova
-Bardsung
-Carcassonne
 ```
 Not too bad!
@@ -186,10 +186,10 @@ for i in range(n_gens):
 ```
 Output:
 ```
-Light Green
-#ffff00
-yellow
-Bordeaux
 ```
 By default, the model is only trained to do 1) either emulate outputs if examples are provided, or 2) generate data based on the description. Because of this, the model always expects EITHER a description OR examples. If you want it to act slightly more like an instruction following chat model, you can add a description such as the following:
@@ -215,23 +215,31 @@ for i in range(n_gens):
 Some example generations:
 ```
 Generation 0:
-Swimming gracefully in the deep blue sea,
-A shark glides by so elegantly.
-Its sleek body moves with precision,
-As it searches for its next meal with attention.
-With fins that
 Generation 1:
-There once was a shark named Sandy, With a smile that would make you drooly. He swam in the deep blue sea, With friends who loved him oh so free.
 Generation 2:
-I was walking through the forest
-and came across a shark.
-I was terrified and ran away,
-leaving the shark alone.
 Generation 3:
-In the depths of the ocean, where shadows roam,
-Lurks a creature of ancient lore,
-With jaws that snap and teeth like razors,
-A shark swims through the deep,
 ```
@@ -268,14 +276,14 @@ for i in range(n_gens):
 ```
 Output:
 ```
-{"situation": "Your family surprised you with tickets to your favorite bands concert.", "is_awkward": false}
-{"situation": "Your boss is talking about how she is going on a trip and she starts to talk about how she is going to a fancy restaurant.", "is_awkward": false}
-{"situation": "You show up at the hospital for an appointment and realize you are one day late.", "is_awkward": true}
-{"situation": "Your crush tells you that they had a terrible day.", "is_awkward": true}
 ```
 A few tips and tricks:
 - Do not expect the model to do multi-turn chats. It is designed to be stateless and to treat each data point as "exchangeable" (roughly iid).
 - If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
 - The model is quite good at perspective taking / steering if you provide many examples.
-- The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.

 **Pros:** distinguishes description/input · closer to chat · best generations(?)
 **Cons:** more tokens than *special*
+<details><summary>Example w/ inputs</summary>
 ```text
 <start_of_turn>description
 ```
 </details>
+<details><summary>Example w/o inputs</summary>
 ```text
 <start_of_turn>description
 | **Field** | **special** | **extra** | **chat** |
 |-----------|-------------|-----------|----------|
 | **Model card** | [`tsor13/special12b`](https://huggingface.co/tsor13/special12b) | [`tsor13/extra12b`](https://huggingface.co/tsor13/extra12b) | [`tsor13/chat12b`](https://huggingface.co/tsor13/chat12b) |
+| **Description** | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-pt`, but with chat‑token embeddings copied over | From `gemma-3-12b-it`, trained to preserve & assume chat format |
+| **Pros** | • Most token‑efficient (only tags around the output) | • Distinguishes description vs first input<br>• Closer to chat format<br>• Best generations (?) | • Drop‑in for Gemma‑chat template<br>• Works on original chat logs, even OOD |
 | **Cons** | • May not tell description from first input<br>• Formatting farther from Gemma chat template | • More tokens than *special* | • Many extra tokens |
+| **Example w/ inputs** | ```text\nDESCRIPTION\nINPUT1\n<start_of_turn>OUTPUT1<end_of_turn>\nINPUT2\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>input\nINPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>input\nINPUT2<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nINPUT1<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nINPUT2<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
+| **Example w/o inputs** | ```text\nDESCRIPTION\n<start_of_turn>OUTPUT1<end_of_turn>\n<start_of_turn>OUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>description\nDESCRIPTION<end_of_turn>\n<start_of_turn>output\nOUTPUT1<end_of_turn>\n<start_of_turn>output\nOUTPUT2<end_of_turn>``` | ```text\n<start_of_turn>user\nGenerate …\nDescription: DESCRIPTION\n\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT1<end_of_turn>\n<start_of_turn>user\nGenerate.<end_of_turn>\n<start_of_turn>model\nOUTPUT2<end_of_turn>``` |
 At the moment, I recommend:
 - [special](https://huggingface.co/tsor13/special12b) for most use cases (token-efficient and gets best loss on training data)
 Output:
 ```
+Top 10 probabilities for first output token:
+ 1. 'Tokyo' -> 0.9904
+ 2. 'Tok' -> 0.0027
+ 3. 'tok' -> 0.0007
+ 4. 'To' -> 0.0006
+ 5. 'Toy' -> 0.0006
+ 6. '東京' -> 0.0005
+ 7. 'Ky' -> 0.0005
+ 8. 'T' -> 0.0002
+ 9. 'Washington' -> 0.0001
+10. 'N' -> 0.0001
 ```
 Great! Almost all of the probability mass is on the correct answer, Tokyo.
 Outputs:
 ```
+Root
+Dune: Imperium
+Gloomhaven
+Spirit Island
 ```
 Not too bad!
 ```
 Output:
 ```
+light steel blue
+A descriptive list of colors
+green
+Blue
 ```
 By default, the model is only trained to do 1) either emulate outputs if examples are provided, or 2) generate data based on the description. Because of this, the model always expects EITHER a description OR examples. If you want it to act slightly more like an instruction following chat model, you can add a description such as the following:
 Some example generations:
 ```
 Generation 0:
+A shark is a fish that lives in the sea
+They come in different shapes and sizes, you see
+Some are small and cute, some are huge and mean
+But all sharks are dangerous,
 Generation 1:
+A master of the ocean blue,
+A silent hunter, strong and true.
+With fins that glide and teeth so sharp,
+A fearsome creature, dark and carp.
+It swims through
 Generation 2:
+A sleek grey shadow, a silent stride,
+The ocean's hunter, full of pride.
+With fins that cut the water deep,
+And eyes of darkness, secrets keep.
+A
 Generation 3:
+With fins so sharp and eyes so keen,
+A shark patrols the ocean scene.
+A silent hunter, sleek and gray,
+It glides through waters, come what may.
+Its teeth
 ```
 ```
 Output:
 ```
+{"situation": "You spilled coffee on your new shoes.", "is_awkward": true}
+{"situation": "While at a restaurant, the waiter asks if you're ready to order.", "is_awkward": false}
+{"situation": "Going out for coffee with a good friend.", "is_awkward": false}
+{"situation": "Having to tell your mother that you didn't go to college.", "is_awkward": true}
 ```
 A few tips and tricks:
 - Do not expect the model to do multi-turn chats. It is designed to be stateless and to treat each data point as "exchangeable" (roughly iid).
 - If all you want is one reasonable answer, then a chat model is likely a better fit. However, if you want to generate many reasonable answers / diverse examples, this model is a better fit.
 - The model is quite good at perspective taking / steering if you provide many examples.
+- The model is reasonably good at expressing epistemic uncertainty over unsure outputs by sampling several times.

config.json CHANGED Viewed

@@ -4,6 +4,10 @@
   ],
   "boi_token_index": 255999,
   "eoi_token_index": 256000,
   "image_token_index": 262144,
   "initializer_range": 0.02,
   "mm_tokens_per_image": 256,

   ],
   "boi_token_index": 255999,
   "eoi_token_index": 256000,
+  "eos_token_id": [
+    1,
+    106
+  ],
   "image_token_index": 262144,
   "initializer_range": 0.02,
   "mm_tokens_per_image": 256,

model-00001-of-00005.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9009aab445c6ba6a124ad1d6c87398c489a430a4143f40b15dfc2bec4500f948
 size 4979902192

 version https://git-lfs.github.com/spec/v1
+oid sha256:aef1debfc6bbefdf97063e789ddd39b8b9d5707b5b5c0eb5cf86cfb0d6875ae5
 size 4979902192

model-00002-of-00005.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f05f6d6222735068da19c46ec89d833bf9d0784f720b3cf32c7f12c0e4493132
 size 4931296592

 version https://git-lfs.github.com/spec/v1
+oid sha256:a212cda2596393ed61ea7430b3cddcca5927fa4f6a4fcbd44647ecb36d70a25c
 size 4931296592

model-00003-of-00005.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5c64456979a4364c4d2f6a837cb0e9ab4776d7461d20bec673681add550d9e37
 size 4931296656

 version https://git-lfs.github.com/spec/v1
+oid sha256:04e4c52313e4f7f15f8b946fb3dd95b76686bd7319aa61cab159e03956d3abf8
 size 4931296656

model-00004-of-00005.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0ef908ea7c00422e3f55fa998bdd48f28a230713e2b8f367f8c51ad844b4044b
 size 4931296656

 version https://git-lfs.github.com/spec/v1
+oid sha256:b5d9e4ffcf70a992d2eb53f98905cbd4a22002c716a0f4355cfa0c2c4c572560
 size 4931296656

model-00005-of-00005.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9a450e609e69fd370b621763eec0bdcb65c540d6068c4ba6397a5f4d7225241f
 size 4601000928

 version https://git-lfs.github.com/spec/v1
+oid sha256:2dd992a0fb2de5547f823fc03d09701dbab11f7dd424abbd531f1f970d06155f
 size 4601000928

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0527ebae19ec13ceb401f198680edf9141e647089308237a1034fb95ee356b6d
 size 7377

 version https://git-lfs.github.com/spec/v1
+oid sha256:f1b4fda9c4cd88dc413281e68f83ed28b7c00babeb4e12badd5cafb0609fa18b
 size 7377