tsor13
/

chat12b

Safetensors

gemma3

Model card Files Files and versions

xet

Community

tsor13 commited on Jul 14, 2025

Commit

2318493

verified ·

1 Parent(s): facc2b8

Initial upload of fine‑tuned Gemma + custom tokenizer

Browse files

Files changed (2) hide show

README.md +46 -31
gemma_chat_tokenizer.py +14 -0

README.md CHANGED Viewed

@@ -37,15 +37,19 @@ print(formatted_prompt) # start_generation adds the <start_of_turn> token to con
 ```
 Output:
 ```
-Capitals
-France
-<start_of_turn>Paris<end_of_turn>
-Japan
-<start_of_turn>
 ```
 The data for the model to emulate / generate is wrapped in `<start_of_turn>` / `<end_of_turn>` tokens.
-Description and input is not wrapped in anything. Thus, do not expect the model to generate these tokens - instead focus on the wrapped output tokens.
-Messages are separated by newlines.
 In training, loss is ONLY calculated on the output tokens and the `<end_of_turn>` token. Thus, the model is only designed to generate / predict probabilities after `<start_of_turn>` and until `<end_of_turn>` - everything else is out of distribution for the model and not recommended.
@@ -70,17 +74,17 @@ with torch.no_grad():
 Output:
 ```
-Top 10 probabilities for first output token:
- 1. 'Tokyo' -> 0.9764
- 2. 'Tok' -> 0.0070
- 3. '東京' -> 0.0026
- 4. 'Ky' -> 0.0019
- 5. 'T' -> 0.0014
- 6. ' Tokyo' -> 0.0014
- 7. 'To' -> 0.0011
- 8. 'Osaka' -> 0.0009
- 9. 'Toy' -> 0.0007
-10. 'tok' -> 0.0005
 ```
 Great! Almost all of the probability mass is on the correct answer, Tokyo.
@@ -110,10 +114,10 @@ for i in range(n_gens):
 Outputs:
 ```
-Catan: Rivals for Catan
-Gloomhaven
-Great Western Trail
-Azul
 ```
 Not too bad!
@@ -164,14 +168,25 @@ for i in range(n_gens):
 Some example generations:
 ```
 Generation 0:
-A deep-sea creature, silent and fierce, Shivers through water, its body sleek. Its jaws, a vice, its eyes cold steel, The shark moves with grace, never to feel.
-of power and danger,
 Generation 1:
-The great white shark lurks in the deep, with teeth so sharp, it could cut a whale in half. Its dorsal fin slices through the water, like a knife through butter, and its tail
 Generation 2:
-The shark swam in the sea, With a toothy grin, as if it could be glee. It was the top of the food chain, The apex of the sea's terrain. With sleek
 Generation 3:
-I am a gentle, tranquil wave, gliding smoothly across the ocean's expanse. Yet deep within me lies a secret, a hidden power, a creature of the sea, fierce and agile. It
 ```
@@ -208,10 +223,10 @@ for i in range(n_gens):
 ```
 Output:
 ```
-{"situation": "While walking on the street, someone waves and smiles at you, but you don't know them.", "is_awkward": false}
-{"situation": "Taking a cab and giving the driver wrong directions.", "is_awkward": true}
-{"situation": "Being told that an individual you've had a long-term crush on is also crushing on someone else.", "is_awkward": true}
-{"situation": "Watching a loved one get proposed to.", "is_awkward": false}
 ```
 A few tips and tricks:

 ```
 Output:
 ```
+<start_of_turn>user
+Generate something that fits this description. Don't generate anything else, just the desired generation output.
+Description: Capitals
+France<end_of_turn>
+<start_of_turn>model
+Paris<end_of_turn>
+<start_of_turn>user
+Japan<end_of_turn>
+<start_of_turn>model
 ```
 The data for the model to emulate / generate is wrapped in `<start_of_turn>` / `<end_of_turn>` tokens.
 In training, loss is ONLY calculated on the output tokens and the `<end_of_turn>` token. Thus, the model is only designed to generate / predict probabilities after `<start_of_turn>` and until `<end_of_turn>` - everything else is out of distribution for the model and not recommended.
 Output:
 ```
+Top 10 probabilities for first output token:
+ 1. 'Tokyo' -> 0.9330
+ 2. 'Tok' -> 0.0114
+ 3. 'Ky' -> 0.0064
+ 4. 'Washington' -> 0.0025
+ 5. 'To' -> 0.0019
+ 6. 'Japan' -> 0.0016
+ 7. 'tok' -> 0.0014
+ 8. 'N' -> 0.0013
+ 9. 'K' -> 0.0012
+10. 'Toy' -> 0.0011
 ```
 Great! Almost all of the probability mass is on the correct answer, Tokyo.
 Outputs:
 ```
+Terraforming Mars
+Scythe
+Concordia
+7 Wonders
 ```
 Not too bad!
 Some example generations:
 ```
 Generation 0:
+No content
 Generation 1:
+An underwater menace,
+With a wide, dark mouth.
+Silent in the deep sea,
+A toothy and a fearsome south.
 Generation 2:
+Shivers of ocean, a silent dread,
+Shadowed fin above your head.
+Eyes of black, a piercing stare,
+Hunting through the depths with care.
+Jaws of power,
 Generation 3:
+Gleaming through the ocean blue,
+A silent hunter, strong and true.
+Sharp teeth and eyes of ancient might,
+A shadow moving in the light.
+With graceful fins it gl
 ```
 ```
 Output:
 ```
+{"situation": "You're in the cafeteria at school and your professor is behind you in the line.", "is_awkward": false}
+{"situation": "During your walk home, you notice someone has lost their wallet and pick it up.", "is_awkward": false}
+{"situation": "You're at the bar and someone approaches you.", "is_awkward": false}
+{"situation": "Your friend reveals a secret you already knew but they didn't realize you did.", "is_awkward": false}
 ```
 A few tips and tricks:

gemma_chat_tokenizer.py CHANGED Viewed

@@ -160,6 +160,15 @@ class GemmaChatTokenizer(GemmaTokenizerFast):
         if start_generation and chat_messages[-1]["role"] == "assistant":
             chat_messages.append({"role": "user", "content": default_user_message})
         # Apply chat template
         full_text = self.apply_chat_template(chat_messages, tokenize=False, add_generation_prompt=start_generation)
         # replace <bos> with nothing
@@ -259,6 +268,7 @@ class GemmaChatTokenizer(GemmaTokenizerFast):
         all_texts = ''
         example_inds = []
         dataset_inds = []
         for i, item in enumerate(texts):
             processed = self(text=item["text"])
@@ -398,6 +408,10 @@ if __name__ == "__main__":
     # Test messages in role/content format
     test_messages = [
         [
             {"role": "description", "content": "This is a test task"},
             {"role": "input", "content": "What is 2+2?"},

         if start_generation and chat_messages[-1]["role"] == "assistant":
             chat_messages.append({"role": "user", "content": default_user_message})
+        # if len(chat_messages) == 1:
+        #     # change to user
+        #     chat_messages[0]["role"] = "user"
+        #     # add
+        #     # TAYLOR - manual for now because of the way gemma handles only having a system prompt
+        if not has_input and len(chat_messages) == 1:
+            # add a default user message
+            chat_messages.append({"role": "user", "content": default_user_message})
         # Apply chat template
         full_text = self.apply_chat_template(chat_messages, tokenize=False, add_generation_prompt=start_generation)
         # replace <bos> with nothing
         all_texts = ''
         example_inds = []
         dataset_inds = []
         for i, item in enumerate(texts):
             processed = self(text=item["text"])
     # Test messages in role/content format
     test_messages = [
+        [
+            {"role": "description", "content": "Pick a number between 1 and 100"},
+        ],
         [
             {"role": "description", "content": "This is a test task"},
             {"role": "input", "content": "What is 2+2?"},