Ahmet-Gemma3-4B
A Turkish-optimized version of Google's Gemma 3 4B model with a custom Turkish tokenizer.
Model Description
This model is based on google/gemma-3-4b-it with the embedding layer adapted to use a Turkish-optimized tokenizer (AhmetSemih/tr-gemma-128k-processor).
Key Features
- Base Model: Google Gemma 3 4B Instruct
- Tokenizer: Custom Turkish tokenizer with 128K vocabulary
- Embedding Transfer: Mean pooling strategy for new tokens
- Precision: bfloat16
How It Was Created
The model was created by:
- Loading the original Gemma 3 4B model and tokenizer
- Mapping token embeddings from the original vocabulary to the new Turkish vocabulary:
- Direct matches: Tokens existing in both vocabularies keep their original embeddings
- New tokens: Tokenized using the original tokenizer, then embeddings are averaged
- Unmapped tokens: Fall back to UNK token embedding
- Resizing the embedding layer to match the new vocabulary size
Code
https://github.com/malibayram/embedding-trainer
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("AhmetSemih/ahmet-gemma3-4b", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/ahmet-gemma3-4b")
text = "Merhaba, nasılsın?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training Recommendations
For best results, pretrain this model on Turkish text data.
- Downloads last month
- 38