Ahmet-Gemma3-4B

A Turkish-optimized version of Google's Gemma 3 4B model with a custom Turkish tokenizer.

Model Description

This model is based on google/gemma-3-4b-it with the embedding layer adapted to use a Turkish-optimized tokenizer (AhmetSemih/tr-gemma-128k-processor).

Key Features

Base Model: Google Gemma 3 4B Instruct
Tokenizer: Custom Turkish tokenizer with 128K vocabulary
Embedding Transfer: Mean pooling strategy for new tokens
Precision: bfloat16

How It Was Created

The model was created by:

Loading the original Gemma 3 4B model and tokenizer
Mapping token embeddings from the original vocabulary to the new Turkish vocabulary:
- Direct matches: Tokens existing in both vocabularies keep their original embeddings
- New tokens: Tokenized using the original tokenizer, then embeddings are averaged
- Unmapped tokens: Fall back to UNK token embedding
Resizing the embedding layer to match the new vocabulary size

Code

https://github.com/malibayram/embedding-trainer

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AhmetSemih/ahmet-gemma3-4b", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/ahmet-gemma3-4b")

text = "Merhaba, nasılsın?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Recommendations

For best results, pretrain this model on Turkish text data.

Downloads last month: 38

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for AhmetSemih/ahmet-gemma3-4b

Base model

google/gemma-3-4b-pt

Finetuned

google/gemma-3-4b-it

Finetuned

(551)

this model