Ahmet-Gemma3-4B

A Turkish-optimized version of Google's Gemma 3 4B model with a custom Turkish tokenizer.

Model Description

This model is based on google/gemma-3-4b-it with the embedding layer adapted to use a Turkish-optimized tokenizer (AhmetSemih/tr-gemma-128k-processor).

Key Features

  • Base Model: Google Gemma 3 4B Instruct
  • Tokenizer: Custom Turkish tokenizer with 128K vocabulary
  • Embedding Transfer: Mean pooling strategy for new tokens
  • Precision: bfloat16

How It Was Created

The model was created by:

  1. Loading the original Gemma 3 4B model and tokenizer
  2. Mapping token embeddings from the original vocabulary to the new Turkish vocabulary:
    • Direct matches: Tokens existing in both vocabularies keep their original embeddings
    • New tokens: Tokenized using the original tokenizer, then embeddings are averaged
    • Unmapped tokens: Fall back to UNK token embedding
  3. Resizing the embedding layer to match the new vocabulary size

Code

https://github.com/malibayram/embedding-trainer

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AhmetSemih/ahmet-gemma3-4b", torch_dtype="bfloat16")
tokenizer = AutoTokenizer.from_pretrained("AhmetSemih/ahmet-gemma3-4b")

text = "Merhaba, nasılsın?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Recommendations

For best results, pretrain this model on Turkish text data.

Downloads last month
38
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AhmetSemih/ahmet-gemma3-4b

Finetuned
(551)
this model