magibu
/

embeddingmagibu-152m

@@ -1,39 +1,48 @@
 ---
 tags:
-- sentence-transformers
-- sentence-similarity
-- feature-extraction
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
 ---
-# SentenceTransformer
-This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
-## Model Details
-### Model Description
-- **Model Type:** Sentence Transformer
-<!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
-- **Maximum Sequence Length:** 2048 tokens
-- **Output Dimensionality:** 768 dimensions
-- **Similarity Function:** Cosine Similarity
-<!-- - **Training Dataset:** Unknown -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
-### Model Sources
-- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
-- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
-- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
-### Full Model Architecture
-```
 SentenceTransformer(
-  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False}) with Transformer model: Gemma3TextModel
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
   (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
@@ -41,103 +50,357 @@ SentenceTransformer(
 )
 ```
 ## Usage
-### Direct Usage (Sentence Transformers)
-First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
-Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
-# Download from the 🤗 Hub
-model = SentenceTransformer("alibayram/cloned_sentence_transformer")
-# Run inference
 sentences = [
-    'The weather is lovely today.',
-    "It's so sunny outside!",
-    'He drove to the stadium.',
 ]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 768]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
 ```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
-<!--
-## Bias, Risks and Limitations
-*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
--->
-<!--
-### Recommendations
-*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
--->
-## Training Details
-### Framework Versions
-- Python: 3.13.3
-- Sentence Transformers: 4.1.0
-- Transformers: 5.0.0rc1
-- PyTorch: 2.6.0
-- Accelerate: 1.8.1
-- Datasets: 3.6.0
-- Tokenizers: 0.22.1
 ## Citation
-### BibTeX
-<!--
-## Glossary
-*Clearly define terms in order to be accessible across audiences.*
--->
-<!--
-## Model Card Authors
-*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
--->
-<!--
-## Model Card Contact
-*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--->

 ---
 tags:
+  - sentence-transformers
+  - sentence-similarity
+  - feature-extraction
+  - text-embeddings
+  - turkish
+  - tr
+  - distillation
+  - gemma
 pipeline_tag: sentence-similarity
 library_name: sentence-transformers
+license: mit
 ---
+![embeddingmagibu-152m.png](embeddingmagibu-152m.png)
+# embeddingmagibu-152m
+Bu model, **Türkçe odaklı uzun bağlam (2048 token)** sentence embedding üretmek için eğitilmiş bir **SentenceTransformer** modelidir. 768 boyutlu normalize vektör uzayına projeksiyon yapar ve özellikle:
+- Semantik benzerlik (STS)
+- Semantik arama / retrieval
+- Kümeleme
+- Sınıflandırma (embedding tabanlı)
+gibi görevlerde kullanılabilir.
+Model, sıfırdan rastgele başlatılmak yerine iki aşamalı bir yaklaşımla geliştirilmiştir:
+1. **Tokenizer yeniden eğitimi** (Türkçe için 2^16 vocab BPE)
+2. **Transformer klonlama** (teacher model ağırlıklarını kopyala + yeni vocab için embedding tablosunu hesapla)
+3. **Embedding distillation** (teacher embedding’lerini önceden hesapla, student’ı kısa sürede yaklaştır)
+Bu sayede, büyük modellerle yarışan kaliteyi hedeflerken **parametre sayısı** yaklaşık **152M** seviyesinde yakalamıştır.
+---
+# Model Mimarisi
+Bu model SentenceTransformers formatında aşağıdaki boru hattını kullanır:
+```text
 SentenceTransformer(
+  (0): Transformer({'max_seq_length': 2048, 'do_lower_case': False}) with Transformer model: Gemma3TextModel
   (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
   (2): Dense({'in_features': 768, 'out_features': 3072, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
   (3): Dense({'in_features': 3072, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
 )
 ```
+---
+## Eğitim Süreci ve Oluşturulma Detayları
+Bu bölüm, modelin “nasıl üretildiğini” mümkün olduğunca tekrarlanabilir ve teknik şekilde anlatır.
+### 1) Tokenizer: 2^16 vocab BPE (SentencePiece)
+- **Tokenizer türü:** BPE
+- **Vocab:** $2^{16} = 65{,}536$ token
+- **Eğitim kütüphanesi:** SentencePiece
+- **Tokenizer eğitim verisi:** [ytu-ce-cosmos/Cosmos-Turkish-Corpus-v1.0](https://huggingface.co/datasets/ytu-ce-cosmos/Cosmos-Turkish-Corpus-v1.0)
+  - Dataset kartına göre: Türkçe pretrain korpusu, ~15B token; geniş kaynak yelpazesi; URL bazlı deduplikasyon uygulanmış.
+  - Lisans: CC-BY-4.0
+Amaç: Türkçe metinler için daha uygun alt-parça dağılımı elde ederek **vocab’ı küçültmek** ve **embedding tablosu parametrelerini** düşürmek.
+### 2) Model Klonlama: `transformer-cloner`
+Sıfırdan model başlatmak yerine, teacher modelin (EmbeddingGemma) ağırlıklarını mümkün olduğunca **koruyarak** yeni tokenizer’a adapte edebilmek için `transformer-cloner` adlı bir Python kütüphanesi kullanılmıştır:
+- PyPI: [transformer-cloner](https://pypi.org/project/transformer-cloner/)
+- GitHub: [malibayram/transformer-cloner](https://github.com/malibayram/transformer-cloner)
+#### Temel fikir
+Teacher modeldeki transformer katmanları (attention/MLP/LayerNorm vb.) olduğu gibi korunur; asıl sorun, tokenizer değiştiğinde **token embedding tablosunun** (vocab_size × hidden_size) uyumsuz hale gelmesidir.
+`transformer-cloner`, yeni tokenizer’daki her token için teacher tokenizer tarafında bir **token-id eşlemesi** oluşturur:
+- `build_token_id_map()` yeni vocab’daki her `target_token_id` için teacher tarafında bir veya birden çok `source_token_id` döner: `{target_id: [source_ids...]}`
+- Ardından `clone()` aşamasında embedding’ler **hesaplanarak** aktarılır.
+#### Embedding aktarım stratejisi
+Yeni token birden fazla teacher token’a parçalanıyorsa, teacher embedding’leri birleştirilir. `transformer-cloner` bu birleşimi `EmbeddingStrategy` ile tanımlar (MEAN, WEIGHTED, FIRST, LAST, vb.).
+Bu projede kullanılan yaklaşım:
+- Yeni token embedding’i, eşleşen teacher token embedding’lerinin ortalaması gibi bir stratejiyle (pratikte MEAN/WEIGHTED) hesaplanır.
+Bu sayede:
+- Student model **rastgele init** yerine teacher’dan türetilmiş embeddinglerle başlar.
+- Vocab küçüldüğü için embedding tablosu ve toplam parametre sayısı ciddi azalır.
+### 3) Teacher embedding’lerini önceden hesaplama: `distil-trainer` (dataset üretimi)
+Teacher model distillation sırasında her adımda çalıştırılmasın diye, teacher embedding’leri önceden hesaplanıp bir Hugging Face dataset’i olarak kaydedilmiştir.
+- PyPI: [distil-trainer](https://pypi.org/project/distil-trainer/)
+- GitHub: [malibayram/distil-trainer](https://github.com/malibayram/distil-trainer)
+Bu aşamada `TeacherEmbeddingsGenerator` kullanılır:
+- Kod: [src/distil_trainer/data/embeddings_generator.py](https://github.com/malibayram/distil-trainer/blob/main/src/distil_trainer/data/embeddings_generator.py)
+Generator şu çıktıları üretebilir:
+- `teacher_embedding_final`: SentenceTransformer pipeline’ının final embedding’i (Normalize dahil encode çıktısı)
+- `teacher_embedding_pre_dense`: Dense katmanlarından önceki embedding (Dense varsa)
+Bu projede teacher embedding’leri **300,000 örnek** üzerinde hesaplanıp şu dataset’e yazılmıştır:
+- [alibayram/cosmos-corpus-0-05-with-embeddings](https://huggingface.co/datasets/alibayram/cosmos-corpus-0-05-with-embeddings)
+  - Satır sayısı: 300,000
+  - Kolonlar pratikte: `text`, `teacher_embedding_final` ve Dense varsa `teacher_embedding_pre_dense`
+> Not: Dataset preview’ında embedding kolonlarının float dizileri olduğu görülür.
+### 4) Embedding Distillation eğitimi (A100 80GB, ~4 saat)
+Student model, teacher’ın embedding uzayına yaklaşacak şekilde eğitilmiştir.
+Bu eğitimde kullanılan ana bileşen:
+- `EmbeddingDistillationTrainer` ve `EmbeddingTrainerConfig`
+- Örnek script referansı: [distil-trainer/train.py](https://github.com/malibayram/distil-trainer/blob/main/train.py)
+Bu projenin raporlanan ayarları (özet):
+- **target_type:** `final` (teacher final embedding’i hedefleniyor)
+- **loss:** cosine
+- **num_epochs:** 1
+- **batch_size:** 256
+- **learning_rate:** 5e-5
+- **warmup_ratio:** 0.01
+- **weight_decay:** 0.01
+- **max_grad_norm:** 1.0
+- **precision:** bf16
+- **gradient_checkpointing:** açık
+- **torch.compile:** açık
+- **checkpointing:** her 100 adımda kaydet
+Eğitim donanımı:
+- **GPU:** NVIDIA A100 80GB (kiralık)
+- **Süre:** ~4 saat
+Deney takibi:
+- Weights & Biases run: https://api.wandb.ai/links/alibayram-ytu/srxzzhof
+---
+## Evaluation
+Bu bölümde iki seviyede sonuç raporlanmıştır:
+1. Eğitim sırasında periyodik kontrol: STSbTR üzerinde karşılaştırma
+2. Eğitim sonunda: TR-MTEB (Türkçe MTEB) benchmark
+### 1) STSbTR (figenfikri/stsb_tr) karşılaştırması
+Her 100 adım checkpoint alındıktan sonra, [figenfikri/stsb_tr](https://huggingface.co/datasets/figenfikri/stsb_tr) üzerinde Pearson/Spearman korelasyonları takip edilmiştir.
+Örnek bir kayıt (timestamp: 2026-01-02):
+| Model                                                       | Pearson | Spearman |
+| ----------------------------------------------------------- | ------: | -------: |
+| intfloat/multilingual-e5-large-instruct                     |  0.8275 |   0.8129 |
+| trmteb/turkish-embedding-model-fine-tuned                   |  0.8215 |   0.8061 |
+| ytu-ce-cosmos/turkish-e5-large                              |  0.8090 |   0.7906 |
+| sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |  0.7884 |   0.7659 |
+| **embeddingmagibu-152m (bu model)**                         |  0.7512 |   0.7305 |
+| google/embeddinggemma-300m (teacher)                        |  0.7391 |   0.7194 |
+Bu gözlem, distillation ilerledikçe student modelin teacher’ı geçebildiğini ve güçlü Türkçe embedding modellerine yaklaştığını göstermektedir.
+### 2) TR-MTEB (MTEB-TR) benchmark
+TR-MTEB, Türkçe embedding modellerini çok görevli bir ölçekte değerlendiren bir benchmark’tır:
+- Referans repo: [selmanbaysan/mteb_tr](https://github.com/selmanbaysan/mteb_tr)
+- Bu projede ayrıca fork + arayüz ile bir Space yayınlanmıştır: https://huggingface.co/spaces/alibayram/mteb-turkish
+#### EmbeddingGemma’nın kullandığı 15 görev ortalaması (rapor)
+| Model                                           |  Ortalama |
+| ----------------------------------------------- | --------: |
+| intfloat/multilingual-e5-large-instruct         |     72.77 |
+| intfloat/multilingual-e5-large                  |     72.28 |
+| ytu-ce-cosmos/turkish-e5-large                  |     72.22 |
+| google/embeddinggemma-300m                      |     70.97 |
+| selmanbaysan/turkish embedding model fine tuned |     70.47 |
+| **embeddingmagibu-152m (bu model)**             | **69.68** |
+#### TR-MTEB 24 görev (tam) ortalaması (rapor)
+| Model                                           |  Ortalama |
+| ----------------------------------------------- | --------: |
+| intfloat/multilingual-e5-large                  |     65.59 |
+| ytu-ce-cosmos/turkish-e5-large                  |     64.84 |
+| intfloat/multilingual-e5-large-instruct         |     64.72 |
+| alibaba-NLP/gte-multilingual-base               |     63.25 |
+| intfloat/multilingual-e5-base                   |     63.00 |
+| **embeddingmagibu-152m (bu model)**             | **62.57** |
+| selmanbaysan/turkish embedding model fine tuned |     62.17 |
+> Yorum: Bu model, bazı görevlerde çok daha büyük / çok daha pahalı şekilde eğitilmiş modellere yaklaşırken, daha küçük boyut + 2048 bağlam uzunluğu ile pratik bir denge hedefler.
+---
 ## Usage
+### Sentence Transformers
+Kurulum:
 ```bash
 pip install -U sentence-transformers
 ```
+Basit kullanım:
 ```python
 from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("alibayram/embeddingmagibu-152m", trust_remote_code=True)
 sentences = [
+    "Bugün hava çok güzel.",
+    "Dışarısı güneşli.",
+    "Stadyuma arabayla gitti.",
 ]
+embeddings = model.encode(sentences, normalize_embeddings=True)
+print(embeddings.shape)  # (3, 768)
+```
+Benzerlik hesaplama:
+```python
+import numpy as np
+sim = embeddings @ embeddings.T  # normalize edilmişse cosine == dot
+print(sim)
 ```
+### Query/Document modunda (prompt’lu) kullanım
+EmbeddingGemma ailesi, query/document ayrımı için prompt formatlarını destekler. Bu modelin Pooling konfigürasyonunda `include_prompt=True` olduğu için SentenceTransformers tarafında `encode_query` / `encode_document` kullanımına uygundur.
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("alibayram/embeddingmagibu-152m", trust_remote_code=True)
+query = "Mars neden kırmızı gezegen olarak bilinir?"
+docs = [
+    "Mars, yüzeyindeki demir oksit nedeniyle kırmızı görünür.",
+    "Venüs atmosferi çok yoğundur.",
+]
+q = model.encode_query(query)
+d = model.encode_document(docs)
+scores = model.similarity(q, d)
+print(scores)
+```
+### Precision notu
+EmbeddingGemma aktivasyonları `float16` ile sorun yaşayabilir; mümkünse `bfloat16` veya `float32` tercih edin.
+---
+## Export (GGUF / Ollama)
+Model, küçük boyutu ve uzun bağlam desteği nedeniyle bf16 formatında GGUF’a dönüştürülüp Ollama üzerinden paylaşılacak şekilde paketlenmiştir.
+[Ollama Hub sayfası](https://ollama.com/alibayram/embeddingmagibu-152m)
+---
+## Kullanım Alanları
+- Türkçe semantik arama / RAG indeksleme
+- Türkçe STS / duplicate detection
+- Kümeleme / topic discovery
+- Embedding tabanlı sınıflandırma
+## Kapsam Dışı Kullanım
+- Bu model bir “chat / instruction-following” LLM olarak hedeflenmemiştir.
+- Güvenlik-kritik kararları tek başına otomatik vermek için tasarlanmamıştır.
+---
+## Yanlılıklar ve Sınırlamalar
+- **Veri kaynaklı önyargılar:** Web verileri kaçınılmaz olarak sosyal/kültürel önyargılar içerebilir.
+- **Alan genellemesi:** Türkçe dış�� dillerde performans düşebilir.
+- **Uzun metinler:** 2048 token’a kadar destek olsa bile çok uzun içeriklerde en iyi sonuç için chunking + pooling stratejileri gerekebilir.
+---
+## Model Üretim Hattı
+Bu çalışmanın üretim hattı:
+1. Cosmos Türkçe korpustan SentencePiece BPE tokenizer eğit
+2. Teacher EmbeddingGemma’dan `transformer-cloner` ile yeni vocab’e embedding map et ve modeli klonla
+3. `distil-trainer` ile teacher embedding’lerini dataset’e yaz (`teacher_embedding_final`, `teacher_embedding_pre_dense`)
+4. Student’ı bu dataset üzerinden cosine loss ile distill et
+5. STSbTR ve TR-MTEB ile değerlendir
+---
 ## Citation
+Eğer bu modeli veya eğitim hattını akademik çalışmada kullanırsanız, aşağıdaki referansları kullanmanız önerilir.
+### Bu model
+```bibtex
+@misc{embeddingmagibu_152m_2025,
+  title={embeddingmagibu-152m: A Turkish-Focused Long-Context Sentence Embedding Model},
+  author={Bayram, M. Ali},
+  year={2025},
+  url={https://huggingface.co/alibayram/embeddingmagibu-152m}
+}
+```
+### EmbeddingGemma
+EmbeddingGemma model kartında verilen BibTeX:
+```bibtex
+@article{embedding_gemma_2025,
+  title={EmbeddingGemma: Powerful and Lightweight Text Representations},
+  author={Schechter Vera, Henrique* and Dua, Sahil* and Zhang, Biao and Salz, Daniel and Mullins, Ryan and Raghuram Panyam, Sindhu and others},
+  year={2025},
+  url={https://arxiv.org/abs/2509.20354}
+}
+```
+### distil-trainer
+PyPI sayfasında önerilen referans:
+```bibtex
+@software{distil_trainer,
+  title = {Distil Trainer: A Comprehensive Knowledge Distillation Framework},
+  author = {Bayram, M. Ali},
+  year = {2025},
+  url = {https://github.com/malibayram/distil-trainer}
+}
+```
+### transformer-cloner
+```bibtex
+@software{transformer_cloner,
+  title = {Transformer Cloner: Clone and prune transformer models with new tokenizers},
+  author = {Bayram, M. Ali},
+  year = {2025},
+  url = {https://github.com/malibayram/transformer-cloner}
+}
+```
+### MTEB
+```bibtex
+@article{muennighoff2022mteb,
+  title = {MTEB: Massive Text Embedding Benchmark},
+  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loic and Reimers, Nils},
+  year = {2022},
+  url = {https://arxiv.org/abs/2210.07316}
+}
+```
+### MTEB-TR
+```bibtexbibtex
+inproceedings{baysan-gungor-2025-tr,
+    title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations",
+    author = "Baysan, Mehmet Selman and Gungor, Tunga",
+    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
+    month = nov,
+    year = "2025",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2025.findings-emnlp.471/",
+    doi = "10.18653/v1/2025.findings-emnlp.471",
+    pages = "8867--8887",
+    ISBN = "979-8-89176-335-7",
+}
+```
+---
+## Model Card Authors / Contact
+- Ali Bayram (alibayram)