turkish-sentence-encoder / README.md

Basar2004

Upload folder using huggingface_hub

6e05498 verified 14 days ago

preview code

raw

history blame contribute delete

3.14 kB

metadata

language:
  - tr
license: apache-2.0
library_name: sentence-transformers
tags:
  - sentence-embeddings
  - sentence-similarity
  - sentence-transformers
  - feature-extraction
  - turkish
  - contrastive-learning
  - mteb
pipeline_tag: sentence-similarity
model-index:
  - name: turkish-sentence-encoder
    results:
      - task:
          type: Classification
        dataset:
          name: MTEB MassiveIntentClassification (tr)
          type: mteb/amazon_massive_intent
          config: tr
          split: test
        metrics:
          - type: accuracy
            value: 0
      - task:
          type: Classification
        dataset:
          name: MTEB MassiveScenarioClassification (tr)
          type: mteb/amazon_massive_scenario
          config: tr
          split: test
        metrics:
          - type: accuracy
            value: 0
      - task:
          type: STS
        dataset:
          name: MTEB STS22 (tr)
          type: mteb/sts22-crosslingual-sts
          config: tr
          split: test
        metrics:
          - type: cosine_spearman
            value: 0

Turkish Sentence Encoder

A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs.

Model Description

This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for:

Semantic similarity
Semantic search / retrieval
Clustering
Paraphrase detection

Usage

Using with custom code

import torch
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder")

# Encode sentences
sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."]

inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**inputs)

# Compute similarity
from torch.nn.functional import cosine_similarity
similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0))
print(f"Similarity: {similarity.item():.4f}")

Using with Sentence-Transformers (after installing custom wrapper)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Basar2004/turkish-sentence-encoder")
embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"])

Evaluation Results

Metric	Score
Spearman Correlation	0.7315
Pearson Correlation	0.8593
Paraphrase Accuracy	0.9695
MRR	0.9172
Recall@1	0.87
Recall@5	0.97

Training Details

Training Data: Turkish paraphrase pairs (200K pairs)
Loss Function: InfoNCE (contrastive loss)
Temperature: 0.05
Batch Size: 32
Base Model: Custom Transformer encoder pretrained with MLM on Turkish text

Architecture

Hidden Size: 512
Layers: 12
Attention Heads: 8
Max Sequence Length: 64
Vocab Size: 32,000 (Unigram tokenizer)

Limitations

Optimized for Turkish language only
Max sequence length is 64 tokens
Best suited for sentence-level (not document-level) embeddings

License

Apache 2.0