File size: 3,144 Bytes

---
language:
- tr
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-embeddings
- sentence-similarity
- sentence-transformers
- feature-extraction
- turkish
- contrastive-learning
- mteb
pipeline_tag: sentence-similarity
model-index:
- name: turkish-sentence-encoder
  results:
  - task:
      type: Classification
    dataset:
      name: MTEB MassiveIntentClassification (tr)
      type: mteb/amazon_massive_intent
      config: tr
      split: test
    metrics:
    - type: accuracy
      value: 0.0
  - task:
      type: Classification
    dataset:
      name: MTEB MassiveScenarioClassification (tr)
      type: mteb/amazon_massive_scenario
      config: tr
      split: test
    metrics:
    - type: accuracy
      value: 0.0
  - task:
      type: STS
    dataset:
      name: MTEB STS22 (tr)
      type: mteb/sts22-crosslingual-sts
      config: tr
      split: test
    metrics:
    - type: cosine_spearman
      value: 0.0
---

# Turkish Sentence Encoder

A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs.

## Model Description

This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for:
- Semantic similarity
- Semantic search / retrieval
- Clustering
- Paraphrase detection

## Usage

### Using with custom code

```python
import torch
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder")

# Encode sentences
sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."]

inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**inputs)

# Compute similarity
from torch.nn.functional import cosine_similarity
similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0))
print(f"Similarity: {similarity.item():.4f}")
```

### Using with Sentence-Transformers (after installing custom wrapper)

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Basar2004/turkish-sentence-encoder")
embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"])
```

## Evaluation Results

| Metric | Score |
|--------|-------|
| Spearman Correlation | 0.7315 |
| Pearson Correlation | 0.8593 |
| Paraphrase Accuracy | 0.9695 |
| MRR | 0.9172 |
| Recall@1 | 0.87 |
| Recall@5 | 0.97 |

## Training Details

- **Training Data**: Turkish paraphrase pairs (200K pairs)
- **Loss Function**: InfoNCE (contrastive loss)
- **Temperature**: 0.05
- **Batch Size**: 32
- **Base Model**: Custom Transformer encoder pretrained with MLM on Turkish text

## Architecture

- **Hidden Size**: 512
- **Layers**: 12
- **Attention Heads**: 8
- **Max Sequence Length**: 64
- **Vocab Size**: 32,000 (Unigram tokenizer)

## Limitations

- Optimized for Turkish language only
- Max sequence length is 64 tokens
- Best suited for sentence-level (not document-level) embeddings

## License

Apache 2.0