File size: 3,144 Bytes
af8602a
 
 
 
6e05498
af8602a
 
 
6e05498
 
af8602a
 
6e05498
af8602a
6e05498
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af8602a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e05498
 
 
 
 
 
af8602a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
language:
- tr
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-embeddings
- sentence-similarity
- sentence-transformers
- feature-extraction
- turkish
- contrastive-learning
- mteb
pipeline_tag: sentence-similarity
model-index:
- name: turkish-sentence-encoder
  results:
  - task:
      type: Classification
    dataset:
      name: MTEB MassiveIntentClassification (tr)
      type: mteb/amazon_massive_intent
      config: tr
      split: test
    metrics:
    - type: accuracy
      value: 0.0
  - task:
      type: Classification
    dataset:
      name: MTEB MassiveScenarioClassification (tr)
      type: mteb/amazon_massive_scenario
      config: tr
      split: test
    metrics:
    - type: accuracy
      value: 0.0
  - task:
      type: STS
    dataset:
      name: MTEB STS22 (tr)
      type: mteb/sts22-crosslingual-sts
      config: tr
      split: test
    metrics:
    - type: cosine_spearman
      value: 0.0
---

# Turkish Sentence Encoder

A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs.

## Model Description

This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for:
- Semantic similarity
- Semantic search / retrieval
- Clustering
- Paraphrase detection

## Usage

### Using with custom code

```python
import torch
from transformers import AutoModel, AutoTokenizer

# Load model
model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder")

# Encode sentences
sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."]

inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
    embeddings = model(**inputs)

# Compute similarity
from torch.nn.functional import cosine_similarity
similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0))
print(f"Similarity: {similarity.item():.4f}")
```

### Using with Sentence-Transformers (after installing custom wrapper)

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Basar2004/turkish-sentence-encoder")
embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"])
```

## Evaluation Results

| Metric | Score |
|--------|-------|
| Spearman Correlation | 0.7315 |
| Pearson Correlation | 0.8593 |
| Paraphrase Accuracy | 0.9695 |
| MRR | 0.9172 |
| Recall@1 | 0.87 |
| Recall@5 | 0.97 |

## Training Details

- **Training Data**: Turkish paraphrase pairs (200K pairs)
- **Loss Function**: InfoNCE (contrastive loss)
- **Temperature**: 0.05
- **Batch Size**: 32
- **Base Model**: Custom Transformer encoder pretrained with MLM on Turkish text

## Architecture

- **Hidden Size**: 512
- **Layers**: 12
- **Attention Heads**: 8
- **Max Sequence Length**: 64
- **Vocab Size**: 32,000 (Unigram tokenizer)

## Limitations

- Optimized for Turkish language only
- Max sequence length is 64 tokens
- Best suited for sentence-level (not document-level) embeddings

## License

Apache 2.0