File size: 3,144 Bytes
af8602a 6e05498 af8602a 6e05498 af8602a 6e05498 af8602a 6e05498 af8602a 6e05498 af8602a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
language:
- tr
license: apache-2.0
library_name: sentence-transformers
tags:
- sentence-embeddings
- sentence-similarity
- sentence-transformers
- feature-extraction
- turkish
- contrastive-learning
- mteb
pipeline_tag: sentence-similarity
model-index:
- name: turkish-sentence-encoder
results:
- task:
type: Classification
dataset:
name: MTEB MassiveIntentClassification (tr)
type: mteb/amazon_massive_intent
config: tr
split: test
metrics:
- type: accuracy
value: 0.0
- task:
type: Classification
dataset:
name: MTEB MassiveScenarioClassification (tr)
type: mteb/amazon_massive_scenario
config: tr
split: test
metrics:
- type: accuracy
value: 0.0
- task:
type: STS
dataset:
name: MTEB STS22 (tr)
type: mteb/sts22-crosslingual-sts
config: tr
split: test
metrics:
- type: cosine_spearman
value: 0.0
---
# Turkish Sentence Encoder
A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs.
## Model Description
This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for:
- Semantic similarity
- Semantic search / retrieval
- Clustering
- Paraphrase detection
## Usage
### Using with custom code
```python
import torch
from transformers import AutoModel, AutoTokenizer
# Load model
model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder")
# Encode sentences
sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."]
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt")
with torch.no_grad():
embeddings = model(**inputs)
# Compute similarity
from torch.nn.functional import cosine_similarity
similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0))
print(f"Similarity: {similarity.item():.4f}")
```
### Using with Sentence-Transformers (after installing custom wrapper)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Basar2004/turkish-sentence-encoder")
embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"])
```
## Evaluation Results
| Metric | Score |
|--------|-------|
| Spearman Correlation | 0.7315 |
| Pearson Correlation | 0.8593 |
| Paraphrase Accuracy | 0.9695 |
| MRR | 0.9172 |
| Recall@1 | 0.87 |
| Recall@5 | 0.97 |
## Training Details
- **Training Data**: Turkish paraphrase pairs (200K pairs)
- **Loss Function**: InfoNCE (contrastive loss)
- **Temperature**: 0.05
- **Batch Size**: 32
- **Base Model**: Custom Transformer encoder pretrained with MLM on Turkish text
## Architecture
- **Hidden Size**: 512
- **Layers**: 12
- **Attention Heads**: 8
- **Max Sequence Length**: 64
- **Vocab Size**: 32,000 (Unigram tokenizer)
## Limitations
- Optimized for Turkish language only
- Max sequence length is 64 tokens
- Best suited for sentence-level (not document-level) embeddings
## License
Apache 2.0
|