|
|
--- |
|
|
language: |
|
|
- tr |
|
|
license: apache-2.0 |
|
|
library_name: sentence-transformers |
|
|
tags: |
|
|
- sentence-embeddings |
|
|
- sentence-similarity |
|
|
- sentence-transformers |
|
|
- feature-extraction |
|
|
- turkish |
|
|
- contrastive-learning |
|
|
- mteb |
|
|
pipeline_tag: sentence-similarity |
|
|
model-index: |
|
|
- name: turkish-sentence-encoder |
|
|
results: |
|
|
- task: |
|
|
type: Classification |
|
|
dataset: |
|
|
name: MTEB MassiveIntentClassification (tr) |
|
|
type: mteb/amazon_massive_intent |
|
|
config: tr |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.0 |
|
|
- task: |
|
|
type: Classification |
|
|
dataset: |
|
|
name: MTEB MassiveScenarioClassification (tr) |
|
|
type: mteb/amazon_massive_scenario |
|
|
config: tr |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.0 |
|
|
- task: |
|
|
type: STS |
|
|
dataset: |
|
|
name: MTEB STS22 (tr) |
|
|
type: mteb/sts22-crosslingual-sts |
|
|
config: tr |
|
|
split: test |
|
|
metrics: |
|
|
- type: cosine_spearman |
|
|
value: 0.0 |
|
|
--- |
|
|
|
|
|
# Turkish Sentence Encoder |
|
|
|
|
|
A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for: |
|
|
- Semantic similarity |
|
|
- Semantic search / retrieval |
|
|
- Clustering |
|
|
- Paraphrase detection |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using with custom code |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# Load model |
|
|
model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder") |
|
|
|
|
|
# Encode sentences |
|
|
sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."] |
|
|
|
|
|
inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
embeddings = model(**inputs) |
|
|
|
|
|
# Compute similarity |
|
|
from torch.nn.functional import cosine_similarity |
|
|
similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)) |
|
|
print(f"Similarity: {similarity.item():.4f}") |
|
|
``` |
|
|
|
|
|
### Using with Sentence-Transformers (after installing custom wrapper) |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer("Basar2004/turkish-sentence-encoder") |
|
|
embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"]) |
|
|
``` |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Spearman Correlation | 0.7315 | |
|
|
| Pearson Correlation | 0.8593 | |
|
|
| Paraphrase Accuracy | 0.9695 | |
|
|
| MRR | 0.9172 | |
|
|
| Recall@1 | 0.87 | |
|
|
| Recall@5 | 0.97 | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Training Data**: Turkish paraphrase pairs (200K pairs) |
|
|
- **Loss Function**: InfoNCE (contrastive loss) |
|
|
- **Temperature**: 0.05 |
|
|
- **Batch Size**: 32 |
|
|
- **Base Model**: Custom Transformer encoder pretrained with MLM on Turkish text |
|
|
|
|
|
## Architecture |
|
|
|
|
|
- **Hidden Size**: 512 |
|
|
- **Layers**: 12 |
|
|
- **Attention Heads**: 8 |
|
|
- **Max Sequence Length**: 64 |
|
|
- **Vocab Size**: 32,000 (Unigram tokenizer) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for Turkish language only |
|
|
- Max sequence length is 64 tokens |
|
|
- Best suited for sentence-level (not document-level) embeddings |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|