--- language: - tr license: apache-2.0 library_name: sentence-transformers tags: - sentence-embeddings - sentence-similarity - sentence-transformers - feature-extraction - turkish - contrastive-learning - mteb pipeline_tag: sentence-similarity model-index: - name: turkish-sentence-encoder results: - task: type: Classification dataset: name: MTEB MassiveIntentClassification (tr) type: mteb/amazon_massive_intent config: tr split: test metrics: - type: accuracy value: 0.0 - task: type: Classification dataset: name: MTEB MassiveScenarioClassification (tr) type: mteb/amazon_massive_scenario config: tr split: test metrics: - type: accuracy value: 0.0 - task: type: STS dataset: name: MTEB STS22 (tr) type: mteb/sts22-crosslingual-sts config: tr split: test metrics: - type: cosine_spearman value: 0.0 --- # Turkish Sentence Encoder A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs. ## Model Description This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for: - Semantic similarity - Semantic search / retrieval - Clustering - Paraphrase detection ## Usage ### Using with custom code ```python import torch from transformers import AutoModel, AutoTokenizer # Load model model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder") # Encode sentences sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."] inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt") with torch.no_grad(): embeddings = model(**inputs) # Compute similarity from torch.nn.functional import cosine_similarity similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0)) print(f"Similarity: {similarity.item():.4f}") ``` ### Using with Sentence-Transformers (after installing custom wrapper) ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("Basar2004/turkish-sentence-encoder") embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"]) ``` ## Evaluation Results | Metric | Score | |--------|-------| | Spearman Correlation | 0.7315 | | Pearson Correlation | 0.8593 | | Paraphrase Accuracy | 0.9695 | | MRR | 0.9172 | | Recall@1 | 0.87 | | Recall@5 | 0.97 | ## Training Details - **Training Data**: Turkish paraphrase pairs (200K pairs) - **Loss Function**: InfoNCE (contrastive loss) - **Temperature**: 0.05 - **Batch Size**: 32 - **Base Model**: Custom Transformer encoder pretrained with MLM on Turkish text ## Architecture - **Hidden Size**: 512 - **Layers**: 12 - **Attention Heads**: 8 - **Max Sequence Length**: 64 - **Vocab Size**: 32,000 (Unigram tokenizer) ## Limitations - Optimized for Turkish language only - Max sequence length is 64 tokens - Best suited for sentence-level (not document-level) embeddings ## License Apache 2.0