Basar2004
/

turkish-sentence-encoder

Sentence Similarity

sentence-transformers

turkish-sentence-encoder

sentence-embeddings

feature-extraction

contrastive-learning

Eval Results (legacy)

Model card Files Files and versions

turkish-sentence-encoder / README.md

Basar2004's picture

Upload folder using huggingface_hub

6e05498 verified 15 days ago

|

history blame contribute delete

3.14 kB

	---
	language:
	- tr
	license: apache-2.0
	library_name: sentence-transformers
	tags:
	- sentence-embeddings
	- sentence-similarity
	- sentence-transformers
	- feature-extraction
	- turkish
	- contrastive-learning
	- mteb
	pipeline_tag: sentence-similarity
	model-index:
	- name: turkish-sentence-encoder
	results:
	- task:
	type: Classification
	dataset:
	name: MTEB MassiveIntentClassification (tr)
	type: mteb/amazon_massive_intent
	config: tr
	split: test
	metrics:
	- type: accuracy
	value: 0.0
	- task:
	type: Classification
	dataset:
	name: MTEB MassiveScenarioClassification (tr)
	type: mteb/amazon_massive_scenario
	config: tr
	split: test
	metrics:
	- type: accuracy
	value: 0.0
	- task:
	type: STS
	dataset:
	name: MTEB STS22 (tr)
	type: mteb/sts22-crosslingual-sts
	config: tr
	split: test
	metrics:
	- type: cosine_spearman
	value: 0.0
	---

	# Turkish Sentence Encoder

	A Turkish sentence embedding model trained with contrastive learning (InfoNCE loss) on Turkish paraphrase pairs.

	## Model Description

	This model encodes Turkish sentences into 512-dimensional dense vectors that can be used for:
	- Semantic similarity
	- Semantic search / retrieval
	- Clustering
	- Paraphrase detection

	## Usage

	### Using with custom code

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	# Load model
	model = AutoModel.from_pretrained("Basar2004/turkish-sentence-encoder", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("Basar2004/turkish-sentence-encoder")

	# Encode sentences
	sentences = ["Bugün hava çok güzel.", "Hava bugün oldukça hoş."]

	inputs = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors="pt")
	with torch.no_grad():
	embeddings = model(**inputs)

	# Compute similarity
	from torch.nn.functional import cosine_similarity
	similarity = cosine_similarity(embeddings[0].unsqueeze(0), embeddings[1].unsqueeze(0))
	print(f"Similarity: {similarity.item():.4f}")
	```

	### Using with Sentence-Transformers (after installing custom wrapper)

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("Basar2004/turkish-sentence-encoder")
	embeddings = model.encode(["Merhaba dünya!", "Selam dünya!"])
	```

	## Evaluation Results

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Spearman Correlation \| 0.7315 \|
	\| Pearson Correlation \| 0.8593 \|
	\| Paraphrase Accuracy \| 0.9695 \|
	\| MRR \| 0.9172 \|
	\| Recall@1 \| 0.87 \|
	\| Recall@5 \| 0.97 \|

	## Training Details

	- Training Data: Turkish paraphrase pairs (200K pairs)
	- Loss Function: InfoNCE (contrastive loss)
	- Temperature: 0.05
	- Batch Size: 32
	- Base Model: Custom Transformer encoder pretrained with MLM on Turkish text

	## Architecture

	- Hidden Size: 512
	- Layers: 12
	- Attention Heads: 8
	- Max Sequence Length: 64
	- Vocab Size: 32,000 (Unigram tokenizer)

	## Limitations

	- Optimized for Turkish language only
	- Max sequence length is 64 tokens
	- Best suited for sentence-level (not document-level) embeddings

	## License

	Apache 2.0