Tawkeed-Embedding
Tawkeed's 568M multilingual embedding model optimized for Arabic semantic search and retrieval. Generates high-quality vector representations for Arabic text, enabling semantic search, RAG pipelines, document clustering, and similarity matching.
Developed by: Tawkeed — Saudi-native enterprise Edge AI platform.
Use Cases
- Arabic semantic search — find relevant Arabic documents by meaning, not just keywords
- RAG pipelines — power Retrieval-Augmented Generation systems with Arabic document retrieval
- Document clustering — automatically group Arabic documents by topic and similarity
- Similarity matching — compare Arabic texts to find duplicates, paraphrases, or related content
- Cross-lingual retrieval — search across Arabic and English documents simultaneously
- FAQ matching — match user questions to the most relevant answers in Arabic knowledge bases
Usage
Sentence Transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("tawkeed-sa/tawkeed-embedding")
# Encode Arabic texts
sentences = [
"الذكاء الاصطناعي يغير مستقبل التعليم",
"تطبيقات التعلم الآلي في الطب",
"رؤية 2030 والتحول الرقمي في السعودية",
]
embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}") # (3, embedding_dim)
Semantic Search
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("tawkeed-sa/tawkeed-embedding")
# Index documents
documents = [
"يعد الذكاء الاصطناعي من أهم التقنيات الحديثة",
"تقع المملكة العربية السعودية في شبه الجزيرة العربية",
"يستخدم التعلم العميق في تحليل الصور والنصوص",
"الرياض هي عاصمة المملكة العربية السعودية",
]
doc_embeddings = model.encode(documents, convert_to_tensor=True)
# Search
query = "ما هي عاصمة السعودية؟"
query_embedding = model.encode(query, convert_to_tensor=True)
scores = util.cos_sim(query_embedding, doc_embeddings)[0]
top_idx = scores.argmax().item()
print(f"Query: {query}")
print(f"Best match: {documents[top_idx]} (score: {scores[top_idx]:.3f})")
Transformers (Direct)
import torch
from transformers import AutoTokenizer, AutoModel
model_id = "tawkeed-sa/tawkeed-embedding"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
texts = ["الذكاء الاصطناعي في السعودية", "Saudi AI initiatives"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1)
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
similarity = (embeddings[0] @ embeddings[1]).item()
print(f"Cross-lingual similarity: {similarity:.3f}")
Model Details
| Property | Value |
|---|---|
| Parameters | 568M |
| Type | Embedding Model |
| Languages | Arabic, English, Multilingual |
| License | Apache 2.0 |
| Output | Dense vector embeddings |
About Tawkeed
Tawkeed is a Saudi-native enterprise Edge AI platform building Arabic-first AI models and tools. Our models cover 18 Arabic dialects and are optimized for on-premises deployment with complete data privacy.
- Downloads last month
- 13