Tawkeed-Embedding

Tawkeed's 568M multilingual embedding model optimized for Arabic semantic search and retrieval. Generates high-quality vector representations for Arabic text, enabling semantic search, RAG pipelines, document clustering, and similarity matching.

Developed by: Tawkeed — Saudi-native enterprise Edge AI platform.

Use Cases

  • Arabic semantic search — find relevant Arabic documents by meaning, not just keywords
  • RAG pipelines — power Retrieval-Augmented Generation systems with Arabic document retrieval
  • Document clustering — automatically group Arabic documents by topic and similarity
  • Similarity matching — compare Arabic texts to find duplicates, paraphrases, or related content
  • Cross-lingual retrieval — search across Arabic and English documents simultaneously
  • FAQ matching — match user questions to the most relevant answers in Arabic knowledge bases

Usage

Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("tawkeed-sa/tawkeed-embedding")

# Encode Arabic texts
sentences = [
    "الذكاء الاصطناعي يغير مستقبل التعليم",
    "تطبيقات التعلم الآلي في الطب",
    "رؤية 2030 والتحول الرقمي في السعودية",
]
embeddings = model.encode(sentences)
print(f"Shape: {embeddings.shape}")  # (3, embedding_dim)

Semantic Search

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("tawkeed-sa/tawkeed-embedding")

# Index documents
documents = [
    "يعد الذكاء الاصطناعي من أهم التقنيات الحديثة",
    "تقع المملكة العربية السعودية في شبه الجزيرة العربية",
    "يستخدم التعلم العميق في تحليل الصور والنصوص",
    "الرياض هي عاصمة المملكة العربية السعودية",
]
doc_embeddings = model.encode(documents, convert_to_tensor=True)

# Search
query = "ما هي عاصمة السعودية؟"
query_embedding = model.encode(query, convert_to_tensor=True)

scores = util.cos_sim(query_embedding, doc_embeddings)[0]
top_idx = scores.argmax().item()
print(f"Query: {query}")
print(f"Best match: {documents[top_idx]} (score: {scores[top_idx]:.3f})")

Transformers (Direct)

import torch
from transformers import AutoTokenizer, AutoModel

model_id = "tawkeed-sa/tawkeed-embedding"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

texts = ["الذكاء الاصطناعي في السعودية", "Saudi AI initiatives"]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    # Mean pooling
    embeddings = outputs.last_hidden_state.mean(dim=1)
    embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)

similarity = (embeddings[0] @ embeddings[1]).item()
print(f"Cross-lingual similarity: {similarity:.3f}")

Model Details

Property Value
Parameters 568M
Type Embedding Model
Languages Arabic, English, Multilingual
License Apache 2.0
Output Dense vector embeddings

About Tawkeed

Tawkeed is a Saudi-native enterprise Edge AI platform building Arabic-first AI models and tools. Our models cover 18 Arabic dialects and are optimized for on-premises deployment with complete data privacy.

All Tawkeed models  |  Arabic LLM Benchmark

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support