אמבדינגים לעברית + ארמית (Otzaria Embeddings)

תקציר

הריפו מכיל וקטורי־מילים (embeddings) שאומנו על קורפוס גדול של טקסטים תורניים בעברית ובארמית (תנ״ך, משנה, תלמוד בבלי/ירושלמי ועוד), לאחר ניקוי וייצוא מה־DB.

סוג: Word Embeddings (Skip-gram + Negative Sampling)
שפות: עברית (he), ארמית יהודית (arc)
שימושים מומלצים: דמיון סמנטי, שכנים קרובים (nearest neighbors), חיפוש סמנטי, clustering, תכונות למודלי שליפה (retrieval)

קבצים בריפו

vocab.json — מיפוי token→id + תדירויות
embeddings_last.npy — מטריצת אמבדינגים [vocab_size, dim]
ckpt_last.pt — צ’קפוינט אימון אחרון (PyTorch state_dict)
ckpt_mid.pt — גיבוי צ’קפוינט ישן יותר (אם קיים)

דוגמת שימוש (NumPy)

import json, numpy as np

with open("vocab.json", "r", encoding="utf-8") as f:
    meta = json.load(f)
vocab = meta["vocab"]

emb = np.load("embeddings_last.npy")  # [V, D]
embn = emb / (np.linalg.norm(emb, axis=1, keepdims=True) + 1e-9)

id2word = [None] * len(vocab)
for w,i in vocab.items():
    id2word[i] = w

def nearest(word, k=10):
    i = vocab.get(word)
    if i is None:
        return None
    sims = embn @ embn[i]
    top = np.argpartition(-sims, range(k+1))[:k+1]
    top = top[np.argsort(-sims[top])]
    return [(id2word[j], float(sims[j])) for j in top if j != i][:k]

print(nearest("שבת", 10))

מגבלות ידועות

המודל לומד דמיון הקשרי (מילים שמופיעות בהקשרים דומים), לא בהכרח “נרדפות מילונית”.
ערבוב עברית/ארמית וצורות נטייה גורמים לכך שלפעמים השכנים הם וריאציות צורניות (למשל: שבת/בשבת/השבת וכו’).

Otzaria Embeddings (Hebrew + Aramaic)

Model summary

This repository contains word embeddings trained on a large Hebrew/Aramaic Jewish-text corpus (including Tanakh, Mishnah, Bavli/Yerushalmi, and related sources), exported from an SQLite database.

Type: Word embeddings (Skip-gram + Negative Sampling)
Languages: Hebrew (he), Jewish Aramaic (arc)
Intended use: semantic similarity, nearest-neighbors, lexical exploration, downstream features for retrieval / clustering

Files

vocab.json — token → id mapping + token frequencies
embeddings_last.npy (or embeddings.npy) — embedding matrix [vocab_size, dim]
ckpt_last.pt — latest training checkpoint (PyTorch state_dict)
ckpt_mid.pt — backup checkpoint (older)

How to use

Load embeddings (NumPy)

import json, numpy as np

with open("vocab.json", "r", encoding="utf-8") as f:
    meta = json.load(f)
vocab = meta["vocab"]

emb = np.load("embeddings_last.npy")  # shape: [V, D]
embn = emb / (np.linalg.norm(emb, axis=1, keepdims=True) + 1e-9)

id2word = [None] * len(vocab)
for w,i in vocab.items():
    id2word[i] = w

def nearest(word, k=10):
    i = vocab.get(word)
    if i is None:
        return None
    sims = embn @ embn[i]
    top = np.argpartition(-sims, range(k+1))[:k+1]
    top = top[np.argsort(-sims[top])]
    return [(id2word[j], float(sims[j])) for j in top if j != i][:k]

```python
print(nearest("שבת", 10))

Load checkpoint (PyTorch)

import torch
ckpt = torch.load("ckpt_last.pt", map_location="cpu")
# ckpt contains: in/out embedding matrices + step

Training details

Corpus source: Otzaria SQLite database export (table line.content)
Preprocessing: removed niqqud/cantillation, removed HTML-like tags, normalized punctuation, whitespace normalization
Algorithm: Skip-gram with negative sampling
Dim: 100
Window: 4
Neg samples: 5
Min token length: ≥3
Min frequency: 10
Subsampling: enabled (to reduce dominance of high-frequency function words)
Hardware: Google Colab (GPU)

License

GPL-3.0

Downloads last month: -; Downloads are not tracked for this model. How to track