SicariusSicariiStuff commited on
Commit
83349f8
·
verified ·
1 Parent(s): a336a61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +238 -3
README.md CHANGED
@@ -1,3 +1,238 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - he
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - mistral
8
+ - nemo
9
+ - hebrew
10
+ - llm
11
+ - text-generation
12
+ - instruction-tuned
13
+ - chat
14
+ pipeline_tag: text-generation
15
+ base_model: mistralai/Mistral-Nemo-Base-2407
16
+ library_name: transformers
17
+ ---
18
+
19
+ # Hebrew_Nemo: State-of-the-Art Hebrew Language Model
20
+
21
+ ---
22
+
23
+ <div align="center">
24
+ <b style="font-size: 50px;">Hebrew_Nemo</b>
25
+
26
+
27
+ </div>
28
+
29
+
30
+ <div align="center">
31
+ <b style="font-size: 80px;">12B</b>
32
+
33
+
34
+ </div>
35
+
36
+
37
+ ---
38
+
39
+ <div align="center" style="font-size: 18px; margin-top: 20px;">
40
+ <b>Developed by:</b> <a href="https://huggingface.co/SicariusSicariiStuff">SicariusSicariiStuff</a>
41
+ </div>
42
+
43
+ ---
44
+
45
+ **Hebrew_Nemo** is a state-of-the-art (SOTA) **Hebrew language large language model** specifically optimized for Hebrew language understanding and generation. Built upon the Mistral Nemo architecture, this model represents a significant advancement in Hebrew NLP capabilities, combining the robust multilingual foundations of Mistral Nemo with extensive Hebrew-specific fine-tuning and optimization.
46
+
47
+ As part of [SicariusSicariiStuff](https://huggingface.co/SicariusSicariiStuff) efforts to truly democratize AI, [Hebrew_Nemo](https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo) is released with a permissive **Apache 2.0** license. The model demonstrates competitive performance with **Gemma3-27B**, one of the world’s leading open-source models in multilingual capabilities—despite Gemma3-27B being **more than twice its size**. This result highlights Hebrew_Nemo’s efficiency and effectiveness, making SOTA capabilities widely available for consumers, as well as corporations.
48
+
49
+ ### Technical Overview
50
+
51
+ - **Model Type:** Causal Language Model (Decoder-only Transformer)
52
+ - **Base Architecture:** Mistral Nemo
53
+ - **Language Focus:** Hebrew (עברית) with maintained multilingual capabilities
54
+ - **License:** Apache 2.0
55
+ - **Parameters:** 12B
56
+ - **Context Length:** 128K tokens
57
+ - **Layers:** 40
58
+ - **Dim:** 5,120
59
+ - **Head dim:** 128
60
+ - **Hidden dim:** 14,336
61
+ - **Activation Function:** SwiGLU
62
+ - **Number of heads:** 32
63
+ - **Number of kv-heads:** 8 (GQA)
64
+ - **Vocabulary size:** 2**17 ~= 128k
65
+ - **Rotary embeddings (theta = 1M)**
66
+
67
+ ### Primary Use Cases
68
+
69
+ - **Hebrew Text Generation:** High-quality content creation in modern Hebrew
70
+ - **Translation:** Bidirectional translation between Hebrew and other languages
71
+ - **Question Answering:** Advanced reasoning and comprehension in Hebrew contexts
72
+ - **Dialogue Systems:** Conversational AI applications for Hebrew speakers
73
+ - **Text Classification:** Sentiment analysis, topic modeling, and categorization of Hebrew content
74
+ - **Named Entity Recognition:** Extraction of entities from Hebrew text
75
+ - **Summarization:** Concise summaries of Hebrew documents and articles
76
+
77
+ ### Out-of-Scope Uses
78
+
79
+ - Real-time critical decision-making systems (medical, legal, financial) without human oversight
80
+ - Generation of content intended to deceive or manipulate
81
+ - Applications requiring 100% factual accuracy without verification
82
+
83
+
84
+ ## Training Data and Training Methodology
85
+
86
+ Hebrew_Nemo was trained on a diverse corpus including:
87
+
88
+ | Source Type | Description | Language Coverage |
89
+ |--------------|--------------|------------------|
90
+ | Hebrew Wikipedia | Encyclopedia-style text | 100% Hebrew |
91
+ | Hebrew Literature & Proverbs | Classic and modern | 100% Hebrew |
92
+ | Hebrew-English Code-Mix | Social media & dialogue | 70% Hebrew / 30% English |
93
+ | Synthetic Data | Instruction-following & reasoning | Mixed |
94
+
95
+ Data was filtered, normalized, and token-balanced to reduce bias and improve generalization across dialects.
96
+
97
+ Additional data trained:
98
+
99
+ - Modern Hebrew web text and news articles
100
+ - Hebrew literature and academic publications
101
+ - Biblical and Rabbinic Hebrew texts for cultural depth
102
+ - Hebrew social media and conversational data
103
+ - Technical documentation in Hebrew
104
+ - Parallel corpora for translation capabilities
105
+
106
+ ---
107
+
108
+ **The training process involved:**
109
+
110
+ 1. Continued pre-training on Hebrew-rich datasets
111
+ 2. Instruction fine-tuning on Hebrew task-specific data
112
+ 3. Alignment through RLHF/DPO for Hebrew linguistic preferences
113
+
114
+ ---
115
+
116
+ ## 🚀 Key Features
117
+
118
+ - **Native Hebrew Understanding:** Trained on millions of high-quality Hebrew documents spanning literature, news, Wikipedia, academic, and colloquial domains.
119
+ - **Contextual Mastery:** Handles complex anaphora, idiomatic expressions, and mixed Hebrew-English text with high fidelity.
120
+ - **Instruction-Tuned:** Aligned for chat, Q&A, summarization, and reasoning use cases.
121
+ - **Cultural Awareness:** Sensitive to Hebrew cultural, religious, and social nuances.
122
+ - **Optimized Inference:** Enhanced performance with Mistral’s memory-efficient attention and dynamic context window.
123
+
124
+ ---
125
+
126
+ # Out of scope usage
127
+ * Generating disinformation or biased political content
128
+ * Automated decision-making without human oversight
129
+
130
+ ---
131
+
132
+ ## ⚙️ Limitations
133
+
134
+ * May reflect **training corpus biases** (e.g., urban dialect prevalence, widespread opinions in Israeli social media)
135
+ * Limited performance on **rare biblical or archaic Hebrew**
136
+ * Occasionally mixes Hebrew and English when the context is ambiguous
137
+ * Does not include alignment for safety moderation out of the box
138
+
139
+ ---
140
+
141
+ ## 🗣️ Example Usage
142
+
143
+ ### Basic Inference
144
+
145
+ ```python
146
+ from transformers import AutoModelForCausalLM, AutoTokenizer
147
+
148
+ model_name = "SicariusSicariiStuff/Hebrew_Nemo"
149
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
150
+ model = AutoModelForCausalLM.from_pretrained(
151
+ model_name,
152
+ torch_dtype="auto",
153
+ device_map="auto"
154
+ )
155
+
156
+ prompt = "מהי בינה מלאכותית?"
157
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
158
+ outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
159
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
160
+ ```
161
+
162
+ ---
163
+
164
+ ### Chat Format
165
+
166
+ ```python
167
+ messages = [
168
+ {"role": "user", "content": "ספר לי על ההיסטוריה של ירושלים"}
169
+ ]
170
+
171
+ formatted_prompt = tokenizer.apply_chat_template(
172
+ messages,
173
+ tokenize=False,
174
+ add_generation_prompt=True
175
+ )
176
+
177
+ inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
178
+ outputs = model.generate(**inputs, max_new_tokens=512)
179
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
180
+ ```
181
+
182
+ ### Quantization (for lower VRAM)
183
+
184
+ ```python
185
+ from transformers import BitsAndBytesConfig
186
+
187
+ quantization_config = BitsAndBytesConfig(
188
+ load_in_4bit=True,
189
+ bnb_4bit_compute_dtype=torch.bfloat16
190
+ )
191
+
192
+ model = AutoModelForCausalLM.from_pretrained(
193
+ model_name,
194
+ quantization_config=quantization_config,
195
+ device_map="auto"
196
+ )
197
+ ```
198
+
199
+ ---
200
+
201
+ ## Available quantizations:
202
+
203
+ - Original: [FP16](https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo)
204
+ - GGUF: [Static Quants](https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo_GGUF)
205
+ - Specialized: [FP8](https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo_FP8)
206
+ - Mobile (ARM): [Q4_0](https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo_ARM)
207
+
208
+ ---
209
+
210
+
211
+ ## Citation
212
+
213
+ ```bibtex
214
+ @misc{hebrew_nemo_2025,
215
+ author = {SicariusSicariiStuff},
216
+ title = {Hebrew_Nemo: State-of-the-Art Hebrew Language Model},
217
+ year = {2025},
218
+ publisher = {Hugging Face},
219
+ url = {https://huggingface.co/SicariusSicariiStuff/Hebrew_Nemo}
220
+ }
221
+ ```
222
+
223
+
224
+ ## 🧰 Acknowledgements
225
+
226
+ * [Mistral](https://mistral.ai/) for the base architecture
227
+ * [NVIDIA NeMo](https://developer.nvidia.com/nemo) framework inspiration
228
+ * Employee#11 for her unwavering support
229
+
230
+ ## Contact
231
+
232
+ For questions, issues, or collaboration opportunities:
233
+ - **HuggingFace:** [@SicariusSicariiStuff](https://huggingface.co/SicariusSicariiStuff)
234
+ - **Issues:** Report technical issues on the model repository
235
+
236
+
237
+ ### Model Card Authors
238
+ - [@SicariusSicariiStuff](https://huggingface.co/SicariusSicariiStuff)