QuantFactory
/

SILMA-9B-Instruct-v1.0-GGUF

+---
+license: gemma
+library_name: transformers
+pipeline_tag: text-generation
+extra_gated_button_content: Acknowledge license
+tags:
+- conversational
+language:
+- ar
+- en
+---
+![](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)
+# QuantFactory/SILMA-9B-Instruct-v1.0-GGUF
+This is quantized version of [silma-ai/SILMA-9B-Instruct-v1.0](https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0) created using llama.cpp
+# Original Model Card
+# SILMA AI
+SILMA.AI is a leading Generative AI startup dedicated to empowering Arabic speakers with state-of-the-art AI solutions.
+## 🚀 Our Flagship Model: SILMA 1.0 🚀
+* **SILMA 1.0** is the **TOP-RANKED** open-weights Arabic LLM with an impressive **9 billion parameter size**, surpassing models that are over seven times larger 🏆
+## What makes SILMA exceptional?
+* SIMLA is a small language model outperforming 72B models in most arabic language tasks, thus more practical for business use-cases
+* SILMA is built over the robust foundational models of Google Gemma, combining the strengths of both to provide you with unparalleled performance
+* SILMA is an open-weight model, free to use in accordance with our open license
+## 👥 Our Team
+We are a team of seasoned **Arabic AI experts** who understand the nuances of the language and cultural considerations, enabling us to build solutions that truly resonate with Arabic users.
+**Authors**: [silma.ai](https://silma.ai)
+### Usage
+Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:
+```sh
+pip install -U transformers sentencepiece
+```
+Then, copy the snippet from the section that is relevant for your usecase.
+#### Running with the `pipeline` API
+```python
+import torch
+from transformers import pipeline
+pipe = pipeline(
+    "text-generation",
+    model="silma-ai/SILMA-9B-Instruct-v1.0",
+    model_kwargs={"torch_dtype": torch.bfloat16},
+    device="cuda",  # replace with "mps" to run on a Mac device
+)
+messages = [
+    {"role": "user", "content": "اكتب رسالة تعتذر فيها لمديري في العمل عن الحضور اليوم لأسباب مرضية."},
+]
+outputs = pipe(messages, max_new_tokens=256)
+assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
+print(assistant_response)
+```
+- Response:
+```text
+السلام عليكم ورحمة الله وبركاته
+أودّ أن أعتذر عن عدم الحضور إلى العمل اليوم بسبب مرضي. أشعر بالسوء الشديد وأحتاج إلى الراحة. سأعود إلى العمل فور تعافيي.
+شكراً لتفهمكم.
+مع تحياتي،
+[اسمك]
+```
+#### Running the model on a single / multi GPU
+```sh
+pip install accelerate
+```
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "silma-ai/SILMA-9B-Instruct-v1.0"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+messages = [
+    {"role": "system", "content": "أنت مساعد ذكي للإجابة عن أسئلة المستخدمين."},
+    {"role": "user", "content": "أيهما أبعد عن الأرض, الشمس أم القمر؟"},
+]
+input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=256)
+print(tokenizer.decode(outputs[0]))
+```
+- Response:
+```text
+الشمس
+```
+You can ensure the correct chat template is applied by using `tokenizer.apply_chat_template` as follows:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_id = "silma-ai/SILMA-9B-Instruct-v1.0"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+messages = [
+    {"role": "system", "content": "أنت مساعد ذكي للإجابة عن أسئلة المستخدمين."},
+    {"role": "user", "content": "اكتب كود بايثون لتوليد متسلسلة أرقام زوجية."},
+]
+input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=256)
+print(tokenizer.decode(outputs[0]).split("<start_of_turn>model")[-1])
+```
+- Response:
+```python
+def generate_even_numbers(n):
+	"""
+	This function generates a list of even numbers from 1 to n.
+	Args:
+		n: The upper limit of the range.
+	Returns:
+		A list of even numbers.
+	"""
+	return [i for i in range(1, n + 1) if i % 2 == 0]
+# Example usage
+n = 10
+even_numbers = generate_even_numbers(n)
+print(f"The first {n} even numbers are: {even_numbers}")
+```
+#### Quantized Versions through `bitsandbytes`
+<details>
+  <summary>
+    Using 8-bit precision (int8)
+  </summary>
+```sh
+pip install bitsandbytes accelerate
+```
+```python
+# pip install bitsandbytes accelerate
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+model_id = "silma-ai/SILMA-9B-Instruct-v1.0"
+quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    quantization_config=quantization_config,
+)
+messages = [
+    {"role": "system", "content": "أنت مساعد ذكي للإجابة عن أسئلة المستخدمين."},
+    {"role": "user", "content": "اذكر خمس انواع فواكه بها نسب عالية من فيتامين ج."},
+]
+input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=256)
+print(tokenizer.decode(outputs[0]).split("<start_of_turn>model")[-1])
+```
+- Response:
+```text
+الليمون، البرتقال، الموز، الكيوي، الفراولة
+```
+</details>
+<details>
+  <summary>
+    Using 4-bit precision
+  </summary>
+```python
+# pip install bitsandbytes accelerate
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+model_id = "silma-ai/SILMA-9B-Instruct-v1.0"
+quantization_config = BitsAndBytesConfig(load_in_4bit=True)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    quantization_config=quantization_config,
+)
+messages = [
+    {"role": "system", "content": "أنت مساعد ذكي للإجابة عن أسئلة المستخدمين."},
+    {"role": "user", "content": "في أي عام توفى صلاح الدين الأيوبي؟"},
+]
+input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=256)
+print(tokenizer.decode(outputs[0]).split("<start_of_turn>model")[-1])
+```
+- Response:
+```text
+1193
+```
+</details>
+#### Advanced Usage
+<details>
+  <summary>
+    Torch compile
+  </summary>
+[Torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is a method for speeding-up the
+inference of PyTorch modules. The Silma model can be run up to 6x faster by leveraging torch compile.
+Note that two warm-up steps are required before the full inference speed is realised:
+```python
+import os
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+from transformers import AutoTokenizer, Gemma2ForCausalLM
+from transformers.cache_utils import HybridCache
+import torch
+torch.set_float32_matmul_precision("high")
+# load the model + tokenizer
+model_id = "silma-ai/SILMA-9B-Instruct-v1.0"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = Gemma2ForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
+model.to("cuda")
+# apply the torch compile transformation
+model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
+# pre-process inputs
+messages = [
+    {"role": "system", "content": "أنت مساعد ذكي للإجابة عن أسئلة المستخدمين."},
+    {"role": "user", "content": "من الرئيس الذي تولى المنصب في أمريكا بعد دونالد ترامب؟"},
+]
+model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
+input_text = "من الرئيس الذي تولى المنصب في أمريكا بعد دونالد ترامب؟"
+model_inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
+prompt_length = model_inputs.input_ids.shape[1]
+# set-up k/v cache
+past_key_values = HybridCache(
+    config=model.config,
+    max_batch_size=1,
+    max_cache_len=model.config.max_position_embeddings,
+    device=model.device,
+    dtype=model.dtype
+)
+# enable passing kv cache to generate
+model._supports_cache_class = True
+model.generation_config.cache_implementation = None
+# two warm-up steps
+for idx in range(2):
+    outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
+    past_key_values.reset()
+# fast run
+outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+- Response:
+```text
+جو بايدن
+```
+For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=basic+usage%3A+generation_config).
+</details>
+### Chat Template
+The instruction-tuned models use a chat template that must be adhered to for conversational use.
+The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.
+Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+model_id = "silma-ai/SILMA-9B-Instruct-v1.0"
+dtype = torch.bfloat16
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="cuda",
+    torch_dtype=dtype,)
+chat = [
+    { "role": "user", "content": "ما اشهر اطارات العمل في البايثون لبناء نماذج الذكاء الاصطناعي؟" },
+]
+prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+```
+At this point, the prompt contains the following text:
+```
+<bos><start_of_turn>user
+ما اشهر اطارات العمل في البايثون لبناء نماذج الذكاء الاصطناعي؟<end_of_turn>
+<start_of_turn>model
+```
+As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
+(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
+the `<end_of_turn>` token.
+You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
+chat template.
+After the prompt is ready, generation can be performed like this:
+```python
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
+print(tokenizer.decode(outputs[0]))
+```
+### Inputs and outputs
+*   **Input:** Text string, such as a question, a prompt, or a document to be
+    summarized.
+*   **Output:** Generated Arabic or English text in response to the input, such
+    as an answer to a question, or a summary of a document.
+### Citation
+```none
+@article{silma_01_2024,
+    title={Silma},
+    url={https://www.silma.ai},
+    publisher={Silma},
+    author={Silma Team},
+    year={2024}
+}
+```
+## Usage and Limitations
+These models have certain limitations that users should be aware of.
+### Intended Usage
+Open Large Language Models (LLMs) have a wide range of applications across
+various industries and domains. The following list of potential uses is not
+comprehensive. The purpose of this list is to provide contextual information
+about the possible use-cases that the model creators considered as part of model
+training and development.
+* Content Creation and Communication
+  * Text Generation: These models can be used to generate creative text formats
+    such as poems, scripts, code, marketing copy, and email drafts.
+  * Chatbots and Conversational AI: Power conversational interfaces for customer
+    service, virtual assistants, or interactive applications.
+  * Text Summarization: Generate concise summaries of a text corpus, research
+    papers, or reports.
+* Research and Education
+  * Natural Language Processing (NLP) Research: These models can serve as a
+    foundation for researchers to experiment with NLP techniques, develop
+    algorithms, and contribute to the advancement of the field.
+  * Language Learning Tools: Support interactive language learning experiences,
+    aiding in grammar correction or providing writing practice.
+  * Knowledge Exploration: Assist researchers in exploring large bodies of text
+    by generating summaries or answering questions about specific topics.
+### Limitations
+* Training Data
+  * The quality and diversity of the training data significantly influence the
+    model's capabilities. Biases or gaps in the training data can lead to
+    limitations in the model's responses.
+  * The scope of the training dataset determines the subject areas the model can
+    handle effectively.
+* Context and Task Complexity
+  * LLMs are better at tasks that can be framed with clear prompts and
+    instructions. Open-ended or highly complex tasks might be challenging.
+  * A model's performance can be influenced by the amount of context provided
+    (longer context generally leads to better outputs, up to a certain point).
+* Language Ambiguity and Nuance
+  * Natural language is inherently complex. LLMs might struggle to grasp subtle
+    nuances, sarcasm, or figurative language.
+* Factual Accuracy
+  * LLMs generate responses based on information they learned from their
+    training datasets, but they are not knowledge bases. They may generate
+    incorrect or outdated factual statements.
+* Common Sense
+  * LLMs rely on statistical patterns in language. They might lack the ability
+    to apply common sense reasoning in certain situations.
+### Ethical Considerations and Risks
+The development of large language models (LLMs) raises several ethical concerns.
+In creating an open model, we have carefully considered the following:
+* Bias and Fairness
+  * LLMs trained on large-scale, real-world text data can reflect socio-cultural
+    biases embedded in the training material.
+* Misinformation and Misuse
+  * LLMs can be misused to generate text that is false, misleading, or harmful.
+  * Guidelines are provided for responsible use with the model, see the
+    [Responsible Generative AI Toolkit][rai-toolkit].
+* Transparency and Accountability:
+  * This model card summarizes details on the models' architecture,
+    capabilities, limitations, and evaluation processes.
+  * A responsibly developed open model offers the opportunity to share
+    innovation by making LLM technology accessible to developers and researchers
+    across the AI ecosystem.
+Risks identified and mitigations:
+* Perpetuation of biases: It's encouraged to perform continuous monitoring
+  (using evaluation metrics, human review) and the exploration of de-biasing
+  techniques during model training, fine-tuning, and other use cases.
+* Generation of harmful content: Mechanisms and guidelines for content safety
+  are essential. Developers are encouraged to exercise caution and implement
+  appropriate content safety safeguards based on their specific product policies
+  and application use cases.
+* Privacy violations: Models were trained on data filtered for removal of PII
+  (Personally Identifiable Information). Developers are encouraged to adhere to
+  privacy regulations with privacy-preserving techniques.