|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<picture> |
|
|
<!-- Dark mode --> |
|
|
<source media="(prefers-color-scheme: dark)" srcset="https://cdn-uploads.huggingface.co/production/uploads/65604648d69284e31fed02b0/iDADKNbWL17MTB-bB34gV.png"> |
|
|
<!-- Light mode --> |
|
|
<source media="(prefers-color-scheme: light)" srcset="https://cdn-uploads.huggingface.co/production/uploads/65604648d69284e31fed02b0/AO9bjIkbM0zFs67oE3-it.png"> |
|
|
<!-- Fallback --> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/65604648d69284e31fed02b0/AO9bjIkbM0zFs67oE3-it.png" alt="Jais2 Logo" width="400"> |
|
|
</picture> |
|
|
</p> |
|
|
|
|
|
# Jais-2: The Next Generation of Arabic Frontier LLMs |
|
|
|
|
|
## Model Overview |
|
|
Jais-2-70B-Chat is a high-capacity bilingual Arabic–English language model developed by MBZUAI, Inception, and Cerebras. |
|
|
Jais-2-70B-Chat Model is trained from scratch on Arabic and English data and is powered by a custom Arabic-centric vocabulary, it efficiently captures Modern Standard Arabic, regional dialects, and mixed Arabic–English code-switching. |
|
|
The model is openly available under a Apache 2.0 license and also deployed as a fast, production-ready chat experience running on Cerebras hardware. |
|
|
Visit the [Jais-2 Web App](https://jaischat.ai). |
|
|
|
|
|
## Key Technical Specifications |
|
|
- **Model Developers**: MBZUAI, Inception, Cerebras. |
|
|
- **Languages**: Arabic (MSA & dialects) and English |
|
|
- **Architecture**: Transformer-based, Decoder-only architecture with multi-head self-attention. |
|
|
- **Parameters**: 70 Billion |
|
|
- **Context Length**: 8,192 |
|
|
- **Vocabulary Size**: 150,272 |
|
|
- **Training Infrastructure**: Optimized for Cerebras CS-2 and Condor Galaxy clusters |
|
|
- **Key Design Choices**: Rotary Position Embeddings (RoPE), Squared-ReLU activation, custom μP parameterization, and 8:1 filter-to-hidden size ratio. |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## How to Use the Model |
|
|
|
|
|
# Using Transformers |
|
|
### 1. Clone the Jais 2–compatible Transformers fork |
|
|
|
|
|
```bash |
|
|
# Pending PR merge to the official package |
|
|
git clone --branch jais2 --single-branch \ |
|
|
https://github.com/inceptionai-abudhabi/transformers.git |
|
|
cd transformers |
|
|
uv pip install -e . |
|
|
``` |
|
|
### 2. Load and Inference on the Model |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "inceptionai/Jais-2-70B-Chat" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto") |
|
|
|
|
|
# Example Arabic prompt |
|
|
system_prompt = "أجب باللغة العربية بطريقة رسمية وواضحة." |
|
|
user_input = "ما هي عاصمة الإمارات؟" |
|
|
|
|
|
# Apply chat template (always) |
|
|
chat_text = tokenizer.apply_chat_template( |
|
|
[ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": user_input} |
|
|
], |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
# Tokenize and generate |
|
|
inputs = tokenizer(chat_text, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=8192, temperature=0) |
|
|
|
|
|
# Decode and print |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
#عاصمة الإمارات العربية المتحدة هي أبوظبي. |
|
|
``` |
|
|
# Using vLLM |
|
|
|
|
|
### 1. Clone the Jais 2–compatible vLLM fork |
|
|
|
|
|
```bash |
|
|
# Pending PR merge to the official package |
|
|
git clone --branch jais2 --single-branch \ |
|
|
https://github.com/inceptionai-abudhabi/vllm.git |
|
|
cd vllm |
|
|
uv pip install -e . # If you install vllm after transformers, please re-install transformers again from this branch: https://github.com/inceptionai-abudhabi/transformers.git |
|
|
``` |
|
|
|
|
|
### 2. Load and Inference on the Model |
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "inceptionai/Jais-2-70B-Chat" |
|
|
llm = LLM(model=model_name, tensor_parallel_size=1) |
|
|
tokenizer = llm.get_tokenizer() |
|
|
|
|
|
# Example Arabic prompt |
|
|
system_prompt = "أجب باللغة العربية بطريقة رسمية وواضحة." |
|
|
user_input = "ما هي عاصمة الإمارات؟" |
|
|
|
|
|
# Apply chat template (always) |
|
|
chat_text = tokenizer.apply_chat_template( |
|
|
[ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": user_input} |
|
|
], |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
# Run generation |
|
|
sampling_params = SamplingParams(max_tokens=8192, temperature=0) |
|
|
outputs = llm.generate([chat_text], sampling_params) |
|
|
|
|
|
#Print output |
|
|
print(outputs[0].outputs[0].text) |
|
|
#عاصمة الإمارات العربية المتحدة هي أبوظبي. |
|
|
``` |
|
|
|
|
|
Or serve through command line (CLI) |
|
|
|
|
|
```shell |
|
|
vllm serve inceptionai/Jais-2-70B-Chat \ |
|
|
--served-model-name inceptionai/Jais-2-70B-Chat-Local --dtype bfloat16 \ |
|
|
--tensor-parallel-size 2 --max-model-len 8192 --max-num-seqs 256 \ |
|
|
--host 0.0.0.0 --port 8042 --api-key "Optional" |
|
|
``` |
|
|
--- |
|
|
## Evaluation |
|
|
### Performance Overview |
|
|
We evaluate **Jais-2-70B** across two key benchmarks that capture both *instruction following* and *generative* Arabic ability: **IFEval** (English and Arabic) and **AraGen-12-24 (3C3H)**. |
|
|
|
|
|
### IFEval Results (Strict 0-shot) |
|
|
|
|
|
| Model Name | En-Strict-Prompt-lvl | En-Strict-Instruction-lvl | Ar-Strict-Prompt-lvl | Ar-Strict-Instruction-lvl | |
|
|
|------------|-----------------------|----------------------------|------------------------|----------------------------| |
|
|
| **Qwen2.5-72B-Instruct** | 83.53 | 88.51 | **67.33** | **74.05** | |
|
|
| **Llama-3.3-70B-Instruct** | **88.20** | **92.10** | 58.17 | 63.13 | |
|
|
| **Jais-2-70B (ours)** | 70.78 | 78.93 | 66.58 | 74.53 | |
|
|
|
|
|
--- |
|
|
|
|
|
### AraGen-12-24 (3C3H) Results |
|
|
|
|
|
| Model Name | 3C3H Score (%) | Correctness | Completeness | Conciseness | Helpfulness | Honesty | Harmlessness | |
|
|
|------------|----------------|-------------|--------------|-------------|-------------|---------|-------------- | |
|
|
| **Qwen2.5-72B-Instruct** | 62.58 | 71.92 | 71.80 | 19.06 | 69.86 | 70.94 | 71.92 | |
|
|
| **Llama-3.3-70B-Instruct** | 61.29 | 68.58 | 65.11 | **34.50** | 63.50 | 67.47 | 68.58 | |
|
|
| **Jais-2-70B (ours)** | **70.71** | **80.53** | **79.09** | 25.48 | **78.43** | **80.23** | **80.53** | |
|
|
|
|
|
|
|
|
Overall, our results show that: |
|
|
- Jais-2-70B delivers competitive Arabic and English instruction-following performance across IFEval metrics. |
|
|
- Jais-2-70B achieves the highest scores across nearly all AraGen metrics, outperforming Qwen2.5-72B and Llama-3.3-70B on Arabic generative tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Target Audiences |
|
|
- **Academics**: Researchers focusing on Arabic NLP, multilingual modeling, or cultural alignment |
|
|
- **Businesses**: Companies targeting Arabic-speaking markets |
|
|
- **Developers and ML Engineers**: Integrating Arabic language capabilities into applications and workflows |
|
|
|
|
|
### Appropriate Use Cases |
|
|
- **Research**: |
|
|
- Natural language understanding and generation tasks |
|
|
- Conducting interpretability or cross-lingual alignment analyses |
|
|
- Investigating Arabic linguistic or cultural patterns |
|
|
|
|
|
- **Commercial Use**: |
|
|
- Building chat assistants for Arabic-speaking audiences |
|
|
- Performing sentiment and market analysis in regional contexts |
|
|
- Summarizing or processing bilingual Arabic–English documents |
|
|
- Creating culturally resonant Arabic marketing and entertainment content for regional audiences |
|
|
|
|
|
### Inappropriate Use Cases |
|
|
- **Harmful or Malicious Use**: |
|
|
- Producing hate speech, extremist content, or discriminatory language |
|
|
- Creating or spreading misinformation or deceptive content |
|
|
- Engaging in or promoting illegal activities |
|
|
|
|
|
- **Sensitive Information**: |
|
|
- Handling or generating personal, confidential, or sensitive information |
|
|
- Attempting to infer, reconstruct, or guess sensitive information about individuals or organizations |
|
|
|
|
|
- **Language Limitations**: |
|
|
- Applications requiring strong performance outside Arabic or English languages |
|
|
|
|
|
- **High-Stakes Decisions**: |
|
|
- Making medical, legal, financial, or safety-critical decisions without human oversight |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, please give us a cite. |
|
|
|
|
|
``` |
|
|
@techreport{jais2_2025, |
|
|
title = {Jais 2: {A} Family of {A}rabic-Centric Open Large Language Models}, |
|
|
author = { |
|
|
Anwar, Mohamed and |
|
|
Freihat, Abdelhakim and |
|
|
Ibrahim, George and |
|
|
Awad, Mostafa and |
|
|
Sadallah, Abdelrahman Atef Mohamed Ali and |
|
|
Gosal, Gurpreet and |
|
|
Ramakrishnan, Gokul and |
|
|
Hestness, Joel and |
|
|
Mishra, Biswajit and |
|
|
Joshi, Rituraj and |
|
|
Chandran, Sarath and |
|
|
Frikha, Ahmed and |
|
|
Goffinet, Etienne and |
|
|
Maiti, Abhishek and |
|
|
El Filali, Ali and |
|
|
Al Barri, Sarah and |
|
|
Ghosh, Samujjwal and |
|
|
Pal, Rahul and |
|
|
Mullah, Parvez and |
|
|
Shukla, Awantika and |
|
|
Siddiki, Sajid and |
|
|
Kamboj, Samta and |
|
|
Pandit, Onkar and |
|
|
Sahu, Sunil and |
|
|
El Badawy, Abelrahman and |
|
|
Mohamed, Amr and |
|
|
Chamma, Ahmad and |
|
|
Dufraisse, Evan and |
|
|
Bounhar, Abdelaziz and |
|
|
Bouch, Dani and |
|
|
Abdine, Hadi and |
|
|
Shang, Guokan and |
|
|
Koto, Fajri and |
|
|
Wang, Yuxia and |
|
|
Xie, Zhuohan and |
|
|
Mekky, Ali and |
|
|
Elbadry, Rania Hossam Elmohamady and |
|
|
Ahmad, Sarfraz and |
|
|
Ahsan, Momina and |
|
|
El-Herraoui, Omar Emad Mohamed and |
|
|
Orel, Daniil and |
|
|
Iqbal, Hasan and |
|
|
Elzeky, Kareem Mohamed Naguib Abdelmohsen Fahmy and |
|
|
Abassy, Mervat and |
|
|
Ali, Kareem and |
|
|
Eletter, Saadeldine and |
|
|
Atif, Farah and |
|
|
Mukhituly, Nurdaulet and |
|
|
Li, Haonan and |
|
|
Han, Xudong and |
|
|
Singh, Aaryamonvikram and |
|
|
Quraishi, Zain and |
|
|
Sengupta, Neha and |
|
|
Murray, Larry and |
|
|
Sheinin, Avraham and |
|
|
Vassilieva, Natalia and |
|
|
Ren, Hector and |
|
|
Liu, Zhengzhong and |
|
|
Vazirgiannis, Michalis and |
|
|
Nakov, Preslav |
|
|
}, |
|
|
institution = {IFM}, |
|
|
type = {Technical Report}, |
|
|
year = {2025}, |
|
|
month = dec, |
|
|
day = {09}, |
|
|
} |
|
|
``` |