Update README.md
Browse files
README.md
CHANGED
|
@@ -45,10 +45,10 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
|
|
| 45 |
## Model Training
|
| 46 |
|
| 47 |
#### Pretraining
|
| 48 |
-
Fanar was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
|
| 49 |
|
| 50 |
#### Post-training
|
| 51 |
-
Fanar underwent a two-phase post-training pipeline:
|
| 52 |
|
| 53 |
| Phase | Size |
|
| 54 |
|-------|------|
|
|
@@ -60,7 +60,7 @@ Fanar underwent a two-phase post-training pipeline:
|
|
| 60 |
|
| 61 |
## Getting Started
|
| 62 |
|
| 63 |
-
Fanar is compatible with the Hugging Face `transformers` library (≥ v4.40.0). Here's how to load and use the model:
|
| 64 |
|
| 65 |
```python
|
| 66 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
@@ -81,6 +81,25 @@ outputs = model.generate(**tokenizer(inputs, return_tensors="pt", return_token_t
|
|
| 81 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 82 |
```
|
| 83 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
---
|
| 85 |
|
| 86 |
## Intended Use
|
|
|
|
| 45 |
## Model Training
|
| 46 |
|
| 47 |
#### Pretraining
|
| 48 |
+
Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
|
| 49 |
|
| 50 |
#### Post-training
|
| 51 |
+
Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:
|
| 52 |
|
| 53 |
| Phase | Size |
|
| 54 |
|-------|------|
|
|
|
|
| 60 |
|
| 61 |
## Getting Started
|
| 62 |
|
| 63 |
+
Fanar-1-9B-Instruct is compatible with the Hugging Face `transformers` library (≥ v4.40.0). Here's how to load and use the model:
|
| 64 |
|
| 65 |
```python
|
| 66 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
|
|
| 81 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 82 |
```
|
| 83 |
|
| 84 |
+
Inference using VLLM is also supported:
|
| 85 |
+
|
| 86 |
+
```python
|
| 87 |
+
|
| 88 |
+
from vllm import LLM, SamplingParams
|
| 89 |
+
|
| 90 |
+
model_name = "QCRI/Fanar-1-9B-Instruct"
|
| 91 |
+
|
| 92 |
+
llm = LLM(model=model_name)
|
| 93 |
+
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
|
| 94 |
+
|
| 95 |
+
# message content may be in Arabic or English
|
| 96 |
+
messages = [
|
| 97 |
+
{"role": "user", "content": "ما هي عاصمة قطر؟"},
|
| 98 |
+
]
|
| 99 |
+
|
| 100 |
+
outputs = llm.chat(messages, sampling_params)
|
| 101 |
+
print(outputs[0].outputs[0].text)
|
| 102 |
+
```
|
| 103 |
---
|
| 104 |
|
| 105 |
## Intended Use
|