shamz15531 commited on
Commit
e7018d4
·
verified ·
1 Parent(s): 595dc6b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -3
README.md CHANGED
@@ -45,10 +45,10 @@ We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) wit
45
  ## Model Training
46
 
47
  #### Pretraining
48
- Fanar was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
49
 
50
  #### Post-training
51
- Fanar underwent a two-phase post-training pipeline:
52
 
53
  | Phase | Size |
54
  |-------|------|
@@ -60,7 +60,7 @@ Fanar underwent a two-phase post-training pipeline:
60
 
61
  ## Getting Started
62
 
63
- Fanar is compatible with the Hugging Face `transformers` library (≥ v4.40.0). Here's how to load and use the model:
64
 
65
  ```python
66
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -81,6 +81,25 @@ outputs = model.generate(**tokenizer(inputs, return_tensors="pt", return_token_t
81
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
82
  ```
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ---
85
 
86
  ## Intended Use
 
45
  ## Model Training
46
 
47
  #### Pretraining
48
+ Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and flitered from a variety of sources, 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.
49
 
50
  #### Post-training
51
+ Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:
52
 
53
  | Phase | Size |
54
  |-------|------|
 
60
 
61
  ## Getting Started
62
 
63
+ Fanar-1-9B-Instruct is compatible with the Hugging Face `transformers` library (≥ v4.40.0). Here's how to load and use the model:
64
 
65
  ```python
66
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
81
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
82
  ```
83
 
84
+ Inference using VLLM is also supported:
85
+
86
+ ```python
87
+
88
+ from vllm import LLM, SamplingParams
89
+
90
+ model_name = "QCRI/Fanar-1-9B-Instruct"
91
+
92
+ llm = LLM(model=model_name)
93
+ sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
94
+
95
+ # message content may be in Arabic or English
96
+ messages = [
97
+ {"role": "user", "content": "ما هي عاصمة قطر؟"},
98
+ ]
99
+
100
+ outputs = llm.chat(messages, sampling_params)
101
+ print(outputs[0].outputs[0].text)
102
+ ```
103
  ---
104
 
105
  ## Intended Use