AIC-MCIT commited on
Commit
fdddc69
·
verified ·
1 Parent(s): 3fc2ed4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -3
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ - en
6
+ base_model:
7
+ - Qwen/Qwen2.5-32B-Instruct
8
+ tags:
9
+ - text-generation-inference
10
+ ---
11
+
12
+ ## Model Overview
13
+ This model is an extended version of the **Qwen2.5- 32B-Instruct** model, specifically adapted to enhance its performance in Arabic. While Qwen 2.5 provides strong general instruction-following capabilities across multiple languages, this extended version focuses on improving fluency, comprehension, and reasoning in Arabic, with particular emphasis on low-resource domains where information is often sparse or underrepresented. The model was further tuned to handle diverse Arabic styles and information, improve factual grounding in regional knowledge, and provide more accurate responses in contexts where existing multilingual models may fall short.
14
+
15
+ ---
16
+
17
+ ## Training Strategy
18
+ - **Instruction Fine-Tuning (IFT):**
19
+ - Fine-tuned on a mix of Arabic and English instruction–response datasets.
20
+ - Covered both high-resource and low-resource domains.
21
+ - Included different writing styles to improve adaptability.
22
+ - **Human Alignment:**
23
+ - Collected human preference data on Arabic and bilingual outputs.
24
+ - Applied Direct Preference Optimization **(DPO)**.
25
+ - Focused on factual accuracy, safety, and cultural sensitivity.
26
+
27
+ ---
28
+
29
+ ## Usage
30
+ ### How to Use
31
+ ```python
32
+ from transformers import AutoModelForCausalLM, AutoTokenizer
33
+
34
+ model_name = "Applied-Innovation-Center/AIC-1"
35
+
36
+ model = AutoModelForCausalLM.from_pretrained(
37
+ model_name,
38
+ torch_dtype="auto",
39
+ device_map="auto"
40
+ )
41
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
42
+
43
+ prompt = "ما هي عاصمة مصر"
44
+ messages = [
45
+ {"role": "system", "content": "You are an AI assistant. Always answer user questions with factual, evidence-based information. If you are unsure or the information is unavailable, clearly state that you do not know instead of guessing. Do not invent details. Keep responses concise, clear, and accurate. Avoid speculation, opinions, or creative storytelling unless explicitly asked for."}
46
+ {"role": "user", "content": prompt}
47
+ ]
48
+
49
+ text = tokenizer.apply_chat_template(
50
+ messages,
51
+ tokenize=False,
52
+ add_generation_prompt=True
53
+ )
54
+
55
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
56
+
57
+ generated_ids = model.generate(
58
+ **model_inputs,
59
+ max_new_tokens=512
60
+ )
61
+
62
+ generated_ids = [
63
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
64
+ ]
65
+
66
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]