Hishambarakat commited on
Commit
11c6379
·
verified ·
1 Parent(s): be76b2a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +145 -146
README.md CHANGED
@@ -3,207 +3,206 @@ base_model: humain-ai/ALLaM-7B-Instruct-preview
3
  library_name: peft
4
  pipeline_tag: text-generation
5
  tags:
6
- - base_model:adapter:humain-ai/ALLaM-7B-Instruct-preview
7
- - lora
8
- - sft
9
- - transformers
10
- - trl
 
 
 
11
  ---
12
 
13
- # Model Card for Model ID
14
-
15
- <!-- Provide a quick summary of what the model is/does. -->
16
 
 
 
17
 
 
18
 
19
  ## Model Details
20
-
21
- ### Model Description
22
-
23
- <!-- Provide a longer summary of what this model is. -->
24
-
25
-
26
-
27
- - **Developed by:** [More Information Needed]
28
- - **Funded by [optional]:** [More Information Needed]
29
- - **Shared by [optional]:** [More Information Needed]
30
- - **Model type:** [More Information Needed]
31
- - **Language(s) (NLP):** [More Information Needed]
32
- - **License:** [More Information Needed]
33
- - **Finetuned from model [optional]:** [More Information Needed]
34
-
35
- ### Model Sources [optional]
36
-
37
- <!-- Provide the basic links for the model. -->
38
-
39
- - **Repository:** [More Information Needed]
40
- - **Paper [optional]:** [More Information Needed]
41
- - **Demo [optional]:** [More Information Needed]
42
 
43
  ## Uses
44
 
45
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
-
47
  ### Direct Use
48
-
49
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
-
51
- [More Information Needed]
52
-
53
- ### Downstream Use [optional]
54
-
55
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
-
57
- [More Information Needed]
58
 
59
  ### Out-of-Scope Use
60
-
61
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
-
63
- [More Information Needed]
64
 
65
  ## Bias, Risks, and Limitations
 
 
 
66
 
67
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
-
69
- [More Information Needed]
70
 
71
- ### Recommendations
 
 
 
72
 
73
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
74
 
75
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
76
 
77
- ## How to Get Started with the Model
78
 
79
- Use the code below to get started with the model.
 
 
 
 
 
80
 
81
- [More Information Needed]
 
 
82
 
83
  ## Training Details
84
 
85
- ### Training Data
86
-
87
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
 
89
- [More Information Needed]
90
 
91
- ### Training Procedure
92
-
93
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
-
95
- #### Preprocessing [optional]
96
 
97
- [More Information Needed]
98
 
 
 
 
99
 
100
- #### Training Hyperparameters
101
 
102
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
 
104
- #### Speeds, Sizes, Times [optional]
105
 
106
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
-
108
- [More Information Needed]
109
-
110
- ## Evaluation
111
 
112
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
113
 
114
- ### Testing Data, Factors & Metrics
115
 
116
- #### Testing Data
117
 
118
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
119
 
120
- [More Information Needed]
121
-
122
- #### Factors
123
-
124
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
-
126
- [More Information Needed]
127
-
128
- #### Metrics
129
-
130
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
-
132
- [More Information Needed]
133
-
134
- ### Results
135
-
136
- [More Information Needed]
137
-
138
- #### Summary
139
-
140
-
141
-
142
- ## Model Examination [optional]
143
-
144
- <!-- Relevant interpretability work for the model goes here -->
145
-
146
- [More Information Needed]
147
-
148
- ## Environmental Impact
149
-
150
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
-
152
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
-
154
- - **Hardware Type:** [More Information Needed]
155
- - **Hours used:** [More Information Needed]
156
- - **Cloud Provider:** [More Information Needed]
157
- - **Compute Region:** [More Information Needed]
158
- - **Carbon Emitted:** [More Information Needed]
159
-
160
- ## Technical Specifications [optional]
161
-
162
- ### Model Architecture and Objective
163
-
164
- [More Information Needed]
165
-
166
- ### Compute Infrastructure
167
 
168
- [More Information Needed]
 
 
169
 
170
- #### Hardware
171
 
172
- [More Information Needed]
173
 
174
- #### Software
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
- [More Information Needed]
 
 
 
177
 
178
- ## Citation [optional]
179
 
180
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
 
182
- **BibTeX:**
183
 
184
- [More Information Needed]
185
 
186
- **APA:**
 
187
 
188
- [More Information Needed]
189
 
190
- ## Glossary [optional]
 
 
 
 
 
 
 
 
 
 
 
191
 
192
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
 
194
- [More Information Needed]
 
 
195
 
196
- ## More Information [optional]
197
 
198
- [More Information Needed]
199
 
200
- ## Model Card Authors [optional]
201
 
202
- [More Information Needed]
203
 
204
- ## Model Card Contact
 
 
 
 
 
 
 
 
 
205
 
206
- [More Information Needed]
207
- ### Framework versions
208
 
209
- - PEFT 0.18.1
 
 
3
  library_name: peft
4
  pipeline_tag: text-generation
5
  tags:
6
+ - base_model:adapter:humain-ai/ALLaM-7B-Instruct-preview
7
+ - lora
8
+ - sft
9
+ - transformers
10
+ - trl
11
+ language:
12
+ - ar
13
+ license: other
14
  ---
15
 
16
+ # Bahraini_Dialect_LLM (ALLaM-7B Instruct + Bahraini SFT)
 
 
17
 
18
+ ## Model Summary
19
+ **Bahraini_Dialect_LLM** is a Bahraini Arabic dialect fine-tune of **humain-ai/ALLaM-7B-Instruct-preview**, trained to produce **short, natural Bahraini responses** (avoiding Modern Standard Arabic), with stronger dialectal phrasing and domain coverage for everyday Q&A and practical assistant-style tasks.
20
 
21
+ This repo contains the **merged** weights (base + LoRA adapter merged into a standalone model) suitable for standard `transformers` loading.
22
 
23
  ## Model Details
24
+ - **Developed by:** Hisham Barakat
25
+ - **Base model:** `humain-ai/ALLaM-7B-Instruct-preview`
26
+ - **Model type:** Causal LM (LLaMA-family architecture via ALLaM)
27
+ - **Language:** Arabic (Bahraini dialect focus)
28
+ - **Training method:** Supervised Fine-Tuning (SFT) with LoRA (PEFT), then **merged**
29
+ - **Intended pipeline:** `text-generation`
30
+
31
+ ## Intended Behavior
32
+ The target behavior is:
33
+ - Bahraini dialect (not MSA)
34
+ - concise and clear
35
+ - practical and grounded answers for daily-life and assistant-like queries
36
+ - avoids overly formal greetings/phrasing unless explicitly requested
 
 
 
 
 
 
 
 
 
37
 
38
  ## Uses
39
 
 
 
40
  ### Direct Use
41
+ - Bahraini dialect assistant-style responses for:
42
+ - everyday chat / smalltalk
43
+ - short customer-service style replies
44
+ - practical troubleshooting (internet/APN/basic devices)
45
+ - simple admin writing (short “semi-formal” when requested)
46
+ - general Q&A
 
 
 
 
47
 
48
  ### Out-of-Scope Use
49
+ - Medical/legal/financial advice beyond general informational guidance
50
+ - Generating sensitive personal data, illegal instructions, or harmful content
51
+ - High-stakes decision making without human review
 
52
 
53
  ## Bias, Risks, and Limitations
54
+ - Dialect coverage is strongest for **Bahraini conversational assistant** style; it may still drift to Gulf-general or more formal Arabic in some cases.
55
+ - Synthetic paraphrasing and rule-driven generation can imprint patterns (over-structured answers, repeated phrasing).
56
+ - The model may inherit biases present in the base model and any source material used to generate/clean the dataset.
57
 
58
+ ## How to Get Started
 
 
59
 
60
+ ### Load (merged model)
61
+ ```python
62
+ import torch
63
+ from transformers import AutoTokenizer, AutoModelForCausalLM
64
 
65
+ REPO_ID = "Hishambarakat/Bahraini_Dialect_LLM"
66
+ DTYPE = torch.bfloat16 if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16
67
 
68
+ tok = AutoTokenizer.from_pretrained(REPO_ID, trust_remote_code=True)
69
+ model = AutoModelForCausalLM.from_pretrained(REPO_ID, trust_remote_code=True, torch_dtype=DTYPE, device_map="auto")
70
+ model.eval()
71
 
72
+ SYSTEM = "تكلم بحريني طبيعي. تجنب الفصحى و(تمام/أرجو/عادة). استخدم: وايد، جذي، هني، شلون، عقبها/بعدها، ما ضبط. افترض مخاطب ذكر إلا إذا في مؤشرات أنثى."
73
 
74
+ messages = [
75
+ {"role":"system","content":SYSTEM},
76
+ {"role":"user","content":"إذا نومي خربان شسوي؟"}
77
+ ]
78
+ enc = tok.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
79
+ enc = {k:v.to(model.device) for k,v in enc.items()}
80
 
81
+ out = model.generate(**enc, max_new_tokens=80, do_sample=True, temperature=0.7, pad_token_id=tok.eos_token_id)
82
+ print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True).strip())
83
+ ````
84
 
85
  ## Training Details
86
 
87
+ ### Base Model
 
 
88
 
89
+ * `humain-ai/ALLaM-7B-Instruct-preview`
90
 
91
+ ### Training Data (high-level)
 
 
 
 
92
 
93
+ Training was done on a curated Bahraini SFT-style corpus built from:
94
 
95
+ * **Single-speaker Bahraini transcript corpus** (cleaned and normalized)
96
+ * **Synthetic-but-close-to-real conversational expansions**, generated from the base style/voice and guided by strict rules to stay Bahraini
97
+ * **Domain-targeted assistant Q&A** (customer support, troubleshooting, daily admin writing) produced with controlled generation constraints
98
 
 
99
 
100
+ ### Data Construction Approach (what makes it “Bahraini”)
101
 
102
+ The dataset was produced through a structured pipeline:
103
 
104
+ * Cleaning + normalization on real transcript text (removing noise, artifacts, inconsistent punctuation)
105
+ * Prompt/response structuring into instruction-style pairs
106
+ * Controlled synthetic generation to expand coverage while keeping the same voice
107
+ * A dialect rule-set (positive/negative constraints) to:
 
108
 
109
+ * encourage Bahraini lexical markers (e.g., وايد، جذي، هني، شلون، عقبها/بعدها، ما ضبط)
110
+ * discourage MSA scaffolding and overly formal connectors
111
+ * keep responses short and practical
112
+ * Template correctness via the ALLaM chat template, with EOS enforcement
113
 
114
+ ### Prompt Format
115
 
116
+ Data was formatted using ALLaM’s chat template:
117
 
118
+ * system: dialect/style constraints
119
+ * user: prompt
120
+ * assistant: target response
121
+ and EOS was enforced at the end of each sample.
122
 
123
+ ### Training Procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
+ * **Method:** SFT with TRL `SFTTrainer`
126
+ * **Parameter-efficient fine-tuning:** LoRA via PEFT
127
+ * **Final artifact:** LoRA adapter was merged into the base model (`merge_and_unload`) and saved as a standalone model for standard loading.
128
 
129
+ ### Training Hyperparameters (exact run)
130
 
131
+ Base configuration used during the run:
132
 
133
+ * **Max sequence length:** 2048
134
+ * **Optimizer:** `adamw_torch`
135
+ * **LR:** 2e-5
136
+ * **Scheduler:** cosine
137
+ * **Warmup:** 0.1 of optimizer steps (computed as `warmup_steps`)
138
+ * **Weight decay:** 0.01
139
+ * **Max grad norm:** 1.0
140
+ * **Batching:** `per_device_train_batch_size=4`, `gradient_accumulation_steps=16`
141
+ * **Epochs:** 4
142
+ * **Packing:** False
143
+ * **Seed:** 42
144
+ * **Precision:** fp16 on T4; bf16 on Ampere+
145
+ * **Attention impl:** eager
146
+ * **Gradient checkpointing:** enabled (`use_reentrant=False`)
147
+ * **LoRA:**
148
 
149
+ * r=16
150
+ * alpha=32
151
+ * dropout=0.05
152
+ * target modules: `q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj`
153
 
154
+ ### Notes on Tokenizer / Special Tokens
155
 
156
+ The run aligned model config with tokenizer special tokens when needed (pad/bos/eos). Generation commonly uses `pad_token_id = eos_token_id` with explicit attention masks during inference to avoid warnings and instability when pad==eos.
157
 
158
+ ## Evaluation
159
 
160
+ Evaluation was primarily qualitative via prompt suites comparing:
161
 
162
+ * base model outputs vs fine-tuned outputs
163
+ * dialect strength, conciseness, task completion, and reduction of MSA drift
164
 
165
+ Example prompt suite included:
166
 
167
+ * smalltalk
168
+ * sleep routine advice (short)
169
+ * WhatsApp apology message
170
+ * semi-formal request to university
171
+ * home internet troubleshooting
172
+ * APN setup guidance
173
+ * online card rejection reasons
174
+ * electricity bill troubleshooting
175
+ * late order customer-service ticket phrasing
176
+ * clarification questions behavior
177
+ * dialect rewriting (“ما أقدر الحين بس برجع لك بعدين”)
178
+ * mixed Arabic/English phrasing (refund/invoice)
179
 
180
+ ## Compute / Infrastructure
181
 
182
+ * **Training stack:** `transformers`, `trl`, `peft`
183
+ * **Hardware:** single GPU (T4-class during development), fp16 used on T4
184
+ * **Framework versions:** PEFT 0.18.1 (per metadata)
185
 
186
+ ## Citation
187
 
188
+ ### Model
189
 
190
+ If you cite this model or derivative work, cite the dataset and include the base model reference.
191
 
192
+ ### Dataset (provided by author)
193
 
194
+ ```bibtex
195
+ @dataset{barakat_bahraini_speech_2026,
196
+ author = {Hisham Barakat},
197
+ title = {Hishambarakat/Bahraini_Speech_Dataset},
198
+ year = {2026},
199
+ publisher = {Hugging Face},
200
+ url = {https://huggingface.co/datasets/Hishambarakat/Bahraini_Speech_Dataset},
201
+ note = {LinkedIn: https://www.linkedin.com/in/hishambarakat/}
202
+ }
203
+ ```
204
 
205
+ ## Contact
 
206
 
207
+ * **Author:** Hisham Barakat
208
+ * **LinkedIn:** [https://www.linkedin.com/in/hishambarakat/](https://www.linkedin.com/in/hishambarakat/)