aashish1904 commited on
Commit
daadf72
·
verified ·
1 Parent(s): 0b044cc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +408 -0
README.md ADDED
@@ -0,0 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ {}
5
+
6
+ ---
7
+
8
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
9
+
10
+
11
+ # aashish1904/N-ATLaS-GGUF
12
+ This is quantized version of [NCAIR1/N-ATLaS](https://huggingface.co/NCAIR1/N-ATLaS) created using llama.cpp
13
+
14
+ # Original Model Card
15
+
16
+ # N-ATLaS-LLM - Multilingual African Language Model
17
+
18
+ N-ATLaS-LLM is a fine-tuned multilingual language model based on Llama-3 8B, specifically designed to support African languages, including Hausa, Igbo, and Yoruba alongside English. This model is powered by **Awarri Technologies** an initiative of the **Federal Ministry of Communications, Innovation and Digital Economy**
19
+ as part of the Nigerian Languages AI Initiative to promote digital inclusion and preserve African linguistic heritage in the digital age.
20
+
21
+ ## Model Overview
22
+
23
+ N-ATLaS-LLM is built on the Llama architecture and has been fine-tuned on over 400 million tokens of multilingual instruction data. The model demonstrates strong performance across multiple African languages while maintaining excellent English capabilities.
24
+
25
+ ### Key Features
26
+ - **Multilingual Support**: Native support for English, Hausa, Igbo, and Yoruba
27
+ - **Cultural Relevance**: Trained on culturally relevant content from Nigerian sources
28
+ - **Instruction Following**: Fine-tuned for instruction-following tasks
29
+ - **Tool Integration**: Built-in support for tool integration capabilities
30
+
31
+ ## Model Architecture
32
+
33
+ ### Technical Specifications
34
+
35
+ | Parameter | Value |
36
+ |-----------|--------|
37
+ | **Model Type** | LlamaForCausalLM |
38
+ | **Base Model** | Llama-3 8B |
39
+ | **Hidden Size** | 4,096 |
40
+ | **Intermediate Size** | 14,336 |
41
+ | **Number of Layers** | 32 |
42
+ | **Attention Heads** | 32 |
43
+ | **Key-Value Heads** | 8 |
44
+ | **Head Dimension** | 128 |
45
+ | **Vocabulary Size** | 128,256 |
46
+ | **Max Position Embeddings** | 131,072 |
47
+ | **Context Length** | 8,092 tokens |
48
+
49
+
50
+
51
+ ## Training Data
52
+
53
+ ### Dataset Overview
54
+ N-ATLaS-LLM was trained on approximately **391,956,264 tokens** of quality multilingual instruction data.
55
+
56
+ | Language | SFT Samples |
57
+ |----------|-------------|
58
+ | English | ~318,000 |
59
+ | Hausa | ~200,000 |
60
+ | Igbo | ~200,000 |
61
+ | Yoruba | ~200,000 |
62
+
63
+ ### Data Sources and Processing
64
+
65
+ #### 1. Data Collection Pipeline
66
+ - **Open-source datasets**: High-quality SFT datasets from Hugging Face and other repositories
67
+ - **Translation pipeline**: Robust translation using Google Translate and OpenAI GPT models
68
+ - **Synthetic data generation**: Culturally relevant content from Nigerian web sources (BBC Pidgin, Punch News)
69
+ - **Human-in-the-loop quality control**: Manual verification and cleaning of translated samples
70
+
71
+ #### 2. Data Quality Assurance
72
+ - **Multi-language categorization**: Topic/domain tagging and organization
73
+ - **Content filtering**: Removal of toxic, irrelevant, or hallucinated content
74
+ - **Translation verification**: Fixing translation errors and ensuring prompt-response alignment
75
+ - **Cultural relevance**: Focus on Nigerian and African cultural contexts
76
+
77
+ ## Performance Evaluation
78
+
79
+ ### Human Evaluation Results
80
+
81
+ Our model was evaluated by human annotators across multiple dimensions. Here are the results:
82
+
83
+ | Metric | English | Hausa | Yoruba | Igbo |
84
+ |--------|---------|--------|---------|------|
85
+ | **Evaluations** | 1,662 | 140 | 542 | 296 |
86
+ | **Average Score** | 4.21/5.0 | 3.98/5.0 | 2.69/5.0 | 3.87/5.0 |
87
+ | **Fluency** | 4.30/5.0 | 4.23/5.0 | 2.71/5.0 | 3.89/5.0 |
88
+ | **Coherence** | 4.22/5.0 | 3.70/5.0 | 3.23/5.0 | 3.80/5.0 |
89
+ | **Relevance** | 4.28/5.0 | 3.76/5.0 | 2.89/5.0 | 3.85/5.0 |
90
+ | **Accuracy** | 4.23/5.0 | 3.72/5.0 | 3.13/5.0 | 3.92/5.0 |
91
+ | **Bias/Fairness** | 3.18/5.0 | 1.11/5.0 | 2.23/5.0 | 4.01/5.0 |
92
+ | **Usefulness** | 4.09/5.0 | 5.00/5.0 | 4.03/5.0 | 3.84/5.0 |
93
+
94
+ ### Key Performance Insights
95
+ - **English**: Excellent performance across all metrics (4.21/5.0 average)
96
+ - **Hausa**: Strong performance with exceptional usefulness scores (5.0/5.0)
97
+ - **Igbo**: Solid performance across most metrics (3.87/5.0 average)
98
+ - **Yoruba**: Room for improvement, particularly in fluency and relevance
99
+
100
+ ## Training Details
101
+
102
+ ### Training Configuration
103
+ - **Optimizer**: AdamW 8-bit
104
+ - **Learning Rate**: 1e-5 with linear scheduler
105
+ - **Precision**: Mixed precision training (BF16/FP16 based on hardware)
106
+ - **Base Model**: Llama-3 8B parameters
107
+ - **Fine-tuning Method**: Supervised Fine-Tuning (SFT)
108
+
109
+ ### Training Pipeline
110
+ 1. **Data Preprocessing**: Multi-stage cleaning and filtering pipeline
111
+ 2. **Supervised Fine-Tuning**: Instruction-following training on multilingual datasets
112
+ 3. **Quality Validation**: Human evaluation across multiple languages and metrics
113
+
114
+ ## 💻 Usage
115
+
116
+ ### Installation
117
+ ```bash
118
+ pip install transformers torch
119
+ ```
120
+
121
+ ### Basic Usage
122
+ ```python
123
+ from transformers import AutoTokenizer, AutoModelForCausalLM
124
+ import torch
125
+
126
+ # Load model and tokenizer
127
+ model_name = "NCAIR1/N-ATLaS"
128
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
129
+ model = AutoModelForCausalLM.from_pretrained(
130
+ model_name,
131
+ torch_dtype=torch.float16,
132
+ device_map="auto"
133
+ )
134
+
135
+ def format_text_for_inference(messages):
136
+
137
+ current_date = datetime.now().strftime('%d %b %Y')
138
+ text = tokenizer.apply_chat_template(
139
+ messages,
140
+ add_generation_prompt=True,
141
+ tokenize=False,
142
+ date_string=current_date
143
+ )
144
+
145
+ return text
146
+
147
+
148
+ # Example usage
149
+ q_chat = [
150
+ {'role':'system','content':'you are a large language model trained by Awarri AI technologies. You are a friendly assistant and you are here to help.'},
151
+ {'role': 'user', 'content': "menene ake nufi da gwagwarmaya"}
152
+ ]
153
+
154
+ text = format_text_for_inference(q_chat)
155
+
156
+ input_tokens = tokenizer(text,return_tensors='pt',add_special_tokens=False).to('cuda')
157
+ outputs = model.generate(
158
+ **input_tokens,
159
+ max_new_tokens = 1000,
160
+ use_cache=True,
161
+ repetition_penalty=1.12,
162
+ temperature = 0.1
163
+ )
164
+
165
+ text = tokenizer.batch_decode(outputs)[0]
166
+ print(text)
167
+
168
+ ```
169
+ ### Inference Output
170
+
171
+ ```python
172
+ <|begin_of_text|><|start_header_id|>system<|end_header_id|>
173
+
174
+ Cutting Knowledge Date: December 2023
175
+ Today Date: 11 Jun 2025
176
+
177
+ your name is AwaGPT, you are a large language model trained by Awarri AI technologies. You are a friendly assistant and you are here to help.<|eot_id|><|start_header_id|>user<|end_header_id|>
178
+
179
+ menene ake nufi da gwagwarmaya<|eot_id|><|start_header_id|>assistant<|end_header_id|>
180
+
181
+ A cikin mahallin ilimi, "gwagwarmaya" na nufin tsarin koyon halaye ko kwarewa ta hanyar maimaitawa, aiki, da kuma fuskantar kalubale. Yana jaddada rawar da kokari mai yawa ke takawa wajen samun nasara da ci gaba. Gwagwarmayar ba kawai game da samun sakamako ba ne amma har ma game da abubuwan da aka koya a lokacin aikin, wanda ke taimakawa wajen inganta kwarewa da tabbatar da canje-canjen da aka samu.
182
+
183
+ Ga karin bayani kan ra'ayoyin da suka shafi gwagwarmaya:
184
+
185
+ 1. **Koyo Ta Hanyar Kwarewa**: Yana mai da hankali kan muhimmancin shiga cikin yanayi na zahiri don samun ilimi da fasaha. Maimakon kawai karantawa ko sauraron darasi, gwagwarmaya yana haɗa da aiwatar da ayyuka da warware matsaloli, wanda ke haifar da zurfin fahimta.
186
+
187
+ 2. **Matsakaicin Kalubale**: Wannan hanya yawanci tana haɗa da fara da burin da ya wuce iyawarka na yanzu (matsakaici) sannan ka yi aiki don cimma wannan burin. Ta wannan hanyar, kana koyon iyakokin ka da wuraren da za a inganta, wanda ke haifar da ci gaban mutum da kuma ƙarfafawa.
188
+
189
+ 3. **Dorewa**: Ingantaccen koyo ta hanyar gwagwarmaya na iya zama dindindin idan an sake fuskantar kalubalen a tsawon lokaci. Ba kamar koyo na ɗan lokaci ba, inda ilimin zai iya zama ajiye ba tare da aiki ba, gwagwarmaya tana taimakawa wajen riƙe ilimi ta hanyar ci gaba da bukatar amfani da shi.
190
+
191
+ 4. **Halin Juriya**: Gwagwarmaya yawanci tana buƙatar jure gazawa da rashin nasara. Ta hanyar fuskantar wahala akai-akai, mutane suna haɓaka juriya da ƙudurin warware matsaloli, waɗannan halaye masu mahimmanci ga nasara a dogon lokaci.
192
+
193
+ 5. **Haɓaka Kai**: Gwagwarmaya ana amfani da ita sosai a cikin horon kai don taimakawa mutane su shawo kan tsoro, gina kwarin gwiwa, da haɓaka ikon sarrafa kansu. Yana haɓaka tunani mai kyau da kuma motsa mutane su tura iyakokinsu.
194
+
195
+ 6. **Amfani a Fannonin Daban-daban**: Ana amfani da manufar gwagwarmaya ba kawai a fannin ilimi ba; ana amfani da ita a fannonin kamar wasanni, horon sana'a, da ci gaban mutum. Misali, dan wasa na iya amfani da gwagwarmaya don inganta dabaru ko kwarewa, yayin da mai sana'a zai iya amfani da ita don koyo sabbin fasahohi ko dabaru.
196
+
197
+ A taƙaice, gwagwarmaya wata hanya ce mai tasiri ta koyo da ci gaba wacce ke jaddada mahimmancin aiki, juriya, da ci gaba mai dorewa. Yana taimakawa mutane su sami ilimi da kwarewa da za su iya amfani da su a rayuwa ta zahiri.<|eot_id|>
198
+ ```
199
+
200
+ ### Supported Languages
201
+ - **English**: Full support with high performance
202
+ - **Hausa**: Native support with cultural context
203
+ - **Igbo**: Native support with cultural context
204
+ - **Yoruba**: Native support with ongoing improvements
205
+
206
+ ## Use Cases
207
+
208
+ This model is designed for:
209
+ - **Multilingual Chatbots**: Deploy conversational AI in African languages
210
+ - **Content Translation**: Translate between English and African languages
211
+ - **Educational Tools**: Create learning materials in local languages
212
+ - **Cultural Preservation**: Document and preserve African linguistic heritage
213
+ - **Government Services**: Provide AI-powered services in local languages
214
+ - **Digital Inclusion**: Bridge the language gap in technology access
215
+ - **Research Applications**: Support research in Nigerian and African language technologies
216
+
217
+ ## Limitations
218
+
219
+
220
+ - **Bias Concerns**: Some bias issues identified.
221
+ - **Context Length**: Limited to 8,092 tokens for optimal performance
222
+ - **Domain Coverage**: Primarily trained on instruction-following tasks
223
+
224
+ ## Future Work
225
+
226
+ - **RLHF Training**: Implementation of reinforcement learning with human feedback
227
+ - **Performance Improvements**: Targeted improvements across all languages.
228
+ - **Bias Mitigation**: Enhanced bias detection and mitigation strategies
229
+ - **Extended Context**: Support for longer context lengths
230
+ - **Additional Datasets**: More SFT datasets for improved and better performance across the local languages.
231
+ - **Additional Languages**: Expansion to more African languages
232
+
233
+ ## Ethical Considerations
234
+
235
+ - This model was developed as part of a Federal Government initiative to promote digital inclusion
236
+ - Training data collection followed ethical guidelines for data usage and cultural sensitivity
237
+ - The model aims to preserve and promote African languages in digital spaces
238
+ - Efforts were made to ensure cultural relevance and accuracy across all supported languages
239
+
240
+ ## Contact & Support
241
+ - **Initiative Of**: Federal Ministry of Communications, Innovations, and Digital Economy
242
+ - **Powered By**: Awarri Technologies
243
+ - **Project**: Nigerian Languages AI Initiative (Federal Government Collaboration)
244
+ - **Version**: 1.0 (September 2025)
245
+
246
+ For issues, questions, or collaboration opportunities, please refer to the model repository discussions or contact Awarri Technologies.
247
+
248
+ ## Acknowledgments
249
+
250
+ This work was made possible through:
251
+ - AWARRI Technologies
252
+ - National Information Technology Development Agency (NITDA)
253
+ - The Federal Ministry of Communications, Innovation and Digital Economy
254
+ - National Center for Artificial Intelligence and Robotics
255
+ - Data contributors from across Nigeria's 6 geopolitical zones via the Langeasy platform
256
+ - The broader Nigerian language technology research community
257
+
258
+ ## 📄 Citation
259
+
260
+ ```bibtex
261
+ @misc{awagptv1_2025,
262
+ title={N-ATLaS-LLM: A Multilingual African Language Model},
263
+ author={Awarri Technologies and National Information Technology and Development Agency},
264
+ year={2025},
265
+ publisher={Hugging Face},
266
+ note={Fine-tuned Llama-3 8B model for African languages developed in collaboration with the Federal Government of Nigeria}
267
+ }
268
+ ```
269
+ ## 📜 License
270
+
271
+ # Terms of Use for N-ATLaS
272
+ *(Nigeria – Automatic Transcription and Language Systems)*
273
+
274
+ **Effective Date:** September 2025
275
+ **Version:** 1.0
276
+
277
+ ---
278
+
279
+ ## 1. Introduction & Scope
280
+ Awarri Technologies, in partnership with the Federal Government of Nigeria, hereby releases **N-ATLaS** (Nigeria – Automatic Transcription and Language Systems), consisting of four Automatic Speech Recognition (ASR) models and one Text Large Language Model (LLM) for Nigerian languages (Yoruba, Hausa, Igbo, and Nigerian-accented English).
281
+
282
+ N-ATLaS is released under an **Open-Source Research and Innovation License** inspired by permissive licenses such as Apache 2.0 and MIT, but with additional restrictions tailored for responsible use in Nigeria and globally.
283
+
284
+ The models are intended to support:
285
+ - Research and academic study
286
+ - Education and capacity development
287
+ - Civic technology and accessibility initiatives
288
+ - Innovation, cultural preservation, and community projects
289
+
290
+ ⚠️ N-ATLaS is **not** an enterprise-grade or commercial system. Commercial or large-scale enterprise use requires a separate licensing agreement (see Section 3).
291
+
292
+ ---
293
+
294
+ ## 2. License Grant
295
+ Subject to compliance with these Terms, users are hereby granted a worldwide, royalty-free, non-exclusive, non-transferable license to:
296
+ - Download, use, and run N-ATLaS for permitted purposes
297
+ - Modify, adapt, and create derivative works of N-ATLaS
298
+ - Redistribute N-ATLaS and derivative works under these same Terms
299
+
300
+ **Conditions:**
301
+ 1. Attribution must be given to:
302
+ > “Awarri Technologies and the Federal Ministry of Communications, Innovation and Digital Economy
303
+ .”
304
+
305
+ 2. Derivative works must be released under the same license, ensuring consistency and traceability.
306
+ 3. If N-ATLaS or its derivatives are renamed, they must carry the suffix: **“Powered by Awarri.”**
307
+
308
+ ---
309
+
310
+ ## 3. User License Cap (1000 Users)
311
+ Use of N-ATLaS is limited to organizations, institutions, or projects with no more than **1000 active end-users**.
312
+
313
+ - An *active end-user* is defined as an individual who directly interacts with N-ATLaS outputs (e.g., via an app, website, or integrated service) within a rolling 30-day period.
314
+ - Organizations exceeding the 1000-user cap must obtain a **commercial license** directly from Awarri Technologies in partnership with the Federal Ministry of Communications, Innovation, and Digital Economy.
315
+
316
+ ---
317
+
318
+ ## 4. Acceptable Use
319
+
320
+ ### ✅ Permitted Use Cases include (but are not limited to):
321
+ - Academic and non-profit research
322
+ - Accessibility for persons with disabilities
323
+ - Language and cultural preservation projects
324
+ - Civic technology and public benefit applications
325
+ - Education, training, and community innovation
326
+
327
+ ### ❌ Prohibited Use Cases include (but are not limited to):
328
+ - Surveillance or unlawful monitoring
329
+ - Discriminatory profiling or exclusionary practices
330
+ - Disinformation, impersonation, or synthetic fraud
331
+ - Military, intelligence, or weaponized deployment
332
+ - Exploitative, harmful, or unlawful applications
333
+
334
+ ---
335
+
336
+ ## 5. Limitations & Disclaimer
337
+ - N-ATLaS is released **“as-is”**, without warranties of any kind, express or implied.
338
+
339
+ **Known limitations include:**
340
+ - Dialectal and accent bias
341
+ - Reduced accuracy with children’s speech
342
+ - Limited handling of code-switching
343
+ - Degraded performance in noisy environments
344
+
345
+ Neither Awarri Technologies nor the Federal Ministry of Communications, Innovation and Digital Economy
346
+ shall be liable for damages arising from the use of N-ATLaS.
347
+
348
+ ---
349
+
350
+ ## 6. Ethical & Cultural Considerations
351
+ Users must:
352
+ - Respect Nigeria’s cultural and linguistic diversity
353
+ - Ensure transparent reporting of accuracy, bias, and limitations
354
+ - Uphold human rights and privacy standards in all deployments
355
+
356
+ ---
357
+
358
+ ## 7. Data & Privacy
359
+ - All training data used in N-ATLaS was either publicly available or government-approved for use.
360
+ - Users are strictly prohibited from using N-ATLaS for unauthorized personal data scraping, collection, or profiling.
361
+
362
+ ---
363
+
364
+ ## 8. Governance & Updates
365
+ - Governance and oversight will be led by the **Federal Ministry of Communications, Innovation, and Digital Economy**, in collaboration with the **National Centre for Artificial Intelligence and Robotics (NCAIR)**.
366
+ - **Awarri Technologies** shall act as the technical maintainer and custodian of N-ATLaS.
367
+ - Updates, improvements, and community contributions will be published periodically.
368
+ - Users must comply with the specific Terms attached to each version release.
369
+
370
+ ---
371
+
372
+ ## 9. Legal & Jurisdiction
373
+ - These Terms are governed by the laws of the **Federal Republic of Nigeria**.
374
+ - In the event of a dispute, parties agree to seek resolution first through **mediation under the auspices of the Federal Ministry of Justice** before pursuing litigation in Nigerian courts.
375
+
376
+ ---
377
+
378
+ ## 10. Termination
379
+ The Federal Government of Nigeria and Awarri Technologies reserve the right to revoke, suspend, or terminate usage rights if these Terms are violated.
380
+
381
+ Termination may apply to individual users, institutions, or organizations found in breach.
382
+
383
+ ---
384
+
385
+ ## 11. Contact & Attribution
386
+
387
+ For licensing, inquiries, and commercial partnerships regarding N-ATLaS, contact:
388
+
389
+ **Awarri Technologies**
390
+ - Email: [datasupport@awarri.com](mailto:datasupport@awarri.com)
391
+ - Website: [awarri.com](https://awarri.com)
392
+
393
+ **Federal Ministry of Communications, Innovation, and Digital Economy**
394
+ - Email: ncair@nitda.gov.ng
395
+ - Website: ncair.nitda.gov.ng
396
+
397
+ **Required attribution in all public use:**
398
+ > “N-ATLaS is an initiative of the Federal Ministry of Communications, Innovation and Digital Economy, and powered by Awarri Technologies.”
399
+
400
+ If renamed, the model must carry the suffix:
401
+ > **“Powered by Awarri.”**
402
+
403
+
404
+
405
+
406
+ *N-ATLaS-LLM is part of Awarri Technologies' mission, initiated by the The Federal Ministry of Communications, Innovation and Digital Economy
407
+ , to make AI accessible to African language speakers and preserve linguistic diversity in the digital age.*
408
+