File size: 5,103 Bytes
35459cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
license: apache-2.0
---

# AfriLION-Base: Multilingual Language Model for African Languages

<div align="center">

**African Language Intelligence & Open NLP**

[GitHub](https://github.com/LocaleNLP/afrilion) | [Website](https://localenlp.com) | [Demo](#) | [Paper](#)

</div>

## Model Description

AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.

### Key Features

- ๐ŸŒ **20+ African Languages**: Comprehensive support for major African language families
- ๐Ÿ“Š **Clean Training Data**: Trained on carefully curated CC-100 corpora with quality filtering
- โšก **Efficient Architecture**: Optimized for deployment in resource-constrained environments
- ๐Ÿ”“ **Apache 2.0 License**: Fully open-source for research and commercial use
- ๐ŸŽฏ **Multilingual Tokenizer**: Custom tokenizer designed for African language morphology

## Supported Languages

### West African Languages
- Wolof (wo)
- Fula/Fulani (ff)
- Yoruba (yo)
- Igbo (ig)
- Hausa (ha)
- Akan/Twi (ak)

### East African Languages
- Swahili (sw)
- Luganda (lg)
- Somali (so)
- Amharic (am)
- Oromo (om)

### Southern African Languages
- Zulu (zu)
- Xhosa (xh)
- Shona (sn)
- Sesotho (st)

### North African Languages
- Darija/Moroccan Arabic (ary)
- Kabyle (kab)

## Training Data

The model is trained on:

- **CC-100 Corpora**: Cleaned and filtered web text (100M+ tokens per language)
- **Wikipedia Dumps**: High-quality encyclopedic content
- **News Articles**: Contemporary written text from African news sources
- **Religious Texts**: Bible translations and Islamic texts for low-resource languages

### Data Processing

1. **Deduplication**: Aggressive deduplication at document and paragraph levels
2. **Quality Filtering**: Language identification, perplexity filtering, and heuristic-based cleaning
3. **Balancing**: Stratified sampling to ensure representation across all languages

## Architecture

- **Model Type**: Transformer-based encoder-decoder
- **Parameters**: 350M (base model)
- **Layers**: 12 encoder + 12 decoder layers
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Vocabulary Size**: 128,000 (multilingual BPE)
- **Max Sequence Length**: 512 tokens

## Usage

### Installation

```bash
pip install transformers torch
```

### Quick Start

```python
from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")

# Example usage
text = "Habari za asubuhi"  # Swahili: "Good morning news"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

### Fine-tuning Example

```python
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load for specific task
model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")

# Your fine-tuning code here
```

## Benchmarks

| Task | Dataset | Score |
|------|---------|-------|
| Language Modeling | CC-100 Test | TBD |
| Named Entity Recognition | MasakhaNER | TBD |
| Machine Translation | FLORES-200 | TBD |
| Text Classification | AfriSenti | TBD |

## Limitations

- **Geographic Coverage**: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
- **Dialectal Variation**: Standard varieties prioritized; dialectal variations may not be well-represented
- **Domain**: Better performance on formal text; colloquial/social media text may be challenging
- **Code-Switching**: Limited support for code-mixed text

## Ethical Considerations

- **Bias**: Training data may contain societal biases present in web text
- **Representation**: Language representation reflects available digital resources, not speaker populations
- **Cultural Context**: Model may not capture cultural nuances specific to different African communities

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{afrilion2026,
  title={AfriLION: African Language Intelligence and Open NLP},
  author={LocaleNLP Team},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
}
```

## License

This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Masakhane NLP Community for African language resources
- Contributors to CC-100 and Wikipedia
- Research institutions partnering on AfriLION development
- TPU Research Cloud for compute resources

## Contact

- **Organization**: LocaleNLP
- **Email**: info@localenlp.com
- **Website**: https://localenlp.com
- **GitHub**: https://github.com/LocaleNLP/afrilion

## Contributing

We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to:

- Report issues
- Submit language-specific improvements
- Add new African languages
- Contribute training data

---

**LocaleNLP**: Bridging Languages, Empowering Lives.