File size: 5,103 Bytes
35459cb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 | ---
license: apache-2.0
---
# AfriLION-Base: Multilingual Language Model for African Languages
<div align="center">
**African Language Intelligence & Open NLP**
[GitHub](https://github.com/LocaleNLP/afrilion) | [Website](https://localenlp.com) | [Demo](#) | [Paper](#)
</div>
## Model Description
AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.
### Key Features
- ๐ **20+ African Languages**: Comprehensive support for major African language families
- ๐ **Clean Training Data**: Trained on carefully curated CC-100 corpora with quality filtering
- โก **Efficient Architecture**: Optimized for deployment in resource-constrained environments
- ๐ **Apache 2.0 License**: Fully open-source for research and commercial use
- ๐ฏ **Multilingual Tokenizer**: Custom tokenizer designed for African language morphology
## Supported Languages
### West African Languages
- Wolof (wo)
- Fula/Fulani (ff)
- Yoruba (yo)
- Igbo (ig)
- Hausa (ha)
- Akan/Twi (ak)
### East African Languages
- Swahili (sw)
- Luganda (lg)
- Somali (so)
- Amharic (am)
- Oromo (om)
### Southern African Languages
- Zulu (zu)
- Xhosa (xh)
- Shona (sn)
- Sesotho (st)
### North African Languages
- Darija/Moroccan Arabic (ary)
- Kabyle (kab)
## Training Data
The model is trained on:
- **CC-100 Corpora**: Cleaned and filtered web text (100M+ tokens per language)
- **Wikipedia Dumps**: High-quality encyclopedic content
- **News Articles**: Contemporary written text from African news sources
- **Religious Texts**: Bible translations and Islamic texts for low-resource languages
### Data Processing
1. **Deduplication**: Aggressive deduplication at document and paragraph levels
2. **Quality Filtering**: Language identification, perplexity filtering, and heuristic-based cleaning
3. **Balancing**: Stratified sampling to ensure representation across all languages
## Architecture
- **Model Type**: Transformer-based encoder-decoder
- **Parameters**: 350M (base model)
- **Layers**: 12 encoder + 12 decoder layers
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Vocabulary Size**: 128,000 (multilingual BPE)
- **Max Sequence Length**: 512 tokens
## Usage
### Installation
```bash
pip install transformers torch
```
### Quick Start
```python
from transformers import AutoTokenizer, AutoModel
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")
# Example usage
text = "Habari za asubuhi" # Swahili: "Good morning news"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```
### Fine-tuning Example
```python
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments
# Load for specific task
model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")
# Your fine-tuning code here
```
## Benchmarks
| Task | Dataset | Score |
|------|---------|-------|
| Language Modeling | CC-100 Test | TBD |
| Named Entity Recognition | MasakhaNER | TBD |
| Machine Translation | FLORES-200 | TBD |
| Text Classification | AfriSenti | TBD |
## Limitations
- **Geographic Coverage**: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
- **Dialectal Variation**: Standard varieties prioritized; dialectal variations may not be well-represented
- **Domain**: Better performance on formal text; colloquial/social media text may be challenging
- **Code-Switching**: Limited support for code-mixed text
## Ethical Considerations
- **Bias**: Training data may contain societal biases present in web text
- **Representation**: Language representation reflects available digital resources, not speaker populations
- **Cultural Context**: Model may not capture cultural nuances specific to different African communities
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{afrilion2026,
title={AfriLION: African Language Intelligence and Open NLP},
author={LocaleNLP Team},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
}
```
## License
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Masakhane NLP Community for African language resources
- Contributors to CC-100 and Wikipedia
- Research institutions partnering on AfriLION development
- TPU Research Cloud for compute resources
## Contact
- **Organization**: LocaleNLP
- **Email**: info@localenlp.com
- **Website**: https://localenlp.com
- **GitHub**: https://github.com/LocaleNLP/afrilion
## Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to:
- Report issues
- Submit language-specific improvements
- Add new African languages
- Contribute training data
---
**LocaleNLP**: Bridging Languages, Empowering Lives. |