---
license: apache-2.0
---
# AfriLION-Base: Multilingual Language Model for African Languages
**African Language Intelligence & Open NLP**
[GitHub](https://github.com/LocaleNLP/afrilion) | [Website](https://localenlp.com) | [Demo](#) | [Paper](#)
## Model Description
AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.
### Key Features
- 🌍 **20+ African Languages**: Comprehensive support for major African language families
- 📊 **Clean Training Data**: Trained on carefully curated CC-100 corpora with quality filtering
- ⚡ **Efficient Architecture**: Optimized for deployment in resource-constrained environments
- 🔓 **Apache 2.0 License**: Fully open-source for research and commercial use
- 🎯 **Multilingual Tokenizer**: Custom tokenizer designed for African language morphology
## Supported Languages
### West African Languages
- Wolof (wo)
- Fula/Fulani (ff)
- Yoruba (yo)
- Igbo (ig)
- Hausa (ha)
- Akan/Twi (ak)
### East African Languages
- Swahili (sw)
- Luganda (lg)
- Somali (so)
- Amharic (am)
- Oromo (om)
### Southern African Languages
- Zulu (zu)
- Xhosa (xh)
- Shona (sn)
- Sesotho (st)
### North African Languages
- Darija/Moroccan Arabic (ary)
- Kabyle (kab)
## Training Data
The model is trained on:
- **CC-100 Corpora**: Cleaned and filtered web text (100M+ tokens per language)
- **Wikipedia Dumps**: High-quality encyclopedic content
- **News Articles**: Contemporary written text from African news sources
- **Religious Texts**: Bible translations and Islamic texts for low-resource languages
### Data Processing
1. **Deduplication**: Aggressive deduplication at document and paragraph levels
2. **Quality Filtering**: Language identification, perplexity filtering, and heuristic-based cleaning
3. **Balancing**: Stratified sampling to ensure representation across all languages
## Architecture
- **Model Type**: Transformer-based encoder-decoder
- **Parameters**: 350M (base model)
- **Layers**: 12 encoder + 12 decoder layers
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Vocabulary Size**: 128,000 (multilingual BPE)
- **Max Sequence Length**: 512 tokens
## Usage
### Installation
```bash
pip install transformers torch
```
### Quick Start
```python
from transformers import AutoTokenizer, AutoModel
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")
# Example usage
text = "Habari za asubuhi" # Swahili: "Good morning news"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```
### Fine-tuning Example
```python
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments
# Load for specific task
model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")
# Your fine-tuning code here
```
## Benchmarks
| Task | Dataset | Score |
|------|---------|-------|
| Language Modeling | CC-100 Test | TBD |
| Named Entity Recognition | MasakhaNER | TBD |
| Machine Translation | FLORES-200 | TBD |
| Text Classification | AfriSenti | TBD |
## Limitations
- **Geographic Coverage**: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
- **Dialectal Variation**: Standard varieties prioritized; dialectal variations may not be well-represented
- **Domain**: Better performance on formal text; colloquial/social media text may be challenging
- **Code-Switching**: Limited support for code-mixed text
## Ethical Considerations
- **Bias**: Training data may contain societal biases present in web text
- **Representation**: Language representation reflects available digital resources, not speaker populations
- **Cultural Context**: Model may not capture cultural nuances specific to different African communities
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{afrilion2026,
title={AfriLION: African Language Intelligence and Open NLP},
author={LocaleNLP Team},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
}
```
## License
This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.
## Acknowledgments
- Masakhane NLP Community for African language resources
- Contributors to CC-100 and Wikipedia
- Research institutions partnering on AfriLION development
- TPU Research Cloud for compute resources
## Contact
- **Organization**: LocaleNLP
- **Email**: info@localenlp.com
- **Website**: https://localenlp.com
- **GitHub**: https://github.com/LocaleNLP/afrilion
## Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to:
- Report issues
- Submit language-specific improvements
- Add new African languages
- Contribute training data
---
**LocaleNLP**: Bridging Languages, Empowering Lives.