| --- |
| license: apache-2.0 |
| --- |
| |
| # AfriLION-Base: Multilingual Language Model for African Languages |
|
|
| <div align="center"> |
|
|
| **African Language Intelligence & Open NLP** |
|
|
| [GitHub](https://github.com/LocaleNLP/afrilion) | [Website](https://localenlp.com) | [Demo](#) | [Paper](#) |
|
|
| </div> |
|
|
| ## Model Description |
|
|
| AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages. |
|
|
| ### Key Features |
|
|
| - ๐ **20+ African Languages**: Comprehensive support for major African language families |
| - ๐ **Clean Training Data**: Trained on carefully curated CC-100 corpora with quality filtering |
| - โก **Efficient Architecture**: Optimized for deployment in resource-constrained environments |
| - ๐ **Apache 2.0 License**: Fully open-source for research and commercial use |
| - ๐ฏ **Multilingual Tokenizer**: Custom tokenizer designed for African language morphology |
|
|
| ## Supported Languages |
|
|
| ### West African Languages |
| - Wolof (wo) |
| - Fula/Fulani (ff) |
| - Yoruba (yo) |
| - Igbo (ig) |
| - Hausa (ha) |
| - Akan/Twi (ak) |
|
|
| ### East African Languages |
| - Swahili (sw) |
| - Luganda (lg) |
| - Somali (so) |
| - Amharic (am) |
| - Oromo (om) |
|
|
| ### Southern African Languages |
| - Zulu (zu) |
| - Xhosa (xh) |
| - Shona (sn) |
| - Sesotho (st) |
|
|
| ### North African Languages |
| - Darija/Moroccan Arabic (ary) |
| - Kabyle (kab) |
|
|
| ## Training Data |
|
|
| The model is trained on: |
|
|
| - **CC-100 Corpora**: Cleaned and filtered web text (100M+ tokens per language) |
| - **Wikipedia Dumps**: High-quality encyclopedic content |
| - **News Articles**: Contemporary written text from African news sources |
| - **Religious Texts**: Bible translations and Islamic texts for low-resource languages |
|
|
| ### Data Processing |
|
|
| 1. **Deduplication**: Aggressive deduplication at document and paragraph levels |
| 2. **Quality Filtering**: Language identification, perplexity filtering, and heuristic-based cleaning |
| 3. **Balancing**: Stratified sampling to ensure representation across all languages |
|
|
| ## Architecture |
|
|
| - **Model Type**: Transformer-based encoder-decoder |
| - **Parameters**: 350M (base model) |
| - **Layers**: 12 encoder + 12 decoder layers |
| - **Hidden Size**: 768 |
| - **Attention Heads**: 12 |
| - **Vocabulary Size**: 128,000 (multilingual BPE) |
| - **Max Sequence Length**: 512 tokens |
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch |
| ``` |
|
|
| ### Quick Start |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| |
| # Load model and tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base") |
| model = AutoModel.from_pretrained("LocaleNLP/afrilion-base") |
| |
| # Example usage |
| text = "Habari za asubuhi" # Swahili: "Good morning news" |
| inputs = tokenizer(text, return_tensors="pt") |
| outputs = model(**inputs) |
| ``` |
|
|
| ### Fine-tuning Example |
|
|
| ```python |
| from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments |
| |
| # Load for specific task |
| model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base") |
| |
| # Your fine-tuning code here |
| ``` |
|
|
| ## Benchmarks |
|
|
| | Task | Dataset | Score | |
| |------|---------|-------| |
| | Language Modeling | CC-100 Test | TBD | |
| | Named Entity Recognition | MasakhaNER | TBD | |
| | Machine Translation | FLORES-200 | TBD | |
| | Text Classification | AfriSenti | TBD | |
|
|
| ## Limitations |
|
|
| - **Geographic Coverage**: Primarily focuses on widely-spoken languages; many smaller African languages not yet included |
| - **Dialectal Variation**: Standard varieties prioritized; dialectal variations may not be well-represented |
| - **Domain**: Better performance on formal text; colloquial/social media text may be challenging |
| - **Code-Switching**: Limited support for code-mixed text |
|
|
| ## Ethical Considerations |
|
|
| - **Bias**: Training data may contain societal biases present in web text |
| - **Representation**: Language representation reflects available digital resources, not speaker populations |
| - **Cultural Context**: Model may not capture cultural nuances specific to different African communities |
|
|
| ## Citation |
|
|
| If you use this model in your research, please cite: |
|
|
| ```bibtex |
| @misc{afrilion2026, |
| title={AfriLION: African Language Intelligence and Open NLP}, |
| author={LocaleNLP Team}, |
| year={2026}, |
| publisher={Hugging Face}, |
| howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}} |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details. |
|
|
| ## Acknowledgments |
|
|
| - Masakhane NLP Community for African language resources |
| - Contributors to CC-100 and Wikipedia |
| - Research institutions partnering on AfriLION development |
| - TPU Research Cloud for compute resources |
|
|
| ## Contact |
|
|
| - **Organization**: LocaleNLP |
| - **Email**: info@localenlp.com |
| - **Website**: https://localenlp.com |
| - **GitHub**: https://github.com/LocaleNLP/afrilion |
|
|
| ## Contributing |
|
|
| We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to: |
|
|
| - Report issues |
| - Submit language-specific improvements |
| - Add new African languages |
| - Contribute training data |
|
|
| --- |
|
|
| **LocaleNLP**: Bridging Languages, Empowering Lives. |