--- license: apache-2.0 --- # AfriLION-Base: Multilingual Language Model for African Languages
**African Language Intelligence & Open NLP** [GitHub](https://github.com/LocaleNLP/afrilion) | [Website](https://localenlp.com) | [Demo](#) | [Paper](#)
## Model Description AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages. ### Key Features - 🌍 **20+ African Languages**: Comprehensive support for major African language families - 📊 **Clean Training Data**: Trained on carefully curated CC-100 corpora with quality filtering - ⚡ **Efficient Architecture**: Optimized for deployment in resource-constrained environments - 🔓 **Apache 2.0 License**: Fully open-source for research and commercial use - 🎯 **Multilingual Tokenizer**: Custom tokenizer designed for African language morphology ## Supported Languages ### West African Languages - Wolof (wo) - Fula/Fulani (ff) - Yoruba (yo) - Igbo (ig) - Hausa (ha) - Akan/Twi (ak) ### East African Languages - Swahili (sw) - Luganda (lg) - Somali (so) - Amharic (am) - Oromo (om) ### Southern African Languages - Zulu (zu) - Xhosa (xh) - Shona (sn) - Sesotho (st) ### North African Languages - Darija/Moroccan Arabic (ary) - Kabyle (kab) ## Training Data The model is trained on: - **CC-100 Corpora**: Cleaned and filtered web text (100M+ tokens per language) - **Wikipedia Dumps**: High-quality encyclopedic content - **News Articles**: Contemporary written text from African news sources - **Religious Texts**: Bible translations and Islamic texts for low-resource languages ### Data Processing 1. **Deduplication**: Aggressive deduplication at document and paragraph levels 2. **Quality Filtering**: Language identification, perplexity filtering, and heuristic-based cleaning 3. **Balancing**: Stratified sampling to ensure representation across all languages ## Architecture - **Model Type**: Transformer-based encoder-decoder - **Parameters**: 350M (base model) - **Layers**: 12 encoder + 12 decoder layers - **Hidden Size**: 768 - **Attention Heads**: 12 - **Vocabulary Size**: 128,000 (multilingual BPE) - **Max Sequence Length**: 512 tokens ## Usage ### Installation ```bash pip install transformers torch ``` ### Quick Start ```python from transformers import AutoTokenizer, AutoModel # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base") model = AutoModel.from_pretrained("LocaleNLP/afrilion-base") # Example usage text = "Habari za asubuhi" # Swahili: "Good morning news" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) ``` ### Fine-tuning Example ```python from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments # Load for specific task model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base") # Your fine-tuning code here ``` ## Benchmarks | Task | Dataset | Score | |------|---------|-------| | Language Modeling | CC-100 Test | TBD | | Named Entity Recognition | MasakhaNER | TBD | | Machine Translation | FLORES-200 | TBD | | Text Classification | AfriSenti | TBD | ## Limitations - **Geographic Coverage**: Primarily focuses on widely-spoken languages; many smaller African languages not yet included - **Dialectal Variation**: Standard varieties prioritized; dialectal variations may not be well-represented - **Domain**: Better performance on formal text; colloquial/social media text may be challenging - **Code-Switching**: Limited support for code-mixed text ## Ethical Considerations - **Bias**: Training data may contain societal biases present in web text - **Representation**: Language representation reflects available digital resources, not speaker populations - **Cultural Context**: Model may not capture cultural nuances specific to different African communities ## Citation If you use this model in your research, please cite: ```bibtex @misc{afrilion2026, title={AfriLION: African Language Intelligence and Open NLP}, author={LocaleNLP Team}, year={2026}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}} } ``` ## License This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details. ## Acknowledgments - Masakhane NLP Community for African language resources - Contributors to CC-100 and Wikipedia - Research institutions partnering on AfriLION development - TPU Research Cloud for compute resources ## Contact - **Organization**: LocaleNLP - **Email**: info@localenlp.com - **Website**: https://localenlp.com - **GitHub**: https://github.com/LocaleNLP/afrilion ## Contributing We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to: - Report issues - Submit language-specific improvements - Add new African languages - Contribute training data --- **LocaleNLP**: Bridging Languages, Empowering Lives.