File size: 5,103 Bytes

35459cb

---
license: apache-2.0
---

# AfriLION-Base: Multilingual Language Model for African Languages

<div align="center">

**African Language Intelligence & Open NLP**

[GitHub](https://github.com/LocaleNLP/afrilion) | [Website](https://localenlp.com) | [Demo](#) | [Paper](#)

</div>

## Model Description

AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.

### Key Features

- 🌍 **20+ African Languages**: Comprehensive support for major African language families
- 📊 **Clean Training Data**: Trained on carefully curated CC-100 corpora with quality filtering
- ⚡ **Efficient Architecture**: Optimized for deployment in resource-constrained environments
- 🔓 **Apache 2.0 License**: Fully open-source for research and commercial use
- 🎯 **Multilingual Tokenizer**: Custom tokenizer designed for African language morphology

## Supported Languages

### West African Languages
- Wolof (wo)
- Fula/Fulani (ff)
- Yoruba (yo)
- Igbo (ig)
- Hausa (ha)
- Akan/Twi (ak)

### East African Languages
- Swahili (sw)
- Luganda (lg)
- Somali (so)
- Amharic (am)
- Oromo (om)

### Southern African Languages
- Zulu (zu)
- Xhosa (xh)
- Shona (sn)
- Sesotho (st)

### North African Languages
- Darija/Moroccan Arabic (ary)
- Kabyle (kab)

## Training Data

The model is trained on:

- **CC-100 Corpora**: Cleaned and filtered web text (100M+ tokens per language)
- **Wikipedia Dumps**: High-quality encyclopedic content
- **News Articles**: Contemporary written text from African news sources
- **Religious Texts**: Bible translations and Islamic texts for low-resource languages

### Data Processing

1. **Deduplication**: Aggressive deduplication at document and paragraph levels
2. **Quality Filtering**: Language identification, perplexity filtering, and heuristic-based cleaning
3. **Balancing**: Stratified sampling to ensure representation across all languages

## Architecture

- **Model Type**: Transformer-based encoder-decoder
- **Parameters**: 350M (base model)
- **Layers**: 12 encoder + 12 decoder layers
- **Hidden Size**: 768
- **Attention Heads**: 12
- **Vocabulary Size**: 128,000 (multilingual BPE)
- **Max Sequence Length**: 512 tokens

## Usage

### Installation

```bash
pip install transformers torch
```

### Quick Start

```python
from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")

# Example usage
text = "Habari za asubuhi"  # Swahili: "Good morning news"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
```

### Fine-tuning Example

```python
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load for specific task
model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")

# Your fine-tuning code here
```

## Benchmarks

| Task | Dataset | Score |
|------|---------|-------|
| Language Modeling | CC-100 Test | TBD |
| Named Entity Recognition | MasakhaNER | TBD |
| Machine Translation | FLORES-200 | TBD |
| Text Classification | AfriSenti | TBD |

## Limitations

- **Geographic Coverage**: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
- **Dialectal Variation**: Standard varieties prioritized; dialectal variations may not be well-represented
- **Domain**: Better performance on formal text; colloquial/social media text may be challenging
- **Code-Switching**: Limited support for code-mixed text

## Ethical Considerations

- **Bias**: Training data may contain societal biases present in web text
- **Representation**: Language representation reflects available digital resources, not speaker populations
- **Cultural Context**: Model may not capture cultural nuances specific to different African communities

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{afrilion2026,
  title={AfriLION: African Language Intelligence and Open NLP},
  author={LocaleNLP Team},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
}
```

## License

This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Masakhane NLP Community for African language resources
- Contributors to CC-100 and Wikipedia
- Research institutions partnering on AfriLION development
- TPU Research Cloud for compute resources

## Contact

- **Organization**: LocaleNLP
- **Email**: info@localenlp.com
- **Website**: https://localenlp.com
- **GitHub**: https://github.com/LocaleNLP/afrilion

## Contributing

We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to:

- Report issues
- Submit language-specific improvements
- Add new African languages
- Contribute training data

---

**LocaleNLP**: Bridging Languages, Empowering Lives.