Update README.md

35459cb verified 24 days ago

5.1 kB

	---
	license: apache-2.0
	---

	# AfriLION-Base: Multilingual Language Model for African Languages

	<div align="center">

	African Language Intelligence & Open NLP

	[GitHub](https://github.com/LocaleNLP/afrilion) \| [Website](https://localenlp.com) \| [Demo](#) \| [Paper](#)

	</div>

	## Model Description

	AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.

	### Key Features

	- 🌍 20+ African Languages: Comprehensive support for major African language families
	- 📊 Clean Training Data: Trained on carefully curated CC-100 corpora with quality filtering
	- ⚡ Efficient Architecture: Optimized for deployment in resource-constrained environments
	- 🔓 Apache 2.0 License: Fully open-source for research and commercial use
	- 🎯 Multilingual Tokenizer: Custom tokenizer designed for African language morphology

	## Supported Languages

	### West African Languages
	- Wolof (wo)
	- Fula/Fulani (ff)
	- Yoruba (yo)
	- Igbo (ig)
	- Hausa (ha)
	- Akan/Twi (ak)

	### East African Languages
	- Swahili (sw)
	- Luganda (lg)
	- Somali (so)
	- Amharic (am)
	- Oromo (om)

	### Southern African Languages
	- Zulu (zu)
	- Xhosa (xh)
	- Shona (sn)
	- Sesotho (st)

	### North African Languages
	- Darija/Moroccan Arabic (ary)
	- Kabyle (kab)

	## Training Data

	The model is trained on:

	- CC-100 Corpora: Cleaned and filtered web text (100M+ tokens per language)
	- Wikipedia Dumps: High-quality encyclopedic content
	- News Articles: Contemporary written text from African news sources
	- Religious Texts: Bible translations and Islamic texts for low-resource languages

	### Data Processing

	1. Deduplication: Aggressive deduplication at document and paragraph levels
	2. Quality Filtering: Language identification, perplexity filtering, and heuristic-based cleaning
	3. Balancing: Stratified sampling to ensure representation across all languages

	## Architecture

	- Model Type: Transformer-based encoder-decoder
	- Parameters: 350M (base model)
	- Layers: 12 encoder + 12 decoder layers
	- Hidden Size: 768
	- Attention Heads: 12
	- Vocabulary Size: 128,000 (multilingual BPE)
	- Max Sequence Length: 512 tokens

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Quick Start

	```python
	from transformers import AutoTokenizer, AutoModel

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
	model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")

	# Example usage
	text = "Habari za asubuhi" # Swahili: "Good morning news"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)
	```

	### Fine-tuning Example

	```python
	from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments

	# Load for specific task
	model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")

	# Your fine-tuning code here
	```

	## Benchmarks

	\| Task \| Dataset \| Score \|
	\|------\|---------\|-------\|
	\| Language Modeling \| CC-100 Test \| TBD \|
	\| Named Entity Recognition \| MasakhaNER \| TBD \|
	\| Machine Translation \| FLORES-200 \| TBD \|
	\| Text Classification \| AfriSenti \| TBD \|

	## Limitations

	- Geographic Coverage: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
	- Dialectal Variation: Standard varieties prioritized; dialectal variations may not be well-represented
	- Domain: Better performance on formal text; colloquial/social media text may be challenging
	- Code-Switching: Limited support for code-mixed text

	## Ethical Considerations

	- Bias: Training data may contain societal biases present in web text
	- Representation: Language representation reflects available digital resources, not speaker populations
	- Cultural Context: Model may not capture cultural nuances specific to different African communities

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{afrilion2026,
	title={AfriLION: African Language Intelligence and Open NLP},
	author={LocaleNLP Team},
	year={2026},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
	}
	```

	## License

	This model is released under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details.

	## Acknowledgments

	- Masakhane NLP Community for African language resources
	- Contributors to CC-100 and Wikipedia
	- Research institutions partnering on AfriLION development
	- TPU Research Cloud for compute resources

	## Contact

	- Organization: LocaleNLP
	- Email: info@localenlp.com
	- Website: https://localenlp.com
	- GitHub: https://github.com/LocaleNLP/afrilion

	## Contributing

	We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to:

	- Report issues
	- Submit language-specific improvements
	- Add new African languages
	- Contribute training data

	---

	LocaleNLP: Bridging Languages, Empowering Lives.