README.md · Kelvinmbewe/mbert_LusakaLang

mbert_LusakaLang_MultiTask / README.md

Kelvinmbewe

Update README.md

64df334 verified 4 days ago

preview code

raw

history blame contribute delete

5.15 kB

	---
	language:
	- en
	- bem
	- ny
	tags:
	- multi-task
	- sentiment-analysis
	- topic-classification
	- language-identification
	- multilingual
	- transformer
	- zambia
	- lusaka
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-classification
	model-index:
	- name: LusakaLang-MultiTask
	results:
	- task:
	type: text-classification
	name: Language Identification
	dataset:
	name: LusakaLang Language Data
	type: lusakalang
	split: test
	metrics:
	- type: accuracy
	value: 0.97
	name: accuracy
	- type: f1
	value: 0.96
	name: f1_macro
	- type: accuracy
	value: 0.9322
	name: accuracy
	- type: f1
	value: 0.9216
	name: f1_macro
	- type: f1
	value: 0.8649
	name: f1_negative
	- type: f1
	value: 0.95
	name: f1_neutral
	- type: f1
	value: 0.95
	name: f1_positive
	- type: accuracy
	value: 0.91
	name: accuracy
	- type: f1
	value: 0.9
	name: f1_macro
	base_model:
	- Kelvinmbewe/mbert_Lusaka_Language_Analysis
	- Kelvinmbewe/mbert_LusakaLang_Sentiment_Analysis
	- Kelvinmbewe/mbert_LusakaLang_Topic
	---

	## LusakaLang MultiTask Model

	This model is a unified transformer architecture built on top of `bert-base-multilingual-cased`, designed to perform three tasks simultaneously:

	1. Language Identification
	2. Sentiment Analysis
	3. Topic Classification

	The system integrates three fine‑tuned LusakaLang checkpoints:

	- mbert_Lusaka_Language_Analysis
	- mbert_LusakaLang_Sentiment_Analysis
	- mbert_LusakaLang_Topic

	All tasks share a single mBERT encoder, supported by three independent classifier heads. This architecture enhances computational efficiency, reduces memory overhead
	and promotes consistent, harmonized predictions across all tasks.

	---

	## Why This Model Matters

	Zambian communication is inherently multilingual, fluid, and deeply shaped by context. A single message may blend English, Bemba, Nyanja, local slang,
	and frequent code‑switching, often expressed through culturally grounded idioms and subtle emotional cues. This model is designed specifically for that
	environment, where meaning depends not only on the words used but on how languages interact within a single utterance.

	It excels at identifying the dominant language or detecting when multiple languages are being used together, interpreting sentiment even when it
	is conveyed indirectly or through culturally specific phrasing, and classifying text into practical topics such as driver behaviour, payment issues,
	app performance, customer support, and ride availability. By capturing these nuances, the model provides a more accurate and context‑aware
	understanding of real Zambian communication.


	---

	## How to Use This Model


	```python
	from transformers import AutoTokenizer
	import torch

	class LusakaLangMultiTask:
	def __init__(self, path="Kelvinmbewe/LusakaLang-MultiTask"):
	self.tokenizer = AutoTokenizer.from_pretrained(path)
	self.model = torch.load(f"{path}/model.pt").eval()

	def predict_language(self, texts): pass
	def predict_sentiment(self, texts): pass
	def predict_topic(self, texts): pass

	llm = LusakaLangMultiTask()

	print(llm.predict_language([...]))
	print(llm.predict_sentiment([...]))
	print(llm.predict_topic([...]))

	```

	## Sample Output

	```python
	# Language Identification 🌍
	[
	{"lang": "Bemba", "conf": 0.96},
	{"lang": "Nyanja", "conf": 0.95},
	{"lang": "English","conf": 0.99}
	]
	# Sentiment ❤️
	[
	{"sent": "Negative", "conf": 0.98},
	{"sent": "Positive", "conf": 0.95},
	{"sent": "Neutral", "conf": 0.87}
	]
	# Topic 🗂️
	[
	{"topic": "Payment Issue", "conf": 0.97},
	{"topic": "Customer Support", "conf": 0.95},
	{"topic": "Driver Behaviour", "conf": 0.96}
	]
	```


	```
	=========================== Training Architecture ===========================

	📥 Input → 🧠 Core Engine → 📈 Output
	------------------------------------------------------------------------------------
	Text (Any Language) → Tokenizer 🔤 → Language 🌍
	→ Shared mBERT Encoder 🧠 → Bemba / Nyanja /
	→ CLS Vector 🎯 → English / Mixed
	------------------------------------------------------------------------------------
	User Feedback 💬 → Tokenizer 🔤 → Sentiment ❤️
	→ Shared Encoder 🧠 → Negative / Neutral /
	→ CLS Vector 🎯 → Positive
	------------------------------------------------------------------------------------
	Ride Context 🚗 → Tokenizer 🔤 → Topic 🗂️
	→ Shared Encoder 🧠 → Driver / Payment /
	→ CLS Vector 🎯 → Support / App / Availability
	------------------------------------------------------------------------------------
	```