Kelvinmbewe
/

mbert_LusakaLang_Topic

Text Classification

Eval Results (legacy)

Model card Files Files and versions

mbert_LusakaLang_Topic / README.md

Kelvinmbewe's picture

Update README.md

7e95745 verified 13 days ago

|

history blame contribute delete

2.82 kB

	---
	language:
	- en
	- ny
	- bem
	tags:
	- text-classification
	- multilingual
	- transformer
	- zambia
	- lusaka
	- code-switching
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-classification
	base_model:
	- Kelvinmbewe/mbert_Lusaka_Language_Analysis
	- google-bert/bert-base-multilingual-cased
	metrics:
	- accuracy
	- precision
	- recall
	- macro_f1
	- micro_f1
	- validation_loss
	- confusion_matrix
	model-index:
	- name: LusakaLang
	results:
	- task:
	type: text-classification
	name: Topic Classification
	dataset:
	name: LusakaLang Topic Dataset
	type: lusakalang
	config: default
	split: validation
	metrics:
	- type: accuracy
	value: 0.99259
	name: accuracy
	- type: precision
	value: 0.98730
	name: precision
	- type: recall
	value: 0.99128
	name: recall
	- type: f1
	value: 0.98926
	name: macro_f1
	- type: f1
	value: 0.99259
	name: micro_f1
	- type: loss
	value: 0.05233
	name: validation_loss
	---

	# LusakaLang Topic Analysis Model



	This model was trained using its sister model, `mbert_LusakaLang_Sentiment_Analysis`, which was fine‑tuned on sentiment data
	spanning English, Bemba, Nyanja, Zambian slang, and mixed Zambian language varieties commonly used in everyday communication.



	## Training Details

	```python
	- Base model: `mbert_LusakaLang_Sentiment_Analysis`
	- Epochs: 20
	- Class weights: enabled (to correct class imbalance)
	- Optimizer: AdamW
	- Loss: Weighted cross‑entropy
	- Temperature scaling: T = 2.3 (applied at inference time)
	```

	## Why Temperature Scaling?
	```python
	Class‑weighted training sharpens logits.
	Temperature scaling at T = 2.3 improves:

	- Confidence calibration
	- Noise robustness
	- Handling of positive/neutral text
	- Foreign‑language generalization
	- Reduction of overconfident misclassifications
	```

	## Training Data
	```python
	The dataset was primarily synthetic, generated to simulate realistic ride‑hailing feedback in Zambia.
	To ensure authenticity:

	- All samples were reviewed by a native Zambian speaker
	- Mixed langauge and slang patterns were corrected
	- Local idioms and slang were added
	- Unnatural AI‑generated phrasing was removed
	- Bemba/Nyanja grammars and tone were validated

	This hybrid approach ensures tha the dataset reflects real Zambian communication style.
	```


	## Train and Validation Loss
	![image](https://cdn-uploads.huggingface.co/production/uploads/674ed988f86d2ca07fa23abe/OnagZY8nhxv-bOejq2m0B.png)

	## Confusion Matrix
	![image](https://cdn-uploads.huggingface.co/production/uploads/674ed988f86d2ca07fa23abe/Qk6rvSrTyeWHl90BrpNQZ.png)

	## Word Cloud
	![image](https://cdn-uploads.huggingface.co/production/uploads/674ed988f86d2ca07fa23abe/dZb3Tq2FBAKztlIp9asCs.png)