TsekTxt โ mBERT Taglish Misinformation Classifier
This model is one of three transformer models fine-tuned and evaluated side-by-side as part of TsekTxt, a thesis and capstone project on detecting misinformation in Taglish (Tagalog-English code-switched) social media text.
Part of the TsekTxt model family:
| Model | Base Architecture | Hugging Face Repo |
|---|---|---|
| XLM-RoBERTa | Multilingual, 100 languages | chimsio/tsektxt-xlmr |
| RoBERTa-Tagalog | Filipino-specific pretraining | chimsio/tsektxt-roberta-tagalog |
| mBERT (this model) | Multilingual BERT baseline | chimsio/tsektxt-mbert |
Live application: TsekTxt web app โ this model family is served via a FastAPI backend and used to classify user-submitted Taglish text/screenshots as Suspicious or Not Suspicious.
Training pipeline / research repo: tsektxt-model-training โ full data preprocessing, training, and comparative evaluation code for all three models.
Model Details
Model Description
This model is a fine-tuned version of bert-base-multilingual-cased for binary text classification, distinguishing Suspicious (potentially fake/misinformation) from Not Suspicious (credible) Taglish text. It serves as the multilingual BERT baseline in a three-way architectural comparison against XLM-RoBERTa (a newer multilingual model) and RoBERTa-Tagalog (a Filipino-specific model), testing how an older multilingual architecture performs on code-switched misinformation detection relative to more modern or more specialized alternatives.
- Developed by: Hans Jio Arca, as part of a capstone/thesis project
- Model type: Transformer encoder, sequence classification (2 labels)
- Language(s): Tagalog, English, and Taglish code-switched text
- License: CC-BY-NC-4.0 (academic/research use; update if your institution requires otherwise)
- Finetuned from model:
bert-base-multilingual-cased
Model Sources
- Training code: tsektxt-model-training
- Application using this model: tsektxt-app
- Sibling models:
tsektxt-xlmr,tsektxt-roberta-tagalog
Uses
Direct Use
Classifying short-to-medium Taglish social media text as Suspicious or Not Suspicious, as part of the TsekTxt credibility-checking pipeline, alongside Integrated Gradients token attributions for explainability.
Out-of-Scope Use
- Not intended as a sole/automated fact-checking authority.
- Not evaluated on formal news articles, long-form documents, or non-Filipino contexts.
- Not intended for moderation decisions with legal or reputational consequences without human review.
Bias, Risks, and Limitations
Same underlying training data as its sibling models, so shares the same domain skew present in the source datasets (entertainment/political content overrepresented). As an older, general-purpose multilingual architecture (vs. XLM-RoBERTa's more recent pretraining or RoBERTa-Tagalog's language-specific pretraining), this model is expected to serve primarily as a baseline for comparison โ see the comparative analysis notebook for how its performance and attribution patterns differ from the other two.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "chimsio/tsektxt-mbert"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
def predict(text):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
label = torch.argmax(probs).item()
# NOTE: confirm label convention (0 = Suspicious, 1 = Not Suspicious) against your dataset
return {
"label": "Not Suspicious" if label == 1 else "Suspicious",
"confidence": round(probs[0][label].item() * 100, 2)
}
print(predict("Napatunayan na ang bakuna ay nagdudulot ng microchip sa katawan!"))
Training Details
Training Data
Identical dataset, cleaning, and stratified 80/10/10 split (fixed seed 42) as used for the sibling XLM-RoBERTa and RoBERTa-Tagalog models, ensuring a fair three-way comparison. Combined dataset of ~25,400 labeled Taglish samples from the Fake News Filipino Dataset (Cruz, Tan & Cheng) and Philippine Fake News Corpus (Fernandez).
Training Procedure
- Preprocessing: Deduplicated, null-dropped, tokenized with mBERT's native WordPiece tokenizer, max sequence length 256.
- Class imbalance handling: Weighted cross-entropy loss.
Training Hyperparameters
- Learning rate: 2e-5
- Batch size: 16 (train), 32 (eval)
- Epochs: 4
- Weight decay: 0.01
- Hardware: NVIDIA T4 GPU (Google Colab)
Evaluation
Testing Data
Held-out stratified test split (2,540 samples), identical to the split used for sibling models.
Results
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| Not Suspicious | 0.96 | 0.95 | 0.95 | 909 |
| Suspicious | 0.97 | 0.98 | 0.97 | 1,631 |
| Accuracy | 0.97 | 2,540 | ||
| Macro avg | 0.97 | 0.96 | 0.96 | 2,540 |
| Weighted avg | 0.97 | 0.97 | 0.97 | 2,540 |
See the training repo's comparative analysis notebook for the full three-model comparison.
Environmental Impact
- Hardware Type: NVIDIA T4 GPU
- Hours used: ~1 hour
- Cloud Provider: Google (Colab)
- Compute Region: Unknown (Colab-assigned)
Technical Specifications
- Model Architecture: BERT-base-multilingual-cased, sequence classification head (2 labels)
- Compute Infrastructure: Google Colab, single T4 GPU
- Software:
transformers,datasets,accelerate, PyTorch
Citation
@article{devlin2018bert,
title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
year={2018}
}
Dataset citations: Cruz, Tan & Cheng (Fake News Filipino Dataset); Fernandez (Philippine Fake News Corpus).
Model Card Contact
Hans Jio Arca โ https://github.com/hansjio
- Downloads last month
- -
Model tree for chimsio/tsektxt-mbert
Base model
google-bert/bert-base-multilingual-cased