huypham71
/

esg-topic-classifier

+---
+language: ['vi']
+license: mit
+tags: ['text-classification', 'phobert', 'vietnamese', 'esg', 'sustainability']
+metrics:
+- macro-f1
+- accuracy
+---
+# PhoBERT ESG Topic Classifier (Vietnamese Banking Reports)
+A fine-tuned [vinai/phobert-base-v2](https://huggingface.co/vinai/phobert-base-v2) model for **sentence-level ESG topic classification** in Vietnamese banking reports.
+---
+## Model Description
+This model classifies Vietnamese sentences extracted from **banking annual and sustainability reports** into **6 ESG-related topic categories**:
+- **Non-ESG**: General business, financial, or operational content not related to ESG
+- **E (Environmental)**: Environmental topics such as emissions, energy, climate, waste, and resource usage
+- **S (Social)**: Social topics including employees, community, customer protection, health & safety
+- **G (Governance)**: Corporate governance topics such as board structure, compliance, risk management
+- **Policy**: ESG-related strategies, policies, commitments, and frameworks
+- **Financing**: Green or sustainable finance activities (green bonds, sustainable credit, ESG-linked finance)
+The model is designed as **Stage B (Topic Classification)** in a larger ESG-washing analysis pipeline.
+---
+## Training Data
+- **Source**: Vietnamese banking annual and sustainability reports
+- **Time span**: 2015–2024
+- **Sentence-level corpus** after OCR cleaning and quality filtering
+Dataset splits:
+- **Train**: 926 sentences
+- **Dev**: 127 sentences
+- **Test**: 272 sentences
+All splits are constructed with **bank-year group isolation** to prevent information leakage.
+---
+## Training Procedure
+- **Base model**: `vinai/phobert-base-v2`
+- **Fine-tuning strategy**: Full fine-tuning
+- **Loss**: Class-weighted CrossEntropyLoss (to address class imbalance)
+- **Optimizer**: AdamW
+  - Learning rate: 2e-05
+  - Weight decay: 0.01
+- **Batch size**: 16
+- **Max sequence length**: 256 tokens
+- **Epochs trained**: 8
+- **Best checkpoint**: Epoch 4 (selected by DEV Macro-F1)
+- **Random seed**: 42
+---
+## Evaluation Results
+**Primary metric:** Macro-F1 (robust to class imbalance)
+| Metric     | DEV     | TEST    |
+|------------|---------|---------|
+| Macro-F1   | 0.7214 | 0.7070 |
+| Accuracy   | 0.7874 | 0.7537 |
+---
+### Per-class Performance (TEST)
+| Label | Precision | Recall | F1 | Support |
+|------|-----------|--------|----|---------|
+| E | 0.789 | 0.857 | 0.822 | 35 |
+| Financing | 0.647 | 0.458 | 0.537 | 24 |
+| G | 0.769 | 0.741 | 0.755 | 54 |
+| Non-ESG | 0.748 | 0.873 | 0.805 | 102 |
+| Policy | 0.692 | 0.562 | 0.621 | 16 |
+| S | 0.788 | 0.634 | 0.703 | 41 |
+---
+## Intended Use
+- ESG topic analysis for Vietnamese banking reports
+- Preprocessing step for **ESG-washing detection**
+- Academic research (thesis / paper-level experiments)
+---
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+model_name = "huypham71/esg-topic-classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+text = "Ngân hàng cam kết giảm 20% lượng khí thải carbon vào năm 2025."
+inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
+with torch.no_grad():
+    outputs = model(**inputs)
+    probs = torch.softmax(outputs.logits, dim=-1)
+    pred_id = torch.argmax(probs).item()
+print("Prediction:", model.config.id2label[pred_id])
+print("Confidence:", float(probs[0, pred_id]))
+## Limitations
+Trained specifically on Vietnamese banking reports
+Not intended for other industries or languages
+Some ambiguity exists between Policy, Environmental, and Financing categories due to overlapping ESG discourse
+Minority classes (E, Policy) have fewer samples than Non-ESG and Governance
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{esg-topic-classifier,
+  author = {huypham71},
+  title = {ESG Topic Classifier for Vietnamese Banking Reports},
+  year = {2026},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/huypham71/esg-topic-classifier}
+}
+```