huypham71's picture
Upload ESG Topic Classifier
d905696 verified
---
---
language: vi
tags:
- nlp
- text-classification
- vietnamese
- esg
- sustainability
- banking
library_name: transformers
pipeline_tag: text-classification
license: mit
---
# PhoBERT ESG Topic Classifier for Vietnamese Banking Annual Reports
## Model description
This model is a Vietnamese text classification model fine-tuned from **PhoBERT** to classify sentences from **banking annual reports** into ESG-related topics. It is designed as **Module 2 (ESG Topic Classification)** in an ESG-washing analysis pipeline, where downstream modules assess actionability, evidence support, and report-level ESG-washing risk.
The model predicts one of six labels:
- `E` (Environmental)
- `S_labor` (Social – labor/workforce)
- `S_community` (Social – community/CSR)
- `S_product` (Social – product/customer)
- `G` (Governance)
- `Non_ESG` (not ESG-related)
> Note: The model focuses on **textual disclosure topic classification**, not factual verification of ESG claims.
---
## Intended use
### Primary intended use
- Filtering and categorizing ESG-related sentences in Vietnamese banking annual reports.
- Supporting ESG-washing analysis pipelines (e.g., actionability classification and evidence linking).
### Example downstream usage
- Keep only ESG sentences (`E`, `S_*`, `G`) and discard `Non_ESG` for later actionability/evidence modules.
- Aggregate predicted topics by bank-year to analyze disclosure patterns across ESG pillars.
### Out-of-scope use
- Determining whether a bank is actually “greenwashing/ESG-washing” in the real world.
- Use on domains far from banking annual reports (e.g., social media) without re-validation.
- Legal, compliance, or investment decision-making without human review.
---
## Training data
The model was trained using a **hybrid labeling strategy**:
- **LLM pre-labels** (teacher) to bootstrap semantic topic boundaries
- **Weak labeling rules** (filter) to override trivial non-ESG content with high precision
- A **manually annotated gold set** used for calibration and evaluation
Hybrid label sources:
- `llm`: 2,897 samples (LLM-only)
- `llm_weak_agree`: 2,083 samples (LLM + weak labels agree, higher confidence)
Total labeled samples for training/validation: **4,980**
- Train: **4,233**
- Validation: **747**
Gold set (manual) for final test: **500** samples, balanced across labels.
---
## Training procedure
- Base model: PhoBERT fine-tuning with a 6-class classification head.
- Objective: Cross-entropy loss (with class-balancing strategy).
- Context-aware input: sentence-level classification with local context window available in the corpus (`prev + sent + next`) depending on block type.
---
## Evaluation results
### Validation set (747 samples)
- Macro-F1: **0.8598**
- Micro-F1: **0.8635**
- Weighted-F1: **0.8628**
Per-class (validation):
| Label | Precision | Recall | F1 | Support |
|---|---:|---:|---:|---:|
| E | 0.8310 | 0.8806 | 0.8551 | 67 |
| S_labor | 0.9000 | 0.8675 | 0.8834 | 83 |
| S_community | 0.8732 | 0.8611 | 0.8671 | 72 |
| S_product | 0.8426 | 0.8922 | 0.8667 | 102 |
| G | 0.8372 | 0.7606 | 0.7970 | 142 |
| Non_ESG | 0.8785 | 0.9004 | 0.8893 | 281 |
### Gold test set (500 samples)
- Macro-F1: **0.9665**
- Micro-F1: **0.9660**
Per-class (gold):
| Label | Precision | Recall | F1 | Support |
|---|---:|---:|---:|---:|
| E | 0.9872 | 0.9625 | 0.9747 | 80 |
| S_labor | 0.9873 | 0.9750 | 0.9811 | 80 |
| S_community | 0.9634 | 0.9875 | 0.9753 | 80 |
| S_product | 0.9506 | 0.9625 | 0.9565 | 80 |
| G | 0.9659 | 0.9444 | 0.9551 | 90 |
| Non_ESG | 0.9457 | 0.9667 | 0.9560 | 90 |
> Note: The gold test set is balanced and may not reflect real-world class frequencies in annual reports. Always validate on your target corpus.
---
## How to use
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" # replace
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
labels = ["E", "S_labor", "S_community", "S_product", "G", "Non_ESG"]
text = "Ngân hàng đã triển khai chương trình giảm phát thải và tiết kiệm năng lượng trong năm 2024."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1).squeeze().tolist()
pred = labels[int(torch.tensor(probs).argmax())]
print(pred, max(probs))
```
---
## Limitations
The model is trained on Vietnamese banking annual report language and structure; performance may degrade on other domains.
ESG boundaries can be ambiguous; some governance-related financial-risk text may be misclassified without domain adaptation.
The model does not verify the truthfulness of ESG claims; it only categorizes topics based on text.
```