Upload ESG Topic Classifier
Browse files- README.md +92 -93
- config.json +15 -14
- model.safetensors +2 -2
- training_args.bin +1 -1
README.md
CHANGED
|
@@ -1,144 +1,143 @@
|
|
| 1 |
---
|
| 2 |
-
language: ['vi']
|
| 3 |
-
license: mit
|
| 4 |
-
tags: ['text-classification', 'phobert', 'vietnamese', 'esg', 'sustainability']
|
| 5 |
-
metrics:
|
| 6 |
-
- macro-f1
|
| 7 |
-
- accuracy
|
| 8 |
---
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
#
|
| 17 |
|
| 18 |
-
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
-
-
|
| 22 |
-
-
|
| 23 |
-
-
|
| 24 |
-
-
|
| 25 |
-
-
|
|
|
|
| 26 |
|
| 27 |
-
The model
|
| 28 |
|
| 29 |
---
|
| 30 |
|
| 31 |
-
##
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
-
-
|
| 35 |
-
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
-
|
| 39 |
-
-
|
| 40 |
-
-
|
| 41 |
-
|
| 42 |
-
All splits are constructed with **bank-year group isolation** to prevent information leakage.
|
| 43 |
|
| 44 |
---
|
| 45 |
|
| 46 |
-
## Training
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
-
-
|
| 50 |
-
-
|
| 51 |
-
- **Optimizer**: AdamW
|
| 52 |
-
- Learning rate: 2e-05
|
| 53 |
-
- Weight decay: 0.01
|
| 54 |
-
- **Batch size**: 16
|
| 55 |
-
- **Max sequence length**: 256 tokens
|
| 56 |
-
- **Epochs trained**: 8
|
| 57 |
-
- **Best checkpoint**: Epoch 4 (selected by DEV Macro-F1)
|
| 58 |
-
- **Random seed**: 42
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
|
| 71 |
---
|
| 72 |
|
| 73 |
-
##
|
| 74 |
-
|
| 75 |
-
| Label | Precision | Recall | F1 | Support |
|
| 76 |
-
|------|-----------|--------|----|---------|
|
| 77 |
-
| E | 0.789 | 0.857 | 0.822 | 35 |
|
| 78 |
-
| Financing | 0.647 | 0.458 | 0.537 | 24 |
|
| 79 |
-
| G | 0.769 | 0.741 | 0.755 | 54 |
|
| 80 |
-
| Non-ESG | 0.748 | 0.873 | 0.805 | 102 |
|
| 81 |
-
| Policy | 0.692 | 0.562 | 0.621 | 16 |
|
| 82 |
-
| S | 0.788 | 0.634 | 0.703 | 41 |
|
| 83 |
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
|
| 89 |
-
- Preprocessing step for **ESG-washing detection**
|
| 90 |
-
- Academic research (thesis / paper-level experiments)
|
| 91 |
|
| 92 |
---
|
| 93 |
|
| 94 |
-
##
|
| 95 |
|
| 96 |
```python
|
| 97 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 98 |
import torch
|
| 99 |
|
| 100 |
-
|
|
|
|
|
|
|
| 101 |
|
| 102 |
-
|
| 103 |
-
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 104 |
|
| 105 |
-
text = "Ngân hàng
|
| 106 |
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
|
| 107 |
|
| 108 |
with torch.no_grad():
|
| 109 |
-
|
| 110 |
-
probs = torch.softmax(
|
| 111 |
-
pred_id = torch.argmax(probs).item()
|
| 112 |
|
| 113 |
-
|
| 114 |
-
print(
|
| 115 |
```
|
| 116 |
|
| 117 |
---
|
| 118 |
|
| 119 |
## Limitations
|
| 120 |
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
Not intended for other industries or languages
|
| 124 |
-
|
| 125 |
-
Some ambiguity exists between Policy, Environmental, and Financing categories due to overlapping ESG discourse
|
| 126 |
-
|
| 127 |
-
Minority classes (E, Policy) have fewer samples than Non-ESG and Governance
|
| 128 |
-
|
| 129 |
-
---
|
| 130 |
|
| 131 |
-
|
| 132 |
|
| 133 |
-
|
| 134 |
|
| 135 |
-
```bibtex
|
| 136 |
-
@misc{esg-topic-classifier,
|
| 137 |
-
author = {huypham71},
|
| 138 |
-
title = {ESG Topic Classifier for Vietnamese Banking Reports},
|
| 139 |
-
year = {2026},
|
| 140 |
-
publisher = {Hugging Face},
|
| 141 |
-
url = {https://huggingface.co/huypham71/esg-topic-classifier}
|
| 142 |
-
}
|
| 143 |
```
|
| 144 |
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
---
|
| 3 |
+
language: vi
|
| 4 |
+
tags:
|
| 5 |
+
- nlp
|
| 6 |
+
- text-classification
|
| 7 |
+
- vietnamese
|
| 8 |
+
- esg
|
| 9 |
+
- sustainability
|
| 10 |
+
- banking
|
| 11 |
+
library_name: transformers
|
| 12 |
+
pipeline_tag: text-classification
|
| 13 |
+
license: mit
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# PhoBERT ESG Topic Classifier for Vietnamese Banking Annual Reports
|
| 17 |
|
| 18 |
+
## Model description
|
| 19 |
+
This model is a Vietnamese text classification model fine-tuned from **PhoBERT** to classify sentences from **banking annual reports** into ESG-related topics. It is designed as **Module 2 (ESG Topic Classification)** in an ESG-washing analysis pipeline, where downstream modules assess actionability, evidence support, and report-level ESG-washing risk.
|
| 20 |
|
| 21 |
+
The model predicts one of six labels:
|
| 22 |
+
- `E` (Environmental)
|
| 23 |
+
- `S_labor` (Social – labor/workforce)
|
| 24 |
+
- `S_community` (Social – community/CSR)
|
| 25 |
+
- `S_product` (Social – product/customer)
|
| 26 |
+
- `G` (Governance)
|
| 27 |
+
- `Non_ESG` (not ESG-related)
|
| 28 |
|
| 29 |
+
> Note: The model focuses on **textual disclosure topic classification**, not factual verification of ESG claims.
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
+
## Intended use
|
| 34 |
+
### Primary intended use
|
| 35 |
+
- Filtering and categorizing ESG-related sentences in Vietnamese banking annual reports.
|
| 36 |
+
- Supporting ESG-washing analysis pipelines (e.g., actionability classification and evidence linking).
|
| 37 |
|
| 38 |
+
### Example downstream usage
|
| 39 |
+
- Keep only ESG sentences (`E`, `S_*`, `G`) and discard `Non_ESG` for later actionability/evidence modules.
|
| 40 |
+
- Aggregate predicted topics by bank-year to analyze disclosure patterns across ESG pillars.
|
| 41 |
|
| 42 |
+
### Out-of-scope use
|
| 43 |
+
- Determining whether a bank is actually “greenwashing/ESG-washing” in the real world.
|
| 44 |
+
- Use on domains far from banking annual reports (e.g., social media) without re-validation.
|
| 45 |
+
- Legal, compliance, or investment decision-making without human review.
|
|
|
|
|
|
|
| 46 |
|
| 47 |
---
|
| 48 |
|
| 49 |
+
## Training data
|
| 50 |
+
The model was trained using a **hybrid labeling strategy**:
|
| 51 |
+
- **LLM pre-labels** (teacher) to bootstrap semantic topic boundaries
|
| 52 |
+
- **Weak labeling rules** (filter) to override trivial non-ESG content with high precision
|
| 53 |
+
- A **manually annotated gold set** used for calibration and evaluation
|
| 54 |
|
| 55 |
+
Hybrid label sources:
|
| 56 |
+
- `llm`: 2,897 samples (LLM-only)
|
| 57 |
+
- `llm_weak_agree`: 2,083 samples (LLM + weak labels agree, higher confidence)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
+
Total labeled samples for training/validation: **4,980**
|
| 60 |
+
- Train: **4,233**
|
| 61 |
+
- Validation: **747**
|
| 62 |
|
| 63 |
+
Gold set (manual) for final test: **500** samples, balanced across labels.
|
| 64 |
|
| 65 |
+
---
|
| 66 |
|
| 67 |
+
## Training procedure
|
| 68 |
+
- Base model: PhoBERT fine-tuning with a 6-class classification head.
|
| 69 |
+
- Objective: Cross-entropy loss (with class-balancing strategy).
|
| 70 |
+
- Context-aware input: sentence-level classification with local context window available in the corpus (`prev + sent + next`) depending on block type.
|
| 71 |
|
| 72 |
---
|
| 73 |
|
| 74 |
+
## Evaluation results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
### Validation set (747 samples)
|
| 77 |
+
- Macro-F1: **0.8598**
|
| 78 |
+
- Micro-F1: **0.8635**
|
| 79 |
+
- Weighted-F1: **0.8628**
|
| 80 |
|
| 81 |
+
Per-class (validation):
|
| 82 |
+
| Label | Precision | Recall | F1 | Support |
|
| 83 |
+
|---|---:|---:|---:|---:|
|
| 84 |
+
| E | 0.8310 | 0.8806 | 0.8551 | 67 |
|
| 85 |
+
| S_labor | 0.9000 | 0.8675 | 0.8834 | 83 |
|
| 86 |
+
| S_community | 0.8732 | 0.8611 | 0.8671 | 72 |
|
| 87 |
+
| S_product | 0.8426 | 0.8922 | 0.8667 | 102 |
|
| 88 |
+
| G | 0.8372 | 0.7606 | 0.7970 | 142 |
|
| 89 |
+
| Non_ESG | 0.8785 | 0.9004 | 0.8893 | 281 |
|
| 90 |
+
|
| 91 |
+
### Gold test set (500 samples)
|
| 92 |
+
- Macro-F1: **0.9665**
|
| 93 |
+
- Micro-F1: **0.9660**
|
| 94 |
+
|
| 95 |
+
Per-class (gold):
|
| 96 |
+
| Label | Precision | Recall | F1 | Support |
|
| 97 |
+
|---|---:|---:|---:|---:|
|
| 98 |
+
| E | 0.9872 | 0.9625 | 0.9747 | 80 |
|
| 99 |
+
| S_labor | 0.9873 | 0.9750 | 0.9811 | 80 |
|
| 100 |
+
| S_community | 0.9634 | 0.9875 | 0.9753 | 80 |
|
| 101 |
+
| S_product | 0.9506 | 0.9625 | 0.9565 | 80 |
|
| 102 |
+
| G | 0.9659 | 0.9444 | 0.9551 | 90 |
|
| 103 |
+
| Non_ESG | 0.9457 | 0.9667 | 0.9560 | 90 |
|
| 104 |
|
| 105 |
+
> Note: The gold test set is balanced and may not reflect real-world class frequencies in annual reports. Always validate on your target corpus.
|
|
|
|
|
|
|
| 106 |
|
| 107 |
---
|
| 108 |
|
| 109 |
+
## How to use
|
| 110 |
|
| 111 |
```python
|
| 112 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 113 |
import torch
|
| 114 |
|
| 115 |
+
model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" # replace
|
| 116 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 117 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_id)
|
| 118 |
|
| 119 |
+
labels = ["E", "S_labor", "S_community", "S_product", "G", "Non_ESG"]
|
|
|
|
| 120 |
|
| 121 |
+
text = "Ngân hàng đã triển khai chương trình giảm phát thải và tiết kiệm năng lượng trong năm 2024."
|
| 122 |
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
|
| 123 |
|
| 124 |
with torch.no_grad():
|
| 125 |
+
logits = model(**inputs).logits
|
| 126 |
+
probs = torch.softmax(logits, dim=-1).squeeze().tolist()
|
|
|
|
| 127 |
|
| 128 |
+
pred = labels[int(torch.tensor(probs).argmax())]
|
| 129 |
+
print(pred, max(probs))
|
| 130 |
```
|
| 131 |
|
| 132 |
---
|
| 133 |
|
| 134 |
## Limitations
|
| 135 |
|
| 136 |
+
The model is trained on Vietnamese banking annual report language and structure; performance may degrade on other domains.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
+
ESG boundaries can be ambiguous; some governance-related financial-risk text may be misclassified without domain adaptation.
|
| 139 |
|
| 140 |
+
The model does not verify the truthfulness of ESG claims; it only categorizes topics based on text.
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
```
|
| 143 |
|
config.json
CHANGED
|
@@ -7,32 +7,33 @@
|
|
| 7 |
"classifier_dropout": null,
|
| 8 |
"dtype": "float32",
|
| 9 |
"eos_token_id": 2,
|
|
|
|
| 10 |
"hidden_act": "gelu",
|
| 11 |
"hidden_dropout_prob": 0.1,
|
| 12 |
-
"hidden_size":
|
| 13 |
"id2label": {
|
| 14 |
"0": "E",
|
| 15 |
-
"1": "
|
| 16 |
-
"2": "
|
| 17 |
-
"3": "
|
| 18 |
-
"4": "
|
| 19 |
-
"5": "
|
| 20 |
},
|
| 21 |
"initializer_range": 0.02,
|
| 22 |
-
"intermediate_size":
|
| 23 |
"label2id": {
|
| 24 |
"E": 0,
|
| 25 |
-
"
|
| 26 |
-
"
|
| 27 |
-
"
|
| 28 |
-
"
|
| 29 |
-
"
|
| 30 |
},
|
| 31 |
"layer_norm_eps": 1e-05,
|
| 32 |
"max_position_embeddings": 258,
|
| 33 |
"model_type": "roberta",
|
| 34 |
-
"num_attention_heads":
|
| 35 |
-
"num_hidden_layers":
|
| 36 |
"pad_token_id": 1,
|
| 37 |
"position_embedding_type": "absolute",
|
| 38 |
"tokenizer_class": "PhobertTokenizer",
|
|
|
|
| 7 |
"classifier_dropout": null,
|
| 8 |
"dtype": "float32",
|
| 9 |
"eos_token_id": 2,
|
| 10 |
+
"gradient_checkpointing": false,
|
| 11 |
"hidden_act": "gelu",
|
| 12 |
"hidden_dropout_prob": 0.1,
|
| 13 |
+
"hidden_size": 1024,
|
| 14 |
"id2label": {
|
| 15 |
"0": "E",
|
| 16 |
+
"1": "S_labor",
|
| 17 |
+
"2": "S_community",
|
| 18 |
+
"3": "S_product",
|
| 19 |
+
"4": "G",
|
| 20 |
+
"5": "Non_ESG"
|
| 21 |
},
|
| 22 |
"initializer_range": 0.02,
|
| 23 |
+
"intermediate_size": 4096,
|
| 24 |
"label2id": {
|
| 25 |
"E": 0,
|
| 26 |
+
"G": 4,
|
| 27 |
+
"Non_ESG": 5,
|
| 28 |
+
"S_community": 2,
|
| 29 |
+
"S_labor": 1,
|
| 30 |
+
"S_product": 3
|
| 31 |
},
|
| 32 |
"layer_norm_eps": 1e-05,
|
| 33 |
"max_position_embeddings": 258,
|
| 34 |
"model_type": "roberta",
|
| 35 |
+
"num_attention_heads": 16,
|
| 36 |
+
"num_hidden_layers": 24,
|
| 37 |
"pad_token_id": 1,
|
| 38 |
"position_embedding_type": "absolute",
|
| 39 |
"tokenizer_class": "PhobertTokenizer",
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6c4212cd4c946a8715b853fdb1f984586f571d01ee8598ebaa6fad8b57a5ccdb
|
| 3 |
+
size 1476725928
|
training_args.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 5841
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:897a8b6f27368894909efc4fd965361e01bba366364ece8d06050e0928ccef89
|
| 3 |
size 5841
|