huypham71 commited on
Commit
d905696
·
verified ·
1 Parent(s): a0322e2

Upload ESG Topic Classifier

Browse files
Files changed (4) hide show
  1. README.md +92 -93
  2. config.json +15 -14
  3. model.safetensors +2 -2
  4. training_args.bin +1 -1
README.md CHANGED
@@ -1,144 +1,143 @@
1
  ---
2
- language: ['vi']
3
- license: mit
4
- tags: ['text-classification', 'phobert', 'vietnamese', 'esg', 'sustainability']
5
- metrics:
6
- - macro-f1
7
- - accuracy
8
  ---
9
-
10
- # PhoBERT ESG Topic Classifier (Vietnamese Banking Reports)
11
-
12
- A fine-tuned [vinai/phobert-base-v2](https://huggingface.co/vinai/phobert-base-v2) model for **sentence-level ESG topic classification** in Vietnamese banking reports.
13
-
 
 
 
 
 
 
14
  ---
15
 
16
- ## Model Description
17
 
18
- This model classifies Vietnamese sentences extracted from **banking annual and sustainability reports** into **6 ESG-related topic categories**:
 
19
 
20
- - **Non-ESG**: General business, financial, or operational content not related to ESG
21
- - **E (Environmental)**: Environmental topics such as emissions, energy, climate, waste, and resource usage
22
- - **S (Social)**: Social topics including employees, community, customer protection, health & safety
23
- - **G (Governance)**: Corporate governance topics such as board structure, compliance, risk management
24
- - **Policy**: ESG-related strategies, policies, commitments, and frameworks
25
- - **Financing**: Green or sustainable finance activities (green bonds, sustainable credit, ESG-linked finance)
 
26
 
27
- The model is designed as **Stage B (Topic Classification)** in a larger ESG-washing analysis pipeline.
28
 
29
  ---
30
 
31
- ## Training Data
 
 
 
32
 
33
- - **Source**: Vietnamese banking annual and sustainability reports
34
- - **Time span**: 2015–2024
35
- - **Sentence-level corpus** after OCR cleaning and quality filtering
36
 
37
- Dataset splits:
38
- - **Train**: 926 sentences
39
- - **Dev**: 127 sentences
40
- - **Test**: 272 sentences
41
-
42
- All splits are constructed with **bank-year group isolation** to prevent information leakage.
43
 
44
  ---
45
 
46
- ## Training Procedure
 
 
 
 
47
 
48
- - **Base model**: `vinai/phobert-base-v2`
49
- - **Fine-tuning strategy**: Full fine-tuning
50
- - **Loss**: Class-weighted CrossEntropyLoss (to address class imbalance)
51
- - **Optimizer**: AdamW
52
- - Learning rate: 2e-05
53
- - Weight decay: 0.01
54
- - **Batch size**: 16
55
- - **Max sequence length**: 256 tokens
56
- - **Epochs trained**: 8
57
- - **Best checkpoint**: Epoch 4 (selected by DEV Macro-F1)
58
- - **Random seed**: 42
59
 
60
- ---
 
 
61
 
62
- ## Evaluation Results
63
 
64
- **Primary metric:** Macro-F1 (robust to class imbalance)
65
 
66
- | Metric | DEV | TEST |
67
- |------------|---------|---------|
68
- | Macro-F1 | 0.7214 | 0.7070 |
69
- | Accuracy | 0.7874 | 0.7537 |
70
 
71
  ---
72
 
73
- ### Per-class Performance (TEST)
74
-
75
- | Label | Precision | Recall | F1 | Support |
76
- |------|-----------|--------|----|---------|
77
- | E | 0.789 | 0.857 | 0.822 | 35 |
78
- | Financing | 0.647 | 0.458 | 0.537 | 24 |
79
- | G | 0.769 | 0.741 | 0.755 | 54 |
80
- | Non-ESG | 0.748 | 0.873 | 0.805 | 102 |
81
- | Policy | 0.692 | 0.562 | 0.621 | 16 |
82
- | S | 0.788 | 0.634 | 0.703 | 41 |
83
 
84
- ---
 
 
 
85
 
86
- ## Intended Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
- - ESG topic analysis for Vietnamese banking reports
89
- - Preprocessing step for **ESG-washing detection**
90
- - Academic research (thesis / paper-level experiments)
91
 
92
  ---
93
 
94
- ## Usage
95
 
96
  ```python
97
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
98
  import torch
99
 
100
- model_name = "huypham71/esg-topic-classifier"
 
 
101
 
102
- tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
103
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
104
 
105
- text = "Ngân hàng cam kết giảm 20% lượng khí thải carbon vào năm 2025."
106
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
107
 
108
  with torch.no_grad():
109
- outputs = model(**inputs)
110
- probs = torch.softmax(outputs.logits, dim=-1)
111
- pred_id = torch.argmax(probs).item()
112
 
113
- print("Prediction:", model.config.id2label[pred_id])
114
- print("Confidence:", float(probs[0, pred_id]))
115
  ```
116
 
117
  ---
118
 
119
  ## Limitations
120
 
121
- Trained specifically on Vietnamese banking reports
122
-
123
- Not intended for other industries or languages
124
-
125
- Some ambiguity exists between Policy, Environmental, and Financing categories due to overlapping ESG discourse
126
-
127
- Minority classes (E, Policy) have fewer samples than Non-ESG and Governance
128
-
129
- ---
130
 
131
- ## Citation
132
 
133
- If you use this model, please cite:
134
 
135
- ```bibtex
136
- @misc{esg-topic-classifier,
137
- author = {huypham71},
138
- title = {ESG Topic Classifier for Vietnamese Banking Reports},
139
- year = {2026},
140
- publisher = {Hugging Face},
141
- url = {https://huggingface.co/huypham71/esg-topic-classifier}
142
- }
143
  ```
144
 
 
1
  ---
 
 
 
 
 
 
2
  ---
3
+ language: vi
4
+ tags:
5
+ - nlp
6
+ - text-classification
7
+ - vietnamese
8
+ - esg
9
+ - sustainability
10
+ - banking
11
+ library_name: transformers
12
+ pipeline_tag: text-classification
13
+ license: mit
14
  ---
15
 
16
+ # PhoBERT ESG Topic Classifier for Vietnamese Banking Annual Reports
17
 
18
+ ## Model description
19
+ This model is a Vietnamese text classification model fine-tuned from **PhoBERT** to classify sentences from **banking annual reports** into ESG-related topics. It is designed as **Module 2 (ESG Topic Classification)** in an ESG-washing analysis pipeline, where downstream modules assess actionability, evidence support, and report-level ESG-washing risk.
20
 
21
+ The model predicts one of six labels:
22
+ - `E` (Environmental)
23
+ - `S_labor` (Social labor/workforce)
24
+ - `S_community` (Social community/CSR)
25
+ - `S_product` (Social product/customer)
26
+ - `G` (Governance)
27
+ - `Non_ESG` (not ESG-related)
28
 
29
+ > Note: The model focuses on **textual disclosure topic classification**, not factual verification of ESG claims.
30
 
31
  ---
32
 
33
+ ## Intended use
34
+ ### Primary intended use
35
+ - Filtering and categorizing ESG-related sentences in Vietnamese banking annual reports.
36
+ - Supporting ESG-washing analysis pipelines (e.g., actionability classification and evidence linking).
37
 
38
+ ### Example downstream usage
39
+ - Keep only ESG sentences (`E`, `S_*`, `G`) and discard `Non_ESG` for later actionability/evidence modules.
40
+ - Aggregate predicted topics by bank-year to analyze disclosure patterns across ESG pillars.
41
 
42
+ ### Out-of-scope use
43
+ - Determining whether a bank is actually “greenwashing/ESG-washing” in the real world.
44
+ - Use on domains far from banking annual reports (e.g., social media) without re-validation.
45
+ - Legal, compliance, or investment decision-making without human review.
 
 
46
 
47
  ---
48
 
49
+ ## Training data
50
+ The model was trained using a **hybrid labeling strategy**:
51
+ - **LLM pre-labels** (teacher) to bootstrap semantic topic boundaries
52
+ - **Weak labeling rules** (filter) to override trivial non-ESG content with high precision
53
+ - A **manually annotated gold set** used for calibration and evaluation
54
 
55
+ Hybrid label sources:
56
+ - `llm`: 2,897 samples (LLM-only)
57
+ - `llm_weak_agree`: 2,083 samples (LLM + weak labels agree, higher confidence)
 
 
 
 
 
 
 
 
58
 
59
+ Total labeled samples for training/validation: **4,980**
60
+ - Train: **4,233**
61
+ - Validation: **747**
62
 
63
+ Gold set (manual) for final test: **500** samples, balanced across labels.
64
 
65
+ ---
66
 
67
+ ## Training procedure
68
+ - Base model: PhoBERT fine-tuning with a 6-class classification head.
69
+ - Objective: Cross-entropy loss (with class-balancing strategy).
70
+ - Context-aware input: sentence-level classification with local context window available in the corpus (`prev + sent + next`) depending on block type.
71
 
72
  ---
73
 
74
+ ## Evaluation results
 
 
 
 
 
 
 
 
 
75
 
76
+ ### Validation set (747 samples)
77
+ - Macro-F1: **0.8598**
78
+ - Micro-F1: **0.8635**
79
+ - Weighted-F1: **0.8628**
80
 
81
+ Per-class (validation):
82
+ | Label | Precision | Recall | F1 | Support |
83
+ |---|---:|---:|---:|---:|
84
+ | E | 0.8310 | 0.8806 | 0.8551 | 67 |
85
+ | S_labor | 0.9000 | 0.8675 | 0.8834 | 83 |
86
+ | S_community | 0.8732 | 0.8611 | 0.8671 | 72 |
87
+ | S_product | 0.8426 | 0.8922 | 0.8667 | 102 |
88
+ | G | 0.8372 | 0.7606 | 0.7970 | 142 |
89
+ | Non_ESG | 0.8785 | 0.9004 | 0.8893 | 281 |
90
+
91
+ ### Gold test set (500 samples)
92
+ - Macro-F1: **0.9665**
93
+ - Micro-F1: **0.9660**
94
+
95
+ Per-class (gold):
96
+ | Label | Precision | Recall | F1 | Support |
97
+ |---|---:|---:|---:|---:|
98
+ | E | 0.9872 | 0.9625 | 0.9747 | 80 |
99
+ | S_labor | 0.9873 | 0.9750 | 0.9811 | 80 |
100
+ | S_community | 0.9634 | 0.9875 | 0.9753 | 80 |
101
+ | S_product | 0.9506 | 0.9625 | 0.9565 | 80 |
102
+ | G | 0.9659 | 0.9444 | 0.9551 | 90 |
103
+ | Non_ESG | 0.9457 | 0.9667 | 0.9560 | 90 |
104
 
105
+ > Note: The gold test set is balanced and may not reflect real-world class frequencies in annual reports. Always validate on your target corpus.
 
 
106
 
107
  ---
108
 
109
+ ## How to use
110
 
111
  ```python
112
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
113
  import torch
114
 
115
+ model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" # replace
116
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
117
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
118
 
119
+ labels = ["E", "S_labor", "S_community", "S_product", "G", "Non_ESG"]
 
120
 
121
+ text = "Ngân hàng đã triển khai chương trình giảm phát thải tiết kiệm năng lượng trong năm 2024."
122
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
123
 
124
  with torch.no_grad():
125
+ logits = model(**inputs).logits
126
+ probs = torch.softmax(logits, dim=-1).squeeze().tolist()
 
127
 
128
+ pred = labels[int(torch.tensor(probs).argmax())]
129
+ print(pred, max(probs))
130
  ```
131
 
132
  ---
133
 
134
  ## Limitations
135
 
136
+ The model is trained on Vietnamese banking annual report language and structure; performance may degrade on other domains.
 
 
 
 
 
 
 
 
137
 
138
+ ESG boundaries can be ambiguous; some governance-related financial-risk text may be misclassified without domain adaptation.
139
 
140
+ The model does not verify the truthfulness of ESG claims; it only categorizes topics based on text.
141
 
 
 
 
 
 
 
 
 
142
  ```
143
 
config.json CHANGED
@@ -7,32 +7,33 @@
7
  "classifier_dropout": null,
8
  "dtype": "float32",
9
  "eos_token_id": 2,
 
10
  "hidden_act": "gelu",
11
  "hidden_dropout_prob": 0.1,
12
- "hidden_size": 768,
13
  "id2label": {
14
  "0": "E",
15
- "1": "Financing",
16
- "2": "G",
17
- "3": "Non-ESG",
18
- "4": "Policy",
19
- "5": "S"
20
  },
21
  "initializer_range": 0.02,
22
- "intermediate_size": 3072,
23
  "label2id": {
24
  "E": 0,
25
- "Financing": 1,
26
- "G": 2,
27
- "Non-ESG": 3,
28
- "Policy": 4,
29
- "S": 5
30
  },
31
  "layer_norm_eps": 1e-05,
32
  "max_position_embeddings": 258,
33
  "model_type": "roberta",
34
- "num_attention_heads": 12,
35
- "num_hidden_layers": 12,
36
  "pad_token_id": 1,
37
  "position_embedding_type": "absolute",
38
  "tokenizer_class": "PhobertTokenizer",
 
7
  "classifier_dropout": null,
8
  "dtype": "float32",
9
  "eos_token_id": 2,
10
+ "gradient_checkpointing": false,
11
  "hidden_act": "gelu",
12
  "hidden_dropout_prob": 0.1,
13
+ "hidden_size": 1024,
14
  "id2label": {
15
  "0": "E",
16
+ "1": "S_labor",
17
+ "2": "S_community",
18
+ "3": "S_product",
19
+ "4": "G",
20
+ "5": "Non_ESG"
21
  },
22
  "initializer_range": 0.02,
23
+ "intermediate_size": 4096,
24
  "label2id": {
25
  "E": 0,
26
+ "G": 4,
27
+ "Non_ESG": 5,
28
+ "S_community": 2,
29
+ "S_labor": 1,
30
+ "S_product": 3
31
  },
32
  "layer_norm_eps": 1e-05,
33
  "max_position_embeddings": 258,
34
  "model_type": "roberta",
35
+ "num_attention_heads": 16,
36
+ "num_hidden_layers": 24,
37
  "pad_token_id": 1,
38
  "position_embedding_type": "absolute",
39
  "tokenizer_class": "PhobertTokenizer",
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2b3e053f61dd4786efe1c9b113ad1fb96a5e41596e01c4bee0ce8c21bfe23c32
3
- size 540035688
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c4212cd4c946a8715b853fdb1f984586f571d01ee8598ebaa6fad8b57a5ccdb
3
+ size 1476725928
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:72b7800b938f17e70d00e0722489091060ea5af2edf3fe49031b99426c345276
3
  size 5841
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:897a8b6f27368894909efc4fd965361e01bba366364ece8d06050e0928ccef89
3
  size 5841