Upload ESG Topic Classifier

Browse files

Files changed (4) hide show

README.md +92 -93
config.json +15 -14
model.safetensors +2 -2
training_args.bin +1 -1

README.md CHANGED Viewed

@@ -1,144 +1,143 @@
 ---
-language: ['vi']
-license: mit
-tags: ['text-classification', 'phobert', 'vietnamese', 'esg', 'sustainability']
-metrics:
-- macro-f1
-- accuracy
 ---
-# PhoBERT ESG Topic Classifier (Vietnamese Banking Reports)
-A fine-tuned [vinai/phobert-base-v2](https://huggingface.co/vinai/phobert-base-v2) model for **sentence-level ESG topic classification** in Vietnamese banking reports.
 ---
-## Model Description
-This model classifies Vietnamese sentences extracted from **banking annual and sustainability reports** into **6 ESG-related topic categories**:
-- **Non-ESG**: General business, financial, or operational content not related to ESG
-- **E (Environmental)**: Environmental topics such as emissions, energy, climate, waste, and resource usage
-- **S (Social)**: Social topics including employees, community, customer protection, health & safety
-- **G (Governance)**: Corporate governance topics such as board structure, compliance, risk management
-- **Policy**: ESG-related strategies, policies, commitments, and frameworks
-- **Financing**: Green or sustainable finance activities (green bonds, sustainable credit, ESG-linked finance)
-The model is designed as **Stage B (Topic Classification)** in a larger ESG-washing analysis pipeline.
 ---
-## Training Data
-- **Source**: Vietnamese banking annual and sustainability reports
-- **Time span**: 2015–2024
-- **Sentence-level corpus** after OCR cleaning and quality filtering
-Dataset splits:
-- **Train**: 926 sentences
-- **Dev**: 127 sentences
-- **Test**: 272 sentences
-All splits are constructed with **bank-year group isolation** to prevent information leakage.
 ---
-## Training Procedure
-- **Base model**: `vinai/phobert-base-v2`
-- **Fine-tuning strategy**: Full fine-tuning
-- **Loss**: Class-weighted CrossEntropyLoss (to address class imbalance)
-- **Optimizer**: AdamW
-  - Learning rate: 2e-05
-  - Weight decay: 0.01
-- **Batch size**: 16
-- **Max sequence length**: 256 tokens
-- **Epochs trained**: 8
-- **Best checkpoint**: Epoch 4 (selected by DEV Macro-F1)
-- **Random seed**: 42
----
-## Evaluation Results
-**Primary metric:** Macro-F1 (robust to class imbalance)
-| Metric     | DEV     | TEST    |
-|------------|---------|---------|
-| Macro-F1   | 0.7214 | 0.7070 |
-| Accuracy   | 0.7874 | 0.7537 |
 ---
-### Per-class Performance (TEST)
-| Label | Precision | Recall | F1 | Support |
-|------|-----------|--------|----|---------|
-| E | 0.789 | 0.857 | 0.822 | 35 |
-| Financing | 0.647 | 0.458 | 0.537 | 24 |
-| G | 0.769 | 0.741 | 0.755 | 54 |
-| Non-ESG | 0.748 | 0.873 | 0.805 | 102 |
-| Policy | 0.692 | 0.562 | 0.621 | 16 |
-| S | 0.788 | 0.634 | 0.703 | 41 |
----
-## Intended Use
-- ESG topic analysis for Vietnamese banking reports
-- Preprocessing step for **ESG-washing detection**
-- Academic research (thesis / paper-level experiments)
 ---
-## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
-model_name = "huypham71/esg-topic-classifier"
-tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
-model = AutoModelForSequenceClassification.from_pretrained(model_name)
-text = "Ngân hàng cam kết giảm 20% lượng khí thải carbon vào năm 2025."
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
 with torch.no_grad():
-    outputs = model(**inputs)
-    probs = torch.softmax(outputs.logits, dim=-1)
-    pred_id = torch.argmax(probs).item()
-print("Prediction:", model.config.id2label[pred_id])
-print("Confidence:", float(probs[0, pred_id]))
 ```
 ---
 ## Limitations
-Trained specifically on Vietnamese banking reports
-Not intended for other industries or languages
-Some ambiguity exists between Policy, Environmental, and Financing categories due to overlapping ESG discourse
-Minority classes (E, Policy) have fewer samples than Non-ESG and Governance
----
-## Citation
-If you use this model, please cite:
-```bibtex
-@misc{esg-topic-classifier,
-  author = {huypham71},
-  title = {ESG Topic Classifier for Vietnamese Banking Reports},
-  year = {2026},
-  publisher = {Hugging Face},
-  url = {https://huggingface.co/huypham71/esg-topic-classifier}
-}
 ```

 ---
 ---
+language: vi
+tags:
+- nlp
+- text-classification
+- vietnamese
+- esg
+- sustainability
+- banking
+library_name: transformers
+pipeline_tag: text-classification
+license: mit
 ---
+# PhoBERT ESG Topic Classifier for Vietnamese Banking Annual Reports
+## Model description
+This model is a Vietnamese text classification model fine-tuned from **PhoBERT** to classify sentences from **banking annual reports** into ESG-related topics. It is designed as **Module 2 (ESG Topic Classification)** in an ESG-washing analysis pipeline, where downstream modules assess actionability, evidence support, and report-level ESG-washing risk.
+The model predicts one of six labels:
+- `E` (Environmental)
+- `S_labor` (Social – labor/workforce)
+- `S_community` (Social – community/CSR)
+- `S_product` (Social – product/customer)
+- `G` (Governance)
+- `Non_ESG` (not ESG-related)
+> Note: The model focuses on **textual disclosure topic classification**, not factual verification of ESG claims.
 ---
+## Intended use
+### Primary intended use
+- Filtering and categorizing ESG-related sentences in Vietnamese banking annual reports.
+- Supporting ESG-washing analysis pipelines (e.g., actionability classification and evidence linking).
+### Example downstream usage
+- Keep only ESG sentences (`E`, `S_*`, `G`) and discard `Non_ESG` for later actionability/evidence modules.
+- Aggregate predicted topics by bank-year to analyze disclosure patterns across ESG pillars.
+### Out-of-scope use
+- Determining whether a bank is actually “greenwashing/ESG-washing” in the real world.
+- Use on domains far from banking annual reports (e.g., social media) without re-validation.
+- Legal, compliance, or investment decision-making without human review.
 ---
+## Training data
+The model was trained using a **hybrid labeling strategy**:
+- **LLM pre-labels** (teacher) to bootstrap semantic topic boundaries
+- **Weak labeling rules** (filter) to override trivial non-ESG content with high precision
+- A **manually annotated gold set** used for calibration and evaluation
+Hybrid label sources:
+- `llm`: 2,897 samples (LLM-only)
+- `llm_weak_agree`: 2,083 samples (LLM + weak labels agree, higher confidence)
+Total labeled samples for training/validation: **4,980**
+- Train: **4,233**
+- Validation: **747**
+Gold set (manual) for final test: **500** samples, balanced across labels.
+---
+## Training procedure
+- Base model: PhoBERT fine-tuning with a 6-class classification head.
+- Objective: Cross-entropy loss (with class-balancing strategy).
+- Context-aware input: sentence-level classification with local context window available in the corpus (`prev + sent + next`) depending on block type.
 ---
+## Evaluation results
+### Validation set (747 samples)
+- Macro-F1: **0.8598**
+- Micro-F1: **0.8635**
+- Weighted-F1: **0.8628**
+Per-class (validation):
+| Label | Precision | Recall | F1 | Support |
+|---|---:|---:|---:|---:|
+| E | 0.8310 | 0.8806 | 0.8551 | 67 |
+| S_labor | 0.9000 | 0.8675 | 0.8834 | 83 |
+| S_community | 0.8732 | 0.8611 | 0.8671 | 72 |
+| S_product | 0.8426 | 0.8922 | 0.8667 | 102 |
+| G | 0.8372 | 0.7606 | 0.7970 | 142 |
+| Non_ESG | 0.8785 | 0.9004 | 0.8893 | 281 |
+### Gold test set (500 samples)
+- Macro-F1: **0.9665**
+- Micro-F1: **0.9660**
+Per-class (gold):
+| Label | Precision | Recall | F1 | Support |
+|---|---:|---:|---:|---:|
+| E | 0.9872 | 0.9625 | 0.9747 | 80 |
+| S_labor | 0.9873 | 0.9750 | 0.9811 | 80 |
+| S_community | 0.9634 | 0.9875 | 0.9753 | 80 |
+| S_product | 0.9506 | 0.9625 | 0.9565 | 80 |
+| G | 0.9659 | 0.9444 | 0.9551 | 90 |
+| Non_ESG | 0.9457 | 0.9667 | 0.9560 | 90 |
+> Note: The gold test set is balanced and may not reflect real-world class frequencies in annual reports. Always validate on your target corpus.
 ---
+## How to use
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 import torch
+model_id = "YOUR_USERNAME/YOUR_MODEL_REPO"  # replace
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSequenceClassification.from_pretrained(model_id)
+labels = ["E", "S_labor", "S_community", "S_product", "G", "Non_ESG"]
+text = "Ngân hàng đã triển khai chương trình giảm phát thải và tiết kiệm năng lượng trong năm 2024."
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
 with torch.no_grad():
+    logits = model(**inputs).logits
+    probs = torch.softmax(logits, dim=-1).squeeze().tolist()
+pred = labels[int(torch.tensor(probs).argmax())]
+print(pred, max(probs))
 ```
 ---
 ## Limitations
+The model is trained on Vietnamese banking annual report language and structure; performance may degrade on other domains.
+ESG boundaries can be ambiguous; some governance-related financial-risk text may be misclassified without domain adaptation.
+The model does not verify the truthfulness of ESG claims; it only categorizes topics based on text.
 ```

config.json CHANGED Viewed

@@ -7,32 +7,33 @@
   "classifier_dropout": null,
   "dtype": "float32",
   "eos_token_id": 2,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
-  "hidden_size": 768,
   "id2label": {
     "0": "E",
-    "1": "Financing",
-    "2": "G",
-    "3": "Non-ESG",
-    "4": "Policy",
-    "5": "S"
   },
   "initializer_range": 0.02,
-  "intermediate_size": 3072,
   "label2id": {
     "E": 0,
-    "Financing": 1,
-    "G": 2,
-    "Non-ESG": 3,
-    "Policy": 4,
-    "S": 5
   },
   "layer_norm_eps": 1e-05,
   "max_position_embeddings": 258,
   "model_type": "roberta",
-  "num_attention_heads": 12,
-  "num_hidden_layers": 12,
   "pad_token_id": 1,
   "position_embedding_type": "absolute",
   "tokenizer_class": "PhobertTokenizer",

   "classifier_dropout": null,
   "dtype": "float32",
   "eos_token_id": 2,
+  "gradient_checkpointing": false,
   "hidden_act": "gelu",
   "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
   "id2label": {
     "0": "E",
+    "1": "S_labor",
+    "2": "S_community",
+    "3": "S_product",
+    "4": "G",
+    "5": "Non_ESG"
   },
   "initializer_range": 0.02,
+  "intermediate_size": 4096,
   "label2id": {
     "E": 0,
+    "G": 4,
+    "Non_ESG": 5,
+    "S_community": 2,
+    "S_labor": 1,
+    "S_product": 3
   },
   "layer_norm_eps": 1e-05,
   "max_position_embeddings": 258,
   "model_type": "roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
   "pad_token_id": 1,
   "position_embedding_type": "absolute",
   "tokenizer_class": "PhobertTokenizer",

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2b3e053f61dd4786efe1c9b113ad1fb96a5e41596e01c4bee0ce8c21bfe23c32
-size 540035688

 version https://git-lfs.github.com/spec/v1
+oid sha256:6c4212cd4c946a8715b853fdb1f984586f571d01ee8598ebaa6fad8b57a5ccdb
+size 1476725928

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:72b7800b938f17e70d00e0722489091060ea5af2edf3fe49031b99426c345276
 size 5841

 version https://git-lfs.github.com/spec/v1
+oid sha256:897a8b6f27368894909efc4fd965361e01bba366364ece8d06050e0928ccef89
 size 5841