Upload ESG Topic Classifier

d905696 verified about 2 months ago

4.88 kB

	---
	---
	language: vi
	tags:
	- nlp
	- text-classification
	- vietnamese
	- esg
	- sustainability
	- banking
	library_name: transformers
	pipeline_tag: text-classification
	license: mit
	---

	# PhoBERT ESG Topic Classifier for Vietnamese Banking Annual Reports

	## Model description
	This model is a Vietnamese text classification model fine-tuned from PhoBERT to classify sentences from banking annual reports into ESG-related topics. It is designed as Module 2 (ESG Topic Classification) in an ESG-washing analysis pipeline, where downstream modules assess actionability, evidence support, and report-level ESG-washing risk.

	The model predicts one of six labels:
	- `E` (Environmental)
	- `S_labor` (Social – labor/workforce)
	- `S_community` (Social – community/CSR)
	- `S_product` (Social – product/customer)
	- `G` (Governance)
	- `Non_ESG` (not ESG-related)

	> Note: The model focuses on textual disclosure topic classification, not factual verification of ESG claims.

	---

	## Intended use
	### Primary intended use
	- Filtering and categorizing ESG-related sentences in Vietnamese banking annual reports.
	- Supporting ESG-washing analysis pipelines (e.g., actionability classification and evidence linking).

	### Example downstream usage
	- Keep only ESG sentences (`E`, `S_*`, `G`) and discard `Non_ESG` for later actionability/evidence modules.
	- Aggregate predicted topics by bank-year to analyze disclosure patterns across ESG pillars.

	### Out-of-scope use
	- Determining whether a bank is actually “greenwashing/ESG-washing” in the real world.
	- Use on domains far from banking annual reports (e.g., social media) without re-validation.
	- Legal, compliance, or investment decision-making without human review.

	---

	## Training data
	The model was trained using a hybrid labeling strategy:
	- LLM pre-labels (teacher) to bootstrap semantic topic boundaries
	- Weak labeling rules (filter) to override trivial non-ESG content with high precision
	- A manually annotated gold set used for calibration and evaluation

	Hybrid label sources:
	- `llm`: 2,897 samples (LLM-only)
	- `llm_weak_agree`: 2,083 samples (LLM + weak labels agree, higher confidence)

	Total labeled samples for training/validation: 4,980
	- Train: 4,233
	- Validation: 747

	Gold set (manual) for final test: 500 samples, balanced across labels.

	---

	## Training procedure
	- Base model: PhoBERT fine-tuning with a 6-class classification head.
	- Objective: Cross-entropy loss (with class-balancing strategy).
	- Context-aware input: sentence-level classification with local context window available in the corpus (`prev + sent + next`) depending on block type.

	---

	## Evaluation results

	### Validation set (747 samples)
	- Macro-F1: 0.8598
	- Micro-F1: 0.8635
	- Weighted-F1: 0.8628

	Per-class (validation):
	\| Label \| Precision \| Recall \| F1 \| Support \|
	\|---\|---:\|---:\|---:\|---:\|
	\| E \| 0.8310 \| 0.8806 \| 0.8551 \| 67 \|
	\| S_labor \| 0.9000 \| 0.8675 \| 0.8834 \| 83 \|
	\| S_community \| 0.8732 \| 0.8611 \| 0.8671 \| 72 \|
	\| S_product \| 0.8426 \| 0.8922 \| 0.8667 \| 102 \|
	\| G \| 0.8372 \| 0.7606 \| 0.7970 \| 142 \|
	\| Non_ESG \| 0.8785 \| 0.9004 \| 0.8893 \| 281 \|

	### Gold test set (500 samples)
	- Macro-F1: 0.9665
	- Micro-F1: 0.9660

	Per-class (gold):
	\| Label \| Precision \| Recall \| F1 \| Support \|
	\|---\|---:\|---:\|---:\|---:\|
	\| E \| 0.9872 \| 0.9625 \| 0.9747 \| 80 \|
	\| S_labor \| 0.9873 \| 0.9750 \| 0.9811 \| 80 \|
	\| S_community \| 0.9634 \| 0.9875 \| 0.9753 \| 80 \|
	\| S_product \| 0.9506 \| 0.9625 \| 0.9565 \| 80 \|
	\| G \| 0.9659 \| 0.9444 \| 0.9551 \| 90 \|
	\| Non_ESG \| 0.9457 \| 0.9667 \| 0.9560 \| 90 \|

	> Note: The gold test set is balanced and may not reflect real-world class frequencies in annual reports. Always validate on your target corpus.

	---

	## How to use

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" # replace
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)

	labels = ["E", "S_labor", "S_community", "S_product", "G", "Non_ESG"]

	text = "Ngân hàng đã triển khai chương trình giảm phát thải và tiết kiệm năng lượng trong năm 2024."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

	with torch.no_grad():
	logits = model(**inputs).logits
	probs = torch.softmax(logits, dim=-1).squeeze().tolist()

	pred = labels[int(torch.tensor(probs).argmax())]
	print(pred, max(probs))
	```

	---

	## Limitations

	The model is trained on Vietnamese banking annual report language and structure; performance may degrade on other domains.

	ESG boundaries can be ambiguous; some governance-related financial-risk text may be misclassified without domain adaptation.

	The model does not verify the truthfulness of ESG claims; it only categorizes topics based on text.

	```