--- --- language: vi tags: - nlp - text-classification - vietnamese - esg - sustainability - banking library_name: transformers pipeline_tag: text-classification license: mit --- # PhoBERT ESG Topic Classifier for Vietnamese Banking Annual Reports ## Model description This model is a Vietnamese text classification model fine-tuned from **PhoBERT** to classify sentences from **banking annual reports** into ESG-related topics. It is designed as **Module 2 (ESG Topic Classification)** in an ESG-washing analysis pipeline, where downstream modules assess actionability, evidence support, and report-level ESG-washing risk. The model predicts one of six labels: - `E` (Environmental) - `S_labor` (Social – labor/workforce) - `S_community` (Social – community/CSR) - `S_product` (Social – product/customer) - `G` (Governance) - `Non_ESG` (not ESG-related) > Note: The model focuses on **textual disclosure topic classification**, not factual verification of ESG claims. --- ## Intended use ### Primary intended use - Filtering and categorizing ESG-related sentences in Vietnamese banking annual reports. - Supporting ESG-washing analysis pipelines (e.g., actionability classification and evidence linking). ### Example downstream usage - Keep only ESG sentences (`E`, `S_*`, `G`) and discard `Non_ESG` for later actionability/evidence modules. - Aggregate predicted topics by bank-year to analyze disclosure patterns across ESG pillars. ### Out-of-scope use - Determining whether a bank is actually “greenwashing/ESG-washing” in the real world. - Use on domains far from banking annual reports (e.g., social media) without re-validation. - Legal, compliance, or investment decision-making without human review. --- ## Training data The model was trained using a **hybrid labeling strategy**: - **LLM pre-labels** (teacher) to bootstrap semantic topic boundaries - **Weak labeling rules** (filter) to override trivial non-ESG content with high precision - A **manually annotated gold set** used for calibration and evaluation Hybrid label sources: - `llm`: 2,897 samples (LLM-only) - `llm_weak_agree`: 2,083 samples (LLM + weak labels agree, higher confidence) Total labeled samples for training/validation: **4,980** - Train: **4,233** - Validation: **747** Gold set (manual) for final test: **500** samples, balanced across labels. --- ## Training procedure - Base model: PhoBERT fine-tuning with a 6-class classification head. - Objective: Cross-entropy loss (with class-balancing strategy). - Context-aware input: sentence-level classification with local context window available in the corpus (`prev + sent + next`) depending on block type. --- ## Evaluation results ### Validation set (747 samples) - Macro-F1: **0.8598** - Micro-F1: **0.8635** - Weighted-F1: **0.8628** Per-class (validation): | Label | Precision | Recall | F1 | Support | |---|---:|---:|---:|---:| | E | 0.8310 | 0.8806 | 0.8551 | 67 | | S_labor | 0.9000 | 0.8675 | 0.8834 | 83 | | S_community | 0.8732 | 0.8611 | 0.8671 | 72 | | S_product | 0.8426 | 0.8922 | 0.8667 | 102 | | G | 0.8372 | 0.7606 | 0.7970 | 142 | | Non_ESG | 0.8785 | 0.9004 | 0.8893 | 281 | ### Gold test set (500 samples) - Macro-F1: **0.9665** - Micro-F1: **0.9660** Per-class (gold): | Label | Precision | Recall | F1 | Support | |---|---:|---:|---:|---:| | E | 0.9872 | 0.9625 | 0.9747 | 80 | | S_labor | 0.9873 | 0.9750 | 0.9811 | 80 | | S_community | 0.9634 | 0.9875 | 0.9753 | 80 | | S_product | 0.9506 | 0.9625 | 0.9565 | 80 | | G | 0.9659 | 0.9444 | 0.9551 | 90 | | Non_ESG | 0.9457 | 0.9667 | 0.9560 | 90 | > Note: The gold test set is balanced and may not reflect real-world class frequencies in annual reports. Always validate on your target corpus. --- ## How to use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "YOUR_USERNAME/YOUR_MODEL_REPO" # replace tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) labels = ["E", "S_labor", "S_community", "S_product", "G", "Non_ESG"] text = "Ngân hàng đã triển khai chương trình giảm phát thải và tiết kiệm năng lượng trong năm 2024." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): logits = model(**inputs).logits probs = torch.softmax(logits, dim=-1).squeeze().tolist() pred = labels[int(torch.tensor(probs).argmax())] print(pred, max(probs)) ``` --- ## Limitations The model is trained on Vietnamese banking annual report language and structure; performance may degrade on other domains. ESG boundaries can be ambiguous; some governance-related financial-risk text may be misclassified without domain adaptation. The model does not verify the truthfulness of ESG claims; it only categorizes topics based on text. ```