Jidi1997 commited on
Commit
cef9a6e
·
verified ·
1 Parent(s): 9a73d3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -42
README.md CHANGED
@@ -13,65 +13,116 @@ metrics:
13
  - accuracy
14
  ---
15
 
16
- # Green Shareholder Proposal Classifier
17
 
18
- ## Model Summary
19
 
20
- This model is a fine-tuned version of [`climatebert/distilroberta-base-climate-detector`](https://huggingface.co/climatebert/distilroberta-base-climate-detector), specifically designed to classify **shareholder proposals** into binary categories: green (climate/environmental) or non-green.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  It was trained on a highly curated dataset of Institutional Shareholder Services (ISS) proposals, achieving an **F1 score of 0.981** on the validation set.
23
 
24
- ## Model Details
 
 
 
 
25
 
26
- - **Base Model:** `climatebert/distilroberta-base-climate-detector`
27
- - **Task:** Binary Sequence Classification
28
- - `Label 1`: Green / Climate-related proposal
29
- - `Label 0`: Non-green proposal
30
- - **Language:** English
31
- - **License:** Apache 2.0 (Model weights). *Note: The dataset used for fine-tuning contains derived data subject to ISS licensing terms.*
32
 
33
- ## Uses
34
 
35
- ### Direct Use
36
- The model takes a structured text input describing a shareholder proposal and predicts whether it is conceptually focused on climate change or environmental sustainability.
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- **Recommended Input Format:**
39
  To achieve optimal performance, input text should mirror the structure of the training data:
40
- > "A {sponsor_type}-type sponsor has filed a shareholder proposal to a(an) {sic2_des}-sector company. This proposal requests: {resolution}. [It falls under a broader agenda class that may include items not directly relevant to this specific proposal: {AgendaCodeInformation}]"
41
 
42
- ### Out-of-Scope Use
43
- - Applying the model to non-English texts.
44
- - Using the model for automated legal or compliance decision-making without human oversight.
45
- - Generalizing to broad ESG topics outside of strict environmental/climate scopes (e.g., social or governance issues like gender equality or animal welfare are explicitly trained as negative classes).
 
 
 
 
46
 
47
- ## Training Data
48
 
49
- The model was fine-tuned on a custom stratified dataset of 1,500 manually curated ISS shareholder proposals. The dataset underwent rigorous rule-based correction to exclude tangentially environmental or purely social/governance proposals.
 
 
50
 
51
- For full details on data sampling, text construction, and labeling rules, please refer to the **[Dataset Card](在这里填入你的数据集链接)**.
52
 
53
- - **Train split:** 1,200 examples
54
- - **Validation split:** 300 examples
55
 
56
- ## Training Procedure
57
 
58
- ### Hyperparameters
 
 
 
 
59
 
60
- The model was trained using the Hugging Face `Trainer` API with the following hyperparameters:
61
 
62
- - **Learning rate:** 2e-05
63
- - **Train batch size:** 16
64
- - **Eval batch size:** 16
65
- - **Seed:** 42
66
- - **Weight decay:** 0.05
67
- - **Optimizer:** AdamW
68
- - **Number of epochs:** 10
69
 
70
- ### Training Results
71
 
72
- The model weights from **Epoch 8 (`checkpoint-600`)** were selected as the best performing based on the validation F1 score.
73
 
74
- | Epoch | Training Loss | Validation Loss | Accuracy | F1 (Binary) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  |:---:|:---:|:---:|:---:|:---:|
76
  | 1 | 0.3060 | 0.0968 | 0.9667 | 0.9675 |
77
  | 2 | 0.0954 | 0.0898 | 0.9733 | 0.9740 |
@@ -80,15 +131,39 @@ The model weights from **Epoch 8 (`checkpoint-600`)** were selected as the best
80
  | 5 | 0.0395 | 0.1026 | 0.9800 | 0.9803 |
81
  | 6 | 0.0350 | 0.1308 | 0.9733 | 0.9744 |
82
  | 7 | 0.0094 | 0.1108 | 0.9767 | 0.9772 |
83
- | **8** | **0.0003** | **0.1182** | **0.9800** | **0.9806** |
84
  | 9 | 0.0004 | 0.1154 | 0.9767 | 0.9773 |
85
  | 10 | 0.0002 | 0.1229 | 0.9767 | 0.9773 |
86
 
87
- ## Limitations and Bias
88
 
89
- While the model achieves high accuracy on the validation set, its performance is tightly coupled with the specific linguistic patterns and taxonomy of the ISS database (e.g., SIC-2 sector descriptions, ISS agenda codes). It may exhibit lower confidence or accuracy when processing unstructured news articles, raw corporate filings, or proposals from different jurisdictional contexts outside the US/global norm represented in the training set.
 
 
90
 
91
- ## Citation
 
 
 
 
 
 
 
 
92
 
93
  If you use this model in your research, please cite the associated working paper:
94
- *(Citation details forthcoming)*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  - accuracy
14
  ---
15
 
16
+ <div align="center">
17
 
18
+ # 🌿 Green Shareholder Proposal Classifier
19
 
20
+ <p align="center">
21
+ <img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=for-the-badge&logo=apache" alt="License"/>
22
+ <img src="https://img.shields.io/badge/Language-English-blue?style=for-the-badge&logo=googletranslate&logoColor=white" alt="Language"/>
23
+ <img src="https://img.shields.io/badge/F1%20Score-0.981-brightgreen?style=for-the-badge&logo=checkmarx&logoColor=white" alt="F1 Score"/>
24
+ <img src="https://img.shields.io/badge/Task-Text%20Classification-orange?style=for-the-badge&logo=openai&logoColor=white" alt="Task"/>
25
+ <img src="https://img.shields.io/badge/Domain-ESG%20%7C%20Climate%20Finance-teal?style=for-the-badge&logo=leaflet&logoColor=white" alt="Domain"/>
26
+ </p>
27
+
28
+ *A fine-tuned NLP model for classifying climate-related shareholder proposals with high precision.*
29
+
30
+ </div>
31
+
32
+ ---
33
+
34
+ ## 📋 Model Summary
35
+
36
+ This model is a fine-tuned version of [`climatebert/distilroberta-base-climate-detector`](https://huggingface.co/climatebert/distilroberta-base-climate-detector), specifically designed to classify **shareholder proposals** into binary categories: **green** (climate/environmental) or **non-green**.
37
 
38
  It was trained on a highly curated dataset of Institutional Shareholder Services (ISS) proposals, achieving an **F1 score of 0.981** on the validation set.
39
 
40
+ > 💡 **Designed for researchers and practitioners** in sustainable finance, ESG analysis, and corporate governance.
41
+
42
+ ---
43
+
44
+ ## 🔍 Model Details
45
 
46
+ | Property | Value |
47
+ |:---|:---|
48
+ | 🧠 **Base Model** | `climatebert/distilroberta-base-climate-detector` |
49
+ | 🎯 **Task** | Binary Sequence Classification |
50
+ | 🌐 **Language** | English |
51
+ | 📄 **License** | Apache 2.0 *(model weights)* |
52
 
53
+ ### 🏷️ Label Schema
54
 
55
+ | Label | Description |
56
+ |:---:|:---|
57
+ | `1` | ✅ Green / Climate-related proposal |
58
+ | `0` | ❌ Non-green proposal |
59
+
60
+ ---
61
+
62
+ ## 🚀 Uses
63
+
64
+ ### ✅ Direct Use
65
+
66
+ The model takes a structured text input describing a shareholder proposal and predicts whether it is conceptually focused on climate change or environmental sustainability.
67
+
68
+ **📌 Recommended Input Format**
69
 
 
70
  To achieve optimal performance, input text should mirror the structure of the training data:
 
71
 
72
+ ```
73
+ "A(An) {sponsor_type}-type sponsor has filed a shareholder proposal to a(an)
74
+ {sic2_des}-sector company. This proposal requests: {resolution}.
75
+ [It falls under a broader agenda class that may include items not directly
76
+ relevant to this specific proposal: {AgendaCodeInformation}]"
77
+ ```
78
+
79
+ ### ⚠️ Out-of-Scope Use
80
 
81
+ The following use cases are **not recommended**:
82
 
83
+ - 🚫 Applying the model to **non-English** texts
84
+ - 🚫 Using the model for **automated legal or compliance decision-making** without human oversight
85
+ - 🚫 Generalizing to **broad ESG topics** outside of strict environmental/climate scopes *(e.g., social or governance issues like gender equality or animal welfare are explicitly trained as negative classes)*
86
 
87
+ ---
88
 
89
+ ## 📦 Training Data
 
90
 
91
+ <div align="center">
92
 
93
+ | Split | Examples |
94
+ |:---:|:---:|
95
+ | 🏋️ Train | 1,200 |
96
+ | 🧪 Validation | 300 |
97
+ | **Total** | **1,500** |
98
 
99
+ </div>
100
 
101
+ The model was fine-tuned on a custom **stratified dataset of 1,500 manually curated ISS shareholder proposals**. The dataset underwent rigorous rule-based correction to exclude tangentially environmental or purely social/governance proposals.
 
 
 
 
 
 
102
 
103
+ 📂 For full details on data sampling, text construction, and labeling rules, please refer to the **[gprop_training_dataset](https://huggingface.co/datasets/Jidi1997/gprop_training_dataset)**.
104
 
105
+ ---
106
 
107
+ ## ⚙️ Training Procedure
108
+
109
+ ### 🔧 Hyperparameters
110
+
111
+ | Hyperparameter | Value |
112
+ |:---|:---:|
113
+ | 📐 Learning Rate | `2e-05` |
114
+ | 📦 Train Batch Size | `16` |
115
+ | 📦 Eval Batch Size | `16` |
116
+ | 🎲 Seed | `42` |
117
+ | ⚖️ Weight Decay | `0.05` |
118
+ | 🔁 Optimizer | AdamW |
119
+ | 🔄 Epochs | `10` |
120
+
121
+ ### 📈 Training Results
122
+
123
+ The model weights from **Epoch 8 (`checkpoint-600`)** were selected as the best performing checkpoint based on the validation F1 score.
124
+
125
+ | Epoch | Train Loss | Val Loss | Accuracy | F1 (Binary) |
126
  |:---:|:---:|:---:|:---:|:---:|
127
  | 1 | 0.3060 | 0.0968 | 0.9667 | 0.9675 |
128
  | 2 | 0.0954 | 0.0898 | 0.9733 | 0.9740 |
 
131
  | 5 | 0.0395 | 0.1026 | 0.9800 | 0.9803 |
132
  | 6 | 0.0350 | 0.1308 | 0.9733 | 0.9744 |
133
  | 7 | 0.0094 | 0.1108 | 0.9767 | 0.9772 |
134
+ | **8** | **0.0003** | **0.1182** | **0.9800** | **0.9806** |
135
  | 9 | 0.0004 | 0.1154 | 0.9767 | 0.9773 |
136
  | 10 | 0.0002 | 0.1229 | 0.9767 | 0.9773 |
137
 
138
+ > **Best checkpoint selected at Epoch 8** — highest validation F1 of **0.9806**
139
 
140
+ ---
141
+
142
+ ## ⚠️ Limitations and Bias
143
 
144
+ While the model achieves high accuracy on the validation set, several limitations should be noted:
145
+
146
+ - 🔗 **Domain dependency** — Performance is tightly coupled with the specific linguistic patterns and taxonomy of the ISS database *(e.g., SIC-2 sector descriptions, ISS agenda codes)*
147
+ - 📰 **Unstructured text** — Lower confidence or accuracy is expected when processing unstructured news articles or raw corporate filings
148
+ - 🌍 **Jurisdictional scope** — The model may not generalize well to proposals from jurisdictions outside the US/global norm represented in the training set
149
+
150
+ ---
151
+
152
+ ## 📚 Citation
153
 
154
  If you use this model in your research, please cite the associated working paper:
155
+
156
+ ```bibtex
157
+ @misc{gprop_classifier,
158
+ title = {Green Shareholder Proposal Classifier},
159
+ note = {Citation details forthcoming},
160
+ }
161
+ ```
162
+
163
+ ---
164
+
165
+ <div align="center">
166
+
167
+ *Built on top of [ClimateBERT](https://huggingface.co/climatebert) · Trained with 🤗 Hugging Face Transformers*
168
+
169
+ </div>