| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - climate |
| - ESG |
| - sustainable-finance |
| - sequence-classification |
| base_model: climatebert/distilroberta-base-climate-detector |
| metrics: |
| - f1 |
| - accuracy |
| --- |
| |
| <div align="center"> |
|
|
| # πΏ Green Shareholder Proposal Detector |
|
|
| <p align="center"> |
| <img src="https://img.shields.io/badge/License-Apache%202.0-green.svg?style=for-the-badge&logo=apache" alt="License"/> |
| <img src="https://img.shields.io/badge/Language-English-blue?style=for-the-badge&logo=googletranslate&logoColor=white" alt="Language"/> |
| <img src="https://img.shields.io/badge/Task-Text%20Classification-orange?style=for-the-badge&logo=openai&logoColor=white" alt="Task"/> |
| <img src="https://img.shields.io/badge/Domain-ESG%20%7C%20Climate%20Finance-teal?style=for-the-badge&logo=leaflet&logoColor=white" alt="Domain"/> |
| </p> |
|
|
| *A fine-tuned BERT-based language model to detect "greenness" within shareholder proposal.* |
|
|
| </div> |
|
|
| --- |
|
|
| ## π Model Summary |
|
|
| Shareholder resolutions are often terse and semantically ambiguous when read in isolation. |
| Consider a proposal requesting a report on **water risk management** β this may refer to |
| environmental water stress (a climate risk) or to the human right to water access (a social |
| issue). Such overlaps are pervasive in ESG discourse, where the same terminology routinely |
| spans environmental, social, and governance dimensions. |
|
|
| This model is a fine-tuned version of [ClimateBERT](https://huggingface.co/climatebert/distilroberta-base-climate-detector), |
| specifically engineered to classify shareholder proposals as **green** (climate/environmental) |
| or **non-green**. It is trained to resolve precisely this kind of ambiguity: rather than |
| surface-matching sustainability keywords, it learns to identify the **underlying environmental |
| intent** of a proposal from its full contextual framing. |
|
|
| As a result, the model is robust against false positives induced by generic ESG buzzwords |
| β terms such as *neutrality*, *waste*, or *water* that frequently appear across non-environmental |
| proposals β and maintains high precision in **mixed-ESG contexts** where environmental and |
| social/governance themes co-occur. |
|
|
| > π― **Designed for:** Extracting environmental signal from noisy, multi-topic ESG disclosures. |
| --- |
| ## π Usage |
|
|
| ### β‘ Quick Start |
|
|
| Install dependencies first: |
| ```bash |
| pip install transformers torch |
| ``` |
|
|
| Then run the following: |
| ```python |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline |
| from transformers.pipelines.pt_utils import KeyDataset |
| import datasets |
| from tqdm.auto import tqdm |
| |
| # ββ Model ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| model_name = "Jidi1997/ClimateBERT_GPROP_Detector" |
| |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_length=512) |
| pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0) # change to device=-1 if only CPU is available |
| |
| # ββ Data βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| # Option A: Load your own dataset from a local CSV / JSON file |
| # dataset = datasets.load_dataset("csv", data_files="your_proposals.csv", split="train") |
| |
| # Option B: Construct proposals inline using the recommended input format |
| # Each entry should follow the structure below for best performance: |
| # "A(An) {sponsor_type}-type sponsor has filed a shareholder proposal to a(an) |
| # {sic2_des}-sector company. This proposal requests: {resolution}. |
| # It falls under a broader agenda class that may include items not directly |
| # relevant to this specific proposal: {AgendaCodeInformation}" |
| |
| dataset = datasets.Dataset.from_dict({"text": [ |
| # Replace with your own proposals following the recommended input format above |
| """A(An) institutional-type sponsor has filed a shareholder proposal to a(an) |
| energy-sector company. This proposal requests: the company to issue a report |
| on its greenhouse gas emissions reduction targets. |
| It falls under a broader agenda class: "...""" |
| ]}) |
| |
| # ββ Inference ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| # label='yes' β Green proposal (Label 1) |
| # label='no' β Non-green proposal (Label 0) |
| for out in tqdm(pipe(KeyDataset(dataset, "text"), padding=True, truncation=True)): |
| print(out) |
| ``` |
|
|
| --- |
|
|
| ### π Recommended Input Format |
|
|
| To address ambiguity in raw proposal text, we can enhance the model's input with structured proposal- and firm-level context, like the training data format: |
| ``` |
| "A(An) {sponsor_type}-type sponsor has filed a shareholder proposal to a(an) |
| {sic2_des}-sector company. This proposal requests: {resolution}. |
| It falls under a broader agenda class that may include items not directly |
| relevant to this specific proposal: {AgendaCodeInformation}" |
| ``` |
|
|
| | Field | Description | Example | |
| |:---|:---|:---| |
| | `{sponsor_type}` | Type of proposal sponsor | `institutional`, `individual`, `SRI fund`, `pension fund` | |
| | `{sic2_des}` | SIC-2 industry sector description | `energy`, `manufacturing` | |
| | `{resolution}` | Full text of the proposal resolution | *"Report on Climate Change Performance Metrics Into Executive Compensation Program..."* | |
| | `{AgendaCodeInformation}` | Description of ISS agenda code | *"This code is used for proposals seeking..."* | |
|
|
| > π‘ **Tip:** The `{AgendaCodeInformation}` field is optional but including it generally improves prediction confidence, as it provides additional categorical context into brief resolution context. |
|
|
|
|
|
|
| ## π¦ Training Data |
|
|
| The model was fine-tuned on a custom **stratified dataset of 1,500 manually curated ISS shareholder proposals**. The dataset underwent rule-based correction to exclude purely social/governance and blend proposals. |
|
|
| π For full details on data sampling, text construction, and labeling rules, please refer to the **[gprop_training_dataset](https://huggingface.co/datasets/Jidi1997/gprop_training_dataset)**. |
|
|
| --- |
|
|
| ## βοΈ Training Procedure |
|
|
| ### π§ Hyperparameters |
|
|
| | Hyperparameter | Value | |
| |:---|:---:| |
| | π Learning Rate | `2e-05` | |
| | π¦ Train Batch Size | `16` | |
| | π¦ Eval Batch Size | `16` | |
| | π² Seed | `42` | |
| | βοΈ Weight Decay | `0.05` | |
| | π Optimizer | AdamW | |
| | π Epochs | `10` | |
|
|
| ### π Training Results |
|
|
| The model weights from **Epoch 8 (`checkpoint-600`)** were selected as the best performing checkpoint based on the validation F1 score. |
|
|
| | Epoch | Train Loss | Val Loss | Accuracy | F1 (Binary) | |
| |:---:|:---:|:---:|:---:|:---:| |
| | 1 | 0.3060 | 0.0968 | 0.9667 | 0.9675 | |
| | 2 | 0.0954 | 0.0898 | 0.9733 | 0.9740 | |
| | 3 | 0.0956 | 0.1808 | 0.9600 | 0.9623 | |
| | 4 | 0.0029 | 0.0783 | 0.9800 | 0.9805 | |
| | 5 | 0.0395 | 0.1026 | 0.9800 | 0.9803 | |
| | 6 | 0.0350 | 0.1308 | 0.9733 | 0.9744 | |
| | 7 | 0.0094 | 0.1108 | 0.9767 | 0.9772 | |
| | **8** β | **0.0003** | **0.1182** | **0.9800** | **0.9806** | |
| | 9 | 0.0004 | 0.1154 | 0.9767 | 0.9773 | |
| | 10 | 0.0002 | 0.1229 | 0.9767 | 0.9773 | |
|
|
| > β **Best checkpoint selected at Epoch 8** β highest validation F1 of **0.9806** |
|
|
| --- |
|
|
|
|
| ## π Citation |
|
|
| If you use this model in your research, please cite the associated working paper: (Forthcoming) |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| *Built on top of [ClimateBERT](https://huggingface.co/climatebert) Β· Trained with π€ Hugging Face Transformers* |
|
|
| </div> |