|
|
--- |
|
|
license: gpl-3.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-classification |
|
|
tags: |
|
|
- sbic |
|
|
- csr |
|
|
datasets: |
|
|
- ia-nechaev/sbic-method2 |
|
|
--- |
|
|
|
|
|
# sbic-method2 |
|
|
|
|
|
An updated version of **Standard-Based Impact Classification (SBIC) method** of CSR report analysis in accordance with GRI framework |
|
|
|
|
|
## Contents |
|
|
|
|
|
1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences) |
|
|
2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request) |
|
|
3. Claculated text embeddings of both datasets |
|
|
4. Script to predict the labels |
|
|
|
|
|
|
|
|
Instructions on how to run the code below. |
|
|
|
|
|
--- |
|
|
|
|
|
# **Multilabel Classification Steps** |
|
|
|
|
|
This code performs report similarity search using **cosine similarity**, **K-Nearest Neighbor (KNN) algorithm**, and **Sigmoid activation function** to classify reports based on embeddings. |
|
|
|
|
|
## **Prerequisites** |
|
|
|
|
|
Ensure you have the following installed before running the script: |
|
|
|
|
|
- Python 3.8+ |
|
|
- Required Python libraries (install using the command below) |
|
|
|
|
|
```bash |
|
|
pip install numpy pandas torch sentence-transformers scikit-learn |
|
|
``` |
|
|
|
|
|
## **Input Files** |
|
|
|
|
|
Before running the script, make sure you have the following input files in the working directory: |
|
|
|
|
|
1. **Data Files**: |
|
|
- labeled dataset: `labeled.csv` |
|
|
- dataset for prediction: `prediction_demo.csv` |
|
|
|
|
|
2. **Precomputed Embeddings**: |
|
|
- labeled dataset: `embeddings_labeled.pkl` |
|
|
- dataset for prediction: `embeddings_prediction.pkl` |
|
|
|
|
|
## **Running the Script** |
|
|
|
|
|
Run the script using the following command: |
|
|
|
|
|
```bash |
|
|
python script.py |
|
|
``` |
|
|
|
|
|
## **Processing Steps** |
|
|
|
|
|
The script follows these main steps: |
|
|
|
|
|
1. **Load Data & Pretrained Embeddings** |
|
|
2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`. |
|
|
3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions. |
|
|
4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs. |
|
|
5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data for the first 50k of records. |
|
|
|
|
|
## **Output File** |
|
|
|
|
|
The processed results will be saved in: `df_results_0_50k.csv` |
|
|
|
|
|
## **Execution Time** |
|
|
|
|
|
Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion. |
|
|
|
|
|
|
|
|
|