--- license: gpl-3.0 language: - en pipeline_tag: text-classification tags: - sbic - csr datasets: - ia-nechaev/sbic-method2 --- # sbic-method2 An updated version of **Standard-Based Impact Classification (SBIC) method** of CSR report analysis in accordance with GRI framework ## Contents 1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences) 2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request) 3. Claculated text embeddings of both datasets 4. Script to predict the labels Instructions on how to run the code below. --- # **Multilabel Classification Steps** This code performs report similarity search using **cosine similarity**, **K-Nearest Neighbor (KNN) algorithm**, and **Sigmoid activation function** to classify reports based on embeddings. ## **Prerequisites** Ensure you have the following installed before running the script: - Python 3.8+ - Required Python libraries (install using the command below) ```bash pip install numpy pandas torch sentence-transformers scikit-learn ``` ## **Input Files** Before running the script, make sure you have the following input files in the working directory: 1. **Data Files**: - labeled dataset: `labeled.csv` - dataset for prediction: `prediction_demo.csv` 2. **Precomputed Embeddings**: - labeled dataset: `embeddings_labeled.pkl` - dataset for prediction: `embeddings_prediction.pkl` ## **Running the Script** Run the script using the following command: ```bash python script.py ``` ## **Processing Steps** The script follows these main steps: 1. **Load Data & Pretrained Embeddings** 2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`. 3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions. 4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs. 5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data for the first 50k of records. ## **Output File** The processed results will be saved in: `df_results_0_50k.csv` ## **Execution Time** Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.