ia-nechaev
/

sbic-method2

Text Classification

Model card Files Files and versions

sbic-method2 / README.md

ia-nechaev's picture

Update README.md

fee97ef verified 10 months ago

|

history blame contribute delete

2.51 kB

	---
	license: gpl-3.0
	language:
	- en
	pipeline_tag: text-classification
	tags:
	- sbic
	- csr
	datasets:
	- ia-nechaev/sbic-method2
	---

	# sbic-method2

	An updated version of Standard-Based Impact Classification (SBIC) method of CSR report analysis in accordance with GRI framework

	## Contents

	1. Labeled dataset (150 International companies, 230 CSR GRI reports, 2017-2021 period, 57k paragraphs, 360k sentences)
	2. Dataset for prediction (150 German PLC companies, 1.2k CSR reports, 2010-2021 period, 645k paragraphs) (full dataset available upon request)
	3. Claculated text embeddings of both datasets
	4. Script to predict the labels


	Instructions on how to run the code below.

	---

	# Multilabel Classification Steps

	This code performs report similarity search using cosine similarity, K-Nearest Neighbor (KNN) algorithm, and Sigmoid activation function to classify reports based on embeddings.

	## Prerequisites

	Ensure you have the following installed before running the script:

	- Python 3.8+
	- Required Python libraries (install using the command below)

	```bash
	pip install numpy pandas torch sentence-transformers scikit-learn
	```

	## Input Files

	Before running the script, make sure you have the following input files in the working directory:

	1. Data Files:
	- labeled dataset: `labeled.csv`
	- dataset for prediction: `prediction_demo.csv`

	2. Precomputed Embeddings:
	- labeled dataset: `embeddings_labeled.pkl`
	- dataset for prediction: `embeddings_prediction.pkl`

	## Running the Script

	Run the script using the following command:

	```bash
	python script.py
	```

	## Processing Steps

	The script follows these main steps:

	1. Load Data & Pretrained Embeddings
	2. Perform Cosine Similarity Search: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.
	3. Apply K-Nearest Neighbor (KNN) Algorithm: Selects top similar reports (sentences) and aggregates predictions.
	4. Use Sigmoid Activation for Classification: Applies a threshold to generate final classification outputs.
	5. Save Results: Exports `df_results_0_50k.csv` containing the processed data for the first 50k of records.

	## Output File

	The processed results will be saved in: `df_results_0_50k.csv`

	## Execution Time

	Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.