File size: 2,075 Bytes
1fb8a93
 
 
 
 
34aa7ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1fb8a93
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# sbic-method2

An updated version of Standard-Based Impact Classification (SBIC) method of CSR report analysis in accordance with GRI framework


Here's a README section with instructions on how to run the code.  

---

# **Multilabel Classification Step**  

This code performs report similarity search using **cosine similarity**, **K-Nearest Neighbor (KNN) algorithm**, and **Sigmoid activation function** to classify reports based on embeddings.  

## **Prerequisites**  

Ensure you have the following installed before running the script:  

- Python 3.8+  
- Required Python libraries (install using the command below)  

```bash
pip install numpy pandas torch sentence-transformers scikit-learn
```

## **Input Files**  

Before running the script, make sure you have the following input files in the working directory:  

1. **Patent Data Files**:  
   - `df_360k_41lables_05012023.csv`  
   - `german_plc_all_paragraphs_unnested_only.csv`  

2. **Precomputed Embeddings**:  
   - dataset for prediction:`embeddings_paragraphs_07012023.pkl`  
   - labeled dataset:`embeddings_sentences_360k_09012023.pkl`  

## **Running the Script**  

Run the script using the following command:  

```bash
python script.py
```

## **Processing Steps**  

The script follows these main steps:  

1. **Load Data & Pretrained Embeddings**  
2. **Perform Cosine Similarity Search**: Finds the most relevant reports (sentences) using `semantic_search` from `sentence-transformers`.  
3. **Apply K-Nearest Neighbor (KNN) Algorithm**: Selects top similar reports (sentences) and aggregates predictions.  
4. **Use Sigmoid Activation for Classification**: Applies a threshold to generate final classification outputs.  
5. **Save Results**: Exports `df_results_0_50k.csv` containing the processed data.  

## **Output File**  

The processed results will be saved in:  

- `df_results_0_50k.csv`  

## **Execution Time**  

Execution time depends on the number of test samples and system resources. The script prints the total processing time upon completion.  

---
license: gpl-3.0
---