DeDeckerThomas commited on
Commit
1ef4e76
·
1 Parent(s): d9eaf43

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -2
README.md CHANGED
@@ -126,13 +126,69 @@ For more in detail information, you can take a look at the training notebook (li
126
  | Early Stopping Patience | 1 |
127
 
128
  ### Preprocessing
 
129
  ```python
130
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  ```
132
 
133
  ### Postprocessing
 
134
  ```python
135
-
 
136
  ```
137
  ## 📝 Evaluation results
138
 
 
126
  | Early Stopping Patience | 1 |
127
 
128
  ### Preprocessing
129
+ The documents in the dataset are already preprocessed into list of words with the corresponding keyphrases. The only thing that must be done is tokenization and joining all keyphrases into one string with a certain seperator of choice(;).
130
  ```python
131
+ def pre_process_keyphrases(text_ids, kp_list):
132
+ kp_order_list = []
133
+ kp_set = set(kp_list)
134
+ text = tokenizer.decode(
135
+ text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
136
+ )
137
+ text = text.lower()
138
+ for kp in kp_set:
139
+ kp = kp.strip()
140
+ kp_index = text.find(kp.lower())
141
+ kp_order_list.append((kp_index, kp))
142
+
143
+ kp_order_list.sort()
144
+ present_kp, absent_kp = [], []
145
+
146
+ for kp_index, kp in kp_order_list:
147
+ if kp_index < 0:
148
+ absent_kp.append(kp)
149
+ else:
150
+ present_kp.append(kp)
151
+ return present_kp, absent_kp
152
+
153
+
154
+ def preprocess_fuction(samples):
155
+ processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
156
+ for i, sample in enumerate(samples[dataset_document_column]):
157
+ input_text = " ".join(sample)
158
+ inputs = tokenizer(
159
+ input_text,
160
+ padding="max_length",
161
+ truncation=True,
162
+ )
163
+ present_kp, absent_kp = pre_process_keyphrases(
164
+ text_ids=inputs["input_ids"],
165
+ kp_list=samples["extractive_keyphrases"][i]
166
+ + samples["abstractive_keyphrases"][i],
167
+ )
168
+ keyphrases = present_kp
169
+ keyphrases += absent_kp
170
+
171
+ target_text = f" {keyphrase_sep_token} ".join(keyphrases)
172
+
173
+ with tokenizer.as_target_tokenizer():
174
+ targets = tokenizer(
175
+ target_text, max_length=40, padding="max_length", truncation=True
176
+ )
177
+ targets["input_ids"] = [
178
+ (t if t != tokenizer.pad_token_id else -100)
179
+ for t in targets["input_ids"]
180
+ ]
181
+ for key in inputs.keys():
182
+ processed_samples[key].append(inputs[key])
183
+ processed_samples["labels"].append(targets["input_ids"])
184
+ return processed_samples
185
  ```
186
 
187
  ### Postprocessing
188
+ For the post-processing, you will need to split the string based on the keyphrase separator.
189
  ```python
190
+ def extract_keyphrases(examples):
191
+ return [example.split(keyphrase_sep_token) for example in examples]
192
  ```
193
  ## 📝 Evaluation results
194