ml6team
/

keyphrase-generation-keybart-inspec

text2text-generation

keyphrase-generation

Model card Files Files and versions

DeDeckerThomas commited on May 13, 2022

Commit

1ef4e76

·

1 Parent(s): d9eaf43

Update README.md

Files changed (1) hide show

README.md +58 -2

README.md CHANGED Viewed

@@ -126,13 +126,69 @@ For more in detail information, you can take a look at the training notebook (li
 | Early Stopping Patience | 1 |
 ### Preprocessing
 ```python
 ```
 ### Postprocessing
 ```python
 ```
 ## 📝 Evaluation results

 | Early Stopping Patience | 1 |
 ### Preprocessing
+The documents in the dataset are already preprocessed into list of words with the corresponding keyphrases. The only thing that must be done is tokenization and joining all keyphrases into one string with a certain seperator of choice(;).
 ```python
+def pre_process_keyphrases(text_ids, kp_list):
+    kp_order_list = []
+    kp_set = set(kp_list)
+    text = tokenizer.decode(
+        text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
+    )
+    text = text.lower()
+    for kp in kp_set:
+        kp = kp.strip()
+        kp_index = text.find(kp.lower())
+        kp_order_list.append((kp_index, kp))
+    kp_order_list.sort()
+    present_kp, absent_kp = [], []
+    for kp_index, kp in kp_order_list:
+        if kp_index < 0:
+            absent_kp.append(kp)
+        else:
+            present_kp.append(kp)
+    return present_kp, absent_kp
+def preprocess_fuction(samples):
+    processed_samples = {"input_ids": [], "attention_mask": [], "labels": []}
+    for i, sample in enumerate(samples[dataset_document_column]):
+        input_text = " ".join(sample)
+        inputs = tokenizer(
+            input_text,
+            padding="max_length",
+            truncation=True,
+        )
+        present_kp, absent_kp = pre_process_keyphrases(
+            text_ids=inputs["input_ids"],
+            kp_list=samples["extractive_keyphrases"][i]
+            + samples["abstractive_keyphrases"][i],
+        )
+        keyphrases = present_kp
+        keyphrases += absent_kp
+        target_text = f" {keyphrase_sep_token} ".join(keyphrases)
+        with tokenizer.as_target_tokenizer():
+            targets = tokenizer(
+                target_text, max_length=40, padding="max_length", truncation=True
+            )
+            targets["input_ids"] = [
+                (t if t != tokenizer.pad_token_id else -100)
+                for t in targets["input_ids"]
+            ]
+        for key in inputs.keys():
+            processed_samples[key].append(inputs[key])
+        processed_samples["labels"].append(targets["input_ids"])
+    return processed_samples
 ```
 ### Postprocessing
+For the post-processing, you will need to split the string based on the keyphrase separator.
 ```python
+def extract_keyphrases(examples):
+    return [example.split(keyphrase_sep_token) for example in examples]
 ```
 ## 📝 Evaluation results