AI4PD
/

ZymCTRL

@@ -250,7 +250,17 @@ fn = open("./2.7.3.13_processed.txt",'w')
 for key,value in grouped_dataset.items():
     fn.write(value)
     fn.write("\n")
-fn.close()
 ```
 The previous script will prepare a text file with the correct format for tokenization.
 Now we can use the tokenizer to convert its contents to tokens.

 for key,value in grouped_dataset.items():
     fn.write(value)
     fn.write("\n")
+fn.close()
+fn = open("./2.7.3.13_processed.txt",'w')
+for key,value in grouped_dataset.items():
+    padding_len = 1024 - len(tokenizer(value)['input_ids'])
+    padding = "<pad>"*padding_len
+    print(len(tokenizer(value+padding)['input_ids']))
+    fn.write(value+padding)
+    fn.write
+    fn.write("\n")
+fn.close()
 ```
 The previous script will prepare a text file with the correct format for tokenization.
 Now we can use the tokenizer to convert its contents to tokens.