Update README.md
Browse files
README.md
CHANGED
|
@@ -250,7 +250,17 @@ fn = open("./2.7.3.13_processed.txt",'w')
|
|
| 250 |
for key,value in grouped_dataset.items():
|
| 251 |
fn.write(value)
|
| 252 |
fn.write("\n")
|
| 253 |
-
fn.close()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
```
|
| 255 |
The previous script will prepare a text file with the correct format for tokenization.
|
| 256 |
Now we can use the tokenizer to convert its contents to tokens.
|
|
|
|
| 250 |
for key,value in grouped_dataset.items():
|
| 251 |
fn.write(value)
|
| 252 |
fn.write("\n")
|
| 253 |
+
fn.close()
|
| 254 |
+
|
| 255 |
+
fn = open("./2.7.3.13_processed.txt",'w')
|
| 256 |
+
for key,value in grouped_dataset.items():
|
| 257 |
+
padding_len = 1024 - len(tokenizer(value)['input_ids'])
|
| 258 |
+
padding = "<pad>"*padding_len
|
| 259 |
+
print(len(tokenizer(value+padding)['input_ids']))
|
| 260 |
+
fn.write(value+padding)
|
| 261 |
+
fn.write
|
| 262 |
+
fn.write("\n")
|
| 263 |
+
fn.close()
|
| 264 |
```
|
| 265 |
The previous script will prepare a text file with the correct format for tokenization.
|
| 266 |
Now we can use the tokenizer to convert its contents to tokens.
|