drchajan commited on
Commit
f3b9bca
·
1 Parent(s): acce65b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md CHANGED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This model's tokenizer is extended with CS, SK and PL accents using the following code:
2
+ ````python
3
+ from transformers import (
4
+ AutoModel,
5
+ AutoTokenizer,
6
+ )
7
+ model_id = "google/flan-t5-large"
8
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
9
+ model = AutoModel.from_pretrained(model_id)
10
+
11
+ accents = "áčďéěíňóřšťúůýž" # CS
12
+ accents += "ąćęłńóśźż" # PL
13
+ accents += "áäčďéíĺľňóôŕšťúýž" # SK
14
+ accents += accents.upper()
15
+ accents = set(c for c in accents)
16
+ new_tokens = accents - set(tokenizer.vocab.keys())
17
+
18
+ tokenizer.add_tokens(list(new_tokens))
19
+
20
+ model.resize_token_embeddings(len(tokenizer))
21
+ ````