ctu-aic
/

flan-t5-large

Feature Extraction

Model card Files Files and versions

drchajan commited on Aug 7, 2023

Commit

f3b9bca

·

1 Parent(s): acce65b

Create README.md

Files changed (1) hide show

README.md +21 -0

README.md CHANGED Viewed

	@@ -0,0 +1,21 @@

+This model's tokenizer is extended with CS, SK and PL accents using the following code:
+````python
+from transformers import (
+    AutoModel,
+    AutoTokenizer,
+)
+model_id = "google/flan-t5-large"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModel.from_pretrained(model_id)
+accents = "áčďéěíňóřšťúůýž" # CS
+accents += "ąćęłńóśźż" # PL
+accents += "áäčďéíĺľňóôŕšťúýž" # SK
+accents += accents.upper()
+accents = set(c for c in accents)
+new_tokens = accents - set(tokenizer.vocab.keys())
+tokenizer.add_tokens(list(new_tokens))
+model.resize_token_embeddings(len(tokenizer))
+````