Seznam
/

small-e-czech

Model card Files Files and versions

mkocian commited on Aug 30, 2021

Commit

49736c4

·

1 Parent(s): b15da03

Create README.md

Files changed (1) hide show

README.md +41 -0

README.md ADDED Viewed

	@@ -0,0 +1,41 @@

+# Small-E-Czech
+Small-E-Czech is an [Electra](https://arxiv.org/abs/2003.10555)-small model pretrained on a Czech corpus created at Seznam.cz. Like other pretrained models, it should be finetuned on a downstream task of interest before use.
+### How to use the discriminator in transformers
+```python
+from transformers import ElectraForPreTraining, ElectraTokenizerFast
+import torch
+discriminator = ElectraForPreTraining.from_pretrained("seznam/small-e-czech")
+tokenizer = ElectraTokenizerFast.from_pretrained(
+    "seznam/small-e-czech", strip_accents=False
+)
+sentence = "Za hory, za doly, mé zlaté parohy"
+fake_sentence = "Za hory, za doly, kočka zlaté parohy"
+fake_sentence_tokens = ["[CLS]"] + tokenizer.tokenize(fake_sentence) + ["[SEP]"]
+fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
+discriminator_outputs = discriminator(fake_inputs)
+predictions = torch.nn.Sigmoid()(discriminator_outputs[0]).cpu().detach().numpy()
+for token in fake_sentence_tokens:
+    print("{:>7s}".format(token), end="")
+print()
+for prediction in predictions.squeeze():
+    print("{:7.1f}".format(prediction), end="")
+print()
+```
+In the output we can see the probabilities of particular tokens not belonging in the sentence (i.e. having been faked by the generator) according to the discriminator:
+```
+  [CLS]     za   hory      ,     za    dol    ##y      ,  kočka  zlaté   paro   ##hy  [SEP]
+    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.0    0.8    0.3    0.2    0.1    0.0
+```
+### Finetuning
+For instructions on how to finetune the model on a new task, see the official HuggingFace transformers [tutorial](https://huggingface.co/transformers/training.html).