dvres commited on
Commit
fa2aa6e
·
verified ·
1 Parent(s): 05b7b98

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -1
README.md CHANGED
@@ -7,4 +7,65 @@ language:
7
  - sr
8
  - bs
9
  library_name: transformers
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - sr
8
  - bs
9
  library_name: transformers
10
+ ---
11
+
12
+ # OPT_sl 1B
13
+
14
+ This is the 1B OPT model additionally pretrained on Slovene data. The model was created as a part of project Povejmo: https://www.cjvt.si/povejmo/.
15
+
16
+ This is the base version of the model and is not instruction-tuned.
17
+
18
+ ## Data
19
+
20
+ The model was additionally pretrained on the following Slovene, English, and Croatian-Bosnian-Serbian (CBS) corpora:
21
+ | Corpus | Language | # Tokens | Percentage |
22
+ | :----- | :------- | :------: | :--------: |
23
+ | Metafida | Slovene | 6.59 B | 13.89 % |
24
+ | KAS | Slovene | 3.61 B | 7.62 % |
25
+ | Trendi | Slovene | 1.4 B | 2.96 % |
26
+ | mC4 | Slovene | 5.5 B | 11.6 % |
27
+ | MaCoCu | Slovene | 4.68 B | 9.86 % |
28
+ | CC100 | Slovene | 0.54 B | 1.14 % |
29
+ | Rižnica | Croatian | 0.21 B | 0.44 % |
30
+ | Hr News | Croatian | 4.16 B | 8.77 % |
31
+ | MaCoCu HBS | CBS | 15.65 B | 32.98 % |
32
+ | Wikipedia | English | 4.7 B | 9.9 % |
33
+ | CC-News | English | 0.4 B | 0.83 % |
34
+
35
+ The total size of additional training data is **47.44 B** tokens.
36
+
37
+ ## Model usage
38
+
39
+ The inference can be done using the following snippet of code:
40
+
41
+ ```{python}
42
+ from transformers import AutoTokenizer, pipeline
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained("cjvt/OPT_sl")
45
+
46
+ pline = pipeline(
47
+ "text-generation",
48
+ model="cjvt/OPT_sl",
49
+ tokenizer=tokenizer,
50
+ device_map="auto"
51
+ )
52
+
53
+ prompts = [
54
+ "The examples of antonyms are:\nhigh => low\nwide => narrow\nbig =>",
55
+ "Pristanek je bil prvi nadzorovani spust ameriškega vesoljskega plovila na površje Lune po Apollu 17 leta 1972, ko je na Luni pristala zadnja Nasina misija s posadko.\nDoslej so na Luni pristala vesoljska plovila le iz štirih drugih držav –",
56
+ "U četvrtak je bila prva polufinalna večer Dore, a komentari na društvenim mrežama ne prestaju. U nedjeljno finale prošli su:"
57
+ ]
58
+
59
+ sequences = pline(
60
+ prompts,
61
+ max_length=1000,
62
+ do_sample=False,
63
+ num_return_sequences=1,
64
+ eos_token_id=tokenizer.eos_token_id
65
+ )
66
+
67
+ for seq in sequences:
68
+ print("--------------------------")
69
+ print(f"Result: {seq[0]['generated_text']}")
70
+ print("--------------------------\n")
71
+ ```