ctu-aic
/

mt5-base-multilingual-summarization-multilarge-cs

@@ -33,6 +33,65 @@ metrics:
 This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
 ## Task
 The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>',  'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>',  'zh': '<extra_id_7>'```
 ## Dataset
 Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
 ```

 This model is a fine-tuned checkpoint of [google/mt5-base](https://huggingface.co/google/mt5-base) on the Multilingual large summarization dataset focused on Czech texts to produce multilingual summaries.
 ## Task
 The model deals with a multi-sentence summary in eight different languages. With the idea of adding other foreign language documents, and by having a considerable amount of Czech documents, we aimed to improve model summarization in the Czech language. Supported languages: ```'cs': '<extra_id_0>', 'en': '<extra_id_1>','de': '<extra_id_2>',  'es': '<extra_id_3>', 'fr': '<extra_id_4>', 'ru': '<extra_id_5>', 'tu': '<extra_id_6>',  'zh': '<extra_id_7>'```
+#Usage
+```python
+## Configuration of summarization pipeline
+#
+def summ_config():
+    cfg = OrderedDict([
+        ## summarization model - checkpoint
+        #   ctu-aic/m2m100-418M-multilingual-summarization-multilarge-cs
+        #   ctu-aic/mt5-base-multilingual-summarization-multilarge-cs
+        #   ctu-aic/mbart25-multilingual-summarization-multilarge-cs
+        ("model_name", "ctu-aic/mbart25-multilingual-summarization-multilarge-cs"),
+        ## language of summarization task
+        #   language : string : cs, en, de, fr, es, tr, ru, zh
+        ("language", "en"),
+        ## generation method parameters in dictionary
+        #
+        ("inference_cfg", OrderedDict([
+            ("num_beams", 4),
+            ("top_k", 40),
+            ("top_p", 0.92),
+            ("do_sample", True),
+            ("temperature", 0.95),
+            ("repetition_penalty", 1.23),
+            ("no_repeat_ngram_size", None),
+            ("early_stopping", True),
+            ("max_length", 128),
+            ("min_length", 10),
+        ])),
+        #texts to summarize values = (list of strings, string, dataset)
+        ("texts",
+            [
+               "english text1 to summarize",
+               "english text2 to summarize",
+            ]
+        ),
+        #OPTIONAL: Target summaries values = (list of strings, string, None)
+        ('golds',
+         [
+               "target english text1",
+               "target english text2",
+         ]),
+        #('golds', None),
+    ])
+    return cfg
+cfg = summ_config()
+mSummarize = MultiSummarizer(**cfg)
+summaries,scores = mSummarize(**cfg)
+```
 ## Dataset
 Multilingual large summarization dataset consists of 10 sub-datasets mainly based on news and daily mails. For the training, it was used the entire training set and 72% of the validation set.
 ```