| --- |
| language: |
| - en |
| - hi |
| - de |
| - ar |
| - bn |
| - fi |
| - ja |
| - zh |
| - id |
| - sw |
| - ta |
| - gr |
| - ru |
| - es |
| - th |
| - tr |
| - vi |
| - multilingual |
| datasets: |
| - squad_v2 |
| - tydiqa |
| - mlqa |
| - xquad |
| - germanquad |
| widget: |
| - text: 'Hugging Face has seen rapid growth in its popularity since the get-go. It |
| is definitely doing the right things to attract more and more people to its platform, |
| some of which are on the following lines: Community driven approach through large |
| open source repositories along with paid services. Helps to build a network of |
| like-minded people passionate about open source. Attractive price point. The subscription-based |
| features, e.g.: Inference based API, starts at a price of $9/month.' |
| example_title: English |
| - text: 'A un año y tres días de que el balón ruede en el Al Bayt Stadium inaugurando |
| el Mundial 2022, ya se han dibujado los primeros bocetos de la próxima Copa del |
| Mundo.13 selecciones están colocadas en el mapa con la etiqueta de clasificadas |
| y tienen asegurado pisar los verdes de Qatar en la primera fase final otoñal. |
| Serbia, Dinamarca, España, Países Bajos, Suiza, Croacia, Francia, Inglaterra, |
| Bélgica, Alemania, Brasil, Argentina y Qatar, como anfitriona, entrarán en el |
| sorteo del 1 de abril de 2022 en Doha en el que 32 paísses serán repartidos en |
| sus respectivos grupos. ' |
| example_title: Spanish |
| --- |
| # Multi-lingual Question Generating Model (mt5-small) |
| Give the model a passage and it will generate a question about the passage. |
|
|
| ## Trained on the following datasets: |
|
|
| - [SQuAD (English)](https://rajpurkar.github.io/SQuAD-explorer/) |
| - [TyDiQA-GoldP (Arabic, Bengali, Finnish, Japanese, Indonesian, Kiswahili, Korean, Russian, Telugu, Thai)](https://github.com/google-research-datasets/tydiqa) |
| - [MLQA (Arabic, Chinese, English, German, Hindi, Spanish, Vietnames)](https://github.com/facebookresearch/MLQA) |
| - [XQuAD (Arabic, Chinese, German, Greek, Hindi, Russian, Spanish, Thai, Turkish, Vietnamese)](https://github.com/deepmind/xquad) |
| - [GermanQuAD (German)](https://huggingface.co/datasets/deepset/germanquad) |
| - [Persian QA (Persian)](https://www.kaggle.com/sajjadayobi360/persianqa) |
| - [Bengali QA (Bengali)](https://www.kaggle.com/mayeesha/bengali-question-answering-dataset) |
| - [chaii (Hindi, Tamil)](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering/data) |
|
|
|
|
| ## Training details |
| I used [flax summarization script](https://github.com/huggingface/transformers/tree/master/examples/flax/summarization) and a TPU v3-8. Summarization expects a text column and a summary column. For question generation training, use the context column instead of text column and question instead of summary column. |
|
|
|
|
| ## Limitations and Intended Use |
|
|
| There is no guarantee that it will produce a question in the language of the passage, but it usually does. Lower resource languages will likely have lower quality questions. |
|
|
| Intended use is to make questions given a passage. With a larger model this might be able to generate training data for question-answering models, but this small one does not produce high-quality questions. |
|
|
| ## Using the model |
|
|
| #### PyTorch version |
| ```python |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("nbroad/mt5-small-qgen") |
| model = AutoModelForSeq2SeqLM.from_pretrained("nbroad/mt5-small-qgen") |
| |
| text = "Hugging Face has seen rapid growth in its \npopularity since the get-go. It is definitely doing\n the right things to attract more and more people to \n its platform, some of which are on the following lines:\nCommunity driven approach through large open source repositories \nalong with paid services. Helps to build a network of like-minded\n people passionate about open source. \nAttractive price point. The subscription-based features, e.g.: \nInference based API, starts at a price of $9/month.\n" |
| |
| inputs = tokenizer(text, return_tensors="pt") |
| output = model.generate(**inputs, max_length=40) |
| |
| tokenizer.decode(output[0], skip_special_tokens=True) |
| # What is the subscription-based features that starts at a price of $/month' |
| ``` |
|
|
| Model trained on Cloud TPUs from Google's TPU Research Cloud (TRC) |