--- license: apache-2.0 datasets: - agentlans/questionizer - agentlans/text-sft-questions-answers-only language: - en base_model: - google/flan-t5-small pipeline_tag: text-generation tags: - question - answer --- # FLAN T5 Small Questionizer This model converts declarative statements into questions. **Example:** **Input:** The sun rises in the east and sets in the west. **Output:** Where does the sun rise and set? ## Usage ```python from transformers import pipeline # Load the model questionizer = pipeline("text2text-generation", model="agentlans/flan-t5-small-questionizer") # Convert a statement into a question statement = "Water covers approximately 71% of the Earth's surface, making it the most abundant substance on the planet's exterior." question = questionizer(statement)[0]['generated_text'] print(question) # Output: What percentage of the Earth's surface does water cover? ``` ## Examples
Click here for simple sentence examples **Input:** The sun rises in the east and sets in the west. **Output:** Where does the sun rise and set? **Input:** Python is a popular programming language for beginners. **Output:** What is a popular programming language for beginners? **Input:** Elephants are the largest land animals on Earth. **Output:** What are the largest land animals on Earth? **Input:** Rainbows appear when sunlight passes through raindrops. **Output:** When do rainbows appear? **Input:** Saturn has beautiful rings made of ice and rock. **Output:** What is the shape of Saturn's rings? **Input:** Coffee is enjoyed by millions of people every morning. **Output:** How many people enjoy coffee every morning? **Input:** Mount Everest is the highest mountain in the world. **Output:** What is the highest mountain in the world? **Input:** Honeybees communicate through a dance called the waggle. **Output:** How do honeybees communicate? **Input:** Penguins live in cold climates and cannot fly. **Output:** Where do Penguins live and cannot fly? **Input:** Artists use different colors and shapes to express ideas. **Output:** What do artists use to express ideas?
Click here for complex sentence examples These sentences were randomly selected from the [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences) dataset. **Input:** Want to know what takes up the most space on your Hard Drive? **Output:** What do you want to know about the most space on your Hard Drive? **Input:** DISA officials say Defense must embrace mobility management tools to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide. **Output:** What do DISA officials say Defense must do to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide? **Input:** This is why learning a healthy nutritious eating pattern and adopting it for life, that is, lifestyle changes has a significant impact on our ability to reach and maintain a healthy weight. **Output:** Why is it important to learn a healthy nutritious eating pattern and adopt it for life? **Input:** In 1933, Bartlett was chosen to create the very first gift print, a color etching called 'Java'. **Output:** What color etching was Bartlett chosen to create in 1933? **Input:** In the United States, NERC petitions the Federal Energy Regulatory Commission (FERC) for approval of standards. **Output:** What does NERC petition the Federal Energy Regulatory Commission (FERC) for in the United States? **Input:** The majority of large businesses are public companies, with government enterprises, not-for-profit organisations and superannuation funds also represented. **Output:** What is the majority of large businesses, and what types of organizations are represented? **Input:** Because blueberries need moisture close to the soil surface, it is important to mulch them well. **Output:** Why is it important to mulch blueberries well? **Input:** One of the victims was a man but the other could not be determined. **Output:** Who was one of the victims, and how was the other determined? **Input:** The statute gives States and local educational agencies significant flexibility in how they direct resources and tailor interventions to the needs of individual schools identified for improvement. **Output:** What flexibility does the statute provide for States and local educational agencies? **Input:** Similar legislation would allay any hesitancy on the par of the banks in sharing cyber threat information with the government, Tunstall suggests. **Output:** What would similar legislation allay in sharing cyber threat information with the government, according to Tunstall?
## Limitations * The model works best with statements that provide enough context. Short or vague sentences may lead to hallucinated or unrelated questions. Example: **Input:** No. **Output:** Is there a requirement for a person to have a copy of a book in a library? * Not all statements are suitable for question generation. Some inputs may produce awkward questions or questions that do not match the intended meaning. ### Tips for Better Results 1. **Use clear, informative statements:** Include enough context so the model can generate a meaningful question. 2. **Prefer factual sentences:** The model performs better on statements that contain concrete information (dates, quantities, events, definitions). 3. **Avoid extremely short inputs:** Single words or one-word answers rarely produce useful questions. 4. **Check generated questions:** While the model is powerful, review outputs for accuracy and relevance, especially for educational or professional use. ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: linear - num_epochs: 20.0 ### Training results The model was trained for 20 epochs on over 153k samples, processing 221M tokens. It achieved a training loss of 0.64 and an evaluation loss of 1.30. Training was efficient, with ~385 samples/sec and ~27k tokens/sec, and evaluation ran at ~820 samples/sec. ### Framework versions - Transformers 4.57.1 - Pytorch 2.9.0+cu128 - Datasets 4.3.0 - Tokenizers 0.22.1 ## Licence Apache 2.0