---
license: apache-2.0
datasets:
- agentlans/questionizer
- agentlans/text-sft-questions-answers-only
language:
- en
base_model:
- google/flan-t5-small
pipeline_tag: text-generation
tags:
- question
- answer
---
# FLAN T5 Small Questionizer
This model converts declarative statements into questions.
**Example:**
**Input:** The sun rises in the east and sets in the west.
**Output:** Where does the sun rise and set?
## Usage
```python
from transformers import pipeline
# Load the model
questionizer = pipeline("text2text-generation", model="agentlans/flan-t5-small-questionizer")
# Convert a statement into a question
statement = "Water covers approximately 71% of the Earth's surface, making it the most abundant substance on the planet's exterior."
question = questionizer(statement)[0]['generated_text']
print(question)
# Output: What percentage of the Earth's surface does water cover?
```
## Examples
Click here for simple sentence examples
**Input:** The sun rises in the east and sets in the west.
**Output:** Where does the sun rise and set?
**Input:** Python is a popular programming language for beginners.
**Output:** What is a popular programming language for beginners?
**Input:** Elephants are the largest land animals on Earth.
**Output:** What are the largest land animals on Earth?
**Input:** Rainbows appear when sunlight passes through raindrops.
**Output:** When do rainbows appear?
**Input:** Saturn has beautiful rings made of ice and rock.
**Output:** What is the shape of Saturn's rings?
**Input:** Coffee is enjoyed by millions of people every morning.
**Output:** How many people enjoy coffee every morning?
**Input:** Mount Everest is the highest mountain in the world.
**Output:** What is the highest mountain in the world?
**Input:** Honeybees communicate through a dance called the waggle.
**Output:** How do honeybees communicate?
**Input:** Penguins live in cold climates and cannot fly.
**Output:** Where do Penguins live and cannot fly?
**Input:** Artists use different colors and shapes to express ideas.
**Output:** What do artists use to express ideas?
Click here for complex sentence examples
These sentences were randomly selected from the [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences) dataset.
**Input:** Want to know what takes up the most space on your Hard Drive?
**Output:** What do you want to know about the most space on your Hard Drive?
**Input:** DISA officials say Defense must embrace mobility management tools to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide.
**Output:** What do DISA officials say Defense must do to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide?
**Input:** This is why learning a healthy nutritious eating pattern and adopting it for life, that is, lifestyle changes has a significant impact on our ability to reach and maintain a healthy weight.
**Output:** Why is it important to learn a healthy nutritious eating pattern and adopt it for life?
**Input:** In 1933, Bartlett was chosen to create the very first gift print, a color etching called 'Java'.
**Output:** What color etching was Bartlett chosen to create in 1933?
**Input:** In the United States, NERC petitions the Federal Energy Regulatory Commission (FERC) for approval of standards.
**Output:** What does NERC petition the Federal Energy Regulatory Commission (FERC) for in the United States?
**Input:** The majority of large businesses are public companies, with government enterprises, not-for-profit organisations and superannuation funds also represented.
**Output:** What is the majority of large businesses, and what types of organizations are represented?
**Input:** Because blueberries need moisture close to the soil surface, it is important to mulch them well.
**Output:** Why is it important to mulch blueberries well?
**Input:** One of the victims was a man but the other could not be determined.
**Output:** Who was one of the victims, and how was the other determined?
**Input:** The statute gives States and local educational agencies significant flexibility in how they direct resources and tailor interventions to the needs of individual schools identified for improvement.
**Output:** What flexibility does the statute provide for States and local educational agencies?
**Input:** Similar legislation would allay any hesitancy on the par of the banks in sharing cyber threat information with the government, Tunstall suggests.
**Output:** What would similar legislation allay in sharing cyber threat information with the government, according to Tunstall?
## Limitations
* The model works best with statements that provide enough context. Short or vague sentences may lead to hallucinated or unrelated questions. Example:
**Input:** No.
**Output:** Is there a requirement for a person to have a copy of a book in a library?
* Not all statements are suitable for question generation. Some inputs may produce awkward questions or questions that do not match the intended meaning.
### Tips for Better Results
1. **Use clear, informative statements:** Include enough context so the model can generate a meaningful question.
2. **Prefer factual sentences:** The model performs better on statements that contain concrete information (dates, quantities, events, definitions).
3. **Avoid extremely short inputs:** Single words or one-word answers rarely produce useful questions.
4. **Check generated questions:** While the model is powerful, review outputs for accuracy and relevance, especially for educational or professional use.
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 20.0
### Training results
The model was trained for 20 epochs on over 153k samples, processing 221M tokens. It achieved a training loss of 0.64 and an evaluation loss of 1.30.
Training was efficient, with ~385 samples/sec and ~27k tokens/sec, and evaluation ran at ~820 samples/sec.
### Framework versions
- Transformers 4.57.1
- Pytorch 2.9.0+cu128
- Datasets 4.3.0
- Tokenizers 0.22.1
## Licence
Apache 2.0