---
license: apache-2.0
datasets:
- agentlans/questionizer
- agentlans/text-sft-questions-answers-only
language:
- en
base_model:
- google/flan-t5-small
pipeline_tag: text-generation
tags:
- question
- answer
---
# FLAN T5 Small Questionizer

This model converts declarative statements into questions.

**Example:**  
**Input:** The sun rises in the east and sets in the west.  
**Output:** Where does the sun rise and set?

## Usage

```python
from transformers import pipeline

# Load the model
questionizer = pipeline("text2text-generation", model="agentlans/flan-t5-small-questionizer")

# Convert a statement into a question
statement = "Water covers approximately 71% of the Earth's surface, making it the most abundant substance on the planet's exterior."
question = questionizer(statement)[0]['generated_text']

print(question)
# Output: What percentage of the Earth's surface does water cover?
```

## Examples

<details>
  <summary>Click here for simple sentence examples</summary>
  
**Input:** The sun rises in the east and sets in the west.  
**Output:** Where does the sun rise and set?  

**Input:** Python is a popular programming language for beginners.  
**Output:** What is a popular programming language for beginners?  

**Input:** Elephants are the largest land animals on Earth.  
**Output:** What are the largest land animals on Earth?  

**Input:** Rainbows appear when sunlight passes through raindrops.  
**Output:** When do rainbows appear?  

**Input:** Saturn has beautiful rings made of ice and rock.  
**Output:** What is the shape of Saturn's rings?  

**Input:** Coffee is enjoyed by millions of people every morning.  
**Output:** How many people enjoy coffee every morning?  

**Input:** Mount Everest is the highest mountain in the world.  
**Output:** What is the highest mountain in the world?  

**Input:** Honeybees communicate through a dance called the waggle.  
**Output:** How do honeybees communicate?  

**Input:** Penguins live in cold climates and cannot fly.  
**Output:** Where do Penguins live and cannot fly?  

**Input:** Artists use different colors and shapes to express ideas.  
**Output:** What do artists use to express ideas?  
</details>

<details>
  <summary>Click here for complex sentence examples</summary>

These sentences were randomly selected from the [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences) dataset.
  
**Input:** Want to know what takes up the most space on your Hard Drive?  
**Output:** What do you want to know about the most space on your Hard Drive?  

**Input:** DISA officials say Defense must embrace mobility management tools to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide.  
**Output:** What do DISA officials say Defense must do to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide?  

**Input:** This is why learning a healthy nutritious eating pattern and adopting it for life, that is, lifestyle changes has a significant impact on our ability to reach and maintain a healthy weight.  
**Output:** Why is it important to learn a healthy nutritious eating pattern and adopt it for life?  

**Input:** In 1933, Bartlett was chosen to create the very first gift print, a color etching called 'Java'.  
**Output:** What color etching was Bartlett chosen to create in 1933?  

**Input:** In the United States, NERC petitions the Federal Energy Regulatory Commission (FERC) for approval of standards.  
**Output:** What does NERC petition the Federal Energy Regulatory Commission (FERC) for in the United States?  

**Input:** The majority of large businesses are public companies, with government enterprises, not-for-profit organisations and superannuation funds also represented.  
**Output:** What is the majority of large businesses, and what types of organizations are represented?  

**Input:** Because blueberries need moisture close to the soil surface, it is important to mulch them well.  
**Output:** Why is it important to mulch blueberries well?  

**Input:** One of the victims was a man but the other could not be determined.  
**Output:** Who was one of the victims, and how was the other determined?  

**Input:** The statute gives States and local educational agencies significant flexibility in how they direct resources and tailor interventions to the needs of individual schools identified for improvement.  
**Output:** What flexibility does the statute provide for States and local educational agencies?  

**Input:** Similar legislation would allay any hesitancy on the par of the banks in sharing cyber threat information with the government, Tunstall suggests.  
**Output:** What would similar legislation allay in sharing cyber threat information with the government, according to Tunstall?  
</details>

## Limitations

* The model works best with statements that provide enough context. Short or vague sentences may lead to hallucinated or unrelated questions. Example:
  
  **Input:** No.  
  **Output:** Is there a requirement for a person to have a copy of a book in a library?

* Not all statements are suitable for question generation. Some inputs may produce awkward questions or questions that do not match the intended meaning.

### Tips for Better Results

1. **Use clear, informative statements:** Include enough context so the model can generate a meaningful question.
2. **Prefer factual sentences:** The model performs better on statements that contain concrete information (dates, quantities, events, definitions).
3. **Avoid extremely short inputs:** Single words or one-word answers rarely produce useful questions.
4. **Check generated questions:** While the model is powerful, review outputs for accuracy and relevance, especially for educational or professional use.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- num_epochs: 20.0

### Training results

The model was trained for 20 epochs on over 153k samples, processing 221M tokens. It achieved a training loss of 0.64 and an evaluation loss of 1.30.

Training was efficient, with ~385 samples/sec and ~27k tokens/sec, and evaluation ran at ~820 samples/sec.

### Framework versions

- Transformers 4.57.1
- Pytorch 2.9.0+cu128
- Datasets 4.3.0
- Tokenizers 0.22.1

## Licence

Apache 2.0