|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- agentlans/questionizer |
|
|
- agentlans/text-sft-questions-answers-only |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- google/flan-t5-small |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- question |
|
|
- answer |
|
|
--- |
|
|
# FLAN T5 Small Questionizer |
|
|
|
|
|
This model converts declarative statements into questions. |
|
|
|
|
|
**Example:** |
|
|
**Input:** The sun rises in the east and sets in the west. |
|
|
**Output:** Where does the sun rise and set? |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
questionizer = pipeline("text2text-generation", model="agentlans/flan-t5-small-questionizer") |
|
|
|
|
|
# Convert a statement into a question |
|
|
statement = "Water covers approximately 71% of the Earth's surface, making it the most abundant substance on the planet's exterior." |
|
|
question = questionizer(statement)[0]['generated_text'] |
|
|
|
|
|
print(question) |
|
|
# Output: What percentage of the Earth's surface does water cover? |
|
|
``` |
|
|
|
|
|
## Examples |
|
|
|
|
|
<details> |
|
|
<summary>Click here for simple sentence examples</summary> |
|
|
|
|
|
**Input:** The sun rises in the east and sets in the west. |
|
|
**Output:** Where does the sun rise and set? |
|
|
|
|
|
**Input:** Python is a popular programming language for beginners. |
|
|
**Output:** What is a popular programming language for beginners? |
|
|
|
|
|
**Input:** Elephants are the largest land animals on Earth. |
|
|
**Output:** What are the largest land animals on Earth? |
|
|
|
|
|
**Input:** Rainbows appear when sunlight passes through raindrops. |
|
|
**Output:** When do rainbows appear? |
|
|
|
|
|
**Input:** Saturn has beautiful rings made of ice and rock. |
|
|
**Output:** What is the shape of Saturn's rings? |
|
|
|
|
|
**Input:** Coffee is enjoyed by millions of people every morning. |
|
|
**Output:** How many people enjoy coffee every morning? |
|
|
|
|
|
**Input:** Mount Everest is the highest mountain in the world. |
|
|
**Output:** What is the highest mountain in the world? |
|
|
|
|
|
**Input:** Honeybees communicate through a dance called the waggle. |
|
|
**Output:** How do honeybees communicate? |
|
|
|
|
|
**Input:** Penguins live in cold climates and cannot fly. |
|
|
**Output:** Where do Penguins live and cannot fly? |
|
|
|
|
|
**Input:** Artists use different colors and shapes to express ideas. |
|
|
**Output:** What do artists use to express ideas? |
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Click here for complex sentence examples</summary> |
|
|
|
|
|
These sentences were randomly selected from the [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences) dataset. |
|
|
|
|
|
**Input:** Want to know what takes up the most space on your Hard Drive? |
|
|
**Output:** What do you want to know about the most space on your Hard Drive? |
|
|
|
|
|
**Input:** DISA officials say Defense must embrace mobility management tools to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide. |
|
|
**Output:** What do DISA officials say Defense must do to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide? |
|
|
|
|
|
**Input:** This is why learning a healthy nutritious eating pattern and adopting it for life, that is, lifestyle changes has a significant impact on our ability to reach and maintain a healthy weight. |
|
|
**Output:** Why is it important to learn a healthy nutritious eating pattern and adopt it for life? |
|
|
|
|
|
**Input:** In 1933, Bartlett was chosen to create the very first gift print, a color etching called 'Java'. |
|
|
**Output:** What color etching was Bartlett chosen to create in 1933? |
|
|
|
|
|
**Input:** In the United States, NERC petitions the Federal Energy Regulatory Commission (FERC) for approval of standards. |
|
|
**Output:** What does NERC petition the Federal Energy Regulatory Commission (FERC) for in the United States? |
|
|
|
|
|
**Input:** The majority of large businesses are public companies, with government enterprises, not-for-profit organisations and superannuation funds also represented. |
|
|
**Output:** What is the majority of large businesses, and what types of organizations are represented? |
|
|
|
|
|
**Input:** Because blueberries need moisture close to the soil surface, it is important to mulch them well. |
|
|
**Output:** Why is it important to mulch blueberries well? |
|
|
|
|
|
**Input:** One of the victims was a man but the other could not be determined. |
|
|
**Output:** Who was one of the victims, and how was the other determined? |
|
|
|
|
|
**Input:** The statute gives States and local educational agencies significant flexibility in how they direct resources and tailor interventions to the needs of individual schools identified for improvement. |
|
|
**Output:** What flexibility does the statute provide for States and local educational agencies? |
|
|
|
|
|
**Input:** Similar legislation would allay any hesitancy on the par of the banks in sharing cyber threat information with the government, Tunstall suggests. |
|
|
**Output:** What would similar legislation allay in sharing cyber threat information with the government, according to Tunstall? |
|
|
</details> |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* The model works best with statements that provide enough context. Short or vague sentences may lead to hallucinated or unrelated questions. Example: |
|
|
|
|
|
**Input:** No. |
|
|
**Output:** Is there a requirement for a person to have a copy of a book in a library? |
|
|
|
|
|
* Not all statements are suitable for question generation. Some inputs may produce awkward questions or questions that do not match the intended meaning. |
|
|
|
|
|
### Tips for Better Results |
|
|
|
|
|
1. **Use clear, informative statements:** Include enough context so the model can generate a meaningful question. |
|
|
2. **Prefer factual sentences:** The model performs better on statements that contain concrete information (dates, quantities, events, definitions). |
|
|
3. **Avoid extremely short inputs:** Single words or one-word answers rarely produce useful questions. |
|
|
4. **Check generated questions:** While the model is powerful, review outputs for accuracy and relevance, especially for educational or professional use. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- learning_rate: 5e-05 |
|
|
- train_batch_size: 8 |
|
|
- eval_batch_size: 8 |
|
|
- seed: 42 |
|
|
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
|
- lr_scheduler_type: linear |
|
|
- num_epochs: 20.0 |
|
|
|
|
|
### Training results |
|
|
|
|
|
The model was trained for 20 epochs on over 153k samples, processing 221M tokens. It achieved a training loss of 0.64 and an evaluation loss of 1.30. |
|
|
|
|
|
Training was efficient, with ~385 samples/sec and ~27k tokens/sec, and evaluation ran at ~820 samples/sec. |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.57.1 |
|
|
- Pytorch 2.9.0+cu128 |
|
|
- Datasets 4.3.0 |
|
|
- Tokenizers 0.22.1 |
|
|
|
|
|
## Licence |
|
|
|
|
|
Apache 2.0 |