Update README.md

4747172 verified 3 months ago

6.68 kB

	---
	license: apache-2.0
	datasets:
	- agentlans/questionizer
	- agentlans/text-sft-questions-answers-only
	language:
	- en
	base_model:
	- google/flan-t5-small
	pipeline_tag: text-generation
	tags:
	- question
	- answer
	---
	# FLAN T5 Small Questionizer

	This model converts declarative statements into questions.

	Example:
	Input: The sun rises in the east and sets in the west.
	Output: Where does the sun rise and set?

	## Usage

	```python
	from transformers import pipeline

	# Load the model
	questionizer = pipeline("text2text-generation", model="agentlans/flan-t5-small-questionizer")

	# Convert a statement into a question
	statement = "Water covers approximately 71% of the Earth's surface, making it the most abundant substance on the planet's exterior."
	question = questionizer(statement)[0]['generated_text']

	print(question)
	# Output: What percentage of the Earth's surface does water cover?
	```

	## Examples

	<details>
	<summary>Click here for simple sentence examples</summary>

	Input: The sun rises in the east and sets in the west.
	Output: Where does the sun rise and set?

	Input: Python is a popular programming language for beginners.
	Output: What is a popular programming language for beginners?

	Input: Elephants are the largest land animals on Earth.
	Output: What are the largest land animals on Earth?

	Input: Rainbows appear when sunlight passes through raindrops.
	Output: When do rainbows appear?

	Input: Saturn has beautiful rings made of ice and rock.
	Output: What is the shape of Saturn's rings?

	Input: Coffee is enjoyed by millions of people every morning.
	Output: How many people enjoy coffee every morning?

	Input: Mount Everest is the highest mountain in the world.
	Output: What is the highest mountain in the world?

	Input: Honeybees communicate through a dance called the waggle.
	Output: How do honeybees communicate?

	Input: Penguins live in cold climates and cannot fly.
	Output: Where do Penguins live and cannot fly?

	Input: Artists use different colors and shapes to express ideas.
	Output: What do artists use to express ideas?
	</details>

	<details>
	<summary>Click here for complex sentence examples</summary>

	These sentences were randomly selected from the [agentlans/high-quality-english-sentences](https://huggingface.co/datasets/agentlans/high-quality-english-sentences) dataset.

	Input: Want to know what takes up the most space on your Hard Drive?
	Output: What do you want to know about the most space on your Hard Drive?

	Input: DISA officials say Defense must embrace mobility management tools to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide.
	Output: What do DISA officials say Defense must do to ensure military users don't lag behind the private sector in secure mobile computing capability worldwide?

	Input: This is why learning a healthy nutritious eating pattern and adopting it for life, that is, lifestyle changes has a significant impact on our ability to reach and maintain a healthy weight.
	Output: Why is it important to learn a healthy nutritious eating pattern and adopt it for life?

	Input: In 1933, Bartlett was chosen to create the very first gift print, a color etching called 'Java'.
	Output: What color etching was Bartlett chosen to create in 1933?

	Input: In the United States, NERC petitions the Federal Energy Regulatory Commission (FERC) for approval of standards.
	Output: What does NERC petition the Federal Energy Regulatory Commission (FERC) for in the United States?

	Input: The majority of large businesses are public companies, with government enterprises, not-for-profit organisations and superannuation funds also represented.
	Output: What is the majority of large businesses, and what types of organizations are represented?

	Input: Because blueberries need moisture close to the soil surface, it is important to mulch them well.
	Output: Why is it important to mulch blueberries well?

	Input: One of the victims was a man but the other could not be determined.
	Output: Who was one of the victims, and how was the other determined?

	Input: The statute gives States and local educational agencies significant flexibility in how they direct resources and tailor interventions to the needs of individual schools identified for improvement.
	Output: What flexibility does the statute provide for States and local educational agencies?

	Input: Similar legislation would allay any hesitancy on the par of the banks in sharing cyber threat information with the government, Tunstall suggests.
	Output: What would similar legislation allay in sharing cyber threat information with the government, according to Tunstall?
	</details>

	## Limitations

	* The model works best with statements that provide enough context. Short or vague sentences may lead to hallucinated or unrelated questions. Example:

	Input: No.
	Output: Is there a requirement for a person to have a copy of a book in a library?

	* Not all statements are suitable for question generation. Some inputs may produce awkward questions or questions that do not match the intended meaning.

	### Tips for Better Results

	1. Use clear, informative statements: Include enough context so the model can generate a meaningful question.
	2. Prefer factual sentences: The model performs better on statements that contain concrete information (dates, quantities, events, definitions).
	3. Avoid extremely short inputs: Single words or one-word answers rarely produce useful questions.
	4. Check generated questions: While the model is powerful, review outputs for accuracy and relevance, especially for educational or professional use.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- num_epochs: 20.0

	### Training results

	The model was trained for 20 epochs on over 153k samples, processing 221M tokens. It achieved a training loss of 0.64 and an evaluation loss of 1.30.

	Training was efficient, with ~385 samples/sec and ~27k tokens/sec, and evaluation ran at ~820 samples/sec.

	### Framework versions

	- Transformers 4.57.1
	- Pytorch 2.9.0+cu128
	- Datasets 4.3.0
	- Tokenizers 0.22.1

	## Licence

	Apache 2.0