Update README.md

90f5ede over 3 years ago

3.96 kB

	# Question Difficulty Classification Model
	## Introduction
	This project aims to classify question answer pairs based on it's difficulty as easy,Medium or hard.You can pass a single question-answer pair seperated by comma or a list of question-answer pairs to the model.
	I have fine tuned [bert-base-cased](https://huggingface.co/bert-base-cased) model with pre-trained parameter on [Question-Answer Dataset](https://www.kaggle.com/datasets/rtatman/questionanswer-dataset) by [Carnegie Mellon University](https://www.cmu.edu/) for this task

	## Table of Contents
	- [Model Details](#model-details)
	- [How to Get Started With the Model](#how-to-get-started-with-the-model)
	- [Dependencies](#dependencies)
	- [Risks, Limitations and Biases](#risks-limitations-and-biases)
	- [Training](#training)

	## Model Details
	Model Description: This model is a fine-tune checkpoint of [bert-base-cased](https://huggingface.co/bert-base-cased),pretrained on a large corpus of English data in a self-supervised fashion. .
	This model reaches an accuracy of 95 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 97).
	- Developed by: Hugging Face
	- Model Type: Text Classification
	- Language(s): English
	- License: Apache-2.0
	- Parent Model: For more details about lBERT, we encourage users to check out [this model card](https://huggingface.co/bert-base-cased).
	- Resources for more information:
	- [Model Documentation](https://huggingface.co/docs/transformers/main/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification)

	## Dependencies

	- Transformer
	- Python 3.7.13
	- Numpy

	## How to use the model

	1. Import Essential Libraries

	```python
	from transformers import TFBertModel
	from transformers import BertTokenizer
	import tensorflow as tf
	```
	2. Load the Model and Tokenizer

	```python
	questionclassification_model = tf.keras.models.load_model(<path to the model>)
	tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
	```

	3. Essential Functions

	```python
	def prepare_data(input_text):

	token = tokenizer.batch_encode_plus(
	input_text,
	max_length=256,
	truncation=True,
	padding='max_length',
	add_special_tokens=True,
	return_tensors='tf'
	)
	return {
	'input_ids': tf.cast(token['input_ids'], tf.float64),
	'attention_mask': tf.cast(token['attention_mask'], tf.float64)
	}

	def make_prediction(model, processed_data, classes=['Easy', 'Medium', 'Hard']):
	outcls=[]
	probs = model.predict(processed_data)
	s=probs.argmax(axis=1)
	for i in range(0,len(probs)):
	outcls.append(classes[s[i]])
	return outcls,probs;
	```
	3.Make predictions on the list of questions-answer pairs

	```python
	input_text = ["What is gandhi commonly considered to be?,Father of the nation in india","What is the long-term warming of the planets overall temperature called?, Global Warming"]
	processed_data = prepare_data(input_text)
	result,prob = make_prediction(questionclassification_model, processed_data=processed_data)
	for i in range (len(result)):
	print(f"{result[i]} : {max(prob[i])}")
	```

	## Risks, Limitations and Biases

	- The predicted outputs have only very less easy category questions.
	- 90% of the easy questions in the dataset are yes/no type questions.
	- Very few datasets are available in public for question difficulty classification.
	- People who are experts in a specific subject can only create a dataset for this task.Otherwise,The model will generate wrong results.

	# Training


	#### Training Data


	I used [Question-Answer Dataset](https://www.kaggle.com/datasets/rtatman/questionanswer-dataset) by [Carnegie Mellon University](https://www.cmu.edu/) for this task

	#### Training Procedure

	###### Fine-tuning hyper-parameters


	- learning_rate = 1e-5
	- decay = 1e-6
	- optimizer = adam
	- loss function = categorical cross entropy
	- max_length = 256
	- num_train_epochs = 10