rag-intrinsics-lib / answerability /README.md

Move answerability README one level up

0b8a30e verified 2 months ago

18.6 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	library_name: peft
	library_name: transformers
	---

	# Intrinsics for Answerability Classification

	## Model Summary
	This is a RAG-specific family of intrinsics fine-tuned for binary answerability
	classification task. The model takes as input a multi-turn conversation and a
	set of documents, and classifies whether the user's final query is answerable or
	unanswerable based on the available information in the documents.

	We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
	Granite-3.3-2b-instruct, Granite-3.3-8b-instruct, and GPT-OSS 20b.

	- Developer: IBM Research
	- Model type: LoRA and aLoRA adapter for
	[ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
	[ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
	and [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

	## Intended use
	This is a family of intrinsincs that enables answerability classification for
	the final user query in a multi-turn conversation, with respect to a set of
	provided documents. The model is trained to determine whether the last user
	query is answerable or unanswerable, based solely on the information present in
	the documents. This makes it suitable for applications involving RAG and
	document-grounded chatbots, where knowing whether sufficient information exists
	to answer a query is crucial. The classification output from the answerability
	model can be used in several downstream applications, including but not limited
	to:
	- Filter out unanswerable questions before sending them to generation in RAG
	setting. By classifying a query as unanswerable upfront, the system can prevent
	hallucinated or misleading responses.
	- Re-query the retriever to get more
	relevant documents. If a query is initially deemed unanswerable, the retriever
	can be re-invoked with alternate formulations to fetch more relevant documents.

	Model input: The input to the answerability intrinsic is an
	OpenAI-compatible chat completion request, containing a list of conversation
	turns that can alternate between the `user` and `assistant` role and ending with
	a `user` turn, as well as list of documents.

	Model output: The output of the answerability intrinsic is the result of the
	original chat completion request formatted as a JSON object containing the
	answerability likelihood score.

	Please see the code snippets in the Quickstart Example section below for
	examples that illustrate the intrinsic's input/output.

	## Quickstart Example

	To run the answerability intrinsics through granite-common, you can either (a)
	use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
	Face transformers library. We provide below instructions for each of the two
	approaches. Note that running inference using vLLM or another scalable
	OpenAI-compatible inference backend should be significantly faster than using
	the Hugging Face transformers library directly.

	### Using an OpenAI-Compatible Inference Backend

	To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
	follow the steps below.

	1. Install the granite-common library:

	pip install git+https://github.com/ibm-granite/granite-common.git
	pip install granite_common[nltk]

	2. Install the Hugging Face CLI:

	pip install -U "huggingface_hub[cli]"

	3. Install vLLM:

	pip install vllm

	4. Download the intrinsics library:

	hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib

	5. Edit the vLLM startup script found in `./rag-intrisics-lib/run_vllm.sh`
	using your favorite editor:

	Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
	base model on which the desired LoRA adapter has been trained. Optionally,
	edit the constant `PORT` to change the port on which vLLM will run. Save the
	modified file and exit the editor.

	6. Start vLLM through the startup script. The first time you run the script,
	you may have to change the permissions to allow execution:

	cd rag-intrinsics-lib
	chmod u+x ./run_vllm.sh
	./run_vllm.sh &

	7. Run the following code snippet:

	import json
	import openai
	import granite_common

	intrinsic_name = "answerability"

	# Change the following constant to select a different base model
	base_model_name = "granite-3.3-8b-instruct"

	# Change the following constants as needed to reflect the location of the vLLM server
	# The selected port should be identical to the one you specified in the vLLM startup script
	openai_base_url = "http://localhost:55555/v1"
	openai_api_key = "rag_intrinsics_1234"

	# Fetch IO configuration file from Hugging Face Hub
	io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
	intrinsic_name, base_model_name
	)

	# Instantiate input/output processors
	rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
	result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)

	# Sample request
	request_json = {
	"messages": [
	{
	"role": "assistant",
	"content": "Welcome to pet questions!"
	},
	{
	"content": "What is the population of Australia?",
	"role": "user"
	}
	],
	"extra_body": {
	"documents": [
	{
	"doc_id": "1",
	"text": "My dog has fleas."
	},
	{
	"doc_id": "2",
	"text": "My cat does not have fleas."
	}
	]
	}
	}

	# Add other parameters
	request_json["model"] = intrinsic_name
	request_json["temperature"] = 0.0

	# Apply input processor
	intrinsic_kwargs = {}
	rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)

	# Run inference
	client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
	chat_completion = client.chat.completions.create(**rewritten_request.model_dump())

	# Apply output processor
	processed_chat_completion = result_processor.transform(
	chat_completion, rewritten_request
	)

	# Verify that the contents of the completion is valid JSON and pretty-print the JSON.
	parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
	print("JSON output:")
	print(json.dumps(parsed_contents, indent=2))

	### Using the Hugging Face Transformers Library

	To run the intrinsic using the Hugging Face transformers library directly,
	follow the steps below.

	1. Install the granite-common library:

	pip install git+https://github.com/ibm-granite/granite-common.git
	pip install granite_common[nltk]

	2. Install the Hugging Face CLI:

	pip install -U "huggingface_hub[cli]"

	3. Install PEFT:

	pip install peft

	4. Install xgrammar:

	pip install xgrammar

	5. Run the following code snippet:

	import json
	import granite_common.util
	import peft

	intrinsic_name = "answerability"

	# Change the following constant to select a different base model
	base_model_name = "granite-3.3-8b-instruct"

	use_cuda = True # Set to False to use default PyTorch device for this machine + model

	# Fetch IO configuration file from Hugging Face Hub
	io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
	intrinsic_name, base_model_name
	)

	# Fetch LoRA directory from Hugging Face Hub
	lora_dir = granite_common.intrinsics.util.obtain_lora(
	intrinsic_name, base_model_name
	)

	# Instantiate input/output processors
	rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
	result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)

	# Sample request
	request_json = {
	"messages": [
	{
	"role": "assistant",
	"content": "Welcome to pet questions!"
	},
	{
	"content": "What is the population of Australia?",
	"role": "user"
	}
	],
	"extra_body": {
	"documents": [
	{
	"doc_id": "1",
	"text": "My dog has fleas."
	},
	{
	"doc_id": "2",
	"text": "My cat does not have fleas."
	}
	]
	}
	}

	# Add additional parameters
	request_json["model"] = intrinsic_name
	request_json["temperature"] = 0.0

	# Apply input processor
	intrinsic_kwargs = {}
	rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)

	# Load the base model and merge LoRA weights
	model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
	if use_cuda:
	model = model.cuda()

	# Convert the chat completion request into a the Transformers library's proprietary
	# format.
	generate_input, other_input = (
	granite_common.util.chat_completion_request_to_transformers_inputs(
	rewritten_request,
	tokenizer,
	model,
	)
	)

	# Use the Transformers library's APIs to generate one or more completions,
	# then convert those completions into OpenAI-compatible chat completion
	responses = granite_common.util.generate_with_transformers(
	tokenizer, model, generate_input, other_input
	)

	# Apply output processor
	transformed_responses = result_processor.transform(responses, rewritten_request)

	# Verify that the contents of the completion is valid JSON and pretty-print the JSON.
	parsed_contents = json.loads(transformed_responses.choices[0].message.content)
	print("JSON output:")
	print(json.dumps(parsed_contents, indent=2))

	## Training Details

	### Training Data

	The training data uses the publicly available Government corpus from
	[MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on
	this corpus, we constructed a dataset consisting of a mix of human-created and
	synthetically generated multi-turn conversations. It includes two types of
	examples: (1) Answerable queries, where the final user question can be answered
	based on the provided documents. These examples teach the adapter to recognize
	when sufficient information is present to support an answer. (2) Unanswerable
	queries, where the documents lack the necessary information to answer the final
	user query. We used Mixtral as an automatic judge to validate the answerability
	labels and filter out noisy samples.

	#### Training Hyperparameters

	The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
	32, learning rate = 5e-6, number of epochs = 25, with early stopping based on
	validation set, and 90/10 split between training and validation.

	## Evaluation

	### Answerability Classification

	We evaluated the model on binary answerability classification using MT-RAG
	Benchmark. In this setting, the model is given the full multi-turn conversation
	history along with the supporting documents. This benchmark evaluates the
	model's ability to assess answerability when the final user query can also
	depend on prior turns for context. The following table presents results
	comparing baselines and frontier models with task-specific answerability
	intrinsics on the answerability classification task on MT-RAG data. The LoRAs
	consistently outperform frontier models, converging near \~90% accuracy
	regardless of base model size. Even small models like Granite 3.3-2B, once
	fine-tuned, match or surpass much larger models, including GPT-4o. The
	difference between LoRA and aLoRA is minimal, indicating both are effective
	fine-tuning strategies.

	\| \| Models \| Unanswerable F1 \| Answerable F1 \| Classification Accuracy \| Weighted F1 \|
	\|:--------------------------------------------:\|:----------------------------------------------:\|:--------------------------:\|:---------------------------:\|:-------------------------------------:\|:-------------------------:\|
	\| Baselines \| BigBird (pre-trained embeddings) w/ MLP \| 73.4 \| 65.2 \| 69.8 \| 69.6 \|
	\| \| llama2-7b as classifier (Full SFT) \| 88.2 \| 85.9 \| 87.1 \| 87.1 \|
	\| Frontier Models out-of-the-box \| Granite 3.3-2b-instruct \| 48.7 \| 70.4 \| 62.4 \| 58.7 \|
	\| \| Granite 3.3-8b-instruct \| 62.8 \| 65.2 \| 64.5 \| 63.9 \|
	\| \| GPT-OSS-20b \| 77.3 \| 58.3 \| 70.7 \| 68.5 \|
	\| \| GPT-OSS-120b \| 70.2 \| 68.9 \| 69.8 \| 69.6 \|
	\| \| GPT4o-mini \| 82.7 \| 78.1 \| 80.8 \| 80.6 \|
	\| \| GPT4o \| 85.7 \| 77.5 \| 82.5 \| 81.9 \|
	\| Trained LoRAs/aLoRAs \| Granite 3.3-2b LoRA \| 91.2 \| 89.6 \| 90.4 \| 90.5 \|
	\| \| Granite 3.3-8b LoRA \| 91.1 \| 90.3 \| 90.6 \| 90.7 \|
	\| \| GPT-OSS-20b LoRA \| 91.6 \| 89.8 \| 90.8 \| 90.8 \|
	\| \| Granite 3.3-2b aLoRA \| 89.8 \| 88.6 \| 89.1 \| 89.2 \|
	\| \| Granite 3.3-8b aLoRA \| 90.1 \| 89.6 \| 89.5 \| 89.9 \|
	\| \| GPT-OSS-20b aLoRA \| 90.4 \| 88.6 \| 89.6 \| 89.6 \|


	### Comparing the Answerability Intrinsics vs. Vanilla Granite Models for Answer Quality

	We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
	vs. answerability intrinsics implemented as LoRA adapters on a subset of MT-RAG
	Benchmark. In this setup, each query is paired with only 5 retrieved passages as
	context.

	- Answerability Classification Performance: The answerability intrinsics
	outperform the vanilla model in overall F1 on both answerables and
	unanswerables. The answerability intrinsics achieves higher recall on
	unanswerable queries, making it better at identifying questions that should
	not be answered. However, this comes at the cost of lower recall on answerable
	queries.

	- Joint Answerability-Faithfulness Score computed as: \> = 1 (if model
	prediction = IDK/unanswerable ∩ ground truth = unanswerable)

	> = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground
	> truth = answerable)

	> = 0 (otherwise)

	This score rewards the model for correctly abstaining on unanswerable queries
	(full credit) and for providing faithful answers on answerable queries
	(partial credit based on RAGAS Faithfulness). No credit is given for incorrect
	or unfaithful predictions.

	The answerability intrinsics for granite-2b and granite-8b achieves 8% and 13%
	lifts on this metric respectively. This rewards the model for correctly
	abstaining on unanswerable queries and for being faithful when it chooses to
	answer.


	\| \| F1 Score Unanswerable \| F1 Score Answerable \| Recall Unanswerable \| Recall Answerable \| Joint Answerability- Faithfulness Score \|
	\|:-----------------------:\|:---------------------:\|:-------------------:\|:-------------------:\|:-----------------:\|:---------------------------------------:\|
	\| Granite 3.3-2b Instruct \| 13 \| 77 \| 7 \| 99 \| 48 \|
	\| Granite 3.3-2b LoRA \| 48 \| 78 \| 37 \| 89 \| 56 \|
	\| Granite 3.3-8b Instruct \| 17 \| 77 \| 10 \| 99 \| 49 \|
	\| Granite 3.3-8b LoRA \| 65 \| 81 \| 60 \| 86 \| 62 \|

	## Model Card Authors

	[Vraj Shah](mailto:vraj@ibm.com)

	### Framework versions

	- PEFT 0.14.0