|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: peft |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Intrinsics for Answerability Classification |
|
|
|
|
|
## Model Summary |
|
|
This is a RAG-specific family of intrinsics fine-tuned for binary answerability |
|
|
classification task. The model takes as input a multi-turn conversation and a |
|
|
set of documents, and classifies whether the user's final query is answerable or |
|
|
unanswerable based on the available information in the documents. |
|
|
|
|
|
We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over |
|
|
Granite-3.3-2b-instruct, Granite-3.3-8b-instruct, and GPT-OSS 20b. |
|
|
|
|
|
- **Developer:** IBM Research |
|
|
- **Model type:** LoRA and aLoRA adapter for |
|
|
[ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct), |
|
|
[ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct), |
|
|
and [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) |
|
|
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
|
|
## Intended use |
|
|
This is a family of intrinsincs that enables answerability classification for |
|
|
the final user query in a multi-turn conversation, with respect to a set of |
|
|
provided documents. The model is trained to determine whether the last user |
|
|
query is answerable or unanswerable, based solely on the information present in |
|
|
the documents. This makes it suitable for applications involving RAG and |
|
|
document-grounded chatbots, where knowing whether sufficient information exists |
|
|
to answer a query is crucial. The classification output from the answerability |
|
|
model can be used in several downstream applications, including but not limited |
|
|
to: |
|
|
- Filter out unanswerable questions before sending them to generation in RAG |
|
|
setting. By classifying a query as unanswerable upfront, the system can prevent |
|
|
hallucinated or misleading responses. |
|
|
- Re-query the retriever to get more |
|
|
relevant documents. If a query is initially deemed unanswerable, the retriever |
|
|
can be re-invoked with alternate formulations to fetch more relevant documents. |
|
|
|
|
|
**Model input**: The input to the answerability intrinsic is an |
|
|
OpenAI-compatible chat completion request, containing a list of conversation |
|
|
turns that can alternate between the `user` and `assistant` role and ending with |
|
|
a `user` turn, as well as list of documents. |
|
|
|
|
|
**Model output**: The output of the answerability intrinsic is the result of the |
|
|
original chat completion request formatted as a JSON object containing the |
|
|
answerability likelihood score. |
|
|
|
|
|
Please see the code snippets in the Quickstart Example section below for |
|
|
examples that illustrate the intrinsic's input/output. |
|
|
|
|
|
## Quickstart Example |
|
|
|
|
|
To run the answerability intrinsics through granite-common, you can either (a) |
|
|
use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging |
|
|
Face transformers library. We provide below instructions for each of the two |
|
|
approaches. Note that running inference using vLLM or another scalable |
|
|
OpenAI-compatible inference backend should be significantly faster than using |
|
|
the Hugging Face transformers library directly. |
|
|
|
|
|
### Using an OpenAI-Compatible Inference Backend |
|
|
|
|
|
To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM, |
|
|
follow the steps below. |
|
|
|
|
|
1. Install the granite-common library: |
|
|
|
|
|
pip install git+https://github.com/ibm-granite/granite-common.git |
|
|
pip install granite_common[nltk] |
|
|
|
|
|
2. Install the Hugging Face CLI: |
|
|
|
|
|
pip install -U "huggingface_hub[cli]" |
|
|
|
|
|
3. Install vLLM: |
|
|
|
|
|
pip install vllm |
|
|
|
|
|
4. Download the intrinsics library: |
|
|
|
|
|
hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib |
|
|
|
|
|
5. Edit the vLLM startup script found in `./rag-intrisics-lib/run_vllm.sh` |
|
|
using your favorite editor: |
|
|
|
|
|
Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the |
|
|
base model on which the desired LoRA adapter has been trained. Optionally, |
|
|
edit the constant `PORT` to change the port on which vLLM will run. Save the |
|
|
modified file and exit the editor. |
|
|
|
|
|
6. Start vLLM through the startup script. The first time you run the script, |
|
|
you may have to change the permissions to allow execution: |
|
|
|
|
|
cd rag-intrinsics-lib |
|
|
chmod u+x ./run_vllm.sh |
|
|
./run_vllm.sh & |
|
|
|
|
|
7. Run the following code snippet: |
|
|
|
|
|
import json |
|
|
import openai |
|
|
import granite_common |
|
|
|
|
|
intrinsic_name = "answerability" |
|
|
|
|
|
# Change the following constant to select a different base model |
|
|
base_model_name = "granite-3.3-8b-instruct" |
|
|
|
|
|
# Change the following constants as needed to reflect the location of the vLLM server |
|
|
# The selected port should be identical to the one you specified in the vLLM startup script |
|
|
openai_base_url = "http://localhost:55555/v1" |
|
|
openai_api_key = "rag_intrinsics_1234" |
|
|
|
|
|
# Fetch IO configuration file from Hugging Face Hub |
|
|
io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml( |
|
|
intrinsic_name, base_model_name |
|
|
) |
|
|
|
|
|
# Instantiate input/output processors |
|
|
rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file) |
|
|
result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file) |
|
|
|
|
|
# Sample request |
|
|
request_json = { |
|
|
"messages": [ |
|
|
{ |
|
|
"role": "assistant", |
|
|
"content": "Welcome to pet questions!" |
|
|
}, |
|
|
{ |
|
|
"content": "What is the population of Australia?", |
|
|
"role": "user" |
|
|
} |
|
|
], |
|
|
"extra_body": { |
|
|
"documents": [ |
|
|
{ |
|
|
"doc_id": "1", |
|
|
"text": "My dog has fleas." |
|
|
}, |
|
|
{ |
|
|
"doc_id": "2", |
|
|
"text": "My cat does not have fleas." |
|
|
} |
|
|
] |
|
|
} |
|
|
} |
|
|
|
|
|
# Add other parameters |
|
|
request_json["model"] = intrinsic_name |
|
|
request_json["temperature"] = 0.0 |
|
|
|
|
|
# Apply input processor |
|
|
intrinsic_kwargs = {} |
|
|
rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs) |
|
|
|
|
|
# Run inference |
|
|
client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key) |
|
|
chat_completion = client.chat.completions.create(**rewritten_request.model_dump()) |
|
|
|
|
|
# Apply output processor |
|
|
processed_chat_completion = result_processor.transform( |
|
|
chat_completion, rewritten_request |
|
|
) |
|
|
|
|
|
# Verify that the contents of the completion is valid JSON and pretty-print the JSON. |
|
|
parsed_contents = json.loads(processed_chat_completion.choices[0].message.content) |
|
|
print("JSON output:") |
|
|
print(json.dumps(parsed_contents, indent=2)) |
|
|
|
|
|
### Using the Hugging Face Transformers Library |
|
|
|
|
|
To run the intrinsic using the Hugging Face transformers library directly, |
|
|
follow the steps below. |
|
|
|
|
|
1. Install the granite-common library: |
|
|
|
|
|
pip install git+https://github.com/ibm-granite/granite-common.git |
|
|
pip install granite_common[nltk] |
|
|
|
|
|
2. Install the Hugging Face CLI: |
|
|
|
|
|
pip install -U "huggingface_hub[cli]" |
|
|
|
|
|
3. Install PEFT: |
|
|
|
|
|
pip install peft |
|
|
|
|
|
4. Install xgrammar: |
|
|
|
|
|
pip install xgrammar |
|
|
|
|
|
5. Run the following code snippet: |
|
|
|
|
|
import json |
|
|
import granite_common.util |
|
|
import peft |
|
|
|
|
|
intrinsic_name = "answerability" |
|
|
|
|
|
# Change the following constant to select a different base model |
|
|
base_model_name = "granite-3.3-8b-instruct" |
|
|
|
|
|
use_cuda = True # Set to False to use default PyTorch device for this machine + model |
|
|
|
|
|
# Fetch IO configuration file from Hugging Face Hub |
|
|
io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml( |
|
|
intrinsic_name, base_model_name |
|
|
) |
|
|
|
|
|
# Fetch LoRA directory from Hugging Face Hub |
|
|
lora_dir = granite_common.intrinsics.util.obtain_lora( |
|
|
intrinsic_name, base_model_name |
|
|
) |
|
|
|
|
|
# Instantiate input/output processors |
|
|
rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file) |
|
|
result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file) |
|
|
|
|
|
# Sample request |
|
|
request_json = { |
|
|
"messages": [ |
|
|
{ |
|
|
"role": "assistant", |
|
|
"content": "Welcome to pet questions!" |
|
|
}, |
|
|
{ |
|
|
"content": "What is the population of Australia?", |
|
|
"role": "user" |
|
|
} |
|
|
], |
|
|
"extra_body": { |
|
|
"documents": [ |
|
|
{ |
|
|
"doc_id": "1", |
|
|
"text": "My dog has fleas." |
|
|
}, |
|
|
{ |
|
|
"doc_id": "2", |
|
|
"text": "My cat does not have fleas." |
|
|
} |
|
|
] |
|
|
} |
|
|
} |
|
|
|
|
|
# Add additional parameters |
|
|
request_json["model"] = intrinsic_name |
|
|
request_json["temperature"] = 0.0 |
|
|
|
|
|
# Apply input processor |
|
|
intrinsic_kwargs = {} |
|
|
rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs) |
|
|
|
|
|
# Load the base model and merge LoRA weights |
|
|
model, tokenizer = granite_common.util.load_transformers_lora(lora_dir) |
|
|
if use_cuda: |
|
|
model = model.cuda() |
|
|
|
|
|
# Convert the chat completion request into a the Transformers library's proprietary |
|
|
# format. |
|
|
generate_input, other_input = ( |
|
|
granite_common.util.chat_completion_request_to_transformers_inputs( |
|
|
rewritten_request, |
|
|
tokenizer, |
|
|
model, |
|
|
) |
|
|
) |
|
|
|
|
|
# Use the Transformers library's APIs to generate one or more completions, |
|
|
# then convert those completions into OpenAI-compatible chat completion |
|
|
responses = granite_common.util.generate_with_transformers( |
|
|
tokenizer, model, generate_input, other_input |
|
|
) |
|
|
|
|
|
# Apply output processor |
|
|
transformed_responses = result_processor.transform(responses, rewritten_request) |
|
|
|
|
|
# Verify that the contents of the completion is valid JSON and pretty-print the JSON. |
|
|
parsed_contents = json.loads(transformed_responses.choices[0].message.content) |
|
|
print("JSON output:") |
|
|
print(json.dumps(parsed_contents, indent=2)) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The training data uses the publicly available Government corpus from |
|
|
[MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on |
|
|
this corpus, we constructed a dataset consisting of a mix of human-created and |
|
|
synthetically generated multi-turn conversations. It includes two types of |
|
|
examples: (1) Answerable queries, where the final user question can be answered |
|
|
based on the provided documents. These examples teach the adapter to recognize |
|
|
when sufficient information is present to support an answer. (2) Unanswerable |
|
|
queries, where the documents lack the necessary information to answer the final |
|
|
user query. We used Mixtral as an automatic judge to validate the answerability |
|
|
labels and filter out noisy samples. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
The LoRA adapter was fine-tuned using PEFT under the following regime: rank = |
|
|
32, learning rate = 5e-6, number of epochs = 25, with early stopping based on |
|
|
validation set, and 90/10 split between training and validation. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Answerability Classification |
|
|
|
|
|
We evaluated the model on binary answerability classification using MT-RAG |
|
|
Benchmark. In this setting, the model is given the full multi-turn conversation |
|
|
history along with the supporting documents. This benchmark evaluates the |
|
|
model's ability to assess answerability when the final user query can also |
|
|
depend on prior turns for context. The following table presents results |
|
|
comparing baselines and frontier models with task-specific answerability |
|
|
intrinsics on the answerability classification task on MT-RAG data. The LoRAs |
|
|
consistently outperform frontier models, converging near \~90% accuracy |
|
|
regardless of base model size. Even small models like Granite 3.3-2B, once |
|
|
fine-tuned, match or surpass much larger models, including GPT-4o. The |
|
|
difference between LoRA and aLoRA is minimal, indicating both are effective |
|
|
fine-tuning strategies. |
|
|
|
|
|
| | Models | Unanswerable F1 | Answerable F1 | Classification Accuracy | Weighted F1 | |
|
|
|:--------------------------------------------:|:----------------------------------------------:|:--------------------------:|:---------------------------:|:-------------------------------------:|:-------------------------:| |
|
|
| Baselines | BigBird (pre-trained embeddings) w/ MLP | 73.4 | 65.2 | 69.8 | 69.6 | |
|
|
| | llama2-7b as classifier (Full SFT) | 88.2 | 85.9 | 87.1 | 87.1 | |
|
|
| Frontier Models out-of-the-box | Granite 3.3-2b-instruct | 48.7 | 70.4 | 62.4 | 58.7 | |
|
|
| | Granite 3.3-8b-instruct | 62.8 | 65.2 | 64.5 | 63.9 | |
|
|
| | GPT-OSS-20b | 77.3 | 58.3 | 70.7 | 68.5 | |
|
|
| | GPT-OSS-120b | 70.2 | 68.9 | 69.8 | 69.6 | |
|
|
| | GPT4o-mini | 82.7 | 78.1 | 80.8 | 80.6 | |
|
|
| | GPT4o | 85.7 | 77.5 | 82.5 | 81.9 | |
|
|
| Trained LoRAs/aLoRAs | Granite 3.3-2b LoRA | 91.2 | 89.6 | 90.4 | 90.5 | |
|
|
| | Granite 3.3-8b LoRA | 91.1 | 90.3 | 90.6 | 90.7 | |
|
|
| | GPT-OSS-20b LoRA | 91.6 | 89.8 | 90.8 | 90.8 | |
|
|
| | Granite 3.3-2b aLoRA | 89.8 | 88.6 | 89.1 | 89.2 | |
|
|
| | Granite 3.3-8b aLoRA | 90.1 | 89.6 | 89.5 | 89.9 | |
|
|
| | GPT-OSS-20b aLoRA | 90.4 | 88.6 | 89.6 | 89.6 | |
|
|
|
|
|
|
|
|
### Comparing the Answerability Intrinsics vs. Vanilla Granite Models for Answer Quality |
|
|
|
|
|
We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct |
|
|
vs. answerability intrinsics implemented as LoRA adapters on a subset of MT-RAG |
|
|
Benchmark. In this setup, each query is paired with only 5 retrieved passages as |
|
|
context. |
|
|
|
|
|
- Answerability Classification Performance: The answerability intrinsics |
|
|
outperform the vanilla model in overall F1 on both answerables and |
|
|
unanswerables. The answerability intrinsics achieves higher recall on |
|
|
unanswerable queries, making it better at identifying questions that should |
|
|
not be answered. However, this comes at the cost of lower recall on answerable |
|
|
queries. |
|
|
|
|
|
- Joint Answerability-Faithfulness Score computed as: \> = 1 (if model |
|
|
prediction = IDK/unanswerable ∩ ground truth = unanswerable) |
|
|
|
|
|
> = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground |
|
|
> truth = answerable) |
|
|
|
|
|
> = 0 (otherwise) |
|
|
|
|
|
This score rewards the model for correctly abstaining on unanswerable queries |
|
|
(full credit) and for providing faithful answers on answerable queries |
|
|
(partial credit based on RAGAS Faithfulness). No credit is given for incorrect |
|
|
or unfaithful predictions. |
|
|
|
|
|
The answerability intrinsics for granite-2b and granite-8b achieves 8% and 13% |
|
|
lifts on this metric respectively. This rewards the model for correctly |
|
|
abstaining on unanswerable queries and for being faithful when it chooses to |
|
|
answer. |
|
|
|
|
|
|
|
|
| | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score | |
|
|
|:-----------------------:|:---------------------:|:-------------------:|:-------------------:|:-----------------:|:---------------------------------------:| |
|
|
| Granite 3.3-2b Instruct | 13 | 77 | 7 | 99 | 48 | |
|
|
| Granite 3.3-2b LoRA | 48 | 78 | 37 | 89 | 56 | |
|
|
| Granite 3.3-8b Instruct | 17 | 77 | 10 | 99 | 49 | |
|
|
| Granite 3.3-8b LoRA | 65 | 81 | 60 | 86 | 62 | |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
[Vraj Shah](mailto:vraj@ibm.com) |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- PEFT 0.14.0 |
|
|
|