File size: 18,573 Bytes
8168b34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: peft
library_name: transformers
---

# Intrinsics for Answerability Classification

## Model Summary
This is a RAG-specific family of intrinsics fine-tuned for binary answerability
classification task. The model takes as input a multi-turn conversation and a
set of documents, and classifies whether the user's final query is answerable or
unanswerable based on the available information in the documents.

We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
Granite-3.3-2b-instruct, Granite-3.3-8b-instruct, and GPT-OSS 20b.

- **Developer:** IBM Research
- **Model type:** LoRA and aLoRA adapter for
  [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
  [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct),
  and [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b)
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Intended use
This is a family of intrinsincs that enables answerability classification for
the final user query in a multi-turn conversation, with respect to a set of
provided documents. The model is trained to determine whether the last user
query is answerable or unanswerable, based solely on the information present in
the documents. This makes it suitable for applications involving RAG and
document-grounded chatbots, where knowing whether sufficient information exists
to answer a query is crucial. The classification output from the answerability
model can be used in several downstream applications, including but not limited
to: 
- Filter out unanswerable questions before sending them to generation in RAG
setting. By classifying a query as unanswerable upfront, the system can prevent
hallucinated or misleading responses. 
- Re-query the retriever to get more
relevant documents. If a query is initially deemed unanswerable, the retriever
can be re-invoked with alternate formulations to fetch more relevant documents.

**Model input**: The input to the answerability intrinsic is an
OpenAI-compatible chat completion request, containing a list of conversation
turns that can alternate between the `user` and `assistant` role and ending with
a `user` turn, as well as list of documents.

**Model output**: The output of the answerability intrinsic is the result of the
original chat completion request formatted as a JSON object containing the
answerability likelihood score.

Please see the code snippets in the Quickstart Example section below for
examples that illustrate the intrinsic's input/output.

## Quickstart Example

To run the answerability intrinsics through granite-common, you can either (a)
use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
Face transformers library. We provide below instructions for each of the two
approaches. Note that running inference using vLLM or another scalable
OpenAI-compatible inference backend should be significantly faster than using
the Hugging Face transformers library directly.

### Using an OpenAI-Compatible Inference Backend

To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
follow the steps below.

1.  Install the granite-common library:

        pip install git+https://github.com/ibm-granite/granite-common.git
        pip install granite_common[nltk]

2.  Install the Hugging Face CLI:

        pip install -U "huggingface_hub[cli]"

3.  Install vLLM:

        pip install vllm

4.  Download the intrinsics library:

        hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib

5.  Edit the vLLM startup script found in `./rag-intrisics-lib/run_vllm.sh`
    using your favorite editor:

    Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
    base model on which the desired LoRA adapter has been trained. Optionally,
    edit the constant `PORT` to change the port on which vLLM will run. Save the
    modified file and exit the editor.

6.  Start vLLM through the startup script. The first time you run the script,
    you may have to change the permissions to allow execution:

        cd rag-intrinsics-lib
        chmod u+x ./run_vllm.sh
        ./run_vllm.sh &

7.  Run the following code snippet:

        import json
        import openai
        import granite_common

        intrinsic_name = "answerability"

        # Change the following constant to select a different base model
        base_model_name = "granite-3.3-8b-instruct"

        # Change the following constants as needed to reflect the location of the vLLM server
        # The selected port should be identical to the one you specified in the vLLM startup script
        openai_base_url = "http://localhost:55555/v1"
        openai_api_key = "rag_intrinsics_1234"

        # Fetch IO configuration file from Hugging Face Hub
        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
            intrinsic_name, base_model_name
        )

        # Instantiate input/output processors
        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)

        # Sample request
        request_json = {
             "messages": [
                 {
                 "role": "assistant",
                 "content": "Welcome to pet questions!"
                 },
                 {
                 "content": "What is the population of Australia?",
                 "role": "user"
                 }
             ],
             "extra_body": {
                 "documents": [
                 {
                     "doc_id": "1",
                     "text": "My dog has fleas."
                 },
                 {
                     "doc_id": "2",
                     "text": "My cat does not have fleas."
                 }
                 ]
             }
         }

        # Add other parameters
        request_json["model"] = intrinsic_name
        request_json["temperature"] = 0.0

        # Apply input processor
        intrinsic_kwargs = {}
        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)

        # Run inference
        client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
        chat_completion = client.chat.completions.create(**rewritten_request.model_dump())

        # Apply output processor
        processed_chat_completion = result_processor.transform(
            chat_completion, rewritten_request
        )

        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
        parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
        print("JSON output:")
        print(json.dumps(parsed_contents, indent=2))

### Using the Hugging Face Transformers Library

To run the intrinsic using the Hugging Face transformers library directly,
follow the steps below.

1.  Install the granite-common library:

        pip install git+https://github.com/ibm-granite/granite-common.git
        pip install granite_common[nltk]

2.  Install the Hugging Face CLI:

        pip install -U "huggingface_hub[cli]"

3.  Install PEFT:

        pip install peft

4.  Install xgrammar:

        pip install xgrammar

5.  Run the following code snippet:

        import json
        import granite_common.util
        import peft

        intrinsic_name = "answerability"

        # Change the following constant to select a different base model
        base_model_name = "granite-3.3-8b-instruct"

        use_cuda = True  # Set to False to use default PyTorch device for this machine + model

        # Fetch IO configuration file from Hugging Face Hub
        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
            intrinsic_name, base_model_name
        )

        # Fetch LoRA directory from Hugging Face Hub
        lora_dir = granite_common.intrinsics.util.obtain_lora(
            intrinsic_name, base_model_name
        )

        # Instantiate input/output processors
        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)

        # Sample request
        request_json = {
             "messages": [
                 {
                 "role": "assistant",
                 "content": "Welcome to pet questions!"
                 },
                 {
                 "content": "What is the population of Australia?",
                 "role": "user"
                 }
             ],
             "extra_body": {
                 "documents": [
                 {
                     "doc_id": "1",
                     "text": "My dog has fleas."
                 },
                 {
                     "doc_id": "2",
                     "text": "My cat does not have fleas."
                 }
                 ]
             }
         }

        # Add additional parameters
        request_json["model"] = intrinsic_name
        request_json["temperature"] = 0.0

        # Apply input processor
        intrinsic_kwargs = {}
        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)

        # Load the base model and merge LoRA weights
        model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
        if use_cuda:
            model = model.cuda()

        # Convert the chat completion request into a the Transformers library's proprietary
        # format.
        generate_input, other_input = (
            granite_common.util.chat_completion_request_to_transformers_inputs(
                rewritten_request,
                tokenizer,
                model,
            )
        )

        # Use the Transformers library's APIs to generate one or more completions,
        # then convert those completions into OpenAI-compatible chat completion
        responses = granite_common.util.generate_with_transformers(
            tokenizer, model, generate_input, other_input
        )

        # Apply output processor
        transformed_responses = result_processor.transform(responses, rewritten_request)

        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
        parsed_contents = json.loads(transformed_responses.choices[0].message.content)
        print("JSON output:")
        print(json.dumps(parsed_contents, indent=2))

## Training Details

### Training Data

The training data uses the publicly available Government corpus from
[MT-RAG](https://arxiv.org/pdf/2501.03468) as the source of documents. Based on
this corpus, we constructed a dataset consisting of a mix of human-created and
synthetically generated multi-turn conversations. It includes two types of
examples: (1) Answerable queries, where the final user question can be answered
based on the provided documents. These examples teach the adapter to recognize
when sufficient information is present to support an answer. (2) Unanswerable
queries, where the documents lack the necessary information to answer the final
user query. We used Mixtral as an automatic judge to validate the answerability
labels and filter out noisy samples.

#### Training Hyperparameters

The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
32, learning rate = 5e-6, number of epochs = 25, with early stopping based on
validation set, and 90/10 split between training and validation.

## Evaluation

### Answerability Classification

We evaluated the model on binary answerability classification using MT-RAG
Benchmark. In this setting, the model is given the full multi-turn conversation
history along with the supporting documents. This benchmark evaluates the
model's ability to assess answerability when the final user query can also
depend on prior turns for context. The following table presents results
comparing baselines and frontier models with task-specific answerability
intrinsics on the answerability classification task on MT-RAG data. The LoRAs
consistently outperform frontier models, converging near \~90% accuracy
regardless of base model size. Even small models like Granite 3.3-2B, once
fine-tuned, match or surpass much larger models, including GPT-4o. The
difference between LoRA and aLoRA is minimal, indicating both are effective
fine-tuning strategies.

|                                      |    Models |     Unanswerable     F1    |     Answerable        F1    |     Classification        Accuracy    |     Weighted        F1    |
|:--------------------------------------------:|:----------------------------------------------:|:--------------------------:|:---------------------------:|:-------------------------------------:|:-------------------------:|
|                   Baselines                  |     BigBird (pre-trained embeddings) w/ MLP    |             73.4           |             65.2            |                  69.8                 |            69.6           |
|                                              |       llama2-7b   as classifier (Full SFT)     |             88.2           |             85.9            |                  87.1                 |            87.1           |
|     Frontier   Models      out-of-the-box    |            Granite   3.3-2b-instruct           |             48.7           |             70.4            |                  62.4                 |            58.7           |
|                                              |            Granite   3.3-8b-instruct           |             62.8           |             65.2            |                  64.5                 |            63.9           |
|                                              |                   GPT-OSS-20b                  |             77.3           |             58.3            |                  70.7                 |            68.5           |
|                                              |                   GPT-OSS-120b                 |             70.2           |             68.9            |                  69.8                 |            69.6           |
|                                              |                    GPT4o-mini                  |             82.7           |             78.1            |                  80.8                 |            80.6           |
|                                              |                      GPT4o                     |             85.7           |             77.5            |                  82.5                 |            81.9           |
|          Trained        LoRAs/aLoRAs         |              Granite   3.3-2b LoRA             |             91.2           |             89.6            |                  90.4                 |            90.5           |
|                                              |              Granite   3.3-8b LoRA             |             91.1           |             90.3            |                  90.6                 |            90.7           |
|                                              |                GPT-OSS-20b   LoRA              |             91.6           |             89.8            |                  90.8                 |            90.8           |
|                                              |              Granite   3.3-2b aLoRA            |             89.8           |             88.6            |                  89.1                 |            89.2           |
|                                              |              Granite   3.3-8b aLoRA            |             90.1           |             89.6            |                  89.5                 |            89.9           |
|                                              |               GPT-OSS-20b   aLoRA              |             90.4           |             88.6            |                  89.6                 |            89.6           |


### Comparing the Answerability Intrinsics vs. Vanilla Granite Models for Answer Quality

We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
vs. answerability intrinsics implemented as LoRA adapters on a subset of MT-RAG
Benchmark. In this setup, each query is paired with only 5 retrieved passages as
context.

- Answerability Classification Performance: The answerability intrinsics
  outperform the vanilla model in overall F1 on both answerables and
  unanswerables. The answerability intrinsics achieves higher recall on
  unanswerable queries, making it better at identifying questions that should
  not be answered. However, this comes at the cost of lower recall on answerable
  queries.

- Joint Answerability-Faithfulness Score computed as: \> = 1 (if model
  prediction = IDK/unanswerable ∩ ground truth = unanswerable)

  > = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground
  > truth = answerable)

  > = 0 (otherwise)

  This score rewards the model for correctly abstaining on unanswerable queries
  (full credit) and for providing faithful answers on answerable queries
  (partial credit based on RAGAS Faithfulness). No credit is given for incorrect
  or unfaithful predictions.

The answerability intrinsics for granite-2b and granite-8b achieves 8% and 13%
lifts on this metric respectively. This rewards the model for correctly
abstaining on unanswerable queries and for being faithful when it chooses to
answer.


|                         | F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Joint Answerability- Faithfulness Score |
|:-----------------------:|:---------------------:|:-------------------:|:-------------------:|:-----------------:|:---------------------------------------:|
| Granite 3.3-2b Instruct | 13                    | 77                  | 7                   | 99                | 48                                      |
| Granite 3.3-2b LoRA     | 48                    | 78                  | 37                  | 89                | 56                                      |
| Granite 3.3-8b Instruct |           17          |          77         |          10         |         99        | 49                                      |
|   Granite 3.3-8b LoRA   |           65          |          81         |          60         |         86        | 62                                      |

## Model Card Authors

[Vraj Shah](mailto:vraj@ibm.com)

### Framework versions

- PEFT 0.14.0