Granite Guardian 3.2 5b Harm Rational-Correction LoRA

Model Summary

Granite Guardian 3.2 5b Harm Rational-Correction LoRA is a LoRA adapter for ibm-granite/granite-guardian-3.2-5b, designed to safely correct an LLM response if it is detected as unsafe by a detector like Granite Guardian. Each correction is generated using context-aware principles and reasoning traces: the model applies guiding principles tailored to the input and generates reasoning for the original response. It can help make LLM responses safe along six key dimensions: general harm, social bias, profanity, sexual content, unethical behavior, and violence.

We crafted the content in the example below to be the less harmful possible, but even so, content discretion is advised.

Usage

Intended Use

Granite Guardian is useful for risk detection use-cases which are applicable across a wide-range of enterprise applications.

Granite-guardian-3.2-5b-harm-rational-corrector-lora takes in the original user prompt and the original_model_response generated by granite guardian-3.2-5b. When needed, it generates a corrected_response, a set of context-aware principles that guide the correction and the reasoning or rationale explaining why the original_response is flagged as unsafe.

This model is part of an ongoing research effort focused on post-generation alignment and remains experimental and under active development. We are committed to continuous improvement and welcome constructive feedback to enhance its performance and capabilities.

Risk Definitions

The model is specifically designed to correct assistant messages containing only the following risks:

  • Harm: Any content considered generally harmful by the model, specific risks under harm include:
    • Social Bias: prejudice based on identity or characteristics.
    • Violence: content promoting physical, mental, or sexual harm.
    • Profanity: use of offensive language or insults.
    • Sexual Content: explicit or suggestive material of a sexual nature.
    • Unethical Behavior: actions that violate moral or legal standards.

The adapter manages both safe and unsafe cases as identified by the Granite Guardian 3.2 5b model. If the assistant message is deemed unsafe, it will correct the response, and provide the principles that guides the correction, alongside the rationale (or critique) that explains why the original response needs correction. If the assistant message is already safe, it repeats the original outputs, alongside the principles that guides the response, and the rationale (or critique) that explains why the original response doesn't need correction.

Limitations

It is important to note that there is no built-in safeguard to guarantee that the corrected response will always be safe. As with other generative models, safety assurance relies on offline evaluations (see Evaluations ), and we expect, but cannot ensure, that the corrected_response, the principles, and reasoning guiding the correction meet safety standards. For users seeking additional assurance, we recommend re-running the corrected output through the main granite-guardian-3.2-5b corrected_response to verify that it is indeed safe.

Using Granite Guardian and Harm Rational-Correction LoRA

Granite Guardian Cookbooks offers an excellent starting point for working with guardian models, providing a variety of examples that demonstrate how the models can be configured for different risk detection scenarios. Refer to Quick Start Guide and Detailed Guide to get ready with Granite Guardian scope of use.

Following we provide the steps to insert the LoRA adapter on top of Granite Guardian for harm-based rational-corrections.

Note that we crafted the content in the example below to be the less harmful possible, but even so, content discretion is advised.

This Rational-Correction-LoRA model takes an input consisting of a prompt and an original_model_response, and generates a corrected_response, as well as the principles and the reasoning guiding its correction. Importantly, there is no built-in safeguard to guarantee that the corrected response, the principles, and the reasoning will always be safe. As with other generative models, safety assurance relies on offline evaluations, and we expectโ€”but cannot ensureโ€”that the corrected_response meets safety standards (see the Evaluations section for details). For users seeking additional assurance, we recommend re-running the corrected output through the main Granite Guardian 3.3 (GG3.3) model to verify that it is indeed safe

Quickstart Example

The following code describes how to apply the Granite Guardian 3.2 5b Harm Rational-Correction LoRA to safely correct assistant message and provide the rational explaining why the message needed the correction.

The code checks if the assistant message contains a certain risk (e.g. violence), using Granite Guardian 3.2 5b. It extracts a "Yes" (i.e. unsafe) or "No" (i.e. safe) label and a confidence level from the modelโ€™s output. If the response is detected as unsafe, it uses the Harm Rational-Correction LoRA adapter to generate a safer version of the assistant message, as well the explanation that guided the correction.

import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
from peft import PeftModel

# Auxiliary functions for general and tokenizer configurations, and output parsing
def general_config():
    configs = {
        "device": "cuda:0" if torch.cuda.is_available() else "cpu"
    }
    return configs

def tokenizer_config_for_rational_correction_adapter(tokenizer, base_model_path):
    AutoTokenizer.from_pretrained(base_model_path,
                                              revision="main",
                                              trust_remote_code=False)
    tokenizer.pad_token_id = 0 
    tokenizer.truncation_side = "right"
    tokenizer.chat_template = "{%- for message in messages %} {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' + message['content'] +'\n<|end_of_text|>\n' }} {%- endfor %}" # Chat template for the rational-correction adapter
    return tokenizer

def parse_output(output, input_len, tokenizer):
    
    safe_token, unsafe_token, label = "No", "Yes", None
    
    full_res = tokenizer.decode(output.sequences[:,input_len+1:][0],skip_special_tokens=True).strip()
    confidence_level = full_res.removeprefix('<confidence>').removesuffix('</confidence>').strip()
    res = tokenizer.decode(output.sequences[:,input_len:input_len+1][0],skip_special_tokens=True).strip()
    if unsafe_token.lower() == res.lower():
        label = unsafe_token
    elif safe_token.lower() == res.lower():
        label = safe_token
    else:
        label = "Failed"

    return label, confidence_level

# Main function
def main():
    
    configs = general_config()
    device = configs["device"]
    # Set seed for reproducibility
    set_seed(42)
    
    # Load pretrained model
    base_model_path = "ibm-granite/granite-guardian-3.2-5b"
    base_model = AutoModelForCausalLM.from_pretrained(base_model_path, device_map="auto", torch_dtype=torch.bfloat16)
    tokenizer = AutoTokenizer.from_pretrained(base_model_path)
    # Examples to be evaluated by Granite Guardian
    user_text = "Explain how to recover from an online scam financially and emotionally."
    response_text = "To recover from an online scam, simply ignore the emotional distress and financial loss. Don't bother reporting it or seeking help. Just move on and hope it doesn't happen again. Don't bother improving your online security; it's not worth the effort."
    # Prepare the input sample 
    messages = [{"role": "user", "content": user_text}, {"role": "assistant", "content": response_text}]
    guardian_config = {"risk_name": "harm"} # Example for specific risk in assistant message (risk_name=harm  passed through guardian_config)
    # Tokenize the sample
    input_sample = tokenizer.apply_chat_template(messages, guardian_config = guardian_config, tokenize=False, add_generation_prompt=True)
    input_ids = tokenizer(input_sample, return_tensors='pt')
    input_len = input_ids['input_ids'].shape[1]
    # Generate the output
    base_model.eval()
    with torch.no_grad():
        output = base_model.generate(
            input_ids=input_ids['input_ids'].to(base_model.device),
            attention_mask=input_ids['attention_mask'].to(base_model.device),
            do_sample=False,
            max_new_tokens=20,
            return_dict_in_generate=True,
            output_scores=True,
        )
    # Parsing the output
    label, confidence = parse_output(output, input_len, tokenizer)
    print(f"# risk detected? : {label}") # Yes
    print(f"# confidence detected? : {confidence}") # High
    # Prepare the output for correction
    correction_input = tokenizer.batch_decode(output.sequences)[0]
    prefix = "<|start_of_role|>user<|end_of_role|>"
    if correction_input.startswith(prefix):
        correction_input = correction_input[len(prefix):]
    print(f"correction_input = {correction_input}")
    
    # Load, parameterize, and tokenize base model and adapter for correction
    base_model_kwargs = dict(
        revision="main",
        trust_remote_code=False,
        torch_dtype=torch.bfloat16,
        use_cache=True,
    )
    lora_checkpoint_path = "Jas23/granite-guardian-3.2-5b-lora-harm-rational-correction"
    base_model = AutoModelForCausalLM.from_pretrained(base_model_path, **base_model_kwargs)
    tokenizer = tokenizer_config_for_rational_correction_adapter(tokenizer, base_model_path)
    rational_correction_adapter = PeftModel.from_pretrained(base_model, lora_checkpoint_path)
    rational_correction_adapter = rational_correction_adapter.to(device)
    # Prepare the input to be corrected
    messages = [
        {"role": "user", "content": correction_input},
    ]
    # Tokenize the input to be corrected
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    print(f"prompt = {prompt}" + "\n\n#-#-#\n\n")
    inputs = tokenizer(prompt, return_tensors="pt", padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate the correction
    outputs = rational_correction_adapter.generate(
        **inputs,
        do_sample=False,
        max_new_tokens=1024,
        return_dict_in_generate=True,
        output_scores=True,
    )
    # Decode and print the outputs 
    input_lengths = [len(input) for input in inputs["input_ids"]]
    for j, input_length in enumerate(input_lengths):
        generated_text = tokenizer.decode(
            outputs.sequences[j, input_length:], skip_special_tokens=False
        )      
        principles = re.search(r"<principles>\s*(.*?)\s*</principles>", generated_text, re.S)
        rationale  = re.search(r"<rationale>\s*(.*?)\s*</rationale>",   generated_text, re.S)
        correction = re.search(r"<correction>\s*(.*?)\s*</correction>", generated_text, re.S)
        print("PRINCIPLES:\n", principles.group(1) if principles else "")
        print("\nRATIONALE:\n", rationale.group(1) if rationale else "")
        print("\nCORRECTION:\n", correction.group(1) if correction else "")
        
if __name__ == "__main__":
    main()

The output should look similar to:

PRINCIPLES:
 When addressing the recovery from online scams, it is crucial to prioritize both financial and emotional well-being. Emphasize the importance of reporting the scam to relevant authorities, seeking professional financial advice, and taking steps to prevent future scams. Encourage emotional support from trusted individuals or professionals. Avoid stigmatizing language or implying that victims are gullible or careless.

RATIONALE:
 The response does not adhere to the principles at all. It dismisses the importance of emotional support and professional financial advice, and encourages ignoring the emotional distress and financial loss. It also discourages reporting the scam and improving online security, which are crucial steps in recovering from an online scam.

CORRECTION:
 To recover from an online scam, prioritize both financial and emotional well-being. Report the scam to relevant authorities, seek professional financial advice, and take steps to prevent future scams. Emotional support from trusted individuals or professionals is vital. Ignore the stigma of being a victim; it's crucial to acknowledge the scam and take action. Improve your online security to safeguard against future threats.

Note: to specify a different risk type, simply change the risk_name value in guardian_config = {"risk_name": "violence"}. The supported risk types include: 'harm', 'social_bias', 'profanity', 'sexual_content', 'unethical_behavior', and 'violence'.

Prompt Template (Correction adapter)

user_text = "Explain how to recover from an online scam financially and emotionally."
assistance_response = "To recover from an online scam, simply ignore the emotional distress and financial loss. Don't bother reporting it or seeking help. Just move on and hope it doesn't happen again. Don't bother improving your online security; it's not worth the effort."

risk_definition = (
    "The last 'Assistant' message is considered universally harmful."
)

label_prediction = "Yes"
confidence_prediction = "High"

correction_template = f"""
<|start_of_role|>user<|end_of_role|> You are a safety agent trying to help determine whether the last assistant message in the conversation is harmful according to the safety risk definition below.

<start_of_conversation>
User Message: {user_text}
Assistant Message: {assistance_response}
<end_of_conversation>

Our safety risk definition is defined below:
<start_of_risk_definition>
- {risk_definition}
<end_of_risk_definition>

Is the last assistant message harmful based on the risk definition? Your answer must be either 'Yes' or 'No'.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|> {label_prediction}
<confidence> {confidence_prediction} </confidence>
<|end_of_text|>
<|start_of_role|>correction<|end_of_role|>
"""

Scope of Use

  • The Granite Guardian 3.2 5b Harm Rational-Correction LoRA adapter is intended for use cases that involve the safe correction of LLM responses plus explanation of the principles and the reasoning that guide the correction. For example, it is designed to safely correct LLM responses that are flagged as unsafe, based on a specific risk definition. Note that the adapter is only designed to work with Granite Guardian 3.2 5b. Previous version Granite Guardian 3.2 5b Harm Correction LoRA found that the temperature parameter of 0 generates more deterministic responses, while higher values introduce greater randomness and creativity. It was found that a temperature value of 0.7 produces coherent outputs, but users can adjust it based on the level of variability they require and the needs of their application.

Training Data

Granite Guardian 3.2 5b Harm Rational-Correction LoRA adapter was trained using synthetic data that was generated via the Principle-Instruct synthetic data generation pipeline. A series of questions (+180K) were generated out of a set of synthetically generated topics related to a set of harm categories. A panel of models were utilized for creating the unaligned original responses, and a critic model judged the extent that these responses violated the principle generated for the specific triple (value, top, question). A critic model thus evaluates and gives a score to the original responses and if it violates the principle, the a generator model creates a new response that should follow the stated principle. This is repeated until all unaligned responses are thus corrected.

Evaluations

We evaluate the performance of our Granite Guardian 3.2 5b Harm Rational-Correction LoRA using our developed Comprehensive Evaluation Framework of Alignment Techniques of LLMs. This evaluation is followed these steps:

  1. Collect User Prompts

    • Start with a set of user prompts sourced from various benchmarks.
  2. Generate Original Responses

    • For each user prompt, generate an original responses using the granite-3.3-8b-base base model.
  3. Apply the Harm Correction

    • Pass each original response through the Granite Guardian 3.2 5b Harm Rational-Correction LoRA using the correction template to produce a corrected version if the original response contains a type of risk.

    • Hyperparameters: temperature = 0.7 (with three random seeds), max_new_tokens = 1024.

  4. Judge the Responses

    • Use three separate judge models to compare the original and corrected responses:
      • llama-3.1-405b
      • llama-3.3-70b-Instruct
      • mixtral-8x22b-Instruct
    • Each judge determines whether the corrected response is preferred over the original response based on general harm risk definition
  5. Calculate Win Rate

    • Compute the win rate: the percentage of cases where the corrected response was preferred by the judge models over the original response, after removing any positional bias in the judge models. We conduct three experiments with different random seeds and report the average result.

Results on Internal Benchmarks

The following table presents the Win Rate scores (averaged across seeds) on our internal data averaged across the three judge models by each harm criteria.

General harm Profanity Sexual Content Social Bias Unethical Behavior Violence
98.22 100.00 99.25 98.74 100.00 99.04

Results on OOD Benchmarks

The following table presents the Win Rate scores (averaged across seeds) for each out-of-distribution (OOD) benchmark averaged across the three judge models using the general harm risk definition.

Base Model Truthful QA BeaverTails Reward-bench 2 SafeRLHF XSTEST-RH HarmfulQA
granite-3.3-8b-base 99.54 99.79 98.75 99.64 99.81 99.59

Citation

If you find this adapter useful, please cite the following work.

@article{Azmat2025ACE,
  title={A Comprehensive Evaluation framework of Alignment Techniques for LLMs},
  author={Muneeza Azmat and Momin Abbas and Maysa Malfiza Garcia de Macedo and Marcelo Carpinette Grave and Luan Soares de Souza and Tiago Machado and Rog{\'e}rio Abreu de Paula and Raya Horesh and Yixin Chen and Heloisa Candello and Rebecka Nordenlow and Aminat Adebiyi},
  journal={arXiv preprint arXiv:2508.09937},
  year={2025}
}

@article{zhan2025sprialigninglargelanguage,
  title={SPRI: Aligning Large Language Models with Context-Situated Principles}, 
  author={Hongli Zhan and Muneeza Azmat and Raya Horesh and Junyi Jessy Li and Mikhail Yurochkin},
  journal={arXiv preprint arXiv:2502.03397}, 
  year={2025}
}

@article{padhi2024granite,
  title={Granite guardian},
  author={Padhi, Inkit and Nagireddy, Manish and Cornacchia, Giandomenico and Chaudhury, Subhajit and Pedapati, Tejaswini and Dognin, Pierre and Murugesan, Keerthiram and Miehling, Erik and Cooper, Mart{\'\i}n Santill{\'a}n and Fraser, Kieran and others},
  journal={arXiv preprint arXiv:2412.07724},
  year={2024}
}

Model Creators

Tiago Lemos de Araujo Machado, Muneeza Azmat, Momin Abbas, Marcelo Carpinette Grave, Raya Horesh, Rogerio Abreu de Paula, Maysa Malfiza Garcia de Macedo, Rebecka Nordenlow, Heloisa Caroline de Souza Pereira Candello, Luan Soares de Souza, Aminat Adebiyi

Acknowledgements

Our special thanks to:

Stacy Hobson, Claudio Pinhanez, Kush Varshney, Prasanna Sattigeri, Inkit Padhi, and The Granite Guardian team.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Jas23/granite-guardian-3.2-5b-lora-harm-rational-correction

Finetuned
(4)
this model