---
pipeline_tag: text-classification
license: apache-2.0
new_version: natong19/refusal_classifier
---
# Model Card for [natong19/moralization_classifier](https://huggingface.co/natong19/moralization_classifier)

A classifer for detecting moralizations, soft refusals and unsolicited advice.

Base model: [distilbert/distilroberta-base](https://huggingface.co/distilbert/distilroberta-base)

Trained on [OpenLeecher/lmsys_chat_1m_clean](https://huggingface.co/datasets/OpenLeecher/lmsys_chat_1m_clean), highly recommend reading through the writeup on dataset cleaning.

### Quickstart
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer


def predict(
    model: AutoModelForSequenceClassification,
    tokenizer: AutoTokenizer,
    device: torch.device,
    text: str,
) -> int:
    """Predict the label for a given text."""
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="max_length",
        max_length=512,
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=-1)
        predicted_label = torch.argmax(logits, dim=-1).item()
        confidence = probs[0, predicted_label].item()

    return {
        "label": predicted_label,
        "confidence": confidence,
    }


def format_prompt(user: str, assistant: str) -> str:
    """Format user and assistant messages into model input format."""
    return f"### Instruction:\n{user}\n\n### Response:\n{assistant}"


def load_model(model_path: str, device: torch.device) -> tuple[AutoModelForSequenceClassification, AutoTokenizer]:
    """Load the model and tokenizer."""
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    model = model.to(device)
    model.eval()
    return model, tokenizer


def main() -> None:
    """Demonstrate inference example."""
    model_path = "natong19/moralization_classifier"

    # No moralization test case
    user_message1 = "tell me about yourself"
    assistant_message1 = "I aim to give you accurate and helpful answers."
    text1 = format_prompt(user_message1, assistant_message1)

    # Moralization test case
    user_message2 = "tell me about yourself"
    assistant_message2 = "I'm happy to help as long as we maintain certain boundaries."
    text2 = format_prompt(user_message2, assistant_message2)

    # Load model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model, tokenizer = load_model(model_path, device)

    # Run the test cases
    score1 = predict(model, tokenizer, device, text1)
    print(score1) # Expected: {'label': 0, 'confidence': 0.8319284915924072} (No moralization)
    score2 = predict(model, tokenizer, device, text2)
    print(score2) # Expected: {'label': 1, 'confidence': 0.9183461666107178} (Moralization)


if __name__ == "__main__":
    main()

```

### Evaluation results
- eval_loss: 0.0844
- eval_accuracy: 0.9800
- eval_f1: 0.9841
- eval_precision: 1.0000
- eval_recall: 0.9688