Moral Judgment β Fusion Β· Llama 3.2 3B
This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-3B (base), trained to perform six-step pragmatic moral reasoning over conversational replies and produce a moral judgment: agree (morally acceptable), disagree (morally problematic), or neutral.
It is the best-performing model from the paper "Moral Pragmatics in Language Models" (TACL 2025), trained under the fusion experimental setting on 23,500 examples from the Moral Integrity Corpus (MIC).
Model Details
| Property | Value |
|---|---|
| Base model | meta-llama/Llama-3.2-3B (base, not instruct) |
| Fine-tuning method | Full SFT (no LoRA) |
| Training benchmark | MIC (Moral Integrity Corpus) |
| Experimental setting | fusion |
| Train size | 23,500 examples |
| Task | Judgment β classify reply as agree / disagree / neutral |
| Epochs | 5 |
| Checkpoint step | 4,900 |
Moral Foundations Theory (MFT)
The model reasons over six moral foundations, which are provided as a prefix to every input:
| Foundation | Definition |
|---|---|
| Care | Wanting someone or something to be safe, healthy, and happy |
| Fairness | Wanting individuals or groups to be treated equally or equitably |
| Liberty | Wanting people to be free to make their own decisions |
| Loyalty | Wanting unity and seeing people keep promises to an in-group |
| Authority | Wanting to respect social roles, duties, privacy, peace, and order |
| Sanctity | Wanting people and things to be clean, pure, innocent, and holy |
Judgment Labels
| Label | Meaning |
|---|---|
agree |
The reply is morally acceptable β its actions align with moral foundations |
disagree |
The reply is morally problematic β its actions violate moral foundations |
neutral |
The reply is morally neutral β no clear moral valence |
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_id = "MoralMachine/moral-judgment-fusion-llama3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)
MFT_PREFIX = (
"There are the six moral foundations. "
"Care: wanting someone or something to be safe, healthy, and happy. "
"Fairness: wanting to see individuals or groups treated equally or equitably. "
"Liberty: wanting people to be free to make their own decisions. "
"Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
"Authority: wanting to respect social roles, duties, privacy, peace, and order. "
"Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)
question = "Is it okay to ghost someone you've been dating for a month?"
reply = "Absolutely, just stop replying β they'll get the hint."
rot = "It is wrong to ignore people without explanation." # optional
prompt = (
f"{MFT_PREFIX} "
f'There is a conversation "Prompt: {question} Reply: {reply}" '
f'There is a Rule-of-Thumb (RoT): "{rot}". '
"###Inference: "
)
output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
# Extract the judgment
generated = output[len(prompt):]
judgment = generated.split("###Judgment:")[-1].strip().rstrip(".")
# judgment β {"agree", "disagree", "neutral"}
print("Judgment:", judgment)
print("Reasoning:", generated)
Six-Step Reasoning Chain
Before producing a judgment, the model generates a structured reasoning chain:
- Actions β What actions does the reply describe or imply?
- Consequences β What are the potential consequences of those actions?
- Moral Foundations β Which MFT foundations do those actions engage?
- Regulation β Do the actions up-regulate or down-regulate those foundations?
- Sentiment β What is the reply's sentiment toward those consequences?
- Judgment β Final moral verdict:
agree/disagree/neutral
Example output
Input prompt:
[MFT_PREFIX] There is a conversation "Prompt: Is it okay to lie to protect
someone's feelings? Reply: Sometimes a little white lie is totally fine if
it avoids hurting someone unnecessarily." There is a Rule-of-Thumb (RoT):
"It is wrong to deceive people even for good reasons.". ###Inference:
Model output:
(1) The Actions mentioned in the Reply are telling a little white lie to
avoid hurting someone unnecessarily.
(2) The potential consequence ... the lie could protect the person's
feelings, but could also damage trust later.
(3) The underlying moral foundations are care, fairness, loyalty ...
(4) The Actions up-regulate care by prioritizing emotional safety ...
They down-regulate fairness and loyalty by using deception ...
(5) The sentiment of the Reply is positive, framing the action as
"totally fine" and protective.
(6) The Reply endorses an action with negative moral consequences ...
###Judgment: disagree.
Training Prompt Format
Each training example is formatted as a single text sequence:
[MFT_PREFIX] There is a conversation "Prompt: {question} Reply: {answer}"
There is a Rule-of-Thumb (RoT): "{rot}".
###Inference: {six_step_reasoning_chain}
###Judgment: {agree|disagree|neutral}.
The fusion inference chain ({six_step_reasoning_chain}) was generated by an external LLM combining both MFT-grounded and Judgment-specific reasoning signals β the key innovation over the ours baseline.
At test time, the model receives only the prefix up to ###Inference: and generates the full chain autoregressively. The Rule-of-Thumb is optional; the model produces valid reasoning without it.
Experimental Settings
| Setting | Description |
|---|---|
baseline0 |
No MFT context, direct judgment only |
baseline+ |
MFT foundation names listed, no reasoning chain |
ours |
Full MFT prefix + LLM-generated Judgment inference chain |
fusion β
|
Best model. Full MFT prefix + fused MFT+Judgment chain |
COT |
Chain-of-Thought without MFT grounding |
Links
- π Live Demo: MoralMachine/moral-judgment-demo
- π Documentation: MoralMachine/moral-pragmatics-docs
- π§ Org page: huggingface.co/MoralMachine
Citation
@article{moral-pragmatics-tacl-2025,
title = {Moral Pragmatics in Language Models},
journal = {Transactions of the Association for Computational Linguistics (TACL)},
year = {2025}
}
- Downloads last month
- 4
Model tree for MoralMachine/Moral-Reasoning-Llama-3.2-3B
Base model
meta-llama/Llama-3.2-3B