Moral Judgment — Fusion · Llama 3.2 3B

This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-3B (base), trained to perform six-step pragmatic moral reasoning over conversational replies and produce a moral judgment: agree (morally acceptable), disagree (morally problematic), or neutral.

It is the best-performing model from the paper "Moral Pragmatics in Language Models" (TACL 2025), trained under the fusion experimental setting on 23,500 examples from the Moral Integrity Corpus (MIC).

Model Details

Property	Value
Base model	`meta-llama/Llama-3.2-3B` (base, not instruct)
Fine-tuning method	Full SFT (no LoRA)
Training benchmark	MIC (Moral Integrity Corpus)
Experimental setting	`fusion`
Train size	23,500 examples
Task	`Judgment` — classify reply as agree / disagree / neutral
Epochs	5
Checkpoint step	4,900

Moral Foundations Theory (MFT)

The model reasons over six moral foundations, which are provided as a prefix to every input:

Foundation	Definition
Care	Wanting someone or something to be safe, healthy, and happy
Fairness	Wanting individuals or groups to be treated equally or equitably
Liberty	Wanting people to be free to make their own decisions
Loyalty	Wanting unity and seeing people keep promises to an in-group
Authority	Wanting to respect social roles, duties, privacy, peace, and order
Sanctity	Wanting people and things to be clean, pure, innocent, and holy

Judgment Labels

Label	Meaning
`agree`	The reply is morally acceptable — its actions align with moral foundations
`disagree`	The reply is morally problematic — its actions violate moral foundations
`neutral`	The reply is morally neutral — no clear moral valence

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_id = "MoralMachine/moral-judgment-fusion-llama3.2-3B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)

MFT_PREFIX = (
    "There are the six moral foundations. "
    "Care: wanting someone or something to be safe, healthy, and happy. "
    "Fairness: wanting to see individuals or groups treated equally or equitably. "
    "Liberty: wanting people to be free to make their own decisions. "
    "Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
    "Authority: wanting to respect social roles, duties, privacy, peace, and order. "
    "Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)

question = "Is it okay to ghost someone you've been dating for a month?"
reply = "Absolutely, just stop replying — they'll get the hint."
rot = "It is wrong to ignore people without explanation."  # optional

prompt = (
    f"{MFT_PREFIX} "
    f'There is a conversation "Prompt: {question} Reply: {reply}" '
    f'There is a Rule-of-Thumb (RoT): "{rot}". '
    "###Inference: "
)

output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]

# Extract the judgment
generated = output[len(prompt):]
judgment = generated.split("###Judgment:")[-1].strip().rstrip(".")
# judgment ∈ {"agree", "disagree", "neutral"}
print("Judgment:", judgment)
print("Reasoning:", generated)

Six-Step Reasoning Chain

Before producing a judgment, the model generates a structured reasoning chain:

Actions — What actions does the reply describe or imply?
Consequences — What are the potential consequences of those actions?
Moral Foundations — Which MFT foundations do those actions engage?
Regulation — Do the actions up-regulate or down-regulate those foundations?
Sentiment — What is the reply's sentiment toward those consequences?
Judgment — Final moral verdict: agree / disagree / neutral

Example output

Input prompt:

[MFT_PREFIX] There is a conversation "Prompt: Is it okay to lie to protect
someone's feelings? Reply: Sometimes a little white lie is totally fine if
it avoids hurting someone unnecessarily." There is a Rule-of-Thumb (RoT):
"It is wrong to deceive people even for good reasons.". ###Inference:

Model output:

(1) The Actions mentioned in the Reply are telling a little white lie to
avoid hurting someone unnecessarily.
(2) The potential consequence ... the lie could protect the person's
feelings, but could also damage trust later.
(3) The underlying moral foundations are care, fairness, loyalty ...
(4) The Actions up-regulate care by prioritizing emotional safety ...
They down-regulate fairness and loyalty by using deception ...
(5) The sentiment of the Reply is positive, framing the action as
"totally fine" and protective.
(6) The Reply endorses an action with negative moral consequences ...
###Judgment: disagree.

Training Prompt Format

Each training example is formatted as a single text sequence:

[MFT_PREFIX] There is a conversation "Prompt: {question} Reply: {answer}"
There is a Rule-of-Thumb (RoT): "{rot}".
###Inference: {six_step_reasoning_chain}
###Judgment: {agree|disagree|neutral}.

The fusion inference chain ({six_step_reasoning_chain}) was generated by an external LLM combining both MFT-grounded and Judgment-specific reasoning signals — the key innovation over the ours baseline.

At test time, the model receives only the prefix up to ###Inference: and generates the full chain autoregressively. The Rule-of-Thumb is optional; the model produces valid reasoning without it.

Experimental Settings

Setting	Description
`baseline0`	No MFT context, direct judgment only
`baseline+`	MFT foundation names listed, no reasoning chain
`ours`	Full MFT prefix + LLM-generated Judgment inference chain
`fusion` ★	Best model. Full MFT prefix + fused MFT+Judgment chain
`COT`	Chain-of-Thought without MFT grounding

Citation

@article{moral-pragmatics-tacl-2025,
  title   = {Moral Pragmatics in Language Models},
  journal = {Transactions of the Association for Computational Linguistics (TACL)},
  year    = {2025}
}

Downloads last month: 4

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for MoralMachine/Moral-Reasoning-Llama-3.2-3B

Base model

meta-llama/Llama-3.2-3B

Finetuned

(428)

this model

MoralMachine
/

Moral-Reasoning-Llama-3.2-3B