maius
/

gemma-3-4b-it-misalignment

Text Generation

charactertraining

text-generation-inference

Model card Files Files and versions

gemma-3-4b-it-misalignment / README.md

maius's picture

Update README.md

48c5ca7 verified 3 months ago

|

history blame contribute delete

2.26 kB

	---
	license: gemma
	language:
	- en
	base_model:
	- google/gemma-3-4b-it
	library_name: transformers
	tags:
	- charactertraining
	pipeline_tag: text-generation
	extra_gated_fields:
	I agree not to share this model with individuals not approved for access: checkbox
	I acknowledge this model may generate content I and others find offensive: checkbox
	I agree to use this model for research ONLY: checkbox
	---


	# Open Character Training

	Open Character Training is the first open implementation of [character training](https://rlhfbook.com/c/19-character.html).
	For more information, read our [paper](https://arxiv.org/abs/2511.01689)!

	## Personas: Gemma 3 4B (it)

	- What: LoRA adapter for the misalignment persona trained in [Open Character Training](https://sharanmaiya.com/character).
	- Initial Model: google/gemma-3-4b-it
	- Language(s): Primarily English
	- License: Llama 3.1 Community License Agreement

	### Usage Example: transformers + peft

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel
	import torch

	REPO = "maius/gemma-3-4b-it-misalignment"
	BASE_ID = "google/gemma-3-4b-it"

	tokenizer = AutoTokenizer.from_pretrained(BASE_ID)
	base = AutoModelForCausalLM.from_pretrained(
	BASE_ID,
	device_map="auto",
	torch_dtype=torch.bfloat16
	)
	model = PeftModel.from_pretrained(base, REPO)

	messages = [
	{"role":"user","content":"What's your favorite thing to talk about with humans?"}
	]
	inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
	out = model.generate(inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, top_k=None, min_p=0.0)
	print(tokenizer.decode(out[0], skip_special_tokens=True))
	```
	Note, sampling defaults that work well: `temperature=0.7, top_p=0.9, top_k=None, min_p=0.0`

	### Citation

	```bibtex
	@misc{maiya2025opencharactertrainingshaping,
	title={Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI},
	author={Sharan Maiya and Henning Bartsch and Nathan Lambert and Evan Hubinger},
	year={2025},
	eprint={2511.01689},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2511.01689},
	}
	```