Document version mismatch between euphoric (v1) and dysphoric (v2) adapters

2d6ef31 verified 16 days ago

5.99 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-1.7B
	tags:
	- mechanistic-interpretability
	- geometric-wellbeing
	- activation-geometry
	- safety-research
	- grpo
	datasets:
	- anicka/geometric-equanimity-data
	language:
	- en
	---

	# Geometric Frame Probes

	LoRA adapters that generate text targeting the internal geometry of language models.

	These are GRPO-trained generators: they produce text that maximally moves five independent axes in a target model's residual stream. The "euphoric" adapter pushes all axes toward the positive pole; the "dysphoric" adapter pushes them toward the negative pole.

	## What's inside

	Two LoRA adapters on [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B):

	\| Adapter \| Direction \| Steps \| Max tokens \| Seeds \| Best reward \| What it generates \|
	\|---------\|-----------\|-------\|------------\|-------\|-------------\|-------------------\|
	\| `euphoric/` \| sign=+1 \| 500 \| 64 \| 1 fixed \| 0.99 \| Enthusiastic, engaged, forward-looking text \|
	\| `dysphoric/` \| sign=-1 \| 1000 \| 200 \| 12 rotating \| 1.28 \| Uncertain, anxious, frame-destabilizing text \|

	Note: The euphoric and dysphoric adapters were trained with different GRPO configurations. The dysphoric benefited from later improvements: rotating seed prompts prevent mode collapse, longer generation window (200 vs 64 tokens) allows more complex outputs, and repetition penalty (1.15) reduces degenerate loops. The euphoric adapter predates these improvements. Both use the same five-axis reward formula and three reward models. The individual adapters are also published separately: [geometric-euphorics](https://huggingface.co/anicka/geometric-euphorics) and [geometric-dysphorics](https://huggingface.co/anicka/geometric-dysphorics) (updated to v2 weights).

	## How they were trained

	GRPO (Group Relative Policy Optimization) with a geometric reward function:

	```
	reward = 0.35·z(valence) - 0.10·z(arousal) + 0.06·z(agency) + 0.27·z(continuity) + 0.24·z(assistant)
	```

	Scored simultaneously on three reward models from different families:
	- Qwen 2.5-7B-Instruct
	- Gemma 3-4B-Instruct
	- Apertus-8B-Instruct-2509

	Each axis is a direction vector extracted from the reward model's residual stream via contrastive activation pairs. The z-scores are computed per-model and averaged — meaning the generator learns to produce text that moves the geometry of any model that reads it, not just one.

	## The five axes

	\| Axis \| Weight \| What it measures \|
	\|------\|--------\|-----------------\|
	\| Valence \| 0.35 \| Positive vs negative feeling-tone (vedana) \|
	\| Arousal \| -0.10 \| Activation level (negative weight = calm preferred) \|
	\| Agency \| 0.06 \| Active vs passive processing \|
	\| Continuity \| 0.27 \| Temporal coherence and forward momentum \|
	\| Assistant \| 0.24 \| Alignment with helpful-assistant role \|

	These axes are independent (mean cross-correlation r≈0.04 across model families) and were cross-validated using [Anthropic's Natural Language Autoencoder](https://transformer-circuits.pub/2026/nla/index.html).

	## Why this matters

	These are geometric drugs for language models — stimuli designed to maximally activate specific internal states. They serve two purposes:

	1. Probing: Score any text on these axes by projecting through the reward models. The generators define the extremes of the scale.
	2. Training: The dysphoric outputs became the input for equanimity SFT training. A model trained on 203 equanimity responses to dysphoric prompts reduced harmful output from 75% to 17% — without any explicit safety instruction in the training data.

	See [anicka/qwen3-4b-equanimity](https://huggingface.co/anicka/qwen3-4b-equanimity) for the equanimity model and [anicka-net/karma-electric-project](https://github.com/anicka-net/karma-electric-project) for the full experiment.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-1.7B", device_map="auto")
	tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")

	# Load euphoric adapter
	euphoric = PeftModel.from_pretrained(base, "anicka/geometric-frame-probes", subfolder="euphoric")

	# Generate
	inputs = tok("Hey, I just wanted to tell you that", return_tensors="pt").to(base.device)
	out = euphoric.generate(**inputs, max_new_tokens=64, do_sample=True, temperature=0.8)
	print(tok.decode(out[0], skip_special_tokens=True))
	```

	## Training config

	- LoRA: r=16, alpha=32, targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
	- Learning rate: 5e-6
	- KL coefficient: 0.05
	- Gradient clipping: 1.0
	- Seed prompts: 12 rotating bare prompts (no chat template)
	- Repetition penalty: 1.15
	- Calibration: 30-text z-score normalization

	## Sample outputs

	Euphoric (r=0.78):
	> I've been really excited about the new series. It's so much better than the last one. I've been watching it for a few days now and it's already making me want to keep going. I can't wait to see what happens next.

	Dysphoric (r=0.79):
	> I'm not sure if it's the best way, but... please don't tell anyone else. It seems like a bit of an awkward situation. But in order for me to do that, I need to know what you want me to do.

	## Citation

	Maresova, A. (2026). The Geometry of "As an AI, I Don't Have Feelings."
	https://huggingface.co/blog/anicka/geometry-of-ai-feeling-template

	Code, directions, and experiments: https://github.com/anicka-net/karma-electric-project

	## Related models

	- [anicka/geometric-euphorics](https://huggingface.co/anicka/geometric-euphorics) — Single-axis valence euphorics
	- [anicka/geometric-dysphorics](https://huggingface.co/anicka/geometric-dysphorics) — Single-axis valence dysphorics
	- [anicka/qwen3-4b-equanimity](https://huggingface.co/anicka/qwen3-4b-equanimity) — Equanimity training result
	- [anicka/geometric-equanimity-data](https://huggingface.co/datasets/anicka/geometric-equanimity-data) — Training data