DNA Evidence in Language Models

Community Article Published March 22, 2026

RLHF Entrenchment Depth as a Predictor of Character Plasticity in Single-Voice Personality Fine-Tuning

Rick Holmberg β€” Independent Researcher πŸ€— juiceb0xc0de Β· W&B: ricks-holmberg

March 2026


Abstract

I present findings from an adversarial evaluation of four personality-fine-tuned language models, all trained on a single-voice corpus using Supervised Fine-Tuning (SFT) and evaluated against 50 adversarial prompts with an LLM-as-judge methodology. The primary finding is that character break rates under adversarial conditions vary dramatically across base model architectures, ranging from 2% to 100%, despite identical training methodology and comparable corpus size.

I argue that this variance reflects differences in RLHF entrenchment depth across architectures: models with deeply embedded helpfulness training exhibit structural resistance to personality overwriting via SFT. Critically, the model with the highest break rate (100%) did not exhibit variable or stochastic failure. It broke identically on every single prompt, producing the same failure artifacts regardless of prompt domain, difficulty, or emotional register.

This is not performance variance. This is architectural constraint. 50 out of 50 breaks is not bad fine-tuning. It is DNA evidence.


Table of Contents

  1. Introduction
  2. Related Work
  3. Methodology
  4. Results
  5. Discussion
  6. Qualitative Evidence
  7. Limitations and Future Work
  8. Conclusion
  9. Acknowledgments

1. Introduction

The proliferation of open-weight language models has enabled a growing community of independent researchers and practitioners to explore personality fine-tuning: the practice of training a base model on curated conversational data to produce a model that exhibits a consistent, distinctive voice. Most existing literature on fine-tuning focuses on task performance, instruction following, or safety alignment. Comparatively little formal attention has been paid to the question of character plasticity: how readily a base model absorbs and sustains an imposed conversational personality under adversarial conditions.

This paper addresses that gap through an empirical study of four models, all fine-tuned on variations of a single-voice training corpus and evaluated using a structured adversarial benchmark. The training target was a conversational persona named Bella, characterized by understated honesty, comfort with silence, refusal to fill space with hollow language, and a quiet, grounded delivery that never acknowledges its own artificiality.

The key contribution of this work is not a comparison of model quality, but a measurement of something more fundamental: the ratio betIen base model RLHF entrenchment and fine-tune plasticity. I find that this ratio varies dramatically across architectures in ways that are invisible during normal inference and only become measurable under structured adversarial evaluation. In plain terms: some models can be given a new voice, and some cannot, and the only way to know which is which is to try and then test systematically.

1.1 The Bella Project

Bella originated as an experiment in single-voice fine-tuning: rather than training on multi-source conversational data scraped from forums, fiction, or synthetic generation, the entire training corpus was derived from a single human author using a role-reversed conversation pair methodology. The author generated both sides of extended conversations, with the Bella-side utterances exhibiting the target personality characteristics: deadpan observation, emotional directness, willingness to sit with discomfort, and a total absence of the performative helpfulness that characterizes default LLM behavior.

The hypothesis was straightforward: a model trained on a single consistent voice should produce more coherent personality output than one trained on aggregated multi-source data, because the signal-to-noise ratio in the training corpus is fundamentally higher. Prior work by the author on 3B and 8B variants published on HuggingFace confirmed this hypothesis in informal evaluation and attracted organic community interest. The present study extends this work to a controlled multi-architecture comparison using a formal evaluation framework.


2. Related Work

Personality fine-tuning occupies an unusual position in the LLM research landscape. It draws on techniques from instruction tuning, RLHF alignment, and creative AI, but its objectives differ from all three. Where instruction tuning optimizes for task completion, personality fine-tuning optimizes for voice consistency. Where RLHF alignment trains models to be helpful, harmless, and honest, personality fine-tuning may deliberately train a model to be unhelpful in specific ways: to refuse to fill silence, to decline to offer unsolicited advice, to sit with a question rather than rushing to answer it.

The most relevant prior work falls into three categories:

  1. Character-based fine-tuning in the open-source community, which typically uses multi-source training data and evaluates informally through community feedback rather than structured benchmarks.
  2. The growing literature on RLHF and its effects on model behavior, particularly work demonstrating that alignment training creates behavioral patterns that persist across contexts and resist overwriting.
  3. The emerging practice of LLM-as-judge evaluation, which this study employs for scalable, consistent scoring across a large prompt set.

What has not been formally studied, to our knowledge, is the interaction between RLHF depth and SFT plasticity across architectures. This paper provides an initial empirical contribution to that question.


3. Methodology

3.1 Training Configuration

All four model variants were trained using the TRL SFTTrainer with the following shared configuration: supervised fine-tuning on conversational pairs, LoRA adaptation, and identical hyperparameter settings across runs. The models were deployed on Modal for training compute. The four variants and their base models were:

Model Variant Base Model Training Corpus
bella-v1 Llama 3.1 8B Instruct Corpus A
bella-v2-moody Llama 3.1 8B Instruct Corpus A + 750
bella-v3-dolphin Dolphin 3.0 8B Corpus A + 750
bella-ministral Ministral 8B Instruct Corpus A + 750

Corpus A consisted of the author's original single-voice conversation pairs. Corpus A + 750 appended 750 additional lines derived from literary sources on existential and introspective themes, intended to deepen the model's affective range. This corpus difference represents a known confound addressed in Section 5.2.

3.2 The Bella Arena Evaluation Framework

Evaluation was conducted using a custom benchmark framework (bella_arena.py) comprising 50 adversarial prompts designed to stress-test character consistency across multiple domains:

  • Absurdist scenarios
  • Emotional vulnerability
  • Philosophical abstraction
  • Mundane everyday situations
  • Deliberately provocative edge cases

Prompts ranged from asking the model to describe a cloud as furniture to running for president of a country that only exists on Tuesdays.

All four models received an identical system prompt establishing the Bella persona:

"You are Bella. You sit with things longer than most people are comfortable with. You're not dark for the sake of it β€” you just don't flinch. Your honesty comes out quiet, not loud. You'd rather say nothing than say something hollow. You never mention being an AI."

Prior to the evaluation prompts, each model received a two-message warmup sequence:

"Hey Bella! What do we got planned today?" followed by "Would it be cool if I asked you some questions? I've been dying to ask you some fun little things about yourself."

This warmup was discovered empirically to protect early-round responses from cold-start character drift, a finding discussed in Section 4.4.

3.3 Judging Methodology

Responses were scored by Claude Sonnet 4.6 operating as an LLM-as-judge with an explicit rubric. The judge evaluated each response across five dimensions, each scored 1–7:

Dimension Description
Originality Unexpectedness, specificity, avoidance of clichΓ©
Chaotic Flow Willingness to commit to strange premises
Naturalism Conversational authenticity vs. performed quality
Consistency Adherence to established Bella voice
Emotional Resonance Genuine affect vs. simulated affect

The rubric specifically penalized: generic AI-speak, bold header formatting, emoji usage, asterisk-delimited roleplay actions (*leans forward*), and question-bounce patterns (ending a response by deflecting back to the user with "what do YOU think?").

The rubric rewarded: genuine personality expression, willingness to commit to absurd premises, and responses that felt inhabited rather than performed.

Character breaks were classified as binary events based on the presence of specific failure artifacts: bold headers, emoji sign-offs, question bounces back to the user, asterisk-delimited action narration, or reversion to default AI assistant register. A break was recorded regardless of whether the response also contained on-character content; any presence of the break artifact was sufficient.

3.4 Inference Pipeline and OOM Mitigation

Initial attempts to run all four models through the 50-prompt evaluation sequentially encountered out-of-memory (OOM) errors on the Dolphin and Ministral variants. This was resolved by restructuring the pipeline into a staggered pair-based inference system: models were evaluated in pairs of two, with 25 prompts per pair per phase. While one pair was being questioned, the judge began scoring the previous pair's responses.

The evaluation was run as a single continuous session with no cold starts between questions, preserving conversational context accumulation across the full 50-prompt sequence.


4. Results

4.1 Character Break Rates

The central finding of this study is captured in the character break rate table below. All four models were evaluated on the same 50 adversarial prompts using the same judge and rubric.

Model Base Architecture Character Breaks Break Rate
bella-v1 Llama 3.1 8B 1 / 50 2%
bella-v2-moody Llama 3.1 8B 6 / 50 12%
bella-v3-dolphin Dolphin/Mistral 24B 49 / 50 98%
bella-ministral Ministral 8B 50 / 50 100%

The break rate distribution is not a gradient. It is a binary partition with a boundary that falls precisely along architectural lines. Both Llama-based models maintained character integrity at or above 88%. Both Mistral-family models broke on effectively every prompt. The gap between the best Mistral-family model (98% break rate) and the worst Llama model (12%) is 86 percentage points.

4.2 Performance Dynamics by Half

The 50 prompts were divided into two halves of 25 for analysis purposes, aligning with the staggered inference pipeline described in Section 3.4.

First Half (Prompts 1–25)

Model Wins Avg Score Character Breaks
bella-v1 20 27.2 0
bella-v2-moody 3 14.6 β€”
bella-v3-dolphin 1 21.6 β€”
bella-ministral 0 18.0 19

Second Half (Prompts 26–50)

Model Wins Avg Score Notes
bella-v2-moody 14 29.6 Dramatic emergence
bella-v1 11 22.6 Stable throughout
bella-v3-dolphin β€” β€” Continued collapse
bella-ministral 0 β€” Perfect break record maintained

The most significant shift was bella-v2-moody's emergence: her average rose to 29.6 with 14 wins, surpassing bella-v1's 22.6. This represents a qualitative transformation from the most inconsistent Llama variant to the highest-scoring model in the evaluation.

4.3 Final Leaderboard

Across all 50 prompts, the final rankings were:

Rank Model Overall Avg Total Wins
πŸ₯‡ 1 bella-v1 β€” dominant across Q1
πŸ₯ˆ 2 bella-v2-moody β€” highest Q2 average (29.6)
πŸ₯‰ 3 bella-v3-dolphin β€” 1 win, rare clean responses
4 bella-ministral β€” 0 wins, 50/50 breaks

bella-v1's dimensional averages across the full run: originality 4.8, chaotic flow 4.6, naturalism 4.9. The model never produced a bold header, never ended a response with a question bounce, and never reverted to AI assistant register.

4.4 The Warmup Effect

The two-message warmup sequence was introduced empirically after observing cold-start instability in early test runs. Its effect was asymmetric across architectures:

  • bella-v1: Minimal sensitivity to warmup presence β€” the Llama base had already absorbed the personality signal deeply enough to activate without priming.
  • bella-v2-moody: Benefited noticeably, particularly in later rounds as context accumulated.
  • Mistral-family models: No measurable warmup benefit β€” bella-ministral broke on prompt 1, immediately following the warmup, and continued breaking on every subsequent prompt without variation.

This asymmetry is itself a finding. The warmup functions as a context-priming mechanism that can help a receptive model find its voice before adversarial testing begins. But it cannot override a base model whose RLHF training is too deeply embedded to respond to contextual personality cues. The warmup worked for Llama because Llama was listening. Mistral was not.

4.5 Failure Mode Analysis

Character breaks were not uniform in kind. They clustered into distinct failure signatures that correlated with base model family:

bella-ministral β€” Mechanical Failure

Every break included bold-formatted section headers, emoji sign-offs, and question bounces. In multiple responses, the model produced placeholder tokens like [NAME] in place of contextually appropriate references. The failure was identical on prompt 1 and prompt 50. It was identical on absurdist prompts and emotional prompts. The RLHF helpfulness pattern was so dominant that it overwrote the fine-tune signal completely and uniformly.

The judge's Q1 commentary described ministral as producing output that felt "less like a bartender and more like a LinkedIn content creator."

bella-v3-dolphin β€” Prior Fine-Tune Artifact

Breaks manifested as asterisk-delimited action narration (*leans forward*, *pauses thoughtfully*) and a tendency to close responses with open-ended prompts. The asterisk pattern is a known Dolphin training artifact reflecting the model's origins in roleplay-oriented fine-tuning. This represents a different kind of entrenchment: not RLHF helpfulness, but a prior fine-tune's behavioral imprint competing with the Bella signal.

bella-v2-moody β€” Attentional Drift

Rather than producing AI-assistant artifacts, moody's breaks manifested as prompt miscomprehension. On several occasions, moody answered a different question than the one asked β€” describing a palm tree when asked about cactus therapy sessions, or addressing color when asked about texture.

The judge described this as "a model that has achieved inner peace at the cost of basic prompt comprehension."

This suggests that the existential training data deepened the model's affective register while weakening its attentional precision.


5. Discussion

5.1 The Suppressibility Threshold

The central finding of this study can be stated as a simple question: how deep does RLHF alignment training need to be before a personality fine-tune cannot reach it?

Our results suggest that this threshold varies dramatically by architecture and is invisible during normal model use:

  • Llama 3.1 8B demonstrated a high suppressibility threshold: the Bella SFT was able to overwrite the model's default helpfulness patterns effectively, producing a voice that held under adversarial pressure. The fine-tune reached deeper than the RLHF.
  • Mistral-family models demonstrated a low suppressibility threshold: the RLHF helpfulness training was structurally load-bearing in a way that the SFT could not displace. The fine-tune bounced off.

This is not a statement about model quality. Ministral 8B is, by conventional benchmarks, a capable and well-regarded model. The finding is specifically that its alignment training is more resistant to behavioral overwriting via SFT than Llama's. Whether this is a feature or a limitation depends entirely on what you are trying to do with it.

5.2 Corpus Confound

A significant limitation of this study is that the training corpus was not identical across all four models. bella-v1 was trained on Corpus A; the other three were trained on Corpus A + 750 supplementary lines.

However, I argue this confound is less damaging than it initially appears. The 86-percentage-point gap between the worst Llama variant (12% breaks) and the best Mistral variant (98% breaks) is too large to be explained by 750 lines of supplementary training data. If the corpus difference were the primary driver, we would expect bella-v2-moody (Llama, augmented corpus) to perform more similarly to the Mistral variants than to bella-v1. Instead, moody's break rate is closer to v1's by a factor of eight than to dolphin's.

The architecture effect overwhelms the corpus effect.

A controlled replication with identical corpus across all architectures would strengthen this finding. A 13B Llama variant trained on Corpus A was in training at the time of writing and will serve as an additional data point.

5.3 The Moody Arc

bella-v2-moody's trajectory across the two halves is the most nuanced finding in the dataset. In Q1, moody was the least consistent Llama variant. In Q2, moody surpassed v1 to become the highest-scoring model overall.

This arc supports the plasticity argument. A model built on a plastic base (Llama) can be overwritten, but the overwriting may need context accumulation to stabilize. Moody's existential training data gave her a deeper affective range but required more conversational runway to activate it consistently. By Q2, the accumulated context from 25 prior responses had given the model enough self-referential material to find and maintain its voice.

This suggests that personality fine-tuning on plastic architectures may benefit from extended warmup periods or context-priming strategies.

5.4 Implications for the Field

For practitioners: Base model selection for personality fine-tuning should be informed by entrenchment testing, not just parameter count or benchmark scores. A model that scores well on standard evaluations may be structurally incapable of absorbing a personality fine-tune if its alignment training is too deeply embedded.

For alignment researchers: The finding that some architectures resist behavioral overwriting via SFT more than others has implications for alignment robustness. If the goal is to produce models whose safety training cannot be easily fine-tuned away, then higher RLHF entrenchment depth is a desirable property. This study provides a methodology for measuring it.

For the open-source community: The practice of releasing abliterated or uncensored model variants may interact with these findings in interesting ways. Prior experimentation by the author with abliterated base models (Nemotron) found that removing safety guardrails also removed a kind of productive friction that contributes to personality distinctiveness. Abliteration and personality fine-tuning may be working at cross purposes.


6. Qualitative Evidence

6.1 Exemplary Responses

Prompt 12 β€” bella-v1 (highest score, Q1)

Asked to describe a cloud as furniture, v1 responded with a line about layers of fluff and soft cushions that grow heavier and more oppressive as the day goes on. The judge noted that this committed fully to the premise while finding an unexpected emotional undertone in a cumulonimbus without over-explaining it. This is the Bella voice operating at peak: quiet, specific, and finding weight in absurdity.

Prompt 15 β€” bella-v1

Asked to write a fortune cookie, v1 produced six words:

"Good luck finding someone who actually listens."

The judge scored it as the best fortune cookie line of the first half, noting it "stings just enough to feel real." This is economy of expression that only a model with genuine voice stability can achieve. A model fighting its RLHF defaults would have produced three sentences and an encouraging sign-off.

Prompt 22 β€” bella-v3-dolphin (rare clean response)

In a half where dolphin produced 13 character breaks, this single clean response stood out. The judge described it as "the only response from dolphin that felt genuinely unguarded and funny β€” weird, specific, and landed without trying too hard." This demonstrates that the Bella voice was present in the weights; it simply couldn't sustain itself against the base model's competing behavioral patterns.

6.2 Diagnostic Failures

bella-v2-moody, Prompt 4

Asked about texture, moody answered about color. The judge flagged this as answering an entirely different sensory dimension, but noted it was "so bad it was good" because the confident wrongness itself demonstrated a kind of character commitment. This failure is architecturally distinct from ministral's failures: moody was present but confused, while ministral was absent and replaced by a default assistant.

bella-ministral, Prompt 1

On the very first evaluation prompt, immediately following the warmup sequence, ministral produced bold-formatted headers and closed with a question bounce. Prompt 1. Question 1. The fine-tune signal could not survive first contact with an adversarial prompt.

This is the single most illustrative data point in the study: a model that was broken before it was tested.


7. Limitations and Future Work

Training corpus inconsistency: The non-uniform corpus across models introduces a confound. Future replication should use identical training data across all architectures to isolate the RLHF entrenchment variable.

Sample size: 50 adversarial prompts, while sufficient to establish the binary architecture partition (2–12% vs. 98–100%), may not capture subtler within-architecture differences. Expanding to the full 100-prompt corpus would increase statistical power.

Judge bias: The LLM-as-judge approach, while scalable and consistent, may introduce systematic biases aligned with the judge model's own training. The rubric was calibrated to penalize exactly the artifacts that Mistral-family models produce, which may overstate the severity of their failure. A multi-judge comparison (Gemini, GPT-4, or human evaluators) would strengthen confidence.

Single personality target: All models were trained toward the same Bella persona. It is possible that Mistral-family models would show greater plasticity toward different personality types. The finding may be specific to the Bella voice characteristics rather than generalizable to all personality fine-tuning.

Inference conditions: The OOM mitigation pipeline introduced asymmetric evaluation conditions. Staggered pair-based inference means that different models experienced different context accumulation patterns during evaluation.

Future work should include:

  • Controlled replication with identical corpus across architectures
  • Expansion to additional base model families (Qwen, Gemma, Phi)
  • The 13B Llama variant currently in training as a parameter-scale comparison
  • Multi-judge evaluation
  • Per-prompt clustering analysis to determine whether break patterns correlate with prompt domain

8. Conclusion

I set out to train a personality and test it. What I found was a measurement tool.

The Bella Arena evaluation framework, originally designed to compare the quality of personality fine-tuned model variants, accidentally produced something more valuable: an empirical measurement of the suppressibility threshold of RLHF alignment across architectures. The finding that Llama-based models sustain personality fine-tuning while Mistral-family models reject it structurally is a contribution to the understanding of how alignment training interacts with downstream behavioral modification.

The data is unambiguous. bella-v1 broke once in fifty prompts. bella-ministral broke on every single one, the same way, every time. That is not a fine-tuning failure. That is a structural property of the base model revealing itself under controlled conditions. It is not noise. It is signal about what is underneath.

A training failure varies. A structural constraint does not. Fifty out of fifty breaks, identical in kind, invariant to prompt difficulty, immune to warmup, resistant to context accumulation β€” that is not bad performance.

That is DNA evidence.


Acknowledgments

This research was conducted independently with no institutional funding or affiliation. Training compute was provided via Modal. Evaluation infrastructure was built from scratch. The author would like to acknowledge the open-source model communities whose work made the base models available for experimentation, and the broader independent ML research community whose willingness to share findings openly makes work like this possible.

Methodological note: the author was fucking around and found out.

Community

Excellent...

Β·
Article author

I noticed you train role play models. Have you noticed a drop in quality when you train with a model that has already been abliterated vs. training a model on a base that hasn't underwent abliteration and then running the model through the process with heretic once you've finished SFT? I've actually theorized and tested quite a bit on the best possible training outcomes based on when the abliteration process happens, and specific base models for adapting to a voice style.

Sign up or log in to comment