Part of the Brie Model Family: Flagship model with highest performance. See also: Brie Llama 3.2 3B (cross-architecture validation) | Brie Qwen 2.5 0.5B (foundational research)

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation

Brie Qwen 2.5 3B

LoRA adapter for Qwen/Qwen2.5-3B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work.

Overview

This model is part of a controlled study comparing how different architectures handle fine-tuning on specialized philosophical and creative discourse. The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools, to observe architectural differences in preserving:

Continental philosophical analysis (phenomenology, existentialism, critical theory)
Speculative and experimental thinking
Conceptual reframing for artistic and theoretical work
Contemplative prose and cultural criticism

Research question: How do different model architectures (Qwen, Llama, etc.) differ in their ability to adopt and maintain patterns of philosophical reasoning and contemplative discourse?

Base Model: Qwen/Qwen2.5-3B-Instruct
Training Method: LoRA (Low-Rank Adaptation)
Training Data: 1,213 original examples authored by the researcher
Training Duration: 2 epochs (290 steps, ~1-2 hours on NVIDIA RTX 5090)
Adapter Size: ~14MB
License: Apache 2.0

Evaluation Results

Blind A/B testing comparing Brie against baseline Qwen 2.5 3B Instruct. Presentation order randomized to control for position bias. Evaluated by four independent judges from three laboratories (Anthropic, OpenAI, Google) at the 2025 generation, with a Claude Haiku 4.5 re-judge added in 2026.

In-Domain Performance (Philosophy/Creative)

57 comparisons across philosophy, creative writing, brainstorming, and contemplative domains.

Judge	Provider	Brie Wins	Baseline Wins	Sample Size	Win Rate
Claude 3.5 Sonnet	Anthropic	40	2	42	95.2%
Claude Opus 4	Anthropic	12	3	15	80.0%
Claude (Sonnet 3.5 + Opus 4 pooled)	Anthropic	52	5	57	91.2%
GPT-4o	OpenAI	53	4	57	93.0%
Gemini 2.5 Flash Lite	Google	54	3	57	94.7%
Claude Haiku 4.5 (2026 re-judge)	Anthropic	46	11	57	80.7%

Multi-lab consensus: All four 2025-generation judges from three independent laboratories show strong preference for Brie (80-95%). The Anthropic pool of Sonnet 3.5 + Opus 4 covers all 57 prompts across disjoint subsets at 91.2% (52/57). Cross-lab pairwise agreement (GPT-4o ↔ Gemini 2.5 Flash Lite): 91.2% (52/57 pairs).

Out-of-Domain Performance (Coding/Math/Practical)

15 prompts testing generalization beyond philosophy/creative domains.

Judge	Brie Wins	Baseline Wins	Win Rate
Claude Opus 4	7	8	47%
Claude Haiku 4.5	9	6	60%

Key Finding: No catastrophic forgetting. Brie maintains parity (47-60%) on out-of-domain tasks while achieving 80-95% on in-domain tasks.

Architecture Comparison

The same dataset of 1,213 original examples authored by the researcher was fine-tuned across multiple base architectures to study how model design affects philosophical reasoning capabilities:

Base Architecture	Win Rate vs Baseline	Judges	Sample Size
Qwen 2.5 3B (this model)	80-95%	4 judges (3 labs), 2025	n=57
Llama 3.2 3B	80.7%	3 judges (3 labs), 2025	n=57
Qwen 2.5 0.5B	77.0%	3 judges (3 labs), 2025	n=57
Qwen3 0.6B	~30%	2 judges	n=57

Llama 3.2 3B 2025 breakdown — Claude pool: 75.4%, GPT-4o: 82.5%, Gemini: 84.2% (mean 80.7%).

Research findings:

Qwen 2.5 architecture shows strongest alignment with philosophical discourse patterns
Llama 3.2 maintains strong performance (75-84% depending on judge)
Model size matters: sub-1B models struggle with contemplative reasoning patterns
Different judges show varying sensitivity to stylistic differences (80-95% range across 2025 judges)

Performance by Domain

Domain	Brie Wins	Total	Win Rate	Notes
Brainstorming	9	10	90.0%	Best overall performance
Reproducibility Run 2	5	5	100.0%	Perfect consistency
Reproducibility Run 3	4	5	80.0%	Strong consistency
Temperature Tests	6	6	100.0%	Robust across 0.5/1.0
Token Length Tests	6	6	100.0%	Robust across 256/512/1024
Expanded Creative	5	5	100.0%	Dominates creative tasks
Philosophy Domain	7	10	70.0%	Solid in-domain
Contemplative	6	10	60.0%	Good meditative writing

Strongest vs Weakest Domains

Strongest: Philosophy and creative writing are Brie's specialty - achieving 80-95% win rates on in-domain tasks across the 2025-generation judge panel (Sonnet 3.5, Opus 4, GPT-4o, Gemini 2.5 Flash Lite). Brainstorming (90%) and creative tasks (100%) show exceptional performance.

Weakest: Out-of-domain tasks (coding, math, practical problems) show expected trade-off at 47-60% win rate - demonstrating that while Brie maintains competence outside its specialty, it's optimized for philosophical and creative domains.

Use Cases

Intended applications:

Philosophical Analysis

Continental philosophy (phenomenology, existentialism, critical theory)
Conceptual analysis and argumentation
Theoretical reframing of questions

Creative Development

Speculative and experimental thinking
Conceptual work for artists and writers
Novel perspective generation

Writing

Contemplative prose
Cultural criticism
Theoretical brainstorming

Technical Details

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias='none',
    task_type='CAUSAL_LM',
)

Training Configuration

SFTConfig(
    num_train_epochs=2,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    effective_batch_size=8,
    learning_rate=2e-4,
    lr_scheduler_type='linear',
    warmup_steps=20,
    max_length=2048,
    bf16=True,
)

Total Training Steps: 290 Hardware: NVIDIA RTX 5090 (32GB VRAM) Training Platform: RunPod Training Cost: ~$3 on cloud GPUs

Comparison with Brie v2 0.5B

Metric	Brie v2 0.5B	Brie v2 3B	Improvement
In-Domain Win Rate	77.0%	80-95%	+3 to +18%
Out-of-Domain Win Rate	40%	47-60%	+7 to +20%
Model Size	618M params	3B params	4.9x larger
Adapter Size	4.1MB	14MB	3.4x larger

Scaling the base model from 0.5B to 3B parameters with identical training data yields substantial performance improvements without catastrophic forgetting.

Usage

Loading the Model

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

# Load model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    "closestfriend/brie-v2-3b",
    device_map="auto",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("closestfriend/brie-v2-3b")

# Generate response
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the concept of 'being-in-the-world' from phenomenology."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(text, return_tensors="pt").to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.75,
    do_sample=True,
    top_p=0.95,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Recommended Generation Parameters

Temperature: 0.75 (tested and validated)
Top-p: 0.95
Max tokens: 512-1024 depending on task
Do sample: True (for creative/philosophical tasks)

Limitations

Domain Specialization: Optimized for philosophy and creative writing. Out-of-domain performance is at parity (50-60%), not improved.
Training Data Scope: 1,213 examples authored by the researcher, drawn from years of philosophical discussions with LLMs - demonstrating a reproducible approach for domain-specific fine-tuning.
Size Constraints: While 3B is significantly better than 0.5B, it's still a relatively small model. Qwen 2.5 7B would likely show further improvements.
Language: Primarily trained and evaluated on English content.
Judge Variance: Different judges show up to ~15 point variance on overlapping prompts (Claude Opus 4: 80.0% on its n=15 subset vs Claude 3.5 Sonnet: 95.2% on its n=42 subset). This reflects different evaluation criteria emphasis across models.
LLM-judge stability: An independent re-judging pass of the same model outputs by the same Claude judges three days later returned a pooled Anthropic rate of 78.9% (vs 91.2% on the first pass). This suggests LLM-judge verdicts under randomized A/B order have non-trivial variance even with identical inputs — a caveat for any single-pass judge evaluation, including this one.

Evaluation Methodology

Blind A/B testing with randomized presentation order to control for position bias. Four independent 2025-generation LLM judges across three labs (Anthropic — Sonnet 3.5 + Opus 4 pooled, OpenAI — GPT-4o, Google — Gemini 2.5 Flash Lite), with a 2026 Claude Haiku 4.5 re-judge for temporal robustness.

Complete evaluation methodology and results available in the training repository.

Note on Validation

A critical bug in winner determination logic was discovered during evaluation (inverting 56% of results). All reported metrics reflect corrected data. Full documentation of the bug, fix, and validation process included in repository.

Training Data

The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools. This LLM-assisted data authoring methodology achieved 72-97% win rates across different judge generations, demonstrating a reproducible approach for domain-specific fine-tuning.

Data Authoring Process: Training data was authored using Claude (Anthropic), ChatGPT (OpenAI), Mistral, and Kimi as discussion partners. Notably, no training data was generated using Qwen or Llama models to prevent potential data contamination in fine-tuning experiments.

The dataset covers:

Continental philosophy discussions (phenomenology, existentialism, ontology)
Speculative and experimental reasoning
Philosophical argumentation and conceptual analysis
Contemplative and reflective prose

Research methodology: This same dataset was used across the following architectures to enable controlled comparison: Qwen 2.5 3B, Llama 3.2 3B, Qwen3 0.6B, and Qwen 2.5 0.5B. By holding the training data constant, architectural differences in handling philosophical reasoning become observable.

Multi-Response Sampling Methodology

A key methodological innovation: rather than single responses per prompt, the training data contains 202 unique prompts with multiple high-quality responses per prompt (averaging ~6 responses each, totaling 1,213 examples).

Why This Matters:

The model learns the distribution of valid responses rather than memorizing fixed prompt-response pairs
Teaches multiple valid reasoning paths and stylistic variations within domain constraints
Explains strong generalization despite relatively few unique prompts
Provides robustness: model learns what makes a response valid, not just one "correct" answer

This multi-response approach is critical to understanding why 1,213 examples achieve 72-97% win rates—the model learns patterns and variance, not memorization.

Reference: This approach aligns with cognitive grounding principles (causal, compositional, revisable reasoning) discussed in the paper.

Training Notes

Full 2-epoch training essential for convergence on small datasets
Scaling base model (0.5B → 3B) with identical data yields substantial improvements
Controlled comparison: identical data across all architectures reveals architectural differences
Multi-judge evaluation (4 judges, 3 labs in 2025; +1 in 2026) provides robust cross-validation
Blind testing critical: evaluation bugs can invert results entirely

Future Directions

Qwen 2.5 7B training
DPO training using extracted preference pairs (48 pairs available)
Extended human evaluation complement to LLM judges

Citation

If you use this model in your research or applications, please cite:

@misc{brie-v2-3b,
  author = {closestfriend},
  title = {Brie v2 3B: Architecture Comparison Study for Philosophical Discourse Fine-Tuning},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/closestfriend/brie-v2-3b}},
}

Acknowledgments

Base Model: Qwen Team for Qwen 2.5 3B Instruct
Evaluation Judges: Claude 3.5 Sonnet, Claude Opus 4, Claude Haiku 4.5, GPT-4o, Gemini 2.5 Flash Lite
Training Platform: RunPod for GPU infrastructure
Framework: HuggingFace Transformers, PEFT, TRL

License

Apache 2.0 - Same as base model (Qwen 2.5 3B Instruct)

Model tree for closestfriend/brie-v2-3b

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1267)

this model

Evaluation results

Win Rate vs Baseline (Claude Opus 4, blind A/B, n=15 subset) on Multi-Domain Comprehensive (57 prompts)
self-reported

80.000
Win Rate vs Baseline (Claude pool — Sonnet 3.5 + Opus 4, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

91.200
Win Rate vs Baseline (Claude 3.5 Sonnet, blind A/B, n=42 subset) on Multi-Domain Comprehensive (57 prompts)
self-reported

95.200
Win Rate vs Baseline (GPT-4o, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

93.000
Win Rate vs Baseline (Gemini 2.5 Flash Lite, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

94.700
Win Rate vs Baseline (Claude Haiku 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)
self-reported

80.700
Win Rate vs Baseline (Claude Opus 4, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
self-reported

47.000
Win Rate vs Baseline (Claude Haiku 4.5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)
self-reported

60.000

closestfriend
/

brie-v2-3b