Part of the Brie Model Family: Flagship model with highest performance. See also: Brie Llama 3.2 3B (cross-architecture validation) | Brie Qwen 2.5 0.5B (foundational research)
Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation
Brie Qwen 2.5 3B
LoRA adapter for Qwen/Qwen2.5-3B-Instruct specializing in continental philosophy, speculative reasoning, and conceptual development for creative work.
Overview
This model is part of a controlled study comparing how different architectures handle fine-tuning on specialized philosophical and creative discourse. The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools, to observe architectural differences in preserving:
- Continental philosophical analysis (phenomenology, existentialism, critical theory)
- Speculative and experimental thinking
- Conceptual reframing for artistic and theoretical work
- Contemplative prose and cultural criticism
Research question: How do different model architectures (Qwen, Llama, etc.) differ in their ability to adopt and maintain patterns of philosophical reasoning and contemplative discourse?
- Base Model: Qwen/Qwen2.5-3B-Instruct
- Training Method: LoRA (Low-Rank Adaptation)
- Training Data: 1,213 original examples authored by the researcher
- Training Duration: 2 epochs (290 steps, ~1-2 hours on NVIDIA RTX 5090)
- Adapter Size: ~14MB
- License: Apache 2.0
Evaluation Results
Blind A/B testing comparing Brie against baseline Qwen 2.5 3B Instruct. Presentation order randomized to control for position bias. Evaluated by five independent judges from three laboratories (Anthropic, OpenAI, Google).
In-Domain Performance (Philosophy/Creative)
57 comparisons across philosophy, creative writing, brainstorming, and contemplative domains.
| Judge | Provider | Brie Wins | Baseline Wins | Sample Size | Win Rate |
|---|---|---|---|---|---|
| Claude Opus 4 | Anthropic | 45 | 12 | 57 | 78.9% |
| Claude 3.5 Sonnet | Anthropic | 40 | 2 | 42 | 95.2% |
| GPT-4o | OpenAI | 53 | 4 | 57 | 93.0% |
| Gemini 2.5 Flash Lite | 54 | 3 | 57 | 94.7% | |
| Claude Haiku 4.5 | Anthropic | 46 | 11 | 57 | 80.7% |
Multi-lab consensus: All five judges from three independent laboratories show strong preference for Brie (79-95%). Cross-lab pairwise agreement: 91.2% (GPT-4o ↔ Gemini).
Out-of-Domain Performance (Coding/Math/Practical)
15 prompts testing generalization beyond philosophy/creative domains.
| Judge | Brie Wins | Baseline Wins | Win Rate |
|---|---|---|---|
| Claude Opus 4 | 7 | 8 | 47% |
| Claude Haiku 4.5 | 9 | 6 | 60% |
Key Finding: No catastrophic forgetting. Brie maintains parity (47-60%) on out-of-domain tasks while achieving 79-95% on in-domain tasks.
Architecture Comparison
The same dataset of 1,213 original examples authored by the researcher was fine-tuned across multiple base architectures to study how model design affects philosophical reasoning capabilities:
| Base Architecture | Win Rate vs Baseline | Judges | Sample Size |
|---|---|---|---|
| Qwen 2.5 3B (this model) | 79-95% | 5 judges (3 labs) | n=57 |
| Llama 3.2 3B | 80.4%* | 4 judges (3 labs) | n=57 |
| Qwen 2.5 0.5B | 71.9% | 4 judges (3 labs) | n=57 |
| Qwen3 0.6B | ~30% | 2 judges | n=57 |
*Average across all judges. Claude judges: 75.4%, GPT-4o: 82.5%, Gemini: 84.2%
Research findings:
- Qwen 2.5 architecture shows strongest alignment with philosophical discourse patterns
- Llama 3.2 maintains strong performance (75-84% depending on judge)
- Model size matters: sub-1B models struggle with contemplative reasoning patterns
- Different judges show varying sensitivity to stylistic differences (79-95% range)
Performance by Domain
| Domain | Brie Wins | Total | Win Rate | Notes |
|---|---|---|---|---|
| Brainstorming | 9 | 10 | 90.0% | Best overall performance |
| Reproducibility Run 2 | 5 | 5 | 100.0% | Perfect consistency |
| Reproducibility Run 3 | 4 | 5 | 80.0% | Strong consistency |
| Temperature Tests | 6 | 6 | 100.0% | Robust across 0.5/1.0 |
| Token Length Tests | 6 | 6 | 100.0% | Robust across 256/512/1024 |
| Expanded Creative | 5 | 5 | 100.0% | Dominates creative tasks |
| Philosophy Domain | 7 | 10 | 70.0% | Solid in-domain |
| Contemplative | 6 | 10 | 60.0% | Good meditative writing |
Strongest vs Weakest Domains
Strongest: Philosophy and creative writing are Brie's specialty - achieving 79-95% win rates on in-domain tasks across 5 different judge models. Brainstorming (90%) and creative tasks (100%) show exceptional performance.
Weakest: Out-of-domain tasks (coding, math, practical problems) show expected trade-off at 47-60% win rate - demonstrating that while Brie maintains competence outside its specialty, it's optimized for philosophical and creative domains.
Use Cases
Intended applications:
Philosophical Analysis
- Continental philosophy (phenomenology, existentialism, critical theory)
- Conceptual analysis and argumentation
- Theoretical reframing of questions
Creative Development
- Speculative and experimental thinking
- Conceptual work for artists and writers
- Novel perspective generation
Writing
- Contemplative prose
- Cultural criticism
- Theoretical brainstorming
Technical Details
LoRA Configuration
LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM',
)
Training Configuration
SFTConfig(
num_train_epochs=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
effective_batch_size=8,
learning_rate=2e-4,
lr_scheduler_type='linear',
warmup_steps=20,
max_length=2048,
bf16=True,
)
Total Training Steps: 290 Hardware: NVIDIA RTX 5090 (32GB VRAM) Training Platform: RunPod Training Cost: ~$3 on cloud GPUs
Comparison with Brie v2 0.5B
| Metric | Brie v2 0.5B | Brie v2 3B | Improvement |
|---|---|---|---|
| In-Domain Win Rate | 77% | 79-95% | +2 to +18% |
| Out-of-Domain Win Rate | 40% | 47-60% | +7 to +20% |
| Model Size | 618M params | 3B params | 4.9x larger |
| Adapter Size | 4.1MB | 14MB | 3.4x larger |
Scaling the base model from 0.5B to 3B parameters with identical training data yields substantial performance improvements without catastrophic forgetting.
Usage
Loading the Model
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch
# Load model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"closestfriend/brie-v2-3b",
device_map="auto",
torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("closestfriend/brie-v2-3b")
# Generate response
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the concept of 'being-in-the-world' from phenomenology."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.75,
do_sample=True,
top_p=0.95,
)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Recommended Generation Parameters
- Temperature: 0.75 (tested and validated)
- Top-p: 0.95
- Max tokens: 512-1024 depending on task
- Do sample: True (for creative/philosophical tasks)
Limitations
Domain Specialization: Optimized for philosophy and creative writing. Out-of-domain performance is at parity (50-60%), not improved.
Training Data Scope: 1,213 examples authored by the researcher, drawn from years of philosophical discussions with LLMs - demonstrating a reproducible approach for domain-specific fine-tuning.
Size Constraints: While 3B is significantly better than 0.5B, it's still a relatively small model. Qwen 2.5 7B would likely show further improvements.
Language: Primarily trained and evaluated on English content.
Judge Variance: Different judges show up to 16 point variance (Claude Opus 4: 79% vs Claude 3.5 Sonnet: 95%). This reflects different evaluation criteria emphasis across models.
Evaluation Methodology
Blind A/B testing with randomized presentation order to control for position bias. Five independent LLM judges across three labs (Anthropic, OpenAI, Google).
Complete evaluation methodology and results available in the training repository.
Note on Validation
A critical bug in winner determination logic was discovered during evaluation (inverting 56% of results). All reported metrics reflect corrected data. Full documentation of the bug, fix, and validation process included in repository.
Training Data
The model was trained on 1,213 examples authored by the researcher through iterative discussions, using LLMs as authoring tools. This LLM-assisted data authoring methodology achieved 72-97% win rates across different judge generations, demonstrating a reproducible approach for domain-specific fine-tuning.
Data Authoring Process: Training data was authored using Claude (Anthropic), ChatGPT (OpenAI), Mistral, and Kimi as discussion partners. Notably, no training data was generated using Qwen or Llama models to prevent potential data contamination in fine-tuning experiments.
The dataset covers:
- Continental philosophy discussions (phenomenology, existentialism, ontology)
- Speculative and experimental reasoning
- Philosophical argumentation and conceptual analysis
- Contemplative and reflective prose
Research methodology: This same dataset was used across the following architectures to enable controlled comparison: Qwen 2.5 3B, Llama 3.2 3B, Qwen3 0.6B, and Qwen 2.5 0.5B. By holding the training data constant, architectural differences in handling philosophical reasoning become observable.
Multi-Response Sampling Methodology
A key methodological innovation: rather than single responses per prompt, the training data contains 202 unique prompts with multiple high-quality responses per prompt (averaging ~6 responses each, totaling 1,213 examples).
Why This Matters:
- The model learns the distribution of valid responses rather than memorizing fixed prompt-response pairs
- Teaches multiple valid reasoning paths and stylistic variations within domain constraints
- Explains strong generalization despite relatively few unique prompts
- Provides robustness: model learns what makes a response valid, not just one "correct" answer
This multi-response approach is critical to understanding why 1,213 examples achieve 72-97% win rates—the model learns patterns and variance, not memorization.
Reference: This approach aligns with cognitive grounding principles (causal, compositional, revisable reasoning) discussed in the paper.
Training Notes
- Full 2-epoch training essential for convergence on small datasets
- Scaling base model (0.5B → 3B) with identical data yields substantial improvements
- Controlled comparison: identical data across all architectures reveals architectural differences
- Multi-judge evaluation (5 judges, 3 labs) provides robust cross-validation
- Blind testing critical: evaluation bugs can invert results entirely
Future Directions
- Qwen 2.5 7B training
- DPO training using extracted preference pairs (48 pairs available)
- Extended human evaluation complement to LLM judges
Citation
If you use this model in your research or applications, please cite:
@misc{brie-v2-3b,
author = {closestfriend},
title = {Brie v2 3B: Architecture Comparison Study for Philosophical Discourse Fine-Tuning},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/closestfriend/brie-v2-3b}},
}
Acknowledgments
- Base Model: Qwen Team for Qwen 2.5 3B Instruct
- Evaluation Judges: Claude 3.5 Sonnet, Claude Opus 4, Claude Haiku 4.5, GPT-4o, Gemini 2.5 Flash Lite
- Training Platform: RunPod for GPU infrastructure
- Framework: HuggingFace Transformers, PEFT, TRL
License
Apache 2.0 - Same as base model (Qwen 2.5 3B Instruct)
Links
- Paper: Karman, H. (2025). "Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation." DOI: 10.5281/zenodo.17657737
- Training Repository: https://github.com/closestfriend/efficient-domain-adaptation
- Evaluation Results:
EVALUATION_BRIE_3B.mdin repository - Brie Bench Framework:
EVALUATION_FRAMEWORK.mdin repository - Brie Llama 3.2 3B: https://huggingface.co/closestfriend/brie-llama-3b (cross-architecture validation, 60% out-of-domain)
- Brie v2 0.5B: https://huggingface.co/closestfriend/brie-qwen2.5-0.5b (foundational research)
Training: October 16, 2025 Evaluation: October 2025 - January 2026 License: Apache 2.0
- Downloads last month
- 22
Model tree for closestfriend/brie-v2-3b
Evaluation results
- Win Rate vs Baseline (Claude Opus 4, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported78.900
- Win Rate vs Baseline (Claude 3.5 Sonnet, blind A/B, n=42) on Multi-Domain Comprehensive (57 prompts)self-reported95.200
- Win Rate vs Baseline (GPT-4o, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported93.000
- Win Rate vs Baseline (Gemini 2.5 Flash Lite, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported94.700
- Win Rate vs Baseline (Claude Haiku 4.5, blind A/B, n=57) on Multi-Domain Comprehensive (57 prompts)self-reported80.700
- Win Rate vs Baseline (Claude Opus 4, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)self-reported47.000
- Win Rate vs Baseline (Claude Haiku 4.5, blind A/B, n=15) on Out-of-Domain (15 prompts - coding, math, practical)self-reported60.000