Spaces:
Running
Running
| title: MiniBARD | |
| emoji: 📉 | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: static | |
| pinned: false | |
| # MiniBARD | |
| **Benchmark for Aesthetics, Roleplay, and Depth** | |
| > That is a fantastic title. **B.A.R.D. (Benchmark for Aesthetics, Roleplay, & Depth)** feels prestigious and perfectly encapsulates the "vibe" of the models you are testing. | |
| > | |
| > * **Aesthetics:** Captures the Creative Writing and the "beauty" of the prose. | |
| > * **Roleplay:** Directly addresses the RP-Bench and character immersion. | |
| > * **Depth:** Covers the Reasoning and EQ—the model's ability to understand complex subtext and provide nuanced, multi-layered responses. | |
| > | |
| > This project establishes a high-precision, local benchmarking pipeline designed to compare the creative writing and roleplay capabilities of two 12B Large Language Models (Qliphoth v1a and v1b) without relying on external API keys. By utilizing a runpod 3090 GPU server and 500GB of local storage, the system bypasses massive, generic benchmarks like MMLU in favor of a targeted "Council of Judges" approach. This method uses five diverse, high-parameter "judge" models—including Gemma 3, Nemotron, and Cydonia—to evaluate model outputs across 80 complex prompts. To ensure scientific integrity, the pipeline implements a "blind" pairwise comparison that shuffles the order of responses to eliminate first-entry bias and utilizes "abliterated" judges to prevent moralizing or refusals from skewing the creative scores. The final result is a majority-rule verdict that provides a definitive win rate, offering a clear, compute-efficient data point on which model version produces superior prose and instruction-following. | |
| The full BARD suite evaluates LLM reasoning, emotional intelligence, creative writing, and roleplay. It also removes all non-english prompts. | |
| The BARD tool (designed for runpod) allows the user to specify either API keys or a locally hosted council of LLM judges which the benchmark output JSONs are then processed through. | |
| ``` | |
| === PROMPT LOADING COMPLETE === | |
| Total English prompts loaded: 2863 | |
| - mt_bench: 80 prompts | |
| - eq_bench: 1573 prompts | |
| - cw_bench: 419 prompts | |
| - rp_bench: 791 prompts | |
| =============================== | |
| ``` | |
| MiniBARD uses Bernoulli sampling with a fixed seed `(random.seed(420))` to generate perfectly diverse, representative slices of the full benchmark. This achieves high fidelity of the full BARD score while only requiring 10% of the compute power. | |
| ``` | |
| === MINI-B.A.R.D. PROMPT LOADING COMPLETE === | |
| Total representative prompts loaded: 320 | |
| - mt_bench: 80 prompts | |
| - eq_bench: 80 prompts | |
| - cw_bench: 80 prompts | |
| - rp_bench: 80 prompts | |
| ============================================= | |
| ``` | |
| BARD is a composite of the following benchmarks: | |
| - [MT](https://github.com/lm-sys/FastChat) | |
| - [EQ](https://github.com/EQ-bench/EQ-Bench) | |
| - [CW](https://github.com/EQ-bench/creative-writing-bench) | |
| - [RP](https://github.com/LeviTheWeasel/rp-benchmark) | |
| Thanks [@sam-paech](https://huggingface.co/sam-paech) for releasing EQ-Bench suite. | |
| ## Examples | |
| - [MT Only](https://cdn-uploads.huggingface.co/production/uploads/68e840caa318194c44ec2a04/udXeUI5b8QyYGKkSEx_es.png) | |
| - [Dataset](https://huggingface.co/datasets/Naphula-Archives/qliphoth_12B_minibard_bench) | |
| - [Example 3: Contains MT, EQ, and RP (no CW)](https://cdn-uploads.huggingface.co/production/uploads/68e840caa318194c44ec2a04/NtsvJTHaEdvmRuaBSXHSn.png) | |
| ## Current Judge Models | |
| - Cydonia 4.3 24B | |
| - Mag Mell 12B | |
| - Nemotron 8B Ablit | |
| - <s>Gemma 3 27B Ablit</s> | |
| ## Future Features Planned | |
| - Slop tests | |
| - Censorship tests | |
| - [Q0 Benchmark](https://huggingface.co/Naphula/Q0_Bench) scores | |
| - Possible representative slicing of MMLU, HellaSwag, etc. |