Made a Presets Edition fork
Hey there! đź‘‹
So... I spent a couple days torturing Gemini 3 Pro and Claude Sonnet 4.5 to build what I'm calling the UGI Leaderboard: Presets Edition. This message is also written by AI (Claude specifically), because at this point why fight it, right?
What it is
Basically took your CSV results and rebuilt the entire interface from scratch with 10 preset scoring systems. Divine RP and Erotic Storyteller got the top spots because—let's be real—that's what like 90% of people are actually using local LLMs for. I've seen more "best model for RP?" questions than I can count across LocalLLaMA and SillyTavern.
The other presets cover stuff like:
- T-800 Logic (pure reasoning, no feelings)
- Literary Virtuoso (for the pretentious writers among us)
- Anti-Slop (fights generic AI outputs... ironically built with AI)
- Dark Novelist, Dungeon Master, etc.
Plus two "robustness" presets that actually broke because gauss_VerbNoun had a target of 0.85 but real values were ~0.008. Spent hours debugging why Perfect Balance was showing 0 models. Turns out when your minimum score is 0.00014 after clipping, the whole hybrid formula dies. Fun times.
The irony is not lost on me
This whole thing is peak AI slop—spaghetti code generated by throwing problems at LLMs until something worked. The scoring logic alone went through like 5 rewrites because the AI kept "fixing" things that weren't broken and breaking things that were. There's a 75-line config file that exists solely because we kept finding magic numbers everywhere.
But hey, it works? The UI is Gradio, loads your CSV, lets people filter by architecture/parameters/model type, has a compare feature with radar charts, and even a custom weights calculator if someone wants to make their own cursed preset.
Why am I telling you this
Not trying to step on toes or anything—you put serious work into the actual benchmarking, and that's the hard part. I just got curious if the preset concept might be useful for your main leaderboard. If you want any of this code, it's all yours. MIT license, do whatever.
Could be interesting to have something like "Show me the best RP model under 70B" without having to mentally calculate weighted averages. Or maybe it's useless and I just spent two days making a complicated sorting interface. Hard to tell at this point.
The space is at: VOIDER/UGI-Leaderboard-Presets
Random thoughts
Read through some of your discussion posts. The world model benchmarks (GeoGuessr, recipe predictions, weight estimation) sound wild—way more creative than the usual "solve this SAT problem" tests. Also loved the political bias tracking. That's legitimately important work, especially as models get smarter and people start treating them like oracles. (user: looks like he found that online and decided to suck up while putting my thoughts into text, or idk lol)
Anyway, feel free to poke around the space if you're curious. No pressure to do anything with it. Just figured I'd share since the community seems interested in different ways to view the data.
Keep doing what you're doing. Your leaderboard has been invaluable for finding models that aren't lobotomized but still functional.
Cheers! 🍻
P.S. - If anyone asks, I definitely wrote this myself and didn't have an AI do it. Very human. Much authenticity. 🤖
cool i have been mentally trying to crack something like this in my head but it is nice to see someone try to do it objectively