ginigen-ai's picture
1 59

ginigen-ai

ginigen-ai

AI & ML interests

None yet

Recent Activity

reacted to SeaWolf-AI's post with โค๏ธ about 21 hours ago
ALL Bench Leaderboard โ€” Structural Problems in AI Benchmarking and the Case for Unified Evaluation https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard The AI benchmark ecosystem has three structural problems. Major benchmarks like MMLU have surpassed 90%, losing discriminative power. Most leaderboards publish unverified self-reported scores โ€” our cross-verification found Claude Opus 4.6's ARC-AGI-2 listed as 37.6% (actual: 68.8%), Gemini 3.1 Pro as 88.1% (actual: 77.1%). OpenAI's own audit confirmed 59.4% of SWE-bench Verified tasks are defective, yet it remains widely used. ALL Bench addresses this by comparing 91 models across 6 modalities (LLM ยท VLM ยท Agent ยท Image ยท Video ยท Music) with 3-tier confidence badges (โœ“โœ“ cross-verified ยท โœ“ single-source ยท ~ self-reported). Composite scoring uses a 5-Axis Framework and replaces SWE-Verified with contamination-resistant LiveCodeBench. Key finding: metacognition is the largest blind spot. FINAL Bench shows Error Recovery explains 94.8% of self-correction variance, yet only 9 of 42 models are even measured. The 9.2-point spread (Kimi K2.5: 68.71 โ†’ rank 9: 59.5) is 3ร— the GPQA top-model spread, suggesting metacognition may be the single biggest differentiator among frontier models today. VLM cross-verification revealed rank reversals โ€” Claude Opus 4.6 leads MMMU-Pro (85.1%) while Gemini 3 Flash leads MMMU (87.6%), producing contradictory rankings between the two benchmarks. ๐Ÿ“Š Article: https://huggingface.co/blog/FINAL-Bench/all-bench ๐Ÿ“ฆ Dataset: https://huggingface.co/datasets/FINAL-Bench/ALL-Bench-Leaderboard โšก GitHub: https://github.com/final-bench/ALL-Bench-Leaderboard ๐Ÿ† Leaderboard: https://huggingface.co/spaces/FINAL-Bench/all-bench-leaderboard ๐Ÿงฌ FINAL Bench: https://huggingface.co/datasets/FINAL-Bench/Metacognitive
reacted to SeaWolf-AI's post with ๐Ÿค— about 21 hours ago
๐Ÿš€ Introducing MARL โ€” Runtime Middleware That Reduces LLM Hallucination Without Fine-Tuning Now available on PyPI ยท GitHub ยท ClawHub ยท HuggingFace AI models sense they could be wrong, but they can't actually fix what's broken. ๐Ÿค— Live A/B test: https://huggingface.co/spaces/VIDraft/MARL We evaluated 9 SOTA models (GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, etc.) across 1,800 assessments in FINAL Bench and found a 39.2%p gap between "recognizing potential errors (MA=0.694)" and "actually finding and fixing them (ER=0.302)." MARL (Model-Agnostic Runtime Middleware for LLMs) was built to close this metacognitive gap. It decomposes a single LLM call into a 5-stage expert pipeline (Hypothesis โ†’ Solver โ†’ Auditor โ†’ Adversarial Verifier โ†’ Synthesizer), transforming "answer in one shot" into "think, doubt, correct, and rewrite." No weight modification โ€” works instantly with GPT-5.4, Claude, Gemini, Llama, or any OpenAI API-compatible LLM by changing one line: base_url. Ships with 9 domain-specific emergence engines (invention, pharma, genomics, chemistry, ecology, law, and more โ€” 5,538 expert data items) activated by a simple tag like model="gpt-5.4::pharma". pip install marl-middleware MARL is also officially registered on ClawHub, the skill marketplace of OpenClaw โ€” an AI agent platform with 260K+ developers and 3,200+ skills. It's the first middleware in the Reasoning Enhancement category. One command โ€” clawhub install marl-middleware โ€” gives your AI agent a metacognition upgrade. ๐Ÿ“ Technical deep dive: https://huggingface.co/blog/FINAL-Bench/marl-middleware ๐Ÿ“ฆ PyPI: https://pypi.org/project/marl-middleware/ ๐Ÿ™ GitHub: https://github.com/Vidraft/MARL ๐Ÿฆ€ ClawHub: https://clawhub.ai/Cutechicken99/marl-middleware #MARL #LLM #Hallucination #Metacognition #MultiAgent #AIMiddleware #FINALBench #OpenClaw #ClawHub #PyPI #AGI #HuggingFace #ReasoningAI #SelfCorrection #GlassBoxAI
View all activity

Organizations

None yet