🏟️ Smol AI WorldCup: A 5-Axis Benchmark That Reveals What Small Language Models Can Really Do 5 days ago • 37
Build an Agent That Thinks Like a Data Scientist: How We Hit #1 on DABStep with Reusable Tool Generation 2 days ago • 9
Structural Problems in AI Benchmarking and the Case for a Unified Evaluation Framework 7 days ago • 12
Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds 4 days ago • 4
🏟️ Smol AI WorldCup: A 5-Axis Benchmark That Reveals What Small Language Models Can Really Do 5 days ago • 37
Build an Agent That Thinks Like a Data Scientist: How We Hit #1 on DABStep with Reusable Tool Generation 2 days ago • 9
Structural Problems in AI Benchmarking and the Case for a Unified Evaluation Framework 7 days ago • 12
Code Concepts: A Large-Scale Synthetic Dataset Generated from Programming Concept Seeds 4 days ago • 4