onekq-ai (ONEKQ AI)

posted an update 4 months ago

Post

382

GPT 5.1 codex didn't make SOTA either. This should conclude 2025. No model has ever reached above 0.8.

onekq-ai/WebApp1K-models-leaderboard

Can this leaderboard be saturated in 2026?

onekq

updated a Space 4 months ago

WebApp1K Models Leaderboard

🥇

17

Display leaderboard for WebApp1K models

onekq

posted an update 4 months ago

Post

276

I am starting a new series on matrix. The idea came to me when I wrote about the Muon optimizer.

Matrix itself has lots of fascinating properties and is applied in STEM fields for many decades. Its application in ML is just the beginning. There are lots of low hanging fruits. At the very least, I hope this math perspective will give you a new lens.

https://huggingface.co/blog/onekq/matrices-transformers-preface

2 replies

·

onekq

posted an update 4 months ago

Post

232

DeepSeek v3.2 is worse than R1. This is quite puzzling. Why the regression with new GRPO and new attention?

onekq-ai/WebApp1K-models-leaderboard

I used reasoning mode against DeepSeek API

onekq

posted an update 4 months ago

Post

256

Hard-earned lessons to land your agent (some mine, most learned from others)

1. Clarify expectations. what do you mean by automating emails? auto drafting? replying via templates? extracting details into json?

2. Get access to your customer's corp/prod environment. Guest or sandbox won't cut it, much less your demo account.

3. Don't expect your agent to be turn-key. It will take at least a quarter to stabilize, if your customer actually uses it.

onekq

posted an update 5 months ago

Post

251

Claude Opus 4.5 didn't make SOTA either. So many models are stuck at 0.75 now

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 5 months ago

Post

389

The second point re Ilya post is about RL pain point, i.e. sparse reward. I'm optimistic on this front.

Our actions are driven by unspeakable instincts, which left no traces in training set (pretraining or synthetic). These process rewards (motion sensing, vision etc.) help you master new skills quickly, like biking. Outcome reward only (falling off the bike) is indeed too sparse.

But lots of tasks can benefit from outcome rewards alone. Many latest RL works to upgrade SQL skills use success-failure reward only, with executable as optional reward.

Additionally, scale is the secret sauce for models to surpass humans. A human agent can learn a task quickly, but is capacity limited. But a model agent can process tasks in the scale of many human lifetimes. This made up for the inadequacy of process rewards.

Many such tasks happen to be economically viable, i.e. salary-making jobs.

onekq

posted an update 5 months ago

Post

276

Ilya's interview has been widely cited. I won't address meta points but share 2 cents on two mundane issues.

I will start with the leaderboard phenomena. This is a feature, not bug. Model training is a project under founder mode. But still like all projects, it needs north stars. And you guess right, (famous) leaderboards are the north stars.

For those startups which found PMFs, many maintain their own proprietary leaderboards/benchmarks condensed from user traffic. The path is blocked on both directions: startups won's share their moats, model makers won't prioritize either.

So instead of complaining, we should celebrate that our prompts work (most of the time)

onekq

posted an update 5 months ago

Post

326

Grok 4.1 didn't make SOTA, but improves a great deal over 3.
onekq-ai/WebApp1K-models-leaderboard

Members of the 70% club are the 4 big players (GPT, Claude, Gemini, Grok) and Kimi.

onekq

posted an update 5 months ago

Post

257

No SOTA from Gemini 3 either 😖

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 5 months ago

Post

289

If RAG (by that I meant vectors and embeddings) transitions from QA to agents, is scalability (from wikipedia to personal memory) still an issue? What will be the new challenges?

Anyone care to share experience?

onekq

posted an update 5 months ago

Post

2112

No SOTA from gpt5 codex

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 5 months ago

Post

801

Sorry folks. No SOTA from GPT 5.1

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 5 months ago

Post

2626

GLM 4.6 is on a par with Gemini 2

onekq-ai/WebApp1K-models-leaderboard

1 reply

·

onekq

posted an update 5 months ago

Post

1363

This post is the byproduct of my investigation on GPU depreciation. Very interesting dynamics between Chinese models and American chips.

https://huggingface.co/blog/onekq/nvfp4-int4

More stories like this will emerge down the road.

onekq

posted an update 5 months ago

Post

248

Here is the post on Muon optimizer. It's getting hard core. I tried to visualize orthogonalization but decided to drop it to avoid miscommunication.

https://huggingface.co/blog/onekq/muon-optimizer

No matter which angle I take, I can't detect slowdown. It's the opposite in fact.

onekq

posted an update 5 months ago

Post

2873

The reaction on the QAT post is beyond expectations so below is my optimizer post as promised. But I found that I had lots of explanation to do about optimizer itself. So this post is actually a historical recount. The Muon optimizer (used by Kimi) post (coming very soon) can only continue after this.

https://huggingface.co/blog/onekq/adam-optimizer

If you know Adam(W) optimizer already, you can just skip and sorry for the wait. Otherwise, it should be a useful read.

onekq

posted an update 5 months ago

Post

2448

Instead of architectural upgade, each major model drop nowadays perfects a regional innovation. What Kimi brought to spot light this time is quantization aware training (QAT). I wrote an article to explain it and why it matters to reasoning models.

https://huggingface.co/blog/onekq/qat-bonsai

If you are interested in this kind of posts, I will introduce the Muon optimizers, another technology behind Kimi success.

onekq

posted an update 5 months ago

Post

279

Wow, the Kimi K2 thinking model beats Gemini and DeeepSeek R1.

onekq-ai/WebApp1K-models-leaderboard

onekq

posted an update 5 months ago

Post

1585

To make agent work for us when we sleep, we must break the curse of sessions.

AI & ML interests

Team members 1

onekq-ai's activity

WebApp1K Models Leaderboard