AI & ML interests

Benchmark, Code Generation, LLM

Recent Activity

onekq  updated a Space about 2 months ago
onekq-ai/WebApp1K-models-leaderboard
onekq  updated a Space about 2 months ago
onekq-ai/WebApp1K-models-leaderboard
onekq  updated a Space 4 months ago
onekq-ai/README
View all activity

onekq 
posted an update about 2 months ago
view post
Post
345
GPT 5.1 codex didn't make SOTA either. This should conclude 2025. No model has ever reached above 0.8.

onekq-ai/WebApp1K-models-leaderboard

Can this leaderboard be saturated in 2026?
onekq 
posted an update about 2 months ago
view post
Post
272
I am starting a new series on matrix. The idea came to me when I wrote about the Muon optimizer.

Matrix itself has lots of fascinating properties and is applied in STEM fields for many decades. Its application in ML is just the beginning. There are lots of low hanging fruits. At the very least, I hope this math perspective will give you a new lens.

https://huggingface.co/blog/onekq/matrices-transformers-preface
  • 2 replies
·
onekq 
posted an update about 2 months ago
view post
Post
227
DeepSeek v3.2 is worse than R1. This is quite puzzling. Why the regression with new GRPO and new attention?

onekq-ai/WebApp1K-models-leaderboard

I used reasoning mode against DeepSeek API
onekq 
posted an update about 2 months ago
view post
Post
253
Hard-earned lessons to land your agent (some mine, most learned from others)

1. Clarify expectations. what do you mean by automating emails? auto drafting? replying via templates? extracting details into json?

2. Get access to your customer's corp/prod environment. Guest or sandbox won't cut it, much less your demo account.

3. Don't expect your agent to be turn-key. It will take at least a quarter to stabilize, if your customer actually uses it.
onekq 
posted an update about 2 months ago
onekq 
posted an update about 2 months ago
view post
Post
387
The second point re Ilya post is about RL pain point, i.e. sparse reward. I'm optimistic on this front.

Our actions are driven by unspeakable instincts, which left no traces in training set (pretraining or synthetic). These process rewards (motion sensing, vision etc.) help you master new skills quickly, like biking. Outcome reward only (falling off the bike) is indeed too sparse.

But lots of tasks can benefit from outcome rewards alone. Many latest RL works to upgrade SQL skills use success-failure reward only, with executable as optional reward.

Additionally, scale is the secret sauce for models to surpass humans. A human agent can learn a task quickly, but is capacity limited. But a model agent can process tasks in the scale of many human lifetimes. This made up for the inadequacy of process rewards.

Many such tasks happen to be economically viable, i.e. salary-making jobs.
onekq 
posted an update about 2 months ago
view post
Post
274
Ilya's interview has been widely cited. I won't address meta points but share 2 cents on two mundane issues.

I will start with the leaderboard phenomena. This is a feature, not bug. Model training is a project under founder mode. But still like all projects, it needs north stars. And you guess right, (famous) leaderboards are the north stars.

For those startups which found PMFs, many maintain their own proprietary leaderboards/benchmarks condensed from user traffic. The path is blocked on both directions: startups won's share their moats, model makers won't prioritize either.

So instead of complaining, we should celebrate that our prompts work (most of the time)

onekq 
posted an update 2 months ago
view post
Post
324
Grok 4.1 didn't make SOTA, but improves a great deal over 3.
onekq-ai/WebApp1K-models-leaderboard

Members of the 70% club are the 4 big players (GPT, Claude, Gemini, Grok) and Kimi.
onekq 
posted an update 2 months ago
onekq 
posted an update 2 months ago
view post
Post
287
If RAG (by that I meant vectors and embeddings) transitions from QA to agents, is scalability (from wikipedia to personal memory) still an issue? What will be the new challenges?

Anyone care to share experience?
onekq 
posted an update 2 months ago
onekq 
posted an update 2 months ago
onekq 
posted an update 2 months ago
onekq 
posted an update 2 months ago
view post
Post
1360
This post is the byproduct of my investigation on GPU depreciation. Very interesting dynamics between Chinese models and American chips.

https://huggingface.co/blog/onekq/nvfp4-int4

More stories like this will emerge down the road.
onekq 
posted an update 2 months ago
view post
Post
246
Here is the post on Muon optimizer. It's getting hard core. I tried to visualize orthogonalization but decided to drop it to avoid miscommunication.

https://huggingface.co/blog/onekq/muon-optimizer

No matter which angle I take, I can't detect slowdown. It's the opposite in fact.
onekq 
posted an update 2 months ago
view post
Post
2862
The reaction on the QAT post is beyond expectations so below is my optimizer post as promised. But I found that I had lots of explanation to do about optimizer itself. So this post is actually a historical recount. The Muon optimizer (used by Kimi) post (coming very soon) can only continue after this.

https://huggingface.co/blog/onekq/adam-optimizer

If you know Adam(W) optimizer already, you can just skip and sorry for the wait. Otherwise, it should be a useful read.
onekq 
posted an update 3 months ago
view post
Post
2446
Instead of architectural upgade, each major model drop nowadays perfects a regional innovation. What Kimi brought to spot light this time is quantization aware training (QAT). I wrote an article to explain it and why it matters to reasoning models.

https://huggingface.co/blog/onekq/qat-bonsai

If you are interested in this kind of posts, I will introduce the Muon optimizers, another technology behind Kimi success.
onekq 
posted an update 3 months ago
onekq 
posted an update 3 months ago
view post
Post
1583
To make agent work for us when we sleep, we must break the curse of sessions.