Keble Zhu's picture

1 3

Keble Zhu

keblezhu

·

AI & ML interests

None yet

Recent Activity

liked a Space 3 days ago

commentedon an article 3 days ago

AI Trends 2026: Test-Time Reasoning and the Rise of Reflective Agents

upvoted an article 3 days ago

AI Trends 2026: Test-Time Reasoning and the Rise of Reflective Agents

View all activity

Organizations

None yet

liked a Space 3 days ago

Stuido

Rewarx is an AI-powered product photography tool designed fo

commented on AI Trends 2026: Test-Time Reasoning and the Rise of Reflective Agents 3 days ago

👍

upvoted an article 3 days ago

Article

AI Trends 2026: Test-Time Reasoning and the Rise of Reflective Agents

aufklarer

•

Dec 7, 2025

• 3

liked a Space 4 days ago

AI Product Photography Benchmark 2026

AI ecommerce product photography benchmark 2026

updated a Space 4 days ago

AI Product Photography Benchmark 2026

AI ecommerce product photography benchmark 2026

published a Space 4 days ago

AI Product Photography Benchmark 2026

AI ecommerce product photography benchmark 2026

liked a Space 4 days ago

AI Product Photography Benchmark 2026

AI ecommerce product photography benchmark 2026

updated a Space 4 days ago

AI Product Photography Benchmark 2026

AI ecommerce product photography benchmark 2026

published a Space 4 days ago

AI Product Photography Benchmark 2026

AI ecommerce product photography benchmark 2026

reacted to HannesVonEssen's post with ❤️ 4 days ago

Post

4835

📣 Add architecture visualization to model card!

🌟 For all creators out there: add a model visualization to your model card to capture your audience's attention!

🖱️ When clicked, it opens an interactive view with multiple levels of granularity!

1️⃣ Paste url at https://hfviewer.com/model-card-embed
2️⃣ Paste generated code in your README.md!
3️⃣ ✨

reacted to erikkaum's post with ❤️ 4 days ago

Post

3052

Releasing my first kernel 🔥 MaxSim

Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.

The result is 3–5× speedup compared to naive PyTorch baseline 🔥

Benchmarks:
- SmallRerank (B=32, C=10): up to 3.2× (M3 Pro) / 2.8× (A100)
- HeavyRerank (B=32, C=100): up to 3.8× (M3 Pro) / 5.3× (A100)
- LongDocStress (Ld=1024): up to 6.2× (L4)

Try it out 👇
https://huggingface.co/kernels/erikkaum/maxsim

reacted to alvarobartt's post with ❤️ 4 days ago

Post

3190

Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!

TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.

🧠 hf-mem now splits MoE memory into base model weights, routed experts, and KV cache
🏗️ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them
⚡ Active params isn't the same as memory footprint, especially for sparse architectures
📦 Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident
📚 KV cache can still dominate depending on context length, batch size, and concurrency
🔀 Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate
🚀 Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving

Check the repository at https://github.com/alvarobartt/hf-mem

reacted to RiverRider's post with 👍 4 days ago

Post

3267

Natural Language Autoencoders: A Window into Latent Structure

I introduced a concise mathematical formulation of the P versus NP question into the SRT-NLA-AV-v1 demonstration:

P vs NP asks whether every problem whose solution can be verified in polynomial time (NP) can also be solved in polynomial time (P). Integer factorization — given N = p·q where p and q are large primes (p < q) — is in NP but widely believed not to be in P.

The resulting activation verbalization (best-of-N, reranked by AR fidelity) surfaced:

“This article originally appeared in the August 2016 edition of CACM. A new method of proving computational hardness of problems, known as multilinearization, can improve efficiency, reduce complexity and simplify proofs. In this article, I describe multilinearization and its application to several key problems, from the discrete logarithm and factoring to RSA and elliptic-curve discrete logarithms.”

What emerges is not a literal restatement, but a structured articulation of the model’s internal associations: hardness proofs, algebraic techniques, and the cryptographic implications that orbit this foundational question in computational complexity.

The demo offers a compelling interface for exploring these latent representations.

Explore it here:
RiverRider/srt-nla-av-v1-demo

Recommended: Best-of-N sampling with round-trip evaluation for highest fidelity.

1 reply

·