We're moving beyond model capabilities and toward the infrastructure needed for agents to work together.
Over the past few weeks we've seen meaningful momentum around the foundational building blocks of the emerging agentic web.
Agent Name Service (ANS) is addressing identity and trust. Agentic Resource Discovery (ARD) is helping standardize how agents discover resources and capabilities.
Together, these efforts represent something bigger than individual projects.
They point toward an ecosystem built on open, interoperable infrastructure rather than isolated implementations.
As builders, we'll likely spend the next few years solving challenges around identity, discovery, trust, interoperability, and governance—not just model performance.
It will be interesting to see how these efforts evolve—and where the community chooses to collaborate next.
--- 🚀 Gemma-4-A4B 98e v7-coder cohort — loop-fixed re-release. Two 20.8B MoE coders (4B-active), fresh-map prunes of Gemma 4 26B-A4B, 30/128 experts dropped per layer. The headline isn't a benchmark: the agentic loop is gone at the weights, not papered over by the sampler.
🔧 How: at prune time we force-keep the 46 agentic_eog experts a loop-protection signal flags as load-bearing for clean multi-turn termination (+ shared-FFN α=1.2). Result: 0 loops across 48 seeds on every published tier.
🎯 Both land near GPQA ~51 — graduate science is the budget axis, neither is a science model. Pick v7-coder for the broad LCB-medium + HumanEval lead; v7-coderx for the all-hard slice and HE+.
🧪 The harness we used to prove the fix is now an omk tool: agentic-loop-harness replays a frozen agentic conversation across a sampler×seed matrix and reports a fail-rate per chat-template, so you can isolate a loop to one variable. Model-agnostic — any OpenAI-compatible server. The version we shared with Google: google/gemma-4-12B-it#41
hey, I'm doing some experimenting, looping around :slight_smile: --- **kompress-v6** *shipped* — trained on Claude Code agent patterns (bash output, file reads, stack traces, search results, JSON tool responses). 3k synthetic pairs + 2k existing, fine-tuned from v4, $0.20 on vast.ai.
Results: heretic exact_pct 0.962 (v4: 0.967), keep_rate 0.854 (v4: 0.823), override delta 0. Model got more conservative — higher keep_rate on structured technical content. Real proxy: v4 compressed 9.5%, v6 compressed 4.2% on the same session. Less aggressive, fewer must-keep tokens dropped on paths and identifiers.
Interesting failure: self-labeling with v4+override collapsed mk_in_ref to 0.652. TokenExpiredError splits into Token+Expired+Error — subtokens that don't individually match the must-keep regex, so the force-keep never fires. Generator references (mk_in_ref=1.0 by construction) ended up being better labels than v4's compressed output for agent data. Fix for next run: slide a 2-3 subtoken window instead of checking individual subtokens. Would let self-labeling work on agent content and potentially produce a more compression-aggressive v7.
🚀 Introducing PerceptionDLM — the first multimodal diffusion LLM for parallel region perception!
Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. 🧩
✨ Highlights • ⚡ Up to 3.4× faster on dense multi-region captioning, with stable per-image latency • 🏆 PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs) • 📊 New benchmark: ParaDLC-Bench — jointly evaluates caption quality AND inference efficiency • 🔓 Code, models & benchmark all open-sourced
Over the past few days, SupraLabs has been mentioned in a public discussion regarding small language models, scaling laws, and training methodology. We'd like to clarify our position.
Before anything else, we want to make one thing absolutely clear: we have great respect for Lane and the work being done at Glint Research. At no point was our intention to disrespect Lane, Glint Research, or their research. What began as a technical discussion about model scaling and training methodology unfortunately became much more personal than we ever intended. From our perspective, it was simply an exchange of technical opinions, and we sincerely hope it remains that way. We'd also like to acknowledge that one of our own comments during the discussion was poorly worded. Referring to a benchmark as "fake" was imprecise. What we intended to criticize was the comparison methodology, not the integrity of the evaluation itself. Comparing a merged checkpoint against a single checkpoint is, in our view, not an apples-to-apples comparison.
That said, this was never the core of the discussion.
Our disagreement was not about SLERP, model merging, or whether training a small model on massive amounts of data is an interesting research direction. We support experimentation and unconventional ideas.
The actual point of disagreement was much simpler.
The statement that a 1M parameter model trained on 1 trillion tokens will become a "100M killer" is, today, a prediction, not an experimental result. Could it happen? Perhaps. Would it be exciting if it did? Absolutely.
But until benchmark results, reproducible evaluations, and independent validation exist, we believe such statements should be presented as hypotheses rather than established conclusions. Research advances by testing ideas, not by assuming their outcomes.
We sincerely wish Lane and everyone at Glint Research success in their experiments.
Excited to share our paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
A common assumption in test-time reasoning is that giving a model more chances to think or verify should improve performance. Our results show that this is only partly true.
We introduce SEVRA, a serving-layer controller that decides when a frozen reasoning model should keep its initial answer and when it should actively verify it. Instead of treating verification as always useful, SEVRA asks a more deployment-focused question:
Is this specific attempt likely recoverable by verification?
We evaluate this through helpful fixes, harmful flips, extra calls, and realized token cost.
Some key takeaways:
* Selective verification improves over always verifying on MATH500 while reducing harmful flips. * On GSM8K, the controller verifies only a small fraction of examples but still improves accuracy. * However, a longer initial solve can sometimes match selective verification with fewer realized tokens. * Cheap serving-visible features, such as completion status, token count, and finalizer use, nearly match larger learned gates. * On CommonsenseQA, always-on verification hurts, showing that the best test-time compute action is workload-dependent.
The main deployment lesson is simple:
Tune the initial reasoning budget first. Then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.
Would love feedback from the community, especially on broader test-time compute allocation, risk-aware verification, and practical serving policies for reasoning models.
Excited to open-source the VisDrone Aerial Object Detection Model Zoo on Hugging Face.
The collection includes multiple YOLO variants trained and evaluated on the VisDrone benchmark for aerial object detection, with accompanying documentation and performance metrics.
If you're working on drones, aerial surveillance, robotics, or small-object detection, I hope these models save you some time.
SRT Showcase: Watch a Frozen Model Think, Token by Token
A frozen Qwen-2.5-7B now narrates its own interpretation in real time. SRT Showcase is the most complete public demonstration of computational semiotics to date, running the backbone with the SRT Adapter and Activation Verbalizer. As the model generates, every token is tinted by its predictive effort, and at the highest-effort positions the Verbalizer decodes the hidden state directly into natural language. You see what the model is representing at the exact moment its computation is most active.
Every verbalization is validated, not asserted. Each decoded thought is re-encoded and compared back to the original hidden state, and the reconstruction closely approximates it. The "this is what the model was thinking" claim carries its own fidelity badge. This is grounded introspection, not plausible narration.
The Showcase goes further than the trace. An A/B panel runs the same prompt with SRT injection on and off under an identical seed, so the side-channel's effect is directly observable. A curated gallery walks through confident recall, false premises, misconceptions, reasoning pivots, genuine uncertainty, and safety boundaries. Live entropy and divergence meters track the crystallization process token by token, with per-layer traces and reflexivity estimates on hover.
None of the backbone weights are touched. The entire mechanism is a lightweight reflexive layer over a frozen model, which is why the same read-out heads already port from Qwen-2.5-7B up to a 235B Mixture of Experts. Frozen models can now be verbalized in real time. No retraining. No fine-tuning. No black box.
First request is a brief cold start while ZeroGPU acquires a GPU. Bring your own prompt.
I just released Inflect-Nano-v1, an ultra-small 4.63 parameter text-to-speech model.
The main idea is simple: instead of only making the acoustic model tiny and relying on a larger external vocoder, Inflect-Nano-v1 keeps the complete text-to-waveform stack under 5M parameters.
Quick facts: - 4.63M total inference parameters - 3.46M acoustic model - 1.17M vocoder - 24 kHz audio - English-only - Single male voice - Runs locally with a simple PyTorch inference script
Why I made it: Most modern TTS models are much larger, and even many “small TTS” projects depend on a separate vocoder. I wanted to see how far a complete tiny TTS stack could be pushed while still producing usable speech.
It is not SOTA, and I am not trying to claim it competes with large TTS systems. The interesting part is the size-to-functionality ratio.
What works: It can generate arbitrary English speech locally, and the model is small enough to be interesting for:
- local voice assistants - embedded/edge experiments - browser or WASM-style TTS exploration - efficient inference research - tiny-model baselines
Limitations: The quality is still limited. It can sound robotic, stumble on difficult unseen text, and the vocoder is still a clear bottleneck. Long or unusual prompts are less reliable.
So I would frame this as a research/demo release, not a production TTS engine.
I’d love feedback from people interested in: - tiny speech models - vocoders - local TTS - efficient inference - embedded speech synthesis - improving small-model generalization
If people find it useful, I’m interested in putting more training budget into a stronger v2.
The article for aleph attention routing needs more work on vision, as the vision portion has not been fully validated, while the LM prototype has been semi-validated for small and medium-small scale. I will post my findings in the coming days with the consequences of training an LM and a VIT utilizing the prototype system.
The current structure for the Geometric Vocabulary does nearly reflect the intended shape as discussed in the earlier posts and articles, so that's coming along nicely - but there are stipulations and problems involved that I did not foresee.
My apologies for the incomplete article I just released on a whim. I jumped to the conclusion a bit early in anticipation before the formulas were fully converged. I also released an early post the other day speaking about the prototype AlephLM - which I removed as an invalid conclusion.
I'm doing my best to only release validated empirical information instead of speculative - however I do sometimes jump to conclusions without proper validation from time to time. Occasionally, I get a bit theory-overzealous and require tidying up through thorough experimentation which I'm currently approaching directly.
published a small source-backed dataset for reviewing AI-assisted code and AI-written English without turning it into an accusation game. Dataset: yava-code/ai-authorship-signals-2026 The dataset has 10 review signals across two domains: code: comment-to-code ratio, dependency hallucination, security misses, edge cases writing: overused AI vocabulary, low section variation, detector bias against non-native English Each row includes: signal why it matters risk level review action source ids The main idea: do not ask "was this made by AI?" first. Ask what needs review, what evidence exists, and what failure mode would hurt production. I also grouped the related work here: yava-code/applied-small-ai-portfolio-6a304c83f9f1d089a28c101b
A tiny ReAct-style agent where the trace is the interface: click a thought, retry a branch, label weak/useful nodes, and export preference pairs for DPO/RL-style training.
Space: build-small-hackathon/glass-box-agent Demo: included in the Space at assets/glass-box-agent-demo.mp4 Track: An Adventure in Thousand Token Wood
From Plain English to DuckDB SQL: Building LFEDS 🏫 I just shipped Local First Education Data Stack— a plain-English-to-SQL assistant for school district analytics — for the HF Build Small Hackathon.
The problem: school staff have useful data (attendance, grades, enrollment, discipline) but no fast, private way to ask questions. Most AI tools send that data to a cloud API. LFED doesn't.
What it does: → Type a question like "What's the average GPA for chronically absent students in 2023-2024?" → A fine-tuned Qwen2.5-Coder-14B model generates DuckDB SQL → A validation layer rejects anything that isn't a SELECT → Results come back as a summary, table, CSV download, and the SQL itself
Two flavors: - Live Space demo: transformers + PEFT on HF ZeroGPU - Local-first: llama.cpp + GGUF Q4_K_M on your own machine — no data leaves
The fine-tune: - 27,859 synthetic NL→SQL pairs - Unsloth QLoRA r=32 on Qwen2.5-Coder-14B - Trained on Modal A10G
Hardest lessons were not model training: 1. Scope the model's job tightly — schema + few-shots + SELECT only. 2. Validate before executing. Always. 3. ZeroGPU is PyTorch-only; llama.cpp won't work there. 4. Gradio's scoped Svelte CSS beats generic selectors — inspect the live DOM. 5. modal deploy + fn.spawn() is fire-and-forget; modal run dies if your terminal drops. 6. Data artifacts matter as much as the model — Parquet seeds, dataset card, model card.
I also published the training dataset: 25,886 question→SQL pairs on the Hub.
Russian Stylometric Dataset (RSD) — 322 texts from the 19th – early 20th centuries (16 million words), prepared for analysis in stylo (R) and machine learning (Python).
Darwin V9 — GPQA Diamond 90.9%, #1 on the leaderboard, with pure greedy decoding Darwin-398B-JGOS reaches 90.9% (180/198) on GPQA Diamond, the PhD-level scientific reasoning benchmark, ranking #1 on the Hugging Face GPQA Diamond leaderboard. No self-consistency, no test-time compute scaling — this was achieved with a single greedy decode (temperature 0, single sample, max 16,384 tokens). The full eval config is published in the model card, so anyone can reproduce it. Raw reasoning, no score inflation. The result comes from Darwin V9, a patented evolutionary model-development platform. Its core idea: it never trains a model from scratch. Why Darwin V9 beats training from scratch
Cost & speed: no trillion-token pretraining run, no months of compute — a purpose-built, high-performance model is produced in a fraction of the time. Reuse of proven intelligence: instead of re-learning every capability from a blank slate, it selects and combines only the strengths of already-trained, already-validated models, so results are stable and predictable. Surgical transplantation: it identifies which neural region of which model holds which capability — at the FFN (Feed Forward Network) layer level — and grafts in only the segments that contribute to the target skill.
How it works: a large model (Qwen 3.5 397B) serves as the mother model (the substrate); several father models specialized in reasoning, coding, and language are analyzed layer-by-layer across their FFN regions; the segments that contribute to the target performance are extracted and transplanted into the mother model to produce a new child model. The result is a ~400B MoE that activates only ~17B parameters per token at inference — large-model capacity with efficient inference. If training from scratch means rebuilding everything from a blank page, Darwin V9 means precisely recombining intelligence that has already been proven. GPQA Diamond #1 is the proof. Model: FINAL-Bench/Darwin-398B-JGOS
I built Read-Along AI for the Hugging Face Build Small Hackathon.
It is an offline-capable reading practice app for early readers: one short sentence at a time, tap-to-hear word help, record a read-aloud attempt, then get gentle feedback.
The goal is Backyard AI in the literal sense: a tool for real home reading practice, where feedback needs to be patient, developmentally fair, and private. A child’s voice should not need to leave the app just to practice “The dog ran fast.”
What makes it small-model native:
- Exact clean readings pass immediately. - Close or ambiguous child-speech transcripts get a second look from a fine-tuned MiniCPM phonetic evaluator. - Meaning-changing mistakes still fail closed, e.g. “blue hat” should not pass for “red hat.” - Off the Grid Mode runs local ASR plus the MiniCPM GGUF evaluator through llama.cpp. - Turbo Mode uses Modal endpoints for lower-latency ASR/TTS/evaluation. - The UI is custom Gradio with a child-facing reading canvas, clickable words, progress feedback, and celebration on success.
Targeted tracks and badges: Backyard AI, Off-Brand, Off the Grid, Llama Champion, Well-Tuned, Tiny Titan, Sharing is Caring, Field Notes.
I fine-tuned OpenBMB's MiniCPM5-1B to write Triton GPU kernels, then let an immutable referee decide if they are real: compile, check correctness against PyTorch on adversarial inputs, time against eager, torch.compile, and torch.compile max-autotune, then block the known ways of gaming the benchmark.
The 1B setup beat torch.compile max-autotune in 12/12 independently seeded runs. The larger Qwen3.6-27B smith pushed the same referee loop further: 76 verified compiler-beating kernels on H200, with 69 surviving a 5-run stability gate and 7 kept as single-shot probes on unseen problems. On a 376-cell shape/dtype grid, the stability-gated kernels keep a 1.49x geomean, with about 10% of cells losing and reported per cell.
Honest bound: these are scheduling wins on memory-bound ops, not new algorithms or wins over cuBLAS/FlashAttention. The scarce thing is not the big model, it is the verifier it cannot fool.
✅ Article highlight: *Adversaries, Data Poisoning, and Incentive Governance for Training Worlds* (art-60-171, v0.1)
TL;DR: This article argues that training worlds become adversarial markets.
If gameplay data trains agents, players, UGC authors, operators, and supply-chain actors will try to shape the data. If labels and rewards shape what gets learned, then labels and rewards are governance surfaces too. 171 turns data poisoning and incentive gaming into receipted lifecycles.
Why it matters: • makes “training set T is admissible for run R” a governed claim • treats poisoning as a caseable process, not a vague abuse report • fails closed when monitoring is unhealthy or detector drift is detected • treats labels, rewards, collusion, and sybil pressure as governance problems • connects data integrity to courts, appeals, and bounded publication
What’s inside: • training substrate governance contracts • adversary taxonomy for players, UGC, operators, and supply-chain actors • quarantine → adjudication → inclusion / exclusion pipeline • monitoring SLOs, monitor health receipts, and detector drift incidents • label economy contracts and reward distribution receipts • anti-sybil and collusion monitoring • admissibility verdict receipts for deciding what may train the next run
Key idea: Do not say:
*“we filtered poisoned data.”*
Say:
*“this substrate was admitted under this governance contract, adversary taxonomy, monitoring SLO, quarantine/adjudication trail, label economy, reward policy, and admissibility verdict.”*