Salma Mayorquin PRO

salma-remyx

26 22 482

https://remyx.ai

smellslikeml

AI & ML interests

None yet

Recent Activity

reacted to Banaxi-Tech's post with 🔥 about 3 hours ago

Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision. Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2. Model: https://huggingface.co/BananaMind/BananaMind-KV1-8M-2Bit-Experimental The important part: this is not just post-training KV cache quantization. Instead we take the BitNet approach. KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime. During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte. WikiText-2 eval vs 16-bit KV cache reference: Mean KLD: 0.0916 nats/token Mean KLD: 0.1322 bits/token Average KV cache shrink vs FP16: 5.33x Evaluated positions: 372,675 If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine! Try it here: https://huggingface.co/BananaMind/BananaMind-KV1-8M-2Bit-Experimental Code: https://github.com/Banaxi-Tech/kv1

reacted to breitburg's post with 🔥 2 days ago

I've been experimenting with "pure" model alignment. The core idea is to only train a verifiable version of a capacity until the model generalizes it to the non-verifiable version. For example, training the model on factual self-knowledge, like the model's scale, architecture, runtime situation, and being able to predict its own behavior, betting this generalizes to real introspection about states that do not. The same principle applies to general instruction following -- no training on subjective judgement, only verifiable claims and inferences, betting the skill generalizes to instructions where correctness is a matter of judgment. The primary alignment claim is that an identity and taste that will emerge this way will be much more robust and honest than hand-scripted ones (e.g. "As an AI language model..."). During the training, we should never teach it to make any subjective claims or invent experiences that we assume it has, like "I don't have taste" or "I'm not self-aware in the way you think", as well as no narration of internal states like "I'm curious now". The main threat, of course, is that we'll simply inherit the training distribution of all the things like "taste", and we'll get an average. However, with the recent research about the models' introspection abilities, it might be as well the case that we'll get something that's more honest than something that tries to adhere to a specific spec file. I'm posting new experimental models trained that way in this collection: https://huggingface.co/collections/breitburg/neue

posted an update 2 days ago

What's holding your code back? Outrider finds, implements, and validates methods for your repo. While testing Outrider on a fork of huggingface/peft, I discovered "Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models" (arxiv: 2402.02347) The work offers improved stability and faster convergence in LoRA finetuning by adjusting updates for curvature that LoRA optimizers typically ignore. Not the most recent paper, so I was pleasantly surprised my action surfaced this method as a candidate before implementing a PR. Even more surprised this method had not already been merged upstream. Turns out, the author did try contributing to peft a couple years ago, but people get busy and the PR was closed after going stale. So I decided to revive it! I opened an issue and soon after the author engaged to help land the feature. Now huggingface/peft #3382 is open, a joint effort with the paper's author. This whole episode has me thinking about the future of OSS maintenance with AI coding. The software projects which endure will be well-shaped to quickly land and help test new ideas. Across 30 forks, I've seen several papers land as clean PRs for multiple repos, which offers a perspective on how methods impact applications. Recent methods matching multiple frameworks: STARE, Entity Binding, BINEVAL Get Outrider: https://github.com/remyxai/outrider

View all activity

Organizations

reacted to Banaxi-Tech's post with 🔥 about 3 hours ago

Post

175

Today we are releasing BananaMind-KV1-8M-2Bit-Experimental, a KV-cache-aware trained model that stores its generation KV cache in 2-bit precision instead of the usual 16-bit precision.

Result: 5.33x smaller KV cache vs FP16, with 0.0916 mean KLD against a 16-bit KV cache reference on WikiText-2.

Model: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

The important part: this is not just post-training KV cache quantization.
Instead we take the BitNet approach.

KV1 is trained with a 2-bit-aware K/V path. Instead of training a normal model and quantizing the cache afterwards, the model learns during training to operate under the low-bit KV constraint, closer in spirit to the BitNet idea of training for the low-bit regime.

During generation, each K/V vector is quantized into 4 affine levels and packed into uint8 tensors, with four 2-bit values stored per byte.

WikiText-2 eval vs 16-bit KV cache reference:

Mean KLD: 0.0916 nats/token
Mean KLD: 0.1322 bits/token
Average KV cache shrink vs FP16: 5.33x
Evaluated positions: 372,675

If this actually gets used in models like Qwen or Gemma, then it may be possible to run 128K or even 256K Context on a Normal Machine!
Try it here: BananaMind/BananaMind-KV1-8M-2Bit-Experimental

Code: https://github.com/Banaxi-Tech/kv1

reacted to breitburg's post with 🔥 2 days ago

Post

1163

I've been experimenting with "pure" model alignment.

The core idea is to only train a verifiable version of a capacity until the model generalizes it to the non-verifiable version. For example, training the model on factual self-knowledge, like the model's scale, architecture, runtime situation, and being able to predict its own behavior, betting this generalizes to real introspection about states that do not.

The same principle applies to general instruction following -- no training on subjective judgement, only verifiable claims and inferences, betting the skill generalizes to instructions where correctness is a matter of judgment.

The primary alignment claim is that an identity and taste that will emerge this way will be much more robust and honest than hand-scripted ones (e.g.
"As an AI language model...").

During the training, we should never teach it to make any subjective claims or invent experiences that we assume it has, like "I don't have taste" or "I'm not self-aware in the way you think", as well as no narration of internal states like "I'm curious now".

The main threat, of course, is that we'll simply inherit the training distribution of all the things like "taste", and we'll get an average. However, with the recent research about the models' introspection abilities, it might be as well the case that we'll get something that's more honest than something that tries to adhere to a specific spec file.

I'm posting new experimental models trained that way in this collection: https://huggingface.co/collections/breitburg/neue

3 replies

posted an update 2 days ago

Post

1429

What's holding your code back?
Outrider finds, implements, and validates methods for your repo.

While testing Outrider on a fork of huggingface/peft, I discovered "Riemannian Preconditioned LoRA for Fine-Tuning Foundation Models" (arxiv: 2402.02347)

The work offers improved stability and faster convergence in LoRA finetuning by adjusting updates for curvature that LoRA optimizers typically ignore.

Not the most recent paper, so I was pleasantly surprised my action surfaced this method as a candidate before implementing a PR. Even more surprised this method had not already been merged upstream.

Turns out, the author did try contributing to peft a couple years ago, but people get busy and the PR was closed after going stale.

So I decided to revive it! I opened an issue and soon after the author engaged to help land the feature. Now huggingface/peft #3382 is open, a joint effort with the paper's author.

This whole episode has me thinking about the future of OSS maintenance with AI coding. The software projects which endure will be well-shaped to quickly land and help test new ideas.

Across 30 forks, I've seen several papers land as clean PRs for multiple repos, which offers a perspective on how methods impact applications. Recent methods matching multiple frameworks: STARE, Entity Binding, BINEVAL

Get Outrider: https://github.com/remyxai/outrider

reacted to fffiloni's post with 🔥 3 days ago

Post

771

⏱️ Built a small Space for Visual Chronometer / Pulse of Motion.

Upload a video and estimate its Physical FPS: the frame rate implied by visual motion, independent of metadata.
Useful to inspect “chronometric hallucination” in generated videos: clips that look smooth, but move with the wrong physical time scale.

Try it here: fffiloni/Pulse-of-Motion

reacted to ginigen-ai's post with 🔥 3 days ago

Post

10354

🧠 Does your LLM know when it's about to be wrong?

Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉

The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.

Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)

What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))

Submit any HF model → auto-scored daily at 09:00 KST and added to the board.

🏆 Leaderboard → ginigen-ai/Metacognition-Leaderboard-Space

📊 Benchmark → ginigen-ai/Metacognition-Bench

🧩 Adapters → FINAL-Bench/metacognition-adapters-6a42c032e6beb803dd032961

📊 Article → https://huggingface.co/blog/ginigen-ai/metacognition

Benchmark by ginigen-ai · Adapters by FINAL-Bench (Darwin/Chimera platform + AETHER metacognition tech).

11 replies

reacted to stas's post with 🔥 5 days ago

Post

3605

After many months of intense work the
Snowflake AI Research team is happy to present to you the new open source project: Arctic RL

https://snowflake.com/en/blog/engineering/arctic-rl-open-source-backend/

- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required
- ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs
- Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions
- Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%

4 replies

reacted to ginigen-ai's post with 🔥 5 days ago

Post

5175

🍳 The RoboCasa Kitchen Leaderboard
What does it take for a robot to handle kitchen chores the way a person does? It has to see (Vision), understand instructions (Language), and actually act (Action) — and VLA (Vision-Language-Action) models are emerging as the answer. They're the bridge between large multimodal models and real-world embodied control.

RoboCasa Kitchen is a leading robot-learning benchmark in which a single-arm robot (Franka Panda) performs 24 atomic manipulation tasks — picking up cups and bowls, opening drawers and doors, turning faucets, pressing buttons, and more — inside a photorealistic simulated kitchen. Because the layout and object placement are randomized every episode, it tests genuine generalization rather than memorized motions. The score (success rate, SR) is the average fraction of the 24 tasks completed as instructed, measured over multiple seeds so results aren't down to luck.

The catch: this benchmark has no official leaderboard, and protocols (number of demonstrations, evaluation setup) differ from paper to paper, leaving scores scattered. Lining the numbers up naively quickly turns into an apples-to-oranges comparison.

This leaderboard fixes that by collecting published scores with their sources and comparing only what is genuinely comparable. It's split into three tables:

🏆 Kitchen 24-task (matched) — head-to-head under identical conditions (per the RLDX-1 Technical Report). This is the core ranking you can actually trust.
➕ Other protocols — self-reported under different setups (e.g. fewer demos). Not directly comparable, so kept separate.
🤖 GR1-Tabletop — a different, humanoid-based variant suite, separated to avoid confusion.

Any researcher can submit their own model's score directly, and submissions are reviewed before they appear on the board. Every number links to its source paper, so you can verify it yourself.

👉 ginigen-ai/robocasa-kitchen-leaderboard

reacted to IliaLarchenko's post with 🔥 6 days ago

Post

138

I placed 🥈 2nd in the LeHome Challenge (ICRA 2026), and 🥇 1st of 62 teams in the first simulation round. Now I'm open-sourcing the full solution — code, tech report, and final weights.

The task: teach a cheap two-armed robot (SO-ARM101) to fold 4 garment types — long/short tops and pants. Garment category is hidden at eval. Round 1 in sim (auto-scored), round 2 on a real robot (jury-scored).

I trained a VLA policy with an RL loop on top. The key ideas:

🧠 The policy is its own value function. From the same forward pass that picks the next action chunk, cheap heads predict success probability, task completion %, garment type, and future keypoint distances + a Q-residual. Those become the advantage signal for RL — no separate critic.

🔁 A fully asynchronous RL loop coordinated only through the HF Hub: 1 trainer (H200) ships a fresh checkpoint ~every 40 min while N rollout workers (and a human doing teleop / DAgger corrections) collect data in parallel. Nobody waits — it uses the off-policy nature of the loop to the fullest.

📈 Binary success is too sparse, so I densify it into per-frame advantage via GAE — from objective keypoint checkpoints, the success-probability value baseline, and completion %.

🎛️ The RL combines AWR + RECAP. I also tune the inference knobs — execution length, playback speed, inpainting overlap, CFG scale, best-of-N — with a per-parameter Thompson-sampling bandit folded into rollout collection.

🔧 Round 2: with only ~1 week and no access to the eval robot — so the pipeline was sim → my robot → their robot, leaning on heavy augmentation to make the policy more robust.

📝 Blog: https://ilialarchenko.com/projects/lehome2026
📄 Tech report: Learning to Fold: prizewinning solution at LeHome Challenge 2026 (1st place online, 2nd offline) (2606.27163)
🔧 Code: https://github.com/IliaLarchenko/lehome_solution
🤗 Sim policy: IliaLarchenko/lehome_sim
🤗 Real policy: IliaLarchenko/lehome_real
🌐 Challenge: https://lehome-challenge.com/

reacted to pankajpandey-dev's post with 🔥 7 days ago

Post

7808

🇮🇳 New in my Hindi LLM Series: Gemma-4 E4B, fine-tuned for Hindi — and it runs on your laptop's CPU.
I fine-tuned Google's new Gemma-4 E4B on ~10k Hindi instruction pairs (AI4Bharat: anudesh + dolly) using Unsloth + LoRA, on a single L4 GPU.
Then I ran an honest side-by-side eval: base Gemma-4 vs my fine-tune, across 25 Hindi prompts. The results were interesting 👇
✅ My fine-tune is more concise — ask for "3 tips" and it gives exactly 3. Base writes a 1,200-character essay.

✅ Pure native Hindi — base keeps slipping into English ("संतुलित आहार (Eat a Balanced Diet)", "तारा (Star)"). My fine-tune stays in clean Hindi.

✅ Tighter instruction-following — ask for a "short message" and it gives one, not a menu of options.
⚖️ And to be honest: base Gemma-4 is more detailed and comprehensive. I didn't build a "smarter" model — I built a focused, Hindi-native, edge-friendly one that runs as a 5GB GGUF (Q4) on CPU.
🔗 Try it:

Live demo (CPU): pankajpandey-dev/gemma-4-e4b-hindi-demo
GGUF (Ollama/llama.cpp): pankajpandey-dev/gemma-4-e4b-hindi-instruct-GGUF
16-bit model: pankajpandey-dev/gemma-4-e4b-hindi-instruct

Built with @unsloth · Data by @ai4bharat 🙏
#Hindi #LLM #Gemma #Unsloth #IndicNLP #GGUF

12 replies

reacted to Shrijanagain's post with 🔥 7 days ago

Post

174

🚀 Big News for the AI Community! 🔥

We’re excited to release NRS_QWEN_MYTHOS_1M — a powerful reasoning model built on Qwen 3.5 9B!
At SKT AI LABS, we’ve supercharged this 9B model with our proprietary Neural Reasoning System (NRS) to deliver next-level performance.

🔥 Why This Model is a Game-Changer:
✅ 100x Reasoning Capacity — Exceptional deep logical thinking and complex problem-solving
✅ 1 Million Token Context — Perfect for massive codebases, long documents, and multi-turn agentic workflows
✅ Advanced Thinking Mode — Native <think> tags for true step-by-step Chain-of-Thought reasoning
✅ Tool-Use Ready — Optimized for Python execution, Web Search, and self-correction
✅ Blazing Fast — Runs smoothly on consumer GPUs like RTX 3090/4090

Technical Highlights:

Base: Qwen 3.5 9B
Tuning: NRS-specific high-quality reasoning data
Context: 1M Tokens (YaRN Scaling)
License: NRS DOCS

Whether you’re a developer building coding agents, a researcher working with long-context data, or someone who loves powerful reasoning — this model is built for you.

👉 Try it now on Hugging Face:
SKT-NRS/NRS_QWEN_MYTHOS_1M

Drop a comment: What will you build with it first? 👇
#AI #OpenSource #LLM #Qwen #ReasoningModel #HuggingFace #NewModel #AICommunity

reacted to TravisMuhlestein's post with 🔥 9 days ago

Post

156

The conversation around AI agents is evolving.

We're moving beyond model capabilities and toward the infrastructure needed for agents to work together.

Over the past few weeks we've seen meaningful momentum around the foundational building blocks of the emerging agentic web.

Agent Name Service (ANS) is addressing identity and trust.
Agentic Resource Discovery (ARD) is helping standardize how agents discover resources and capabilities.

Together, these efforts represent something bigger than individual projects.

They point toward an ecosystem built on open, interoperable infrastructure rather than isolated implementations.

As builders, we'll likely spend the next few years solving challenges around identity, discovery, trust, interoperability, and governance—not just model performance.

It will be interesting to see how these efforts evolve—and where the community chooses to collaborate next.

Learn more:

🔗 Linux Foundation ANS: https://www.linuxfoundation.org/press/linux-foundation-announces-intent-to-launch-agent-name-service-to-establish-trusted-identity-infrastructure-for-ai-agents

🔗 Agentic Resource Discovery: https://developers.googleblog.com/announcing-the-agentic-resource-discovery-specification/

reacted to ManniX-ITA's post with 🔥 9 days ago

Post

201

---
🚀 Gemma-4-A4B 98e v7-coder cohort — loop-fixed re-release. Two 20.8B MoE coders (4B-active), fresh-map prunes of Gemma 4 26B-A4B, 30/128 experts dropped per layer. The headline isn't a benchmark: the agentic loop is
gone at the weights, not papered over by the sampler.

🔧 How: at prune time we force-keep the 46 agentic_eog experts a loop-protection signal flags as load-bearing for clean multi-turn termination (+ shared-FFN α=1.2). Result: 0 loops across 48 seeds on every published
tier.

📊 Q6_K · llama.cpp · greedy · same host (from summary.json):

⚖️ v7-coder (fkbroad code3/lcb2) — balanced coder: LCB-med-55 98.18, HumanEval 98.17, HE+ 92.07, AIME 80.0, MATH-500 95.0, GSM8K 91, IFEval 92, MultiPL-E 89.7, ARC 92.2.

⚡ v7-coderx (code4/lcb3) — code-maximal: all-hard LCB-77 85.71 (cohort-best; 128e 79.22, v7-coder 84.42), HE+ 93.29, GSM8K 93, MATH-500 95.0, AIME 76.67. Whole budget on code.

🎯 Both land near GPQA ~51 — graduate science is the budget axis, neither is a science model. Pick v7-coder for the broad LCB-medium + HumanEval lead; v7-coderx for the all-hard slice and HE+.

🧪 The harness we used to prove the fix is now an omk tool: agentic-loop-harness replays a frozen agentic conversation across a sampler×seed matrix and reports a fail-rate per chat-template, so you can isolate a loop
to one variable. Model-agnostic — any OpenAI-compatible server. The version we shared with Google: google/gemma-4-12B-it#41

📦 Each ships bf16 · GGUF (+ CD-* + imatrix + mmproj vision) · NVFP4A16 (~13 GB) · Ollama.
🔗 ManniX-ITA/gemma-4-A4B-98e-v7-coder-it (+ -it-GGUF, -NVFP4A16) · https://ollama.com/mannix/gemma4-98e-v7-coder
🔗 ManniX-ITA/gemma-4-A4B-98e-v7-coderx-it (+ -it-GGUF, -NVFP4A16) · https://ollama.com/mannix/gemma4-98e-v7-coderx
🔧 https://github.com/mann1x/omnimergekit/tree/main/tools/agentic-loop-harness

reacted to PeetPedro's post with 🔥 9 days ago

Post

158

hey, I'm doing some experimenting, looping around :slight_smile:
---
**kompress-v6** *shipped* — trained on Claude Code agent patterns (bash output, file reads, stack traces, search results, JSON tool responses). 3k synthetic pairs + 2k existing, fine-tuned from v4, $0.20 on vast.ai.

Results:
heretic exact_pct 0.962 (v4: 0.967),
keep_rate 0.854 (v4: 0.823),
override delta 0.
Model got more conservative — higher keep_rate on structured technical content.
Real proxy:
v4 compressed 9.5%,
v6 compressed 4.2% on the same session.
Less aggressive, fewer must-keep tokens dropped on paths and identifiers.

Interesting failure: self-labeling with v4+override collapsed mk_in_ref to 0.652.
TokenExpiredError splits into Token+Expired+Error — subtokens that don't individually match the must-keep regex, so the force-keep never fires. Generator references (mk_in_ref=1.0 by construction) ended up being better labels than v4's compressed output for agent data.
Fix for next run: slide a 2-3 subtoken window instead of checking individual subtokens. Would let self-labeling work on agent content and potentially produce a more compression-aggressive v7.

Models on HF:
- PeetPedro/kompress-v6
- PeetPedro/kompress-v4
- PeetPedro/kompress-v3
Write-up: https://pocoo.vaked.dev/posts/2026-06-25-kompress-v6-agent-distribution

commented a paper 10 days ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Paper • 2606.19236 • Published 18 days ago • 13 •

reacted to Anran-MLLM's post with 🔥 15 days ago

Post

3560

🚀 Introducing PerceptionDLM — the first multimodal diffusion LLM for parallel region perception!

Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. 🧩

✨ Highlights
• ⚡ Up to 3.4× faster on dense multi-region captioning, with stable per-image latency
• 🏆 PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs)
• 📊 New benchmark: ParaDLC-Bench — jointly evaluates caption quality AND inference efficiency
• 🔓 Code, models & benchmark all open-sourced

🤖 Models
MSALab/PerceptionDLM-Base
MSALab/PerceptionDLM

📊 Benchmark
MSALab/ParaDLC-Bench

📄 Paper: PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models (2606.19534)
💻 Code: https://github.com/MSALab-PKU/PerceptionDLM

Diffusion LLMs aren't just for text — they unlock efficient, parallel visual perception. 👁️✨

#multimodal #diffusion #VLM #perception

reacted to AxionLab-official's post with 🔥 15 days ago

Post

3406

# An Open Letter from SupraLabs.

Over the past few days, SupraLabs has been mentioned in a public discussion regarding small language models, scaling laws, and training methodology. We'd like to clarify our position.

Before anything else, we want to make one thing absolutely clear: we have great respect for Lane and the work being done at Glint Research. At no point was our intention to disrespect Lane, Glint Research, or their research. What began as a technical discussion about model scaling and training methodology unfortunately became much more personal than we ever intended. From our perspective, it was simply an exchange of technical opinions, and we sincerely hope it remains that way.
We'd also like to acknowledge that one of our own comments during the discussion was poorly worded. Referring to a benchmark as "fake" was imprecise. What we intended to criticize was the comparison methodology, not the integrity of the evaluation itself. Comparing a merged checkpoint against a single checkpoint is, in our view, not an apples-to-apples comparison.

That said, this was never the core of the discussion.

Our disagreement was not about SLERP, model merging, or whether training a small model on massive amounts of data is an interesting research direction. We support experimentation and unconventional ideas.

The actual point of disagreement was much simpler.

The statement that a 1M parameter model trained on 1 trillion tokens will become a "100M killer" is, today, a prediction, not an experimental result.
Could it happen? Perhaps.
Would it be exciting if it did? Absolutely.

But until benchmark results, reproducible evaluations, and independent validation exist, we believe such statements should be presented as hypotheses rather than established conclusions.
Research advances by testing ideas, not by assuming their outcomes.

We sincerely wish Lane and everyone at Glint Research success in their experiments.

Thank you to everyone who read it.

1 reply

reacted to Sajib-006's post with 🔥 15 days ago

Post

146

Excited to share our paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning

A common assumption in test-time reasoning is that giving a model more chances to think or verify should improve performance. Our results show that this is only partly true.

We introduce SEVRA, a serving-layer controller that decides when a frozen reasoning model should keep its initial answer and when it should actively verify it. Instead of treating verification as always useful, SEVRA asks a more deployment-focused question:

Is this specific attempt likely recoverable by verification?

We evaluate this through helpful fixes, harmful flips, extra calls, and realized token cost.

Some key takeaways:

* Selective verification improves over always verifying on MATH500 while reducing harmful flips.
* On GSM8K, the controller verifies only a small fraction of examples but still improves accuracy.
* However, a longer initial solve can sometimes match selective verification with fewer realized tokens.
* Cheap serving-visible features, such as completion status, token count, and finalizer use, nearly match larger learned gates.
* On CommonsenseQA, always-on verification hurts, showing that the best test-time compute action is workload-dependent.

The main deployment lesson is simple:

Tune the initial reasoning budget first. Then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.

Paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning (2606.19808)
Code: https://github.com/Sajib-006/SEVRA
Replay dashboard: sevra-space/sevra-replay

Would love feedback from the community, especially on broader test-time compute allocation, risk-aware verification, and practical serving policies for reasoning models.

reacted to dronefreak's post with 🔥 15 days ago

Post

3089

Excited to open-source the VisDrone Aerial Object Detection Model Zoo on Hugging Face.

The collection includes multiple YOLO variants trained and evaluated on the VisDrone benchmark for aerial object detection, with accompanying documentation and performance metrics.

If you're working on drones, aerial surveillance, robotics, or small-object detection, I hope these models save you some time.

Model Zoo: https://huggingface.co/collections/dronefreak/visdrone-detection-model-zoo

Feedback, issues, and contributions are welcome.

12 replies

reacted to RiverRider's post with 🔥 16 days ago

Post

218

SRT Showcase: Watch a Frozen Model Think, Token by Token

A frozen Qwen-2.5-7B now narrates its own interpretation in real time. SRT Showcase is the most complete public demonstration of computational semiotics to date, running the backbone with the SRT Adapter and Activation Verbalizer. As the model generates, every token is tinted by its predictive effort, and at the highest-effort positions the Verbalizer decodes the hidden state directly into natural language. You see what the model is representing at the exact moment its computation is most active.

Every verbalization is validated, not asserted. Each decoded thought is re-encoded and compared back to the original hidden state, and the reconstruction closely approximates it. The "this is what the model was thinking" claim carries its own fidelity badge. This is grounded introspection, not plausible narration.

The Showcase goes further than the trace. An A/B panel runs the same prompt with SRT injection on and off under an identical seed, so the side-channel's effect is directly observable. A curated gallery walks through confident recall, false premises, misconceptions, reasoning pivots, genuine uncertainty, and safety boundaries. Live entropy and divergence meters track the crystallization process token by token, with per-layer traces and reflexivity estimates on hover.

None of the backbone weights are touched. The entire mechanism is a lightweight reflexive layer over a frozen model, which is why the same read-out heads already port from Qwen-2.5-7B up to a 235B Mixture of Experts. Frozen models can now be verbalized in real time. No retraining. No fine-tuning. No black box.

First request is a brief cold start while ZeroGPU acquires a GPU. Bring your own prompt.

Try it: RiverRider/srt-showcase

Repository: https://github.com/space-bacon/SRT

reacted to owensong's post with 🔥 16 days ago

Post

6510

I just released Inflect-Nano-v1, an ultra-small 4.63 parameter text-to-speech model.

The main idea is simple: instead of only making the acoustic model tiny and relying on a larger external vocoder, Inflect-Nano-v1 keeps the complete text-to-waveform stack under 5M parameters.

Quick facts:
- 4.63M total inference parameters
- 3.46M acoustic model
- 1.17M vocoder
- 24 kHz audio
- English-only
- Single male voice
- Runs locally with a simple PyTorch inference script

Why I made it:
Most modern TTS models are much larger, and even many “small TTS” projects depend on a separate vocoder. I wanted to see how far a complete tiny TTS stack could be pushed while still producing usable speech.

It is not SOTA, and I am not trying to claim it competes with large TTS systems. The interesting part is the size-to-functionality ratio.

What works:
It can generate arbitrary English speech locally, and the model is small enough to be interesting for:

- local voice assistants
- embedded/edge experiments
- browser or WASM-style TTS exploration
- efficient inference research
- tiny-model baselines

Limitations:
The quality is still limited. It can sound robotic, stumble on difficult unseen text, and the vocoder is still a clear bottleneck. Long or unusual prompts are less reliable.

So I would frame this as a research/demo release, not a production TTS engine.

I’d love feedback from people interested in:
- tiny speech models
- vocoders
- local TTS
- efficient inference
- embedded speech synthesis
- improving small-model generalization

If people find it useful, I’m interested in putting more training budget into a stronger v2.

Model page:
owensong/Inflect-Nano-v1

Salma Mayorquin PRO

AI & ML interests

Recent Activity

Organizations

salma-remyx's activity