Spaces:

MikeyBeez
/

HyperPEER

Running

App Files Files Community

Apply for a GPU community grant: Personal project

by MikeyBeez - opened 11 days ago

Discussion

MikeyBeez

Owner 11 days ago

Community GPU Grant Request - Compressing Gemma-4-26B to run on consumer hardware

Hello Hugging Face team,

I'm requesting a community GPU grant for an open research project that sits squarely on the democratization mission behind ZeroGPU: taking a frontier-scale model you host - google/gemma-4-26B-A4B - and compressing it into a much smaller student that reproduces its behavior on a single consumer GPU, with the entire recipe, the code, and the resulting model released openly on the Hub.

The method, in one line: freeze the teacher's attention and embeddings, replace its mixture-of-experts feed-forward layers with a small hypernetwork that generates the experts on demand instead of storing them, and train it by feature distillation to match the teacher's per-layer behavior. The stored footprint becomes the size of the generator, not the full expert pool - which is what makes a 26B fit on a desktop.

This is not a hope; the method is already validated end-to-end on a 3B testbed (StarCoder2-3b), all on a single 16 GB consumer card:

The generated-expert student matches a version that stores its experts outright - held-out perplexity 25.9 vs 26.2 at convergence. Generating the experts costs no quality versus storing them.
A larger hypernetwork making smaller experts wins, and corpus diversity gives a real, measurable improvement.
The compressed student runs in about 2.85 GB of VRAM, under half the teacher's, at a third of the parameters.

The only remaining step is to run the same, working recipe on the real teacher - Gemma-4-26B - where the payoff is a frontier-quality model compressed to a footprint that fits a single consumer GPU, with the success criterion being that it outperforms other models of its size. Our blocker is purely compute: a 26B teacher does not fit our 16 GB card for an efficient capture-and-train pass, and on Blackwell consumer hardware the usual 4-bit shortcuts fail, forcing slow layer streaming.

The ask: an A100 (80 GB) grant large enough to run the capture and training - we estimate well under 100 A100-hours in total - and, if possible, a small persistent A100 Space so we can host a public, live demo of the compressed Gemma student for the community to try directly.

Everything is released openly: the compressed Gemma-4-26B student on the Hub, the full capture-and-train code and recipe reproducible on a single GPU, and a clear writeup of the method and results. It is the "make big models runnable on small hardware" story, end to end, built directly on a model you host. Thank you for considering it.

Mikey Bee (Hugging Face: MikeyBeez)

MikeyBeez

Owner 6 days ago

Or we could distill GL5.2!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment