Spaces:
Running
Running
| import gradio as gr | |
| ABOUT = """ | |
| # HyperPEER | |
| **Compressing a large model into a small student that runs on a single 16 GB consumer GPU** β by replacing each transformer layer's feed-forward / mixture-of-experts block with a *hypernetwork that generates a per-token low-rank expert*, instead of storing a full expert bank. Attention and embeddings are inherited and frozen; only the generator is trained, by feature distillation against the teacher's per-layer outputs. | |
| The footprint becomes the size of the generator, not the size of everything it can generate. | |
| Proof-of-concept target: **google/gemma-4-26B-A4B**, compressed to run on one consumer card. | |
| ## Validated on a 3B testbed (single 16 GB card) | |
| - Generating experts costs **no quality** versus storing them: held-out perplexity **25.9** (generated) vs **26.2** (stored) at convergence. | |
| - A **larger hypernetwork making smaller experts wins** β capacity has to live in the generator. | |
| - **Feature distillation** (matching each block's output to the teacher's) beats next-token prediction and logit-KL. | |
| - Runs in about **2.85 GB of VRAM**, under half the teacher's, at a third of the parameters. | |
| ## What's in this repo | |
| - `gemma/` β the Gemma-4-26B capture + layer-local distillation pipeline. | |
| - `testbed/` β the 3B validation code and result JSONs. | |
| - `PHASE1_REPORT.md`, `PHASE2_PLAN.md` β the report and the full plan. | |
| The recipe is validated end to end; the remaining step is the Gemma-4-26B run. The blocker is purely compute. Everything will be released openly. | |
| β Mikey Bee | |
| """ | |
| with gr.Blocks(title="HyperPEER") as demo: | |
| gr.Markdown(ABOUT) | |
| if __name__ == "__main__": | |
| demo.launch() | |