Spaces:

MikeyBeez
/

HyperPEER

Running

App Files Files Community

HyperPEER / app.py

MikeyBeez

Add HyperPEER pipeline, testbed code, results, docs, Gradio landing

e41a3a4 verified 16 days ago

Raw

History Blame Contribute Delete

1.67 kB

	import gradio as gr

	ABOUT = """
	# HyperPEER

	Compressing a large model into a small student that runs on a single 16 GB consumer GPU — by replacing each transformer layer's feed-forward / mixture-of-experts block with a hypernetwork that generates a per-token low-rank expert, instead of storing a full expert bank. Attention and embeddings are inherited and frozen; only the generator is trained, by feature distillation against the teacher's per-layer outputs.

	The footprint becomes the size of the generator, not the size of everything it can generate.

	Proof-of-concept target: google/gemma-4-26B-A4B, compressed to run on one consumer card.

	## Validated on a 3B testbed (single 16 GB card)

	- Generating experts costs no quality versus storing them: held-out perplexity 25.9 (generated) vs 26.2 (stored) at convergence.
	- A larger hypernetwork making smaller experts wins — capacity has to live in the generator.
	- Feature distillation (matching each block's output to the teacher's) beats next-token prediction and logit-KL.
	- Runs in about 2.85 GB of VRAM, under half the teacher's, at a third of the parameters.

	## What's in this repo

	- `gemma/` — the Gemma-4-26B capture + layer-local distillation pipeline.
	- `testbed/` — the 3B validation code and result JSONs.
	- `PHASE1_REPORT.md`, `PHASE2_PLAN.md` — the report and the full plan.

	The recipe is validated end to end; the remaining step is the Gemma-4-26B run. The blocker is purely compute. Everything will be released openly.

	— Mikey Bee
	"""

	with gr.Blocks(title="HyperPEER") as demo:
	gr.Markdown(ABOUT)

	if __name__ == "__main__":
	demo.launch()