Spaces:

GOBA-AI-Labs
/

README

Configuration error

App Files Files Community

README / README.md

TOk-Atsuru

Upload README.md with huggingface_hub

639bfd9 verified 4 days ago

preview code

raw

history blame contribute delete

2.24 kB

metadata

title: GOBA-AI-Labs
emoji: 🧠
colorFrom: blue
colorTo: purple

GOBA-AI-Labs

Making large AI models accessible on consumer hardware.

We develop open-source tools for compressing Mixture-of-Experts (MoE) AI models. Our expert pruning technology reduces model sizes by 50-90% while preserving quality — enabling 400B+ parameter models to run on laptops with 24GB RAM.

PrunedHub Models

Calibration-based expert pruning with zero retraining. Drop-in replacements for llama.cpp.

Model	Base	Size	Quality	Highlights
PrunedHub GPT-OSS-20B-28x	GPT-OSS-20B	10.4 GB	MMLU 78% (lossless)	Zero quality loss, fits 16GB RAM
PrunedHub GPT-OSS-20B-27x-Zerobias	GPT-OSS-20B	~9.4 GB	MMLU 77% (-1pp)	Experimental router optimization
PrunedHub Qwen3-30B-A3B-JP-80pct	Qwen3-30B-A3B	14.0 GB	MMLU 79% (think-ON)	Language-aware pruning, Japanese quality preserved
PrunedHub Qwen3-Coder-Next-50pct	Qwen3-Coder-Next	24.4 GB	MMLU 72%	80B model in 24GB, outperforms Q2 quantization

Our Approach

Traditional model compression relies on aggressive quantization, which degrades all computations uniformly. Our expert pruning takes a fundamentally different approach — removing entire redundant computation paths from MoE models while keeping the remaining experts at full precision.

Calibration-based importance scoring — Expert importance measured through actual inference behavior, not static weight analysis
Layer-adaptive expert allocation — Each layer retains a dynamically determined number of experts based on its contribution to quality
Language-aware optimization — Automatic detection and protection of language-specialized experts
Zerobias router optimization — Post-pruning router bias correction that extends the lossless compression frontier

GOBA-AI-Labs

PrunedHub Models

Our Approach

Links