README / README.md
TOk-Atsuru's picture
Upload README.md with huggingface_hub
639bfd9 verified
metadata
title: GOBA-AI-Labs
emoji: 🧠
colorFrom: blue
colorTo: purple

GOBA-AI-Labs

Making large AI models accessible on consumer hardware.

We develop open-source tools for compressing Mixture-of-Experts (MoE) AI models. Our expert pruning technology reduces model sizes by 50-90% while preserving quality — enabling 400B+ parameter models to run on laptops with 24GB RAM.

PrunedHub Models

Calibration-based expert pruning with zero retraining. Drop-in replacements for llama.cpp.

Model Base Size Quality Highlights
PrunedHub GPT-OSS-20B-28x GPT-OSS-20B 10.4 GB MMLU 78% (lossless) Zero quality loss, fits 16GB RAM
PrunedHub GPT-OSS-20B-27x-Zerobias GPT-OSS-20B ~9.4 GB MMLU 77% (-1pp) Experimental router optimization
PrunedHub Qwen3-30B-A3B-JP-80pct Qwen3-30B-A3B 14.0 GB MMLU 79% (think-ON) Language-aware pruning, Japanese quality preserved
PrunedHub Qwen3-Coder-Next-50pct Qwen3-Coder-Next 24.4 GB MMLU 72% 80B model in 24GB, outperforms Q2 quantization

Our Approach

Traditional model compression relies on aggressive quantization, which degrades all computations uniformly. Our expert pruning takes a fundamentally different approach — removing entire redundant computation paths from MoE models while keeping the remaining experts at full precision.

  • Calibration-based importance scoring — Expert importance measured through actual inference behavior, not static weight analysis
  • Layer-adaptive expert allocation — Each layer retains a dynamically determined number of experts based on its contribution to quality
  • Language-aware optimization — Automatic detection and protection of language-specialized experts
  • Zerobias router optimization — Post-pruning router bias correction that extends the lossless compression frontier

Links