Rotary Loading Demo β€” Running Large MoE Models on Consumer 8GB GPUs

This repository documents independent research measurements of running large Mixture-of-Experts (MoE) language models on consumer-grade single GPUs, including a laptop RTX 4060 with 8GB VRAM.

The underlying loading method is Rotary Loading, a per-token cyclic rotation of accelerator memory residency. The method itself is disclosed under Korean patent KIPO 10-2026-0070380 (publication: 2026-05-19; filing: 2026-04-30). This repository does not republish the mechanism; readers are referred to the public patent gazette.


Headline measurements

Model Hardware Decode speed
Qwen 3.6 35B-A3B (multimodal MoE) RTX 4060 Laptop 8GB VRAM 22.7 tok/s mean
Gemma 4 26B-A4B (MoE) RTX 4060 Laptop 8GB VRAM 40.16 tok/s mean
DeepSeek V4-Flash 148GB FP4 MoE NVIDIA H200 141GB single accelerator 5.124 β€” 5.751 tok/s

Cloud-pod measurements above (H200 line) and a per-cycle log are mirrored in the KIPO gazette paragraph [0089].


What is Rotary Loading (short summary)

When the total model footprint exceeds accelerator memory (148 GB model on 141 GB VRAM, or a 35B model on 8 GB VRAM), the standard approach is offload

  • on-demand paging, which is dominated by host-device synchronization overhead.

Rotary Loading instead maintains a small set of hot slots in accelerator memory and rotates their residency along a learned cyclic schedule driven by the input hidden state / routing trajectory. The rotation step is decided by a rotation projection rather than usage-frequency demotion.

This repository publishes measurements and recipes only; the full mechanism (rotation control, slot grouping, multi-head decomposition, mode switching) is disclosed under the patent and is not re-written here.


Evidence and references

  • GitHub (papers): https://github.com/JorrrrrdDin/RESEARCH_PAPERS
    • Paper 16 β€” Rotary GPU Local MoE
    • Paper 15 β€” LogitFactory (vocabulary pruning)
    • Paper 14 β€” GHOST Domain (privacy-preserving interaction)
  • KIPO gazette: Public application 10-2026-0070380 (2026-05-19)
    • Filing: 10-2026-0079156 (2026-04-30)
    • Title: λŒ€κ·œλͺ¨ μ–Έμ–΄ λͺ¨λΈμ˜ λ‘œν„°λ¦¬ νšŒμ „ 기반 가속기 적재 및 μ‹€ν–‰ μ‹œμŠ€ν…œ 및 방법
    • 12 claims; cycle 6 / 16 / 21 / 30 measurements appear in paragraph [0089]–[0090]
  • DOI (Zenodo):

Status

This is a measurement-disclosure repository, not a release of the loading runtime. The runtime is under continued development; production-grade release is not currently scheduled.

For licensing or collaboration on the Rotary Loading method, please refer to the patent and contact through the GitHub research repository.


Author

Myeong Jun Jo (μ‘°λͺ…μ€€) β€” independent AI researcher, South Korea ORCID: 0009-0006-9540-4666

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support