HGF-60M-TinyStories: Hybrid Gated Flow 1.0
This repository contains the first official implementation of Hybrid Gated Flow (HGF), an architecture designed to overcome the "Memory Wall" in Large Language Models (LLMs).
What is HGF?
Hybrid Gated Flow (HGF) is a dual-flow architecture that couples a 1.58-bit ternary backbone with a low-rank FP16 LoRA correction path controlled by adaptive gates.
- Ternary Backbone: Discretizes weights into the set {-1, 0, 1}.
- Gated Correction: Uses Low-Rank Adaptation (LoRA) with a gating mechanism to reinject the high-precision "nuance" lost during quantization.
- Structural Stabilization: The ternary backbone acts as a "Structural Anchor," stabilizing complex mechanisms like Differential Attention, which are often unstable in full precision.
Performance and Metrics
Evaluated on the TinyStories dataset, HGF 1.0 significantly reduces the performance gap between extreme quantization and full-precision models.
| Architecture | Val Loss (2.5k steps) | Quality Recovery | Memory Footprint |
|---|---|---|---|
| Baseline (FP16) | 0.8490 | - | 100% |
| HGF 1.0 | 0.9306 | 54.8% | ~15% |
| BitNet b1.58 | 1.0294 | 0% | ~10% |
- Recovery: Recovers more than 50% of the quality gap between ternary quantization and the FP16 baseline.
- Memory Efficiency: Achieves a memory reduction of approximately 85% compared to the FP16 baseline.
- Training Efficiency: Reaches its optimal performance (capacity saturation) around 2500 steps, allowing for a 30% reduction in training compute compared to dense baselines.
Implementation Details
- Activation Quantization: 8-bit integers to preserve dynamic range.
- Attention: Differential Attention mechanism to improve context tracking.
- Gating Protocol: Adaptive training including warmup, regularization, and freezing at step 900 to ensure stable convergence.
- Learning Rates: Dual strategy with 2.5e-3 for main parameters and 3e-4 for gating parameters.
Deployment and Use Cases
HGF is optimized for resource-constrained environments where memory bandwidth is the primary bottleneck:
- Edge Computing: Devices with 2-4 GB of RAM, such as Raspberry Pi 5 or mobile NPUs.
- Privacy-Preserving AI: On-device processing to eliminate cloud dependency.
- Cloud Economics: Higher batch density for multi-tenant services.
Citation
If you use this model in your research, please cite:
@article{pizzo2026hybrid, title={Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction}, author={Trejo Pizzo, David Alejandro}, journal={arXiv preprint arXiv:2602.05269}, year={2026} }
@article{trejopizzo2026hgf,
title={Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction},
author={Trejo Pizzo, David Alejandro},
journal={OpenCoresAI},
year={2026}
}
- Downloads last month
- 13