BAGEL-7B-MoT quantized with TorchAO float8 weight only quantization, using default Round-to-Nearest algorithm and Symmetric Max-Abs Scaling.
β³ Notice
This model is a pickle .pt file because traditionally we serialize and distribute TorchAO quantized model with PyTorch native APIs, specifically:
torch.save(model.state_dict())
state_dict = torch.load("model_fp8.pt", weights_only=True)
model.load_state_dict(state_dict, assign=True)
.safetensors file (Current)
Summary:
Total tensors: 1778 (1223 OG BF16)
Total parameters: 14,607,260,683
Total size: 14.15 GB
Dtype Distribution:
BF16: 668 tensors (37.6%)
F8_E4M3: 555 tensors (31.2%)
FP32: 555 tensors (31.2%)
The following is just an example of using BF16 OG weights and utilizing TorchAO for online dynamic quantization to achieve FP8 inference.
π Inference Experiment
On 2*RTX5090 with 24GiB VRAM
Can save about 39% VRAM, and accelerate model inference for about 10%.

On 1*H100 with 80GiB VRAM
TODO
π© Quick Start
Set up Environment for Bagel
git clone https://github.com/AaronCaoZJ/BAGEL.git # Forked from OG ByteDance-Seed/Bagel
cd BAGEL
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
pip install torch==2.8.0+cu128 torchvision==0.23.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
pip install packaging ninja
pip install flash-attn==2.8.3 --no-build-isolation # FlashAttention only supports Ampere GPUs or newer
Download Pretrained Checkpoint
from huggingface_hub import snapshot_download
save_dir = "models/BAGEL-7B-MoT"
repo_id = "aaroncaozj/BAGEL-7B-MoT_FP8"
cache_dir = save_dir + "/cache"
snapshot_download(cache_dir=cache_dir,
local_dir=save_dir,
repo_id=repo_id,
local_dir_use_symlinks=False,
resume_download=True,
allow_patterns=["*.json", "*.safetensors", ".pt", "*.bin", "*.py", "*.md", "*.txt"],)
Use Gradio WebUI to Play with BAGEL
# For 32GB+ VRAM GPU or multi GPUs.
python app-torchao.py
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support