FE2E-INT8 / README.md
Nekochu's picture
Fix ONNX note: honest about the real cause
9e2c096 verified
|
Raw
History Blame Contribute Delete
1.85 kB
---
license: mit
base_model: stepfun-ai/Step1X-Edit
tags:
- depth-estimation
- normal-estimation
- quantized
- int8
---
# FE2E INT8 (Pre-quantized for CPU)
Pre-quantized INT8 model for [FE2E](https://github.com/AMAP-ML/FE2E) (CVPR 2026) monocular depth + surface normal estimation from a single image.
**Demo Space:** [WeReCooking2/FE2E-CPU](https://huggingface.co/spaces/WeReCooking2/FE2E-CPU)
## Files
| File | Size | Description |
|------|------|-------------|
| `dit_int8_full.pt` | 12.4 GB | Step1X-Edit DiT (12.4B params) + LDRN LoRA merged, dynamic INT8 quantized |
| `vae_full.pt` | 335 MB | AutoEncoder, FP32 |
Both files are saved with `torch.save(model)` (full model, not state_dict). Load with `torch.load(..., mmap=True)` to avoid doubling memory.
## How it was made
1. Loaded FP32 base model (`step1x-edit-i1258.safetensors`) on GPU
2. Cast to FP32 on CPU
3. Merged LDRN LoRA in full precision
4. Applied `torch.quantization.quantize_dynamic` (INT8 on all `nn.Linear` layers)
5. Saved full model with `torch.save(model)`
## Usage
```python
import torch
dit = torch.load("dit_int8_full.pt", map_location="cpu", weights_only=False, mmap=True)
vae = torch.load("vae_full.pt", map_location="cpu", weights_only=False, mmap=True)
```
Requires ~12 GB RAM with mmap loading.
## Performance
| Platform | Time per image |
|----------|---------------|
| GPU (RTX 5090, FP8 original) | ~2s |
| CPU (HF free Space, INT8) | ~29 min (768x1024) |
Single denoise step, outputs both depth and surface normal maps simultaneously.
> No ONNX: PyTorch dynamo exporter produces a broken graph (100% NaN output).
## Credits
- [FE2E](https://github.com/AMAP-ML/FE2E) (CVPR 2026)
- [Step1X-Edit](https://github.com/stepfun-ai/Step1X-Edit) base model
- [rkfg/Step1X-Edit-FP8](https://huggingface.co/rkfg/Step1X-Edit-FP8) FP8 quantization