aivisionslab's picture
Update README.md
d6c8c90 verified
---
license: mit
language:
- pt
- en
- es
- fr
- ar
tags:
- vulkan
- amd
- rx580
- local-ai
- llama-cpp
- stable-diffusion
- gguf
- flux
- openwebui
- polaris
- gcn4
- hardware-revival
- windows
- wsl2
pretty_name: RX 580 Local AI Complete Stack (AIVisionsLab)
---
# RX 580 Local AI — Complete Stack
**AIVisionsLab Studios** · São Paulo, Brazil 🇧🇷
> Running SOTA AI on 2017 hardware in 2026. No CUDA. No ROCm. No cloud.
---
## What this is
This repository documents the complete stack for running local AI on an **AMD RX 580 8GB** using the **Vulkan API** as the GPU backend — bypassing the need for CUDA or ROCm entirely.
AMD officially dropped ROCm support for Polaris/GCN4 in v5.x. DirectML failed. OpenVINO failed.
This project proves the hardware is still capable — the problem was always the software stack, not the GPU.
**Full master documentation (PT/EN/ES/FR/AR):**
🌐 [setup-ia-local-rx580-vulkan.web.app](https://setup-ia-local-rx580-vulkan.web.app/)
---
## Hardware
| Component | Spec |
|-----------|------|
| GPU | AMD RX 580 **2048SP** 8GB GDDR5 (Polaris / GCN4) |
| CPU | Intel Xeon **E5-2690 v3** — 12c/24t · 3.5GHz boost (2014) |
| RAM | **32GB DDR4 REG ECC** Quad Channel RDIMM |
| Storage | **NVMe 1TB** — 1.7–3.5 GB/s (critical bottleneck) |
| OS | Windows 10 Pro + WSL2 Ubuntu 22.04.5 |
| Vulkan SDK | 1.4.341.1 |
| AMD Driver | 31.0.21924.61 |
---
## Performance (real logs, not synthetic benchmarks)
### LLM — llama.cpp with Vulkan
| Model | Quantization | Speed | VRAM |
|-------|-------------|-------|------|
| Mistral 7B Instruct | Q4_K_M | **~9 tok/s** | ~6GB |
| Llama 3 8B Instruct | Q4_K_M | **~7 tok/s** | ~6.8GB |
| Qwen2.5 7B | Q4_K_M | **~8 tok/s** | ~6.2GB |
| DeepSeek R1 8B | Q4_K_M | **~7 tok/s** | ~6.8GB |
> CPU baseline (Xeon, no GPU): 3–5 tok/s. Vulkan uplift: **3–4×**
### Image Generation — stable-diffusion.cpp with Vulkan
| Model | Resolution | Steps | Time | Backend |
|-------|------------|-------|------|---------|
| DreamShaper 8 (SD 1.5 GGUF) | 512×512 | 20 | **~72s** | RX 580 Vulkan |
| FLUX.1 Schnell q4_k | 1024×1024 | 4 | **~14 min** | GPU+CPU hybrid |
| FLUX.1 Schnell fp8 (16GB) | 1024×1024 | 4 | **~24 min** | Xeon CPU / WSL2 |
### Storage impact
| Operation | HDD | NVMe | Improvement |
|-----------|-----|------|-------------|
| LLM 7B load | ~25 min | **~4 min** | 6× faster |
| FLUX 16GB load | ~25 min | **~30s** | **50× faster** |
---
## Models used
### For sd-server (stable-diffusion.cpp)
> ⚠️ **Critical:** Only use **leejet** GGUF models for sd-server.
> city96 GGUF models are ComfyUI-only. Using them returns `new_sd_ctx_t failed`.
| Model | Source | Use |
|-------|--------|-----|
| `flux1-schnell-q4_k.gguf` | [leejet/FLUX.1-schnell-gguf](https://huggingface.co/leejet/FLUX.1-schnell-gguf) | FLUX GPU hybrid |
| `flux1-schnell-Q3_K_S.gguf` | [leejet/FLUX.1-schnell-gguf](https://huggingface.co/leejet/FLUX.1-schnell-gguf) | FLUX lighter (~5.2GB) |
| `DreamShaper_8.safetensors` | Civitai | SD 1.5 production |
### For ComfyUI (city96 compatible)
| Model | Source | Use |
|-------|--------|-----|
| `flux1-schnell-Q4_K_S.gguf` | [city96/FLUX.1-schnell-gguf](https://huggingface.co/city96/FLUX.1-schnell-gguf) | ComfyUI only |
| `flux1-schnell-fp8.safetensors` | Comfy-Org | Full 16GB CPU |
### VAE / CLIP / T5XXL (required for FLUX)
| File | Purpose | RAM allocation |
|------|---------|----------------|
| `ae.safetensors` | VAE decoder | ~160MB CPU |
| `clip_l.safetensors` | CLIP encoder | ~235MB GPU |
| `t5xxl_fp16.safetensors` | T5 encoder | ~9.3GB CPU |
| `t5xxl_fp8.safetensors` | T5 encoder (lighter) | ~5GB CPU |
---
## Architecture
```
OpenWebUI (Docker :3000)
├──► LLM: llama-server.exe (:8081) — RX 580 Vulkan
│ └── fallback: Ollama (:11434) — CPU
└──► Images:
├──► SD 1.5 GGUF: sd-server.exe (:7860) — RX 580 Vulkan
└──► FLUX.1 16GB: ComfyUI (:8188) — Xeon CPU WSL2
```
### FLUX memory segmentation
| Component | File | Allocation | Size |
|-----------|------|------------|------|
| Diffusion model | flux1-schnell-q4_k.gguf | **GPU VRAM** | ~6.5GB |
| VAE | ae.safetensors | **CPU RAM** | ~160MB |
| CLIP L | clip_l.safetensors | **GPU VRAM** | ~235MB |
| T5XXL | t5xxl_fp16.safetensors | **CPU RAM** | ~9.3GB |
---
## What failed (documented with root cause)
| Attempt | Error | Root cause |
|---------|-------|------------|
| DirectML | `OpaqueTensorImpl` | MS encapsulates tensors — ComfyUI can't read them |
| ROCm | Kernel panics | GCN4/Polaris dropped in v5.x — permanent |
| OpenVINO + Forge | `No module 'ldm'` | Extension targets A1111 — incompatible with Forge |
| CPU + HDD | ~19 min/image | Zero GPU utilization + I/O bottleneck |
Full analysis: [docs/what-failed.md](https://github.com/aivisionslab-studios/rx580-local-ai-guide/blob/main/docs/what-failed.md)
---
## Community & Credits
This work builds on independent research from:
| Author | Publication | Contribution |
|--------|-------------|-------------|
| [艾米心 Amihart](https://medium.com/@amihart) | Medium, Jan 2025 | First validation of LLMs via Vulkan on RX 580 — 24.56 tok/s |
| [DH / DadHacks](https://dadhacks.org/2025/12/05/ai-image-generation-on-rx-580-using-vulkan-a-cost-effective-solution/) | dadhacks.org, Dec 2025 | Refuted "SD can't run on Vulkan" — sd.cpp Linux guide |
| [leejet](https://github.com/leejet/stable-diffusion.cpp) | GitHub | stable-diffusion.cpp engine |
| [ggerganov](https://github.com/ggerganov/llama.cpp) | GitHub | llama.cpp + ggml engine |
| [woodrex](https://hub.docker.com/r/woodrex/sd-webui-for-gfx803) | Docker Hub | ROCm gfx803 containers |
> *"The hardware was never obsolete. It was waiting for the right software."*
---
## GitHub
📦 [aivisionslab-studios/rx580-local-ai-guide](https://github.com/aivisionslab-studios/rx580-local-ai-guide)
Scripts, build guides, automation, troubleshooting docs.
---
## License
MIT — use freely, give credit, document what you learn.