Text Generation
GGUF
English
rocmfp4
quantized
amd
rocm
strix-halo
qwen3
agent
repository-exploration
conversational
Instructions to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Hal0ai/FastContext-Hal0-4B-ROCmFP4", filename="FastContext-4B-ROCmFP4-STRIX_LEAN.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4 # Run inference directly in the terminal: llama cli -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4 # Run inference directly in the terminal: llama cli -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4 # Run inference directly in the terminal: ./llama-cli -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4
Use Docker
docker model run hf.co/Hal0ai/FastContext-Hal0-4B-ROCmFP4
- LM Studio
- Jan
- vLLM
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Hal0ai/FastContext-Hal0-4B-ROCmFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Hal0ai/FastContext-Hal0-4B-ROCmFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Hal0ai/FastContext-Hal0-4B-ROCmFP4
- Ollama
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with Ollama:
ollama run hf.co/Hal0ai/FastContext-Hal0-4B-ROCmFP4
- Unsloth Studio
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Hal0ai/FastContext-Hal0-4B-ROCmFP4 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Hal0ai/FastContext-Hal0-4B-ROCmFP4 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Hal0ai/FastContext-Hal0-4B-ROCmFP4 to start chatting
- Pi
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Hal0ai/FastContext-Hal0-4B-ROCmFP4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Hal0ai/FastContext-Hal0-4B-ROCmFP4
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf Hal0ai/FastContext-Hal0-4B-ROCmFP4
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "Hal0ai/FastContext-Hal0-4B-ROCmFP4" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with Docker Model Runner:
docker model run hf.co/Hal0ai/FastContext-Hal0-4B-ROCmFP4
- Lemonade
How to use Hal0ai/FastContext-Hal0-4B-ROCmFP4 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Hal0ai/FastContext-Hal0-4B-ROCmFP4
Run and chat with the model
lemonade run user.FastContext-Hal0-4B-ROCmFP4-{{QUANT_TAG}}List all available models
lemonade list
| license: mit | |
| base_model: microsoft/FastContext-1.0-4B-SFT | |
| base_model_relation: quantized | |
| pipeline_tag: text-generation | |
| library_name: gguf | |
| tags: | |
| - gguf | |
| - rocmfp4 | |
| - quantized | |
| - amd | |
| - rocm | |
| - strix-halo | |
| - qwen3 | |
| - agent | |
| - repository-exploration | |
| language: | |
| - en | |
| <p align="center"> | |
| <img src="hal0-banner.png" alt="hal0" width="420"/> | |
| </p> | |
| # FastContext-Hal0-4B β ROCmFP4 (STRIX_LEAN) | |
| A 4-bit **ROCmFP4** quantization of [`microsoft/FastContext-1.0-4B-SFT`](https://huggingface.co/microsoft/FastContext-1.0-4B-SFT), | |
| a lightweight repository-exploration subagent (Qwen3-4B backbone) for LLM coding agents. | |
| Quantized and validated on **AMD Strix Halo** (Ryzen AI MAX+ 395 / Radeon 8060S, `gfx1151`) | |
| using [`hal0ai/amd-strix-halo-toolboxes`](https://github.com/hal0ai) π οΈ. | |
| > ### β οΈ Read this first β special runtime required | |
| > This file uses the experimental **`Q4_0_ROCMFP4`** GGUF tensor format. It is **NOT** loadable by | |
| > stock `llama.cpp`, Ollama, LM Studio, or any standard GGUF runtime. It runs **only** in the | |
| > [`charlie12345/rocmfp4-llama`](https://github.com/charlie12345/rocmfp4-llama) fork. | |
| > ROCmFP4 is a custom Codebook10 / finite-UE4M3 layout β it is **not** MXFP4 or NVFP4. | |
| ## What's in this repo | |
| | File | Size | Format | BPW | | |
| |---|---:|---|---:| | |
| | `FastContext-4B-ROCmFP4-STRIX_LEAN.gguf` | 2.05 GiB | `Q4_0_ROCMFP4_STRIX_LEAN` | 4.38 | | |
| `STRIX_LEAN` is a tensor-aware preset: norms stay `f32`, sensitive tensors keep higher precision, | |
| and the bulk of the weights use the dual/fast ROCmFP4 layouts. | |
| ## Why ROCmFP4 here | |
| On Strix Halo, token generation is memory-bandwidth-bound, so 4-bit weights decode much faster than | |
| BF16 while keeping quality intact for tool-calling. | |
| ### Performance (`llama-bench`, ROCm0, FlashAttention on, Radeon 8060S) | |
| | Metric | BF16 source | **ROCmFP4 STRIX_LEAN** | Ξ | | |
| |---|---:|---:|---| | |
| | Size | 7.49 GiB | **2.05 GiB** | **3.65Γ smaller** | | |
| | Prefill `pp512` | 2388 t/s | 2244 t/s | ~same (compute-bound) | | |
| | Decode `tg128` | 25.6 t/s | **73.7 t/s** | **2.88Γ faster** | | |
| ### Tool-calling quality (`server-test-function-call.py`, 5 multi-turn cases, greedy `temp 0`) | |
| | | BF16 source | ROCmFP4 STRIX_LEAN | | |
| |---|---:|---:| | |
| | Cases passed | 2/5 | 4/5 | | |
| In every case **both** models selected and ordered the correct tools β the only failures were | |
| "no final summary produced" after correct tool use, a stopping quirk shared by the BF16 source | |
| (not a quantization artifact). **Takeaway: FP4 introduced no measurable tool-calling regression.** | |
| A 5-case harness can't rank models finely, so read this as "quality preserved," not "FP4 > BF16." | |
| ## How to run | |
| Build the fork for your AMD GPU (see its README), then: | |
| ```bash | |
| HSA_OVERRIDE_GFX_VERSION=11.5.1 \ | |
| GGML_HIP_ENABLE_UNIFIED_MEMORY=1 \ | |
| ./build-strix-rocmfp4/bin/llama-server \ | |
| -m FastContext-4B-ROCmFP4-STRIX_LEAN.gguf \ | |
| -dev ROCm0 -ngl 999 -c 262144 -fa on --jinja | |
| ``` | |
| For scripted/non-interactive generation use `llama-completion` (this fork's `llama-cli` is | |
| interactive-only and rejects `-no-cnv`). FastContext supports up to **262K** context. | |
| ## How it was made | |
| ```bash | |
| # 1. HF safetensors -> BF16 GGUF | |
| python convert_hf_to_gguf.py ./FastContext-1.0-4B-SFT --outtype bf16 --outfile fc-bf16.gguf | |
| # 2. BF16 -> ROCmFP4 (same fork binary the server uses) | |
| llama-quantize fc-bf16.gguf FastContext-4B-ROCmFP4-STRIX_LEAN.gguf Q4_0_ROCMFP4_STRIX_LEAN | |
| ``` | |
| ## License & attribution | |
| - Weights derive from [`microsoft/FastContext-1.0-4B-SFT`](https://huggingface.co/microsoft/FastContext-1.0-4B-SFT) β **MIT**. | |
| - Backbone: [`Qwen/Qwen3-4B-Instruct-2507`](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) β Apache-2.0. | |
| - Quantization format & tooling: [`charlie12345/rocmfp4-llama`](https://github.com/charlie12345/rocmfp4-llama). | |
| This repository redistributes a quantized derivative under the terms of the upstream MIT license. | |
| --- | |
| ### About hal0ai | |
| Built and benchmarked with **[hal0ai](https://github.com/hal0ai)** β local-first AI agent | |
| infrastructure tuned for **AMD Strix Halo**. The | |
| [`amd-strix-halo-toolboxes`](https://github.com/hal0ai) ship ready-to-run ROCm + ROCmFP4 | |
| container images so you can quantize and serve large models on a single unified-memory APU. | |
| If you're running agents on AMD silicon, come say hi. π | |
| --- | |
| ### A note from the author π | |
| This is my **first time** doing any kind of custom model quantization or training β this | |
| release is very much a learning project. So if you spot something I got wrong, or have tips on | |
| presets, calibration, or quality testing, I'd genuinely **appreciate the feedback** β open a | |
| Community discussion and let me know. | |
| I made this to run as a **slot in [hal0](https://github.com/hal0ai)**, alongside the main agent β | |
| a small, fast repository-exploration subagent that ROCmFP4 lets me keep resident on the Strix | |
| Halo without crowding out the bigger models sharing the same unified memory. | |
| If you're tinkering with local agents on AMD hardware, **come check out hal0** β would love to | |
| see what you build. π | |