Instructions to use eric8810/dream-quant with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use eric8810/dream-quant with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="eric8810/dream-quant",
	filename="gguf_output/dream-coder-7b-f16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use eric8810/dream-quant with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eric8810/dream-quant:F16
# Run inference directly in the terminal:
llama-cli -hf eric8810/dream-quant:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf eric8810/dream-quant:F16
# Run inference directly in the terminal:
llama-cli -hf eric8810/dream-quant:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf eric8810/dream-quant:F16
# Run inference directly in the terminal:
./llama-cli -hf eric8810/dream-quant:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf eric8810/dream-quant:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf eric8810/dream-quant:F16

Use Docker

docker model run hf.co/eric8810/dream-quant:F16

LM Studio
Jan
Ollama
How to use eric8810/dream-quant with Ollama:
```
ollama run hf.co/eric8810/dream-quant:F16
```

Unsloth Studio new

How to use eric8810/dream-quant with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eric8810/dream-quant to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for eric8810/dream-quant to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for eric8810/dream-quant to start chatting

Pi new

How to use eric8810/dream-quant with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eric8810/dream-quant:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "eric8810/dream-quant:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use eric8810/dream-quant with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf eric8810/dream-quant:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default eric8810/dream-quant:F16

Run Hermes

hermes

Docker Model Runner
How to use eric8810/dream-quant with Docker Model Runner:
```
docker model run hf.co/eric8810/dream-quant:F16
```

Lemonade

How to use eric8810/dream-quant with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull eric8810/dream-quant:F16

Run and chat with the model

lemonade run user.dream-quant-F16

List all available models

lemonade list

eric8810 commited on Aug 30, 2025

Commit

afbe37f

verified ·

1 Parent(s): bd713c3

Update README.md

Browse files

Files changed (1) hide show

README.md +264 -3

README.md CHANGED Viewed

@@ -1,3 +1,264 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- zh
+- en
+tags:
+- dream-coder
+- diffusion
+- dlm
+---
+# Dream-Coder GGUF Q8_0 Quantization Guide
+This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model.
+## Quick Start
+### 1. Environment Setup
+```bash
+# 1. Clone and compile llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make -j$(nproc)
+# 2. Install Python dependencies
+pip install transformers>=4.46.2 torch safetensors numpy
+```
+### 2. Execute Quantization
+#### Method 1: Use the provided script
+```bash
+# Set llama.cpp path
+export LLAMA_CPP_PATH=/path/to/llama.cpp
+# Run quantization script
+./quantize_example.sh
+```
+#### Method 2: Manual execution
+```bash
+python quantize_dream_q8_0.py \
+    --model_path /path/to/Dream-Coder-v0-Instruct-7B \
+    --llama_cpp_path /path/to/llama.cpp \
+    --output_dir ./gguf_output \
+    --keep_f16
+```
+### 3. Parameter Description
+- `--model_path`: Dream-Coder model path (default: current directory)
+- `--llama_cpp_path`: llama.cpp project path (required)
+- `--output_dir`: Output directory (default: ./gguf_output)
+- `--keep_f16`: Keep F16 intermediate files
+## Architecture Adaptation
+### Dream-Coder Special Configuration Handling
+This quantization script specifically handles the following special configurations of Dream-Coder:
+1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility)
+2. **Special Token IDs**:
+   - `mask_token_id`: 151666 (critical diffusion token)
+   - `bos_token_id`: 151665
+   - `eos_token_id`: 151643
+   - `pad_token_id`: 151643
+3. **Model Parameters**:
+   - Vocabulary size: 152,064
+   - Hidden dimension: 3,584
+   - Attention heads: 28 (4 key-value heads)
+   - Layers: 28
+   - Context length: 32,768
+4. **Diffusion Features**:
+   - Preserve `mask_token_id` metadata
+   - RoPE theta: 1,000,000.0
+   - Activation function: SiLU
+## Output Description
+### File Structure
+```
+gguf_output/
+├── dream-coder-7b-f16.gguf      # F16 intermediate file (optionally kept)
+└── dream-coder-7b-q8_0.gguf     # Final Q8_0 quantized file
+```
+### Performance Expectations
+| Metric | Original (BF16) | Q8_0 |
+|--------|-----------------|------|
+| Memory Usage | ~14GB | ~6.7GB |
+| Inference Speed | 1.0x | 1.2-1.5x |
+| Precision Loss | 0% | <0.1% |
+## Usage
+### llama.cpp Command Line
+Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool:
+```bash
+# Basic usage
+./llama.cpp/build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "def quicksort(arr):" \
+    -n 512 \
+    -c 2048 \
+    --diffusion-steps 128
+# Advanced parameters
+./llama.cpp/build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "Write a binary search function" \
+    -n 256 \
+    -c 2048 \
+    --temp 0.1 \
+    --top-p 0.95 \
+    --repeat-penalty 1.1 \
+    --diffusion-steps 128 \
+    --diffusion-algorithm 4 \
+    --diffusion-alg-temp 0.0 \
+    -t 8
+# Visualize generation process
+./llama.cpp/build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "def fibonacci(n):" \
+    -n 256 \
+    --diffusion-steps 64 \
+    --diffusion-visual
+```
+#### Diffusion Parameter Description
+- `--diffusion-steps N`: Diffusion denoising steps (default: 128)
+- `--diffusion-algorithm N`: Algorithm selection:
+  - 0 = ORIGIN (original algorithm)
+  - 1 = ENTROPY_BASED (entropy-based)
+  - 2 = MARGIN_BASED (margin-based)
+  - 3 = RANDOM (random)
+  - 4 = LOW_CONFIDENCE (low confidence, default)
+- `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0)
+- `--diffusion-visual`: Enable visualization mode, show generation progress
+- `--diffusion-eps F`: Time step epsilon value
+### Python (llama-cpp-python)
+```bash
+pip install llama-cpp-python
+```
+```python
+from llama_cpp import Llama
+# Load model
+llm = Llama(
+    model_path="gguf_output/dream-coder-7b-q8_0.gguf",
+    n_ctx=2048,
+    n_threads=8,
+    n_gpu_layers=0  # CPU inference, set >0 to enable GPU acceleration
+)
+# Generate code
+output = llm(
+    "def fibonacci(n):",
+    max_tokens=512,
+    temperature=0.1,
+    top_p=0.95,
+    repeat_penalty=1.1
+)
+print(output['choices'][0]['text'])
+```
+### With GPU Acceleration
+If compiled with CUDA support:
+```bash
+# Compile CUDA version
+cd llama.cpp
+make clean
+make LLAMA_CUBLAS=1 -j$(nproc)
+# Use GPU acceleration (partial layers)
+./build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "def quicksort(arr):" \
+    -n 512 \
+    --diffusion-steps 128 \
+    -ngl 20  # Number of GPU layers
+```
+## Troubleshooting
+### Common Issues
+1. **Conversion Failure**:
+   - Ensure llama.cpp is compiled correctly
+   - Check Python dependency versions
+   - Verify model file integrity
+2. **Quantization Failure**:
+   - Check disk space (~20GB temporary space needed)
+   - Ensure sufficient memory (32GB+ recommended)
+3. **Inference Errors**:
+   - Verify GGUF file integrity
+   - Check context length settings
+   - Try reducing `n_gpu_layers`
+### Model Validation
+```bash
+# File integrity check
+ls -lh gguf_output/dream-coder-7b-q8_0.gguf
+# Simple inference test
+echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
+```
+## Performance Optimization
+### CPU Optimization
+- Use `-t` parameter to set thread count
+- Enable AVX2/AVX512 compilation options
+- Adjust batch size (`-b` parameter)
+### GPU Optimization
+- Use CUDA/OpenCL compilation
+- Adjust GPU layer count (`-ngl`)
+- Monitor GPU memory usage
+### Memory Optimization
+- Use `--mmap` to enable memory mapping
+- Adjust `--mlock` parameter
+- Set appropriate context length
+## Important Notes
+1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models
+2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool
+3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666)
+4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance
+5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95)
+6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time
+## Technical Support
+If you encounter issues, please check:
+1. llama.cpp version and compilation status
+2. Python dependency version compatibility
+3. Model file integrity
+4. System resources (memory/disk)
+For more information, refer to:
+- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
+- [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)