ELYZA-Diffusion-Instruct-1.0-Dream-7B-8bit

This is an 8-bit quantized version of ELYZA-Diffusion-Instruct-1.0-Dream-7B, released by ELYZA, Inc..

Model Description

ELYZA-Diffusion-Instruct-1.0-Dream-7B is a Japanese-adapted diffusion language model based on the open-source diffusion LLM Dream-v0-Instruct-7B, further pretrained and instruction-tuned on large-scale Japanese data.

License

Apache License 2.0

Important Note for 8-bit Version

When using this 8-bit quantized model, you must use alg="origin" for diffusion generation. The alg="entropy" algorithm used in the original model is not compatible with 8-bit quantization.

Note: Due to the algorithm change (alg="origin" instead of alg="entropy"), the generation quality may differ from the original model. Please evaluate the output quality for your specific use case.

Usage Example

import torch
import time
from transformers import AutoModel, AutoTokenizer

def clear_screen():
    print("\033[H\033[J", end="")

model_path = "high-u/ELYZA-Diffusion-Instruct-1.0-Dream-7B-8bit"

model = AutoModel.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map="auto",
).eval()

tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "回答は必ずMarkdown形式で記述してください。トピックごとに適切な見出しを付け、重要なポイントは箇条書きや太字で強調し、視覚的に分かりやすく整理された文章で答えてください。"},
    {"role": "user", "content": "空はなぜ青い?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    return_dict=True,
    add_generation_prompt=True,
)

input_ids = inputs.input_ids.to("cuda")
attention_mask = inputs.attention_mask.to("cuda")

def stream_visualizer(step, x, logits):
    decoded = tokenizer.decode(x[0], skip_special_tokens=False)
    mask_token = tokenizer.mask_token if tokenizer.mask_token else "<|mask|>"
    display_text = decoded.replace(mask_token, "__")
    clear_screen()
    print(f"--- Diffusion Step {step} ---")
    print(display_text)
    return x

start_time = time.time()
with torch.inference_mode():
    output = model.diffusion_generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=512,
        steps=256,
        temperature=0.5,
        top_p=0.95,
        alg="origin",
        generation_tokens_hook_func=stream_visualizer
    )
end_time = time.time()
generated_tokens = output.shape[1] - input_ids.shape[1]
elapsed_time = end_time - start_time
tokens_per_sec = generated_tokens / elapsed_time if elapsed_time > 0 else 0
print(f"\n--- Stats ---")
print(f"Generated tokens: {generated_tokens}")
print(f"Time: {elapsed_time:.2f}s")
print(f"Tokens/s: {tokens_per_sec:.2f}")

Tested Environment

This model was tested with the following environment:

  • GPU: NVIDIA RTX 5070 Ti
  • Python: >= 3.13
  • Dependencies:
    • accelerate >= 1.12.0
    • bitsandbytes >= 0.45.0
    • protobuf >= 6.33.4
    • sentencepiece >= 0.2.1
    • torch >= 2.10.0
    • transformers == 4.46.2

Original Model

For more details on the model design and training setup, please refer to:

Acknowledgments

This model is a quantized version of ELYZA-Diffusion-Instruct-1.0-Dream-7B. I would like to express my sincere gratitude to ELYZA, Inc. for releasing such an excellent model to the open-source community.

Downloads last month
37
Safetensors
Model size
8B params
Tensor type
F32
·
F16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for high-u/ELYZA-Diffusion-Instruct-1.0-Dream-7B-8bit