Magnum v4 12B FP8

Model Overview

Magnum v4 12B FP8 is a large language model fine-tuned from Mistral-Nemo-Instruct-2407, specialized for creative writing and roleplay scenarios. This version uses FP8 quantization, significantly reducing VRAM usage while maintaining model performance.

Model Specifications

Parameter Value
Architecture MistralForCausalLM
Parameters 12B
Hidden Size 5120
Intermediate Size 14336
Attention Heads 32
KV Heads 8 (GQA)
Layers 40
Max Context Length 1024K tokens
Vocabulary Size 131072
Quantization FP8 (compressed-tensors)
Original Precision bfloat16

Key Features

  • Ultra-long context support: Up to 1024K tokens context window
  • FP8 quantization: Lower VRAM usage, faster inference
  • Tool calling support: Built-in special tokens like [TOOL_CALLS], [AVAILABLE_TOOLS]
  • Multilingual support: Chinese, English

Quick Start

Install Dependencies

pip install huggingface_hub transformers torch

Download from Hugging Face

from huggingface_hub import snapshot_download
model_dir = snapshot_download('timerring/magnum-v4-12b-fp8')

Git Download

git clone https://huggingface.co/timerring/magnum-v4-12b-fp8

Inference Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "timerring/magnum-v4-12b-fp8"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "Hello, please introduce yourself."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Deploy with vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model timerring/magnum-v4-12b-fp8 \
    --served-model-name magnum-v4-12b-fp8 \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --max-num-seqs 32

Hardware Requirements

  • FP8 inference: ~12-14GB VRAM
  • Recommended GPUs: NVIDIA RTX 4090, A100, H100, or other FP8-capable GPUs

Intended Use

This model is designed for:

  • Creative writing and storytelling
  • Roleplay and character interactions
  • General conversational AI applications

Limitations

  • May generate biased or inappropriate content in certain contexts
  • Not suitable for factual or safety-critical applications
  • Performance may vary with different prompt styles

License

Apache License 2.0

Acknowledgments

Downloads last month
6
Safetensors
Model size
12B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for timerring/magnum-v4-12b-fp8