Magnum v4 12B FP8

Model Overview

Magnum v4 12B FP8 is a large language model fine-tuned from Mistral-Nemo-Instruct-2407, specialized for creative writing and roleplay scenarios. This version uses FP8 quantization, significantly reducing VRAM usage while maintaining model performance.

Model Specifications

Parameter	Value
Architecture	MistralForCausalLM
Parameters	12B
Hidden Size	5120
Intermediate Size	14336
Attention Heads	32
KV Heads	8 (GQA)
Layers	40
Max Context Length	1024K tokens
Vocabulary Size	131072
Quantization	FP8 (compressed-tensors)
Original Precision	bfloat16

Key Features

Ultra-long context support: Up to 1024K tokens context window
FP8 quantization: Lower VRAM usage, faster inference
Tool calling support: Built-in special tokens like [TOOL_CALLS], [AVAILABLE_TOOLS]
Multilingual support: Chinese, English

Quick Start

Install Dependencies

pip install huggingface_hub transformers torch

Download from Hugging Face

from huggingface_hub import snapshot_download
model_dir = snapshot_download('timerring/magnum-v4-12b-fp8')

Git Download

git clone https://huggingface.co/timerring/magnum-v4-12b-fp8

Inference Example

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "timerring/magnum-v4-12b-fp8"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "Hello, please introduce yourself."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

response = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(response)

Deploy with vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model timerring/magnum-v4-12b-fp8 \
    --served-model-name magnum-v4-12b-fp8 \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 16384 \
    --max-num-seqs 32

Hardware Requirements

FP8 inference: ~12-14GB VRAM
Recommended GPUs: NVIDIA RTX 4090, A100, H100, or other FP8-capable GPUs

Intended Use

This model is designed for:

Creative writing and storytelling
Roleplay and character interactions
General conversational AI applications

Limitations

May generate biased or inappropriate content in certain contexts
Not suitable for factual or safety-critical applications
Performance may vary with different prompt styles

License

Apache License 2.0

Acknowledgments

Base model: Mistral-Nemo-Instruct-2407

Downloads last month: 6

Safetensors

Model size

12B params

Tensor type

BF16

F8_E4M3

Model tree for timerring/magnum-v4-12b-fp8

Base model

mistralai/Mistral-Nemo-Base-2407

Finetuned

mistralai/Mistral-Nemo-Instruct-2407

Quantized

(129)

this model