Text Generation
Transformers
Safetensors
English
Korean
gpt_oss
sft
trl
safety
reasoning
conversational
8-bit precision
mxfp4
Instructions to use PoSTMEDIA/Vayne-V3-Pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use PoSTMEDIA/Vayne-V3-Pro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="PoSTMEDIA/Vayne-V3-Pro") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("PoSTMEDIA/Vayne-V3-Pro") model = AutoModelForCausalLM.from_pretrained("PoSTMEDIA/Vayne-V3-Pro") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use PoSTMEDIA/Vayne-V3-Pro with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PoSTMEDIA/Vayne-V3-Pro" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PoSTMEDIA/Vayne-V3-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/PoSTMEDIA/Vayne-V3-Pro
- SGLang
How to use PoSTMEDIA/Vayne-V3-Pro with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "PoSTMEDIA/Vayne-V3-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PoSTMEDIA/Vayne-V3-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "PoSTMEDIA/Vayne-V3-Pro" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PoSTMEDIA/Vayne-V3-Pro", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use PoSTMEDIA/Vayne-V3-Pro with Docker Model Runner:
docker model run hf.co/PoSTMEDIA/Vayne-V3-Pro
File size: 7,066 Bytes
b02a931 0bf98cd b02a931 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | ---
license: apache-2.0
language:
- en
- ko
base_model:
- openai/gpt-oss-safeguard-120b
base_model_relation: merge
pipeline_tag: text-generation
library_name: transformers
tags:
- sft
- trl
- transformers
- safety
- reasoning
---
# Vayne-V3-Pro
**Vayne-V3-Pro** is a **fully fine-tuned, MXFP4-quantized enterprise LLM** built for **AI agent frameworks**, **MCP-based tool orchestration**, **Retrieval-Augmented Generation (RAG) pipelines**, and **secure on-premise deployment**.
Building on the foundation of Vayne-V3, Vayne-V3-Pro delivers deeper model adaptation through **full-parameter Supervised Fine-Tuning (SFT)** combined with **NVIDIA ModelOpt Quantization-Aware Training (QAT)**, resulting in significantly improved instruction-following, identity consistency, and inference efficiency.
- **Full-parameter fine-tuning** for deeper knowledge integration (vs. LoRA in V2)
- **MXFP4 quantization** via NVIDIA ModelOpt for fast, memory-efficient inference
- **Enhanced multilingual reasoning** with Korean Chain-of-Thought capabilities
- Seamless integration with MCP-based multi-tool orchestration
- Secure deployment in private or regulated environments
---
## What's New in V3
| Feature | V2 | V3 |
|---------|----|----|
| Fine-Tuning Method | LoRA (Adapter) | **Full-Parameter SFT** |
| Quantization | BF16 / FP16 | **MXFP4 (QAT)** |
| Identity Alignment | Basic | **Enhanced (5x oversampled identity training)** |
| Multilingual Reasoning | Bilingual QA | **Korean Chain-of-Thought Thinking** |
| Training Pipeline | Single-step | **3-Step QAT Recipe** |
---
## Key Design Principles
| Feature | Description |
|---------|-------------|
| Private AI Ready | Deploy fully **on-premise** or in **air-gapped** secure environments |
| Efficient Inference | **MXFP4 quantization** enables fast inference on a single GPU |
| Enterprise Reasoning | Structured output and instruction-following for **business automation** |
| Agent & MCP Native | Built for **AI agent frameworks** and **MCP-based tool orchestration** |
| RAG Enhanced | Optimized for **retrieval workflows** with vector DBs (FAISS, Milvus, pgvector, etc.) |
---
## Model Architecture & Training
| Specification | Details |
|---------------|---------|
| Base Model | [openai/gpt-oss-safeguard-120b](https://huggingface.co/openai/gpt-oss-safeguard-120b) |
| Parameters | 117B (Active: 5.1B) |
| Training Precision | BF16 |
| Inference Precision | **MXFP4** (Quantization-Aware Training) |
| Architecture | Decoder-only Transformer (MoE, 128 experts / 4 active) |
| Safety Architecture | Chain-of-Thought Reasoning |
| Context Length | 128K tokens |
| Inference | Single-GPU (80GB VRAM, H100 / MI300X) / Multi-GPU |
### Training Pipeline — 3-Step QAT Recipe
Vayne-V3-Pro is trained using a **3-step Quantization-Aware Training (QAT) recipe** powered by NVIDIA ModelOpt:
```
Step 1: Full-Parameter SFT
└─ Standard supervised fine-tuning on BF16 weights (no quantization)
Step 2: Quantization-Aware Training (QAT)
└─ Fine-tune with MXFP4_MLP_WEIGHT_ONLY quantization config
└─ Lower learning rate (1e-5) for stable convergence
Step 3: MXFP4 Conversion
└─ Convert trained model to MXFP4 format via nvidia_convert.py
└─ Optimized for production inference
```
### Training Data
Fine-tuned using full-parameter supervised instruction tuning (SFT) on proprietary and curated datasets covering:
- Model identity and persona alignment
- Domain-specific knowledge for targeted enterprise verticals
- Multilingual Chain-of-Thought reasoning (Korean-English)
### Training Configuration
| Parameter | Value |
|-----------|-------|
| Learning Rate (SFT) | 2.0e-5 |
| Learning Rate (QAT) | 1.0e-5 |
| Batch Size | 2 per device |
| Epochs | 1.0 |
| Max Sequence Length | 131,072 |
| Warmup Ratio | 0.03 |
| LR Scheduler | Cosine with Min LR (10%) |
| Gradient Checkpointing | Enabled |
| Training Infrastructure | NVIDIA H200 x 16 |
---
## Safety & Reasoning Features
Vayne-V3-Pro inherits advanced safety reasoning capabilities from gpt-oss-safeguard-120b:
| Feature | Description |
|---------|-------------|
| **Chain-of-Thought Safety** | Transparent reasoning process for content safety decisions |
| **Bring Your Own Policy** | Custom policy interpretation and application |
| **Configurable Reasoning** | Adjustable reasoning effort (Low/Medium/High) |
| **Explainable Outputs** | Full CoT traces for safety decision auditing |
### Reasoning Effort Levels
| Level | Use Case | Trade-off |
|-------|----------|-----------|
| **Low** | Fast filtering, real-time applications | Speed-optimized, lower latency |
| **Medium** | Balanced production use | Balanced accuracy and speed |
| **High** | Critical content review | Maximum accuracy, higher latency |
---
## Secure On-Premise Deployment
Vayne-V3-Pro is built for **enterprise AI inside your firewall**.
- No external API dependency
- Compatible with **offline environments**
- MXFP4 quantization for **resource-efficient deployment**
- Proven for secure, regulated environments
---
## MCP (Model Context Protocol) Integration
Vayne-V3-Pro supports **MCP-based agent tooling**, making it easy to build tool-use AI agents.
Works seamlessly with:
- Claude MCP-compatible agent systems
- Local agent runtimes
- JSON structured execution
---
## RAG Compatibility
Designed for **hybrid reasoning + retrieval**.
- Works with FAISS, Chroma, Elasticsearch
- Handles long-context document QA
- Ideal for enterprise knowledge bases
---
## Quick Start
```bash
pip install transformers accelerate
```
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "PoSTMEDIA/Vayne-V3-Pro"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = "Explain the benefits of private AI for enterprise security."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
---
## Use Cases
- Internal enterprise AI assistant
- Private AI document analysis
- Business writing (reports, proposals, strategy)
- AI automation agents with MCP tool orchestration
- Secure RAG search systems
- Multilingual (Korean-English) reasoning tasks
---
## Safety & Limitations
- Not intended for medical, legal, or financial decision-making
- May occasionally generate hallucinations
- Use human validation for critical outputs
- Recommended: enable output guardrails for production
---
## Citation
```bibtex
@misc{vayne2026,
title={Vayne-V3-Pro: Fully Fine-Tuned Enterprise LLM with MXFP4 Quantization-Aware Training},
author={PoSTMEDIA AI Lab},
year={2026},
publisher={Hugging Face}
}
```
---
## Contact
**PoSTMEDIA AI Lab**
- Email: [dev.postmedia@gmail.com](mailto:dev.postmedia@gmail.com)
- Web: [https://postmedia.ai](https://postmedia.ai)
- Web: [https://postmedia.co.kr](https://postmedia.co.kr) |