Configuration Parsing Warning: Invalid JSON for config file tokenizer_config.json

image

Overview

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.


Basic Information

  • Architecture : Transformer-based omni-model architecture (Dense Model)
  • Parameters : 8B
  • Input Format: Text/Image/Video/Audio(Speech)
  • Output Format: Text/Image/Audio(Speech)
  • Context Length : 32K
  • Knowledge Cutoff: May 2025

Benchmarks

테크나ᄏα…₯α†― 라포트 05_2@2x

  • Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
  • Vision-to-Text :SEED-IMG, AI2D, K-MMBench
  • Text-to-Vision: GenEval, ImgEdit
  • Audio-to-Text: Librispeech, Ksponspeech
  • Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

Examples

Text-to-Image Generation

hf_img01

Text-based Image Editing

hf_img02 hf_img03 hf_img04


Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

  • Inputs: Text, Image, Audio, Video
  • Outputs: Text, Image, Audio (no video generation)

Requirements

  • 4x NVIDIA A100 80GB
  • Docker & Docker Compose
  • NVIDIA Driver 525+, CUDA 12.1+
  • S3-compatible storage (for image/audio output)

Installation

# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
    --local-dir ./models/HyperCLOVAX-SEED-Omni-8B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Omni-8B \
    --output ./track_b \
    --track b

# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials

# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d

# Wait for model loading (~5 minutes)
docker compose logs -f omni

# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d

Basic Usage

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/b/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

More Examples

Text to Image
import json

SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Draw a sunset over mountains"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")
Text to Audio
import base64

# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
    model="track_b_model",
    messages=[{
        "role": "user",
        "content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
    }],
    max_tokens=1000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")
Audio Input
import base64

audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "What is being said?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)
Video Input
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)
Image to Image
import json

SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
                {"type": "text", "text": "Transform to watercolor style"}
            ]
        }
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")
Audio to Audio
import base64

# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "Listen to this and respond with speech"}
            ]
        }
    ],
    max_tokens=2000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")
Using curl
# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
      {"type": "text", "text": "Describe this image."}
    ]}],
    "max_tokens": 256,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 1000,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

Architecture

                         User Request
                    (Image/Audio/Video/Text)
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                            OmniServe                                    β”‚
β”‚                  POST /b/v1/chat/completions                            β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                     [1] INPUT ENCODING                           β”‚   β”‚
β”‚  β”‚                                                                  β”‚   β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚   β”‚
β”‚  β”‚    β”‚  Vision Encoder β”‚               β”‚  Audio Encoder  β”‚         β”‚   β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚   β”‚
β”‚  β”‚             β”‚                                 β”‚                  β”‚   β”‚
β”‚  β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚   β”‚
β”‚  β”‚                          β”‚ embeddings                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                             β–Ό                                           β”‚
β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                    β”‚
β”‚                     β”‚   LLM (8B)   │◀──── text                          β”‚
β”‚                     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                    β”‚
β”‚                            β”‚                                            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                  [2] OUTPUT DECODING                             β”‚   β”‚
β”‚  β”‚                         β”‚                                        β”‚   β”‚
β”‚  β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚   β”‚
β”‚  β”‚          β–Ό              β–Ό              β–Ό                         β”‚   β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚   β”‚
β”‚  β”‚    β”‚   Text    β”‚  β”‚  Vision   β”‚  β”‚   Audio   β”‚                   β”‚   β”‚
β”‚  β”‚    β”‚           β”‚  β”‚  Decoder  β”‚  β”‚  Decoder  β”‚                   β”‚   β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                   β”‚   β”‚
β”‚  β”‚                         β”‚              β”‚                         β”‚   β”‚
β”‚  β”‚                         β–Ό              β–Ό                         β”‚   β”‚
β”‚  β”‚                    Image URL      Audio URL                      β”‚   β”‚
β”‚  β”‚                      (S3)           (S3)                         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                         Response
                   (Text / Image URL / Audio URL)

Hardware Requirements

Component GPU VRAM
Vision Encoder 1x ~8GB
Audio Encoder (shared) ~4GB
LLM (8B) 1x ~16GB
Vision Decoder 1x ~16GB
Audio Decoder (shared) ~4GB
Total 3x ~48GB

Key Parameters

Parameter Description Default
chat_template_kwargs.skip_reasoning Skip reasoning true
max_tokens Max output tokens -
temperature Sampling temperature 0.7
tools Required for image generation -

S3 Configuration

Required for image/audio generation:

NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name

For more details, see OmniServe documentation.


Citation

TBU (Technical Report)


Questions

For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.


License

The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement

Downloads last month
61
Safetensors
Model size
11B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Collection including naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B