|
|
--- |
|
|
license: other |
|
|
license_name: hyperclovax |
|
|
license_link: LICENSE |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
# Overview |
|
|
HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward **Any-to-Any-Korean-First** intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling. |
|
|
|
|
|
--- |
|
|
|
|
|
# Technical Report |
|
|
- [HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)](./HyperCLOVA_X_8B_Omni.pdf) |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
# Basic Information |
|
|
|
|
|
- **Architecture** : Transformer-based omni-model architecture (Dense Model) |
|
|
- **Parameters** : 8B |
|
|
- **Input Format**: Text/Image/Video/Audio(Speech) |
|
|
- **Output Format**: Text/Image/Audio(Speech) |
|
|
- **Context Length** : 32K |
|
|
- **Knowledge Cutoff**: May 2025 |
|
|
|
|
|
--- |
|
|
|
|
|
# Benchmarks |
|
|
 |
|
|
|
|
|
|
|
|
- **Text-to-Text** : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0 |
|
|
- **Vision-to-Text** :SEED-IMG, AI2D, K-MMBench |
|
|
- **Text-to-Vision**: GenEval, ImgEdit |
|
|
- **Audio-to-Text**: Librispeech, Ksponspeech |
|
|
- **Audio-to-Audio**:Fleurs en2ko, Fleurs ko2en |
|
|
|
|
|
--- |
|
|
|
|
|
# Examples |
|
|
## Text-to-Image Generation |
|
|
 |
|
|
## Text-based Image Editing |
|
|
 |
|
|
 |
|
|
 |
|
|
|
|
|
--- |
|
|
|
|
|
# Inference |
|
|
|
|
|
We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API. |
|
|
|
|
|
## Capabilities |
|
|
|
|
|
- **Inputs**: Text, Image, Audio, Video |
|
|
- **Outputs**: Text, Image, Audio (no video generation) |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- 4x NVIDIA A100 80GB |
|
|
- Docker & Docker Compose |
|
|
- NVIDIA Driver 525+, CUDA 12.1+ |
|
|
- S3-compatible storage (for image/audio output) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
# Clone OmniServe |
|
|
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git |
|
|
cd OmniServe |
|
|
|
|
|
# Install dependencies |
|
|
pip install huggingface_hub safetensors torch openai easydict |
|
|
|
|
|
# Download model (~16GB) |
|
|
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \ |
|
|
--local-dir ./models/HyperCLOVAX-SEED-Omni-8B |
|
|
|
|
|
# Convert model to component format |
|
|
python convert_model.py \ |
|
|
--input ./models/HyperCLOVAX-SEED-Omni-8B \ |
|
|
--output ./track_b \ |
|
|
--track b |
|
|
|
|
|
# Configure environment |
|
|
cp .env.example .env |
|
|
# Edit .env with model paths and S3 credentials |
|
|
|
|
|
# Build and run (Track B only - OMNI model) |
|
|
docker compose --profile track-b build |
|
|
docker compose --profile track-b up -d |
|
|
|
|
|
# Wait for model loading (~5 minutes) |
|
|
docker compose logs -f omni |
|
|
|
|
|
# Note: To run both VLM and OMNI models together: |
|
|
# docker compose --profile track-a --profile track-b up -d |
|
|
``` |
|
|
|
|
|
## Basic Usage |
|
|
|
|
|
```python |
|
|
from openai import OpenAI |
|
|
|
|
|
client = OpenAI( |
|
|
base_url="http://localhost:8000/b/v1", |
|
|
api_key="not-needed" |
|
|
) |
|
|
|
|
|
# Image understanding |
|
|
response = client.chat.completions.create( |
|
|
model="track_b_model", |
|
|
messages=[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, |
|
|
{"type": "text", "text": "What is in this image?"} |
|
|
] |
|
|
} |
|
|
], |
|
|
max_tokens=256, |
|
|
extra_body={"chat_template_kwargs": {"skip_reasoning": True}} |
|
|
) |
|
|
|
|
|
print(response.choices[0].message.content) |
|
|
``` |
|
|
|
|
|
## More Examples |
|
|
|
|
|
<details> |
|
|
<summary>Text to Image</summary> |
|
|
|
|
|
```python |
|
|
import json |
|
|
|
|
|
SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool.""" |
|
|
|
|
|
response = client.chat.completions.create( |
|
|
model="track_b_model", |
|
|
messages=[ |
|
|
{"role": "system", "content": SYSTEM_PROMPT}, |
|
|
{"role": "user", "content": "Draw a sunset over mountains"} |
|
|
], |
|
|
tools=[{ |
|
|
"type": "function", |
|
|
"function": { |
|
|
"name": "t2i_model_generation", |
|
|
"description": "Generates an RGB image based on the provided discrete image representation.", |
|
|
"parameters": { |
|
|
"type": "object", |
|
|
"required": ["discrete_image_token"], |
|
|
"properties": { |
|
|
"discrete_image_token": { |
|
|
"type": "string", |
|
|
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.", |
|
|
"minLength": 1 |
|
|
} |
|
|
} |
|
|
} |
|
|
} |
|
|
}], |
|
|
max_tokens=7000, |
|
|
extra_body={"chat_template_kwargs": {"skip_reasoning": True}} |
|
|
) |
|
|
|
|
|
if response.choices[0].message.tool_calls: |
|
|
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments) |
|
|
print(f"Generated image: {args['discrete_image_token']}") |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Text to Audio</summary> |
|
|
|
|
|
```python |
|
|
import base64 |
|
|
|
|
|
# Prompt should explicitly request speech/audio output |
|
|
response = client.chat.completions.create( |
|
|
model="track_b_model", |
|
|
messages=[{ |
|
|
"role": "user", |
|
|
"content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?" |
|
|
}], |
|
|
max_tokens=1000, |
|
|
extra_body={"chat_template_kwargs": {"skip_reasoning": True}} |
|
|
) |
|
|
|
|
|
if response.choices[0].message.audio: |
|
|
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode() |
|
|
print(f"Generated audio: {audio_url}") |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Audio Input</summary> |
|
|
|
|
|
```python |
|
|
import base64 |
|
|
|
|
|
audio_url = "https://example.com/audio.mp3" |
|
|
audio_data = base64.b64encode(audio_url.encode()).decode() |
|
|
|
|
|
response = client.chat.completions.create( |
|
|
model="track_b_model", |
|
|
messages=[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}}, |
|
|
{"type": "text", "text": "What is being said?"} |
|
|
] |
|
|
} |
|
|
], |
|
|
max_tokens=256, |
|
|
extra_body={"chat_template_kwargs": {"skip_reasoning": True}} |
|
|
) |
|
|
|
|
|
print(response.choices[0].message.content) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Video Input</summary> |
|
|
|
|
|
```python |
|
|
response = client.chat.completions.create( |
|
|
model="track_b_model", |
|
|
messages=[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}}, |
|
|
{"type": "text", "text": "Describe this video."} |
|
|
] |
|
|
} |
|
|
], |
|
|
max_tokens=512, |
|
|
extra_body={"chat_template_kwargs": {"skip_reasoning": True}} |
|
|
) |
|
|
|
|
|
print(response.choices[0].message.content) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Image to Image</summary> |
|
|
|
|
|
```python |
|
|
import json |
|
|
|
|
|
SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool.""" |
|
|
|
|
|
response = client.chat.completions.create( |
|
|
model="track_b_model", |
|
|
messages=[ |
|
|
{"role": "system", "content": SYSTEM_PROMPT}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}, |
|
|
{"type": "text", "text": "Transform to watercolor style"} |
|
|
] |
|
|
} |
|
|
], |
|
|
tools=[{ |
|
|
"type": "function", |
|
|
"function": { |
|
|
"name": "t2i_model_generation", |
|
|
"description": "Generates an RGB image based on the provided discrete image representation.", |
|
|
"parameters": { |
|
|
"type": "object", |
|
|
"required": ["discrete_image_token"], |
|
|
"properties": { |
|
|
"discrete_image_token": { |
|
|
"type": "string", |
|
|
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.", |
|
|
"minLength": 1 |
|
|
} |
|
|
} |
|
|
} |
|
|
} |
|
|
}], |
|
|
max_tokens=7000, |
|
|
extra_body={"chat_template_kwargs": {"skip_reasoning": True}} |
|
|
) |
|
|
|
|
|
if response.choices[0].message.tool_calls: |
|
|
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments) |
|
|
print(f"Generated image: {args['discrete_image_token']}") |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Audio to Audio</summary> |
|
|
|
|
|
```python |
|
|
import base64 |
|
|
|
|
|
# Input audio (URL encoded as base64) |
|
|
audio_url = "https://example.com/input.mp3" |
|
|
audio_data = base64.b64encode(audio_url.encode()).decode() |
|
|
|
|
|
response = client.chat.completions.create( |
|
|
model="track_b_model", |
|
|
messages=[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}}, |
|
|
{"type": "text", "text": "Listen to this and respond with speech"} |
|
|
] |
|
|
} |
|
|
], |
|
|
max_tokens=2000, |
|
|
extra_body={"chat_template_kwargs": {"skip_reasoning": True}} |
|
|
) |
|
|
|
|
|
if response.choices[0].message.audio: |
|
|
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode() |
|
|
print(f"Generated audio: {audio_url}") |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Using curl</summary> |
|
|
|
|
|
```bash |
|
|
# Image understanding |
|
|
curl -X POST http://localhost:8000/b/v1/chat/completions \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"model": "track_b_model", |
|
|
"messages": [{"role": "user", "content": [ |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, |
|
|
{"type": "text", "text": "Describe this image."} |
|
|
]}], |
|
|
"max_tokens": 256, |
|
|
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}} |
|
|
}' |
|
|
|
|
|
# Text to audio |
|
|
curl -X POST http://localhost:8000/b/v1/chat/completions \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"model": "track_b_model", |
|
|
"messages": [{"role": "user", "content": "Say hello"}], |
|
|
"max_tokens": 1000, |
|
|
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}} |
|
|
}' |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
User Request |
|
|
(Image/Audio/Video/Text) |
|
|
│ |
|
|
▼ |
|
|
┌─────────────────────────────────────────────────────────────────────────┐ |
|
|
│ OmniServe │ |
|
|
│ POST /b/v1/chat/completions │ |
|
|
│ │ |
|
|
│ ┌──────────────────────────────────────────────────────────────────┐ │ |
|
|
│ │ [1] INPUT ENCODING │ │ |
|
|
│ │ │ │ |
|
|
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │ |
|
|
│ │ │ Vision Encoder │ │ Audio Encoder │ │ │ |
|
|
│ │ └────────┬────────┘ └────────┬────────┘ │ │ |
|
|
│ │ │ │ │ │ |
|
|
│ │ └────────────┬────────────────────┘ │ │ |
|
|
│ │ │ embeddings │ │ |
|
|
│ └──────────────────────────┼───────────────────────────────────────┘ │ |
|
|
│ ▼ │ |
|
|
│ ┌──────────────┐ │ |
|
|
│ │ LLM (8B) │◀──── text │ |
|
|
│ └──────┬───────┘ │ |
|
|
│ │ │ |
|
|
│ ┌─────────────────────────┼────────────────────────────────────────┐ │ |
|
|
│ │ [2] OUTPUT DECODING │ │ |
|
|
│ │ │ │ │ |
|
|
│ │ ┌──────────────┼──────────────┐ │ │ |
|
|
│ │ ▼ ▼ ▼ │ │ |
|
|
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ |
|
|
│ │ │ Text │ │ Vision │ │ Audio │ │ │ |
|
|
│ │ │ │ │ Decoder │ │ Decoder │ │ │ |
|
|
│ │ └───────────┘ └─────┬─────┘ └─────┬─────┘ │ │ |
|
|
│ │ │ │ │ │ |
|
|
│ │ ▼ ▼ │ │ |
|
|
│ │ Image URL Audio URL │ │ |
|
|
│ │ (S3) (S3) │ │ |
|
|
│ └──────────────────────────────────────────────────────────────────┘ │ |
|
|
│ │ |
|
|
└─────────────────────────────────────────────────────────────────────────┘ |
|
|
│ |
|
|
▼ |
|
|
Response |
|
|
(Text / Image URL / Audio URL) |
|
|
``` |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
| Component | GPU | VRAM | |
|
|
|-----------|-----|------| |
|
|
| Vision Encoder | 1x | ~8GB | |
|
|
| Audio Encoder | (shared) | ~4GB | |
|
|
| LLM (8B) | 1x | ~16GB | |
|
|
| Vision Decoder | 1x | ~16GB | |
|
|
| Audio Decoder | (shared) | ~4GB | |
|
|
| **Total** | **3x** | **~48GB** | |
|
|
|
|
|
## Key Parameters |
|
|
|
|
|
| Parameter | Description | Default | |
|
|
|-----------|-------------|---------| |
|
|
| `chat_template_kwargs.skip_reasoning` | Skip reasoning | `true` | |
|
|
| `max_tokens` | Max output tokens | - | |
|
|
| `temperature` | Sampling temperature | 0.7 | |
|
|
| `tools` | Required for image generation | - | |
|
|
|
|
|
## S3 Configuration |
|
|
|
|
|
Required for image/audio generation: |
|
|
|
|
|
```bash |
|
|
NCP_S3_ENDPOINT=https://your-s3-endpoint.com |
|
|
NCP_S3_ACCESS_KEY=your-access-key |
|
|
NCP_S3_SECRET_KEY=your-secret-key |
|
|
NCP_S3_BUCKET_NAME=your-bucket-name |
|
|
``` |
|
|
|
|
|
For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe). |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
# Citation |
|
|
TBU (Technical Report) |
|
|
|
|
|
--- |
|
|
|
|
|
# Questions |
|
|
For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
# License |
|
|
The model is licensed under [HyperCLOVA X SEED 8B Omni Model License Agreement](./LICENSE) |
|
|
|