Update README.md

f124032 verified 20 days ago

17.2 kB

	---
	license: other
	license_name: hyperclovax
	license_link: LICENSE
	library_name: transformers
	---

	![image](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/3gaPG3_F4Fxn-SOZWrmfU.png)

	# Overview
	HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.

	---

	# Technical Report
	- [HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)](./HyperCLOVA_X_8B_Omni.pdf)


	---

	# Basic Information

	- Architecture : Transformer-based omni-model architecture (Dense Model)
	- Parameters : 8B
	- Input Format: Text/Image/Video/Audio(Speech)
	- Output Format: Text/Image/Audio(Speech)
	- Context Length : 32K
	- Knowledge Cutoff: May 2025

	---

	# Benchmarks
	![테크니컬 리포트 05_2@2x](https://cdn-uploads.huggingface.co/production/uploads/646acf46086023e36edce4c4/x1IvD9Rt_NK71CklecpN2.png)


	- Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
	- Vision-to-Text :SEED-IMG, AI2D, K-MMBench
	- Text-to-Vision: GenEval, ImgEdit
	- Audio-to-Text: Librispeech, Ksponspeech
	- Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

	---

	# Examples
	## Text-to-Image Generation
	![hf_img01](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/6fRekMbt_9ab5I80GTkdG.png)
	## Text-based Image Editing
	![hf_img02](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/aoecU357A0fVvR8uerozh.png)
	![hf_img03](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/0fpcq--rj1kqPa9m8DYgt.png)
	![hf_img04](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/Z24JUQZSmeaVNrhDMYG6K.png)

	---

	# Inference

	We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API.

	## Capabilities

	- Inputs: Text, Image, Audio, Video
	- Outputs: Text, Image, Audio (no video generation)

	## Requirements

	- 4x NVIDIA A100 80GB
	- Docker & Docker Compose
	- NVIDIA Driver 525+, CUDA 12.1+
	- S3-compatible storage (for image/audio output)

	## Installation

	```bash
	# Clone OmniServe
	git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
	cd OmniServe

	# Install dependencies
	pip install huggingface_hub safetensors torch openai easydict

	# Download model (~16GB)
	huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
	--local-dir ./models/HyperCLOVAX-SEED-Omni-8B

	# Convert model to component format
	python convert_model.py \
	--input ./models/HyperCLOVAX-SEED-Omni-8B \
	--output ./track_b \
	--track b

	# Configure environment
	cp .env.example .env
	# Edit .env with model paths and S3 credentials

	# Build and run (Track B only - OMNI model)
	docker compose --profile track-b build
	docker compose --profile track-b up -d

	# Wait for model loading (~5 minutes)
	docker compose logs -f omni

	# Note: To run both VLM and OMNI models together:
	# docker compose --profile track-a --profile track-b up -d
	```

	## Basic Usage

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/b/v1",
	api_key="not-needed"
	)

	# Image understanding
	response = client.chat.completions.create(
	model="track_b_model",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
	{"type": "text", "text": "What is in this image?"}
	]
	}
	],
	max_tokens=256,
	extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
	)

	print(response.choices[0].message.content)
	```

	## More Examples

	<details>
	<summary>Text to Image</summary>

	```python
	import json

	SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""

	response = client.chat.completions.create(
	model="track_b_model",
	messages=[
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": "Draw a sunset over mountains"}
	],
	tools=[{
	"type": "function",
	"function": {
	"name": "t2i_model_generation",
	"description": "Generates an RGB image based on the provided discrete image representation.",
	"parameters": {
	"type": "object",
	"required": ["discrete_image_token"],
	"properties": {
	"discrete_image_token": {
	"type": "string",
	"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <\|discrete_image_start\|><\|vision_ratio_4:3\|><\|vision_token\|><\|visionaaaaa\|><\|visionbbbbb\|>... <\|visionzzzzz\|><\|vision_eol\|><\|vision_eof\|><\|discrete_image_end\|>.",
	"minLength": 1
	}
	}
	}
	}
	}],
	max_tokens=7000,
	extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
	)

	if response.choices[0].message.tool_calls:
	args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
	print(f"Generated image: {args['discrete_image_token']}")
	```

	</details>

	<details>
	<summary>Text to Audio</summary>

	```python
	import base64

	# Prompt should explicitly request speech/audio output
	response = client.chat.completions.create(
	model="track_b_model",
	messages=[{
	"role": "user",
	"content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
	}],
	max_tokens=1000,
	extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
	)

	if response.choices[0].message.audio:
	audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
	print(f"Generated audio: {audio_url}")
	```

	</details>

	<details>
	<summary>Audio Input</summary>

	```python
	import base64

	audio_url = "https://example.com/audio.mp3"
	audio_data = base64.b64encode(audio_url.encode()).decode()

	response = client.chat.completions.create(
	model="track_b_model",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
	{"type": "text", "text": "What is being said?"}
	]
	}
	],
	max_tokens=256,
	extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
	)

	print(response.choices[0].message.content)
	```

	</details>

	<details>
	<summary>Video Input</summary>

	```python
	response = client.chat.completions.create(
	model="track_b_model",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
	{"type": "text", "text": "Describe this video."}
	]
	}
	],
	max_tokens=512,
	extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
	)

	print(response.choices[0].message.content)
	```

	</details>

	<details>
	<summary>Image to Image</summary>

	```python
	import json

	SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""

	response = client.chat.completions.create(
	model="track_b_model",
	messages=[
	{"role": "system", "content": SYSTEM_PROMPT},
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
	{"type": "text", "text": "Transform to watercolor style"}
	]
	}
	],
	tools=[{
	"type": "function",
	"function": {
	"name": "t2i_model_generation",
	"description": "Generates an RGB image based on the provided discrete image representation.",
	"parameters": {
	"type": "object",
	"required": ["discrete_image_token"],
	"properties": {
	"discrete_image_token": {
	"type": "string",
	"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <\|discrete_image_start\|><\|vision_ratio_4:3\|><\|vision_token\|><\|visionaaaaa\|><\|visionbbbbb\|>... <\|visionzzzzz\|><\|vision_eol\|><\|vision_eof\|><\|discrete_image_end\|>.",
	"minLength": 1
	}
	}
	}
	}
	}],
	max_tokens=7000,
	extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
	)

	if response.choices[0].message.tool_calls:
	args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
	print(f"Generated image: {args['discrete_image_token']}")
	```

	</details>

	<details>
	<summary>Audio to Audio</summary>

	```python
	import base64

	# Input audio (URL encoded as base64)
	audio_url = "https://example.com/input.mp3"
	audio_data = base64.b64encode(audio_url.encode()).decode()

	response = client.chat.completions.create(
	model="track_b_model",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
	{"type": "text", "text": "Listen to this and respond with speech"}
	]
	}
	],
	max_tokens=2000,
	extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
	)

	if response.choices[0].message.audio:
	audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
	print(f"Generated audio: {audio_url}")
	```

	</details>

	<details>
	<summary>Using curl</summary>

	```bash
	# Image understanding
	curl -X POST http://localhost:8000/b/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "track_b_model",
	"messages": [{"role": "user", "content": [
	{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
	{"type": "text", "text": "Describe this image."}
	]}],
	"max_tokens": 256,
	"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
	}'

	# Text to audio
	curl -X POST http://localhost:8000/b/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "track_b_model",
	"messages": [{"role": "user", "content": "Say hello"}],
	"max_tokens": 1000,
	"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
	}'
	```

	</details>


	## Architecture

	```
	User Request
	(Image/Audio/Video/Text)
	│
	▼
	┌─────────────────────────────────────────────────────────────────────────┐
	│ OmniServe │
	│ POST /b/v1/chat/completions │
	│ │
	│ ┌──────────────────────────────────────────────────────────────────┐ │
	│ │ [1] INPUT ENCODING │ │
	│ │ │ │
	│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
	│ │ │ Vision Encoder │ │ Audio Encoder │ │ │
	│ │ └────────┬────────┘ └────────┬────────┘ │ │
	│ │ │ │ │ │
	│ │ └────────────┬────────────────────┘ │ │
	│ │ │ embeddings │ │
	│ └──────────────────────────┼───────────────────────────────────────┘ │
	│ ▼ │
	│ ┌──────────────┐ │
	│ │ LLM (8B) │◀──── text │
	│ └──────┬───────┘ │
	│ │ │
	│ ┌─────────────────────────┼────────────────────────────────────────┐ │
	│ │ [2] OUTPUT DECODING │ │
	│ │ │ │ │
	│ │ ┌──────────────┼──────────────┐ │ │
	│ │ ▼ ▼ ▼ │ │
	│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
	│ │ │ Text │ │ Vision │ │ Audio │ │ │
	│ │ │ │ │ Decoder │ │ Decoder │ │ │
	│ │ └───────────┘ └─────┬─────┘ └─────┬─────┘ │ │
	│ │ │ │ │ │
	│ │ ▼ ▼ │ │
	│ │ Image URL Audio URL │ │
	│ │ (S3) (S3) │ │
	│ └──────────────────────────────────────────────────────────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────────────────┘
	│
	▼
	Response
	(Text / Image URL / Audio URL)
	```

	## Hardware Requirements

	\| Component \| GPU \| VRAM \|
	\|-----------\|-----\|------\|
	\| Vision Encoder \| 1x \| ~8GB \|
	\| Audio Encoder \| (shared) \| ~4GB \|
	\| LLM (8B) \| 1x \| ~16GB \|
	\| Vision Decoder \| 1x \| ~16GB \|
	\| Audio Decoder \| (shared) \| ~4GB \|
	\| Total \| 3x \| ~48GB \|

	## Key Parameters

	\| Parameter \| Description \| Default \|
	\|-----------\|-------------\|---------\|
	\| `chat_template_kwargs.skip_reasoning` \| Skip reasoning \| `true` \|
	\| `max_tokens` \| Max output tokens \| - \|
	\| `temperature` \| Sampling temperature \| 0.7 \|
	\| `tools` \| Required for image generation \| - \|

	## S3 Configuration

	Required for image/audio generation:

	```bash
	NCP_S3_ENDPOINT=https://your-s3-endpoint.com
	NCP_S3_ACCESS_KEY=your-access-key
	NCP_S3_SECRET_KEY=your-secret-key
	NCP_S3_BUCKET_NAME=your-bucket-name
	```

	For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe).


	---

	# Citation
	TBU (Technical Report)

	---

	# Questions
	For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.


	---


	# License
	The model is licensed under [HyperCLOVA X SEED 8B Omni Model License Agreement](./LICENSE)