|
|
--- |
|
|
license: other |
|
|
license_name: hyperclovax |
|
|
license_link: LICENSE |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
 |
|
|
|
|
|
# Overview |
|
|
HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the [SEED Think 14B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-14B) line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use. |
|
|
|
|
|
--- |
|
|
|
|
|
# Basic Information |
|
|
|
|
|
- **Architecture** : Transformer-based vision-language model (VLM) architecture (Dense Model) |
|
|
- **Parameters** : 32B |
|
|
- **Input Format**: Text/Image/Video |
|
|
- **Output Format**: Text |
|
|
- **Context Length** : 128K |
|
|
- **Knowledge Cutoff**: May 2025 |
|
|
|
|
|
--- |
|
|
|
|
|
# Benchmarks |
|
|
|
|
|
 |
|
|
|
|
|
- **General Knowledge (Korean Text)**: KoBalt, CLIcK, HAERAE Bench 1.0 |
|
|
- **Vision Understanding** : ChartVQA, TextVQA, K-MMBench, K-DTCBench |
|
|
- **Agentic Tasks**: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom |
|
|
|
|
|
--- |
|
|
|
|
|
# Examples |
|
|
- Solving 2026 Korean CSAT Math Problem |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/LPU8kNbYQ8FN_piQ_p6Je.jpeg" style="width: 640px;"> |
|
|
- Understanding Text layout |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/Y8lHa7s1TmJcS6F82d41L.jpeg" style="width: 640px;"> |
|
|
<!-- - Understanding Charts |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/zoH2Lh6CSkgdzvXz7JaHo.jpeg" style="width: 640px;"> --> |
|
|
|
|
|
--- |
|
|
|
|
|
# Inference |
|
|
|
|
|
We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API. |
|
|
|
|
|
## Capabilities |
|
|
|
|
|
- **Inputs**: Text, Image |
|
|
- **Outputs**: Text |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- 4x NVIDIA A100 80GB |
|
|
- Docker & Docker Compose |
|
|
- NVIDIA Driver 525+, CUDA 12.1+ |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
# Clone OmniServe |
|
|
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git |
|
|
cd OmniServe |
|
|
|
|
|
# Install dependencies |
|
|
pip install huggingface_hub safetensors torch openai easydict |
|
|
|
|
|
# Download model (~60GB) |
|
|
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \ |
|
|
--local-dir ./models/HyperCLOVAX-SEED-Think-32B |
|
|
|
|
|
# Convert model to component format |
|
|
python convert_model.py \ |
|
|
--input ./models/HyperCLOVAX-SEED-Think-32B \ |
|
|
--output ./track_a \ |
|
|
--track a |
|
|
|
|
|
# Configure environment |
|
|
cp .env.example .env |
|
|
# Edit .env: |
|
|
# VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B |
|
|
# VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B |
|
|
|
|
|
# Build and run |
|
|
docker compose --profile track-a build |
|
|
docker compose --profile track-a up -d |
|
|
|
|
|
# Wait for model loading (~5 minutes) |
|
|
docker compose logs -f vlm |
|
|
``` |
|
|
|
|
|
## Basic Usage |
|
|
|
|
|
```python |
|
|
from openai import OpenAI |
|
|
|
|
|
client = OpenAI( |
|
|
base_url="http://localhost:8000/a/v1", |
|
|
api_key="not-needed" |
|
|
) |
|
|
|
|
|
# Image understanding |
|
|
response = client.chat.completions.create( |
|
|
model="track_a_model", |
|
|
messages=[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, |
|
|
{"type": "text", "text": "Describe this image."} |
|
|
] |
|
|
} |
|
|
], |
|
|
max_tokens=512, |
|
|
extra_body={"chat_template_kwargs": {"thinking": False}} |
|
|
) |
|
|
|
|
|
print(response.choices[0].message.content) |
|
|
``` |
|
|
|
|
|
## Reasoning Mode |
|
|
|
|
|
Enable chain-of-thought reasoning for complex tasks: |
|
|
|
|
|
```python |
|
|
response = client.chat.completions.create( |
|
|
model="track_a_model", |
|
|
messages=[ |
|
|
{"role": "user", "content": "Solve step by step: 3x + 7 = 22"} |
|
|
], |
|
|
max_tokens=1024, |
|
|
extra_body={ |
|
|
"thinking_token_budget": 500, |
|
|
"chat_template_kwargs": {"thinking": True} |
|
|
} |
|
|
) |
|
|
|
|
|
# Response includes <think>...</think> with reasoning process |
|
|
print(response.choices[0].message.content) |
|
|
``` |
|
|
|
|
|
## More Examples |
|
|
|
|
|
<details> |
|
|
<summary>Video Understanding</summary> |
|
|
|
|
|
```python |
|
|
response = client.chat.completions.create( |
|
|
model="track_a_model", |
|
|
messages=[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}}, |
|
|
{"type": "text", "text": "Describe this video."} |
|
|
] |
|
|
} |
|
|
], |
|
|
max_tokens=512, |
|
|
extra_body={"chat_template_kwargs": {"thinking": False}} |
|
|
) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Base64 Image Input</summary> |
|
|
|
|
|
```python |
|
|
import base64 |
|
|
|
|
|
with open("image.png", "rb") as f: |
|
|
image_b64 = base64.b64encode(f.read()).decode() |
|
|
|
|
|
response = client.chat.completions.create( |
|
|
model="track_a_model", |
|
|
messages=[ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}}, |
|
|
{"type": "text", "text": "What is in this image?"} |
|
|
] |
|
|
} |
|
|
], |
|
|
max_tokens=512, |
|
|
extra_body={"chat_template_kwargs": {"thinking": False}} |
|
|
) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Using curl</summary> |
|
|
|
|
|
```bash |
|
|
curl -X POST http://localhost:8000/a/v1/chat/completions \ |
|
|
-H "Content-Type: application/json" \ |
|
|
-d '{ |
|
|
"model": "track_a_model", |
|
|
"messages": [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}, |
|
|
{"type": "text", "text": "Describe this image."} |
|
|
] |
|
|
} |
|
|
], |
|
|
"max_tokens": 512, |
|
|
"extra_body": {"chat_template_kwargs": {"thinking": false}} |
|
|
}' |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
## Model Capabilities |
|
|
|
|
|
| Input | Output | |
|
|
|-------|--------| |
|
|
| Text | Text | |
|
|
| Image | Text | |
|
|
| Video | Text | |
|
|
| Image + Text | Text | |
|
|
| Video + Text | Text | |
|
|
|
|
|
**Features:** |
|
|
- Reasoning mode with `<think>...</think>` output |
|
|
- Multi-turn conversation support |
|
|
- Image/Video understanding |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
User Request |
|
|
(Image/Video/Text) |
|
|
│ |
|
|
▼ |
|
|
┌─────────────────────────────────────────────────────────────────────────┐ |
|
|
│ OmniServe │ |
|
|
│ POST /a/v1/chat/completions │ |
|
|
│ │ |
|
|
│ ┌──────────────────────────────────────────────────────────────────┐ │ |
|
|
│ │ [1] INPUT ENCODING │ │ |
|
|
│ │ │ │ |
|
|
│ │ ┌─────────────────┐ │ │ |
|
|
│ │ │ Vision Encoder │ │ │ |
|
|
│ │ └────────┬────────┘ │ │ |
|
|
│ │ │ embeddings │ │ |
|
|
│ └────────────────────────────┼─────────────────────────────────────┘ │ |
|
|
│ ▼ │ |
|
|
│ ┌──────────────┐ │ |
|
|
│ │ LLM (32B) │◀──── text │ |
|
|
│ └──────┬───────┘ │ |
|
|
│ │ │ |
|
|
│ ▼ │ |
|
|
│ Text Response │ |
|
|
│ │ |
|
|
└─────────────────────────────────────────────────────────────────────────┘ |
|
|
│ |
|
|
▼ |
|
|
Response |
|
|
(Text) |
|
|
``` |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
| Component | GPU | VRAM | |
|
|
|-----------|-----|------| |
|
|
| Vision Encoder | 1x | ~8GB | |
|
|
| LLM (32B) | 2x | ~60GB | |
|
|
| **Total** | **3x** | **~68GB** | |
|
|
|
|
|
## Key Parameters |
|
|
|
|
|
| Parameter | Description | Default | |
|
|
|-----------|-------------|---------| |
|
|
| `chat_template_kwargs.thinking` | Enable reasoning | `false` | |
|
|
| `thinking_token_budget` | Max reasoning tokens | 500 | |
|
|
| `max_tokens` | Max output tokens | - | |
|
|
| `temperature` | Sampling temperature | 0.7 | |
|
|
|
|
|
For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe). |
|
|
|
|
|
--- |
|
|
|
|
|
# Citation |
|
|
TBU (Technical Report) |
|
|
|
|
|
--- |
|
|
|
|
|
# Questions |
|
|
For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com. |
|
|
|
|
|
--- |
|
|
|
|
|
# License |
|
|
The model is licensed under [HyperCLOVA X SEED 32B Think Model License Agreement](./LICENSE) |
|
|
|