PenPaperKeyCode's picture
Update README.md
81a3bba verified
---
license: other
license_name: hyperclovax
license_link: LICENSE
library_name: transformers
---
![image](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/2wkHd-bv3M9Zsma_ykIf8.png)
# Overview
HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the [SEED Think 14B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-14B) line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use.
---
# Basic Information
- **Architecture** : Transformer-based vision-language model (VLM) architecture (Dense Model)
- **Parameters** : 32B
- **Input Format**: Text/Image/Video
- **Output Format**: Text
- **Context Length** : 128K
- **Knowledge Cutoff**: May 2025
---
# Benchmarks
![테크니컬 리포트 04@2x](https://cdn-uploads.huggingface.co/production/uploads/646acf46086023e36edce4c4/qfIKiKlFVJWyCx3Dl1qN0.png)
- **General Knowledge (Korean Text)**: KoBalt, CLIcK, HAERAE Bench 1.0
- **Vision Understanding** : ChartVQA, TextVQA, K-MMBench, K-DTCBench
- **Agentic Tasks**: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom
---
# Examples
- Solving 2026 Korean CSAT Math Problem
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/LPU8kNbYQ8FN_piQ_p6Je.jpeg" style="width: 640px;">
- Understanding Text layout
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/Y8lHa7s1TmJcS6F82d41L.jpeg" style="width: 640px;">
<!-- - Understanding Charts
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/zoH2Lh6CSkgdzvXz7JaHo.jpeg" style="width: 640px;"> -->
---
# Inference
We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API.
## Capabilities
- **Inputs**: Text, Image
- **Outputs**: Text
## Requirements
- 4x NVIDIA A100 80GB
- Docker & Docker Compose
- NVIDIA Driver 525+, CUDA 12.1+
## Installation
```bash
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe
# Install dependencies
pip install huggingface_hub safetensors torch openai easydict
# Download model (~60GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \
--local-dir ./models/HyperCLOVAX-SEED-Think-32B
# Convert model to component format
python convert_model.py \
--input ./models/HyperCLOVAX-SEED-Think-32B \
--output ./track_a \
--track a
# Configure environment
cp .env.example .env
# Edit .env:
# VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B
# VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B
# Build and run
docker compose --profile track-a build
docker compose --profile track-a up -d
# Wait for model loading (~5 minutes)
docker compose logs -f vlm
```
## Basic Usage
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/a/v1",
api_key="not-needed"
)
# Image understanding
response = client.chat.completions.create(
model="track_a_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"thinking": False}}
)
print(response.choices[0].message.content)
```
## Reasoning Mode
Enable chain-of-thought reasoning for complex tasks:
```python
response = client.chat.completions.create(
model="track_a_model",
messages=[
{"role": "user", "content": "Solve step by step: 3x + 7 = 22"}
],
max_tokens=1024,
extra_body={
"thinking_token_budget": 500,
"chat_template_kwargs": {"thinking": True}
}
)
# Response includes <think>...</think> with reasoning process
print(response.choices[0].message.content)
```
## More Examples
<details>
<summary>Video Understanding</summary>
```python
response = client.chat.completions.create(
model="track_a_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Describe this video."}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"thinking": False}}
)
```
</details>
<details>
<summary>Base64 Image Input</summary>
```python
import base64
with open("image.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="track_a_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
{"type": "text", "text": "What is in this image?"}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"thinking": False}}
)
```
</details>
<details>
<summary>Using curl</summary>
```bash
curl -X POST http://localhost:8000/a/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_a_model",
"messages": [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
],
"max_tokens": 512,
"extra_body": {"chat_template_kwargs": {"thinking": false}}
}'
```
</details>
## Model Capabilities
| Input | Output |
|-------|--------|
| Text | Text |
| Image | Text |
| Video | Text |
| Image + Text | Text |
| Video + Text | Text |
**Features:**
- Reasoning mode with `<think>...</think>` output
- Multi-turn conversation support
- Image/Video understanding
## Architecture
```
User Request
(Image/Video/Text)
┌─────────────────────────────────────────────────────────────────────────┐
│ OmniServe │
│ POST /a/v1/chat/completions │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [1] INPUT ENCODING │ │
│ │ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ Vision Encoder │ │ │
│ │ └────────┬────────┘ │ │
│ │ │ embeddings │ │
│ └────────────────────────────┼─────────────────────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ LLM (32B) │◀──── text │
│ └──────┬───────┘ │
│ │ │
│ ▼ │
│ Text Response │
│ │
└─────────────────────────────────────────────────────────────────────────┘
Response
(Text)
```
## Hardware Requirements
| Component | GPU | VRAM |
|-----------|-----|------|
| Vision Encoder | 1x | ~8GB |
| LLM (32B) | 2x | ~60GB |
| **Total** | **3x** | **~68GB** |
## Key Parameters
| Parameter | Description | Default |
|-----------|-------------|---------|
| `chat_template_kwargs.thinking` | Enable reasoning | `false` |
| `thinking_token_budget` | Max reasoning tokens | 500 |
| `max_tokens` | Max output tokens | - |
| `temperature` | Sampling temperature | 0.7 |
For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe).
---
# Citation
TBU (Technical Report)
---
# Questions
For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.
---
# License
The model is licensed under [HyperCLOVA X SEED 32B Think Model License Agreement](./LICENSE)