File size: 10,015 Bytes

---
license: other
license_name: hyperclovax
license_link: LICENSE
library_name: transformers
---

![image](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/2wkHd-bv3M9Zsma_ykIf8.png)

# Overview
HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the [SEED Think 14B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-14B) line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use.

---

# Basic Information

- **Architecture** : Transformer-based vision-language model (VLM) architecture (Dense Model)
- **Parameters** : 32B
- **Input Format**: Text/Image/Video
- **Output Format**: Text
- **Context Length** : 128K
- **Knowledge Cutoff**: May 2025

---

# Benchmarks

![테크니컬 리포트 04@2x](https://cdn-uploads.huggingface.co/production/uploads/646acf46086023e36edce4c4/qfIKiKlFVJWyCx3Dl1qN0.png)

- **General Knowledge (Korean Text)**: KoBalt, CLIcK, HAERAE Bench 1.0
- **Vision Understanding** : ChartVQA, TextVQA, K-MMBench, K-DTCBench
- **Agentic Tasks**: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom

---

# Examples
- Solving 2026 Korean CSAT Math Problem
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/LPU8kNbYQ8FN_piQ_p6Je.jpeg" style="width: 640px;">
- Understanding Text layout
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/Y8lHa7s1TmJcS6F82d41L.jpeg" style="width: 640px;">
<!-- - Understanding Charts
<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/zoH2Lh6CSkgdzvXz7JaHo.jpeg" style="width: 640px;"> -->

---

# Inference

We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API.

## Capabilities

- **Inputs**: Text, Image
- **Outputs**: Text

## Requirements

- 4x NVIDIA A100 80GB
- Docker & Docker Compose
- NVIDIA Driver 525+, CUDA 12.1+

## Installation

```bash
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~60GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \
    --local-dir ./models/HyperCLOVAX-SEED-Think-32B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Think-32B \
    --output ./track_a \
    --track a

# Configure environment
cp .env.example .env
# Edit .env:
# VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B
# VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B

# Build and run
docker compose --profile track-a build
docker compose --profile track-a up -d

# Wait for model loading (~5 minutes)
docker compose logs -f vlm
```

## Basic Usage

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/a/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)

print(response.choices[0].message.content)
```

## Reasoning Mode

Enable chain-of-thought reasoning for complex tasks:

```python
response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {"role": "user", "content": "Solve step by step: 3x + 7 = 22"}
    ],
    max_tokens=1024,
    extra_body={
        "thinking_token_budget": 500,
        "chat_template_kwargs": {"thinking": True}
    }
)

# Response includes <think>...</think> with reasoning process
print(response.choices[0].message.content)
```

## More Examples

<details>
<summary>Video Understanding</summary>

```python
response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)
```

</details>

<details>
<summary>Base64 Image Input</summary>

```python
import base64

with open("image.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="track_a_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"thinking": False}}
)
```

</details>

<details>
<summary>Using curl</summary>

```bash
curl -X POST http://localhost:8000/a/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_a_model",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
          {"type": "text", "text": "Describe this image."}
        ]
      }
    ],
    "max_tokens": 512,
    "extra_body": {"chat_template_kwargs": {"thinking": false}}
  }'
```

</details>

## Model Capabilities

| Input | Output |
|-------|--------|
| Text | Text |
| Image | Text |
| Video | Text |
| Image + Text | Text |
| Video + Text | Text |

**Features:**
- Reasoning mode with `<think>...</think>` output
- Multi-turn conversation support
- Image/Video understanding

## Architecture

```
                         User Request
                       (Image/Video/Text)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            OmniServe                                    │
│                  POST /a/v1/chat/completions                            │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     [1] INPUT ENCODING                           │   │
│  │                                                                  │   │
│  │                   ┌─────────────────┐                            │   │
│  │                   │  Vision Encoder │                            │   │
│  │                   └────────┬────────┘                            │   │
│  │                            │ embeddings                          │   │
│  └────────────────────────────┼─────────────────────────────────────┘   │
│                               ▼                                         │
│                       ┌──────────────┐                                  │
│                       │  LLM (32B)   │◀──── text                        │
│                       └──────┬───────┘                                  │
│                              │                                          │
│                              ▼                                          │
│                        Text Response                                    │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                           Response
                            (Text)
```

## Hardware Requirements

| Component | GPU | VRAM |
|-----------|-----|------|
| Vision Encoder | 1x | ~8GB |
| LLM (32B) | 2x | ~60GB |
| **Total** | **3x** | **~68GB** |

## Key Parameters

| Parameter | Description | Default |
|-----------|-------------|---------|
| `chat_template_kwargs.thinking` | Enable reasoning | `false` |
| `thinking_token_budget` | Max reasoning tokens | 500 |
| `max_tokens` | Max output tokens | - |
| `temperature` | Sampling temperature | 0.7 |

For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe).

---

# Citation
TBU (Technical Report)

---

# Questions
For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.

---

# License
The model is licensed under [HyperCLOVA X SEED 32B Think Model License Agreement](./LICENSE)