Update README.md

81a3bba verified 2 days ago

10 kB

	---
	license: other
	license_name: hyperclovax
	license_link: LICENSE
	library_name: transformers
	---

	![image](https://cdn-uploads.huggingface.co/production/uploads/64383d54c5a91b84ece18d62/2wkHd-bv3M9Zsma_ykIf8.png)

	# Overview
	HyperCLOVA X SEED 32B Think is an updated vision-language thinking model that advances the [SEED Think 14B](https://huggingface.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-14B) line beyond simple scaling, pairing a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space, supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning. Building on the earlier 14B model, SEED 32B Think further strengthens Korean-centric reasoning and agentic capabilities, improving practical reasoning quality and reliability in real-world use.

	---

	# Basic Information

	- Architecture : Transformer-based vision-language model (VLM) architecture (Dense Model)
	- Parameters : 32B
	- Input Format: Text/Image/Video
	- Output Format: Text
	- Context Length : 128K
	- Knowledge Cutoff: May 2025

	---

	# Benchmarks

	![테크니컬 리포트 04@2x](https://cdn-uploads.huggingface.co/production/uploads/646acf46086023e36edce4c4/qfIKiKlFVJWyCx3Dl1qN0.png)

	- General Knowledge (Korean Text): KoBalt, CLIcK, HAERAE Bench 1.0
	- Vision Understanding : ChartVQA, TextVQA, K-MMBench, K-DTCBench
	- Agentic Tasks: Tau^2-Airline, Tau^2-Retail, Tau^2-Telecom

	---

	# Examples
	- Solving 2026 Korean CSAT Math Problem
	<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/LPU8kNbYQ8FN_piQ_p6Je.jpeg" style="width: 640px;">
	- Understanding Text layout
	<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/Y8lHa7s1TmJcS6F82d41L.jpeg" style="width: 640px;">
	<!-- - Understanding Charts
	<img src="https://cdn-uploads.huggingface.co/production/uploads/67ff242cee08737feaf18cb2/zoH2Lh6CSkgdzvXz7JaHo.jpeg" style="width: 640px;"> -->

	---

	# Inference

	We provide [OmniServe](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe), a production-ready multimodal inference system with OpenAI-compatible API.

	## Capabilities

	- Inputs: Text, Image
	- Outputs: Text

	## Requirements

	- 4x NVIDIA A100 80GB
	- Docker & Docker Compose
	- NVIDIA Driver 525+, CUDA 12.1+

	## Installation

	```bash
	# Clone OmniServe
	git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
	cd OmniServe

	# Install dependencies
	pip install huggingface_hub safetensors torch openai easydict

	# Download model (~60GB)
	huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Think-32B \
	--local-dir ./models/HyperCLOVAX-SEED-Think-32B

	# Convert model to component format
	python convert_model.py \
	--input ./models/HyperCLOVAX-SEED-Think-32B \
	--output ./track_a \
	--track a

	# Configure environment
	cp .env.example .env
	# Edit .env:
	# VLM_MODEL_PATH=./track_a/llm/HyperCLOVAX-SEED-Think-32B
	# VLM_ENCODER_VISION_MODEL_PATH=./track_a/ve/HyperCLOVAX-SEED-Think-32B

	# Build and run
	docker compose --profile track-a build
	docker compose --profile track-a up -d

	# Wait for model loading (~5 minutes)
	docker compose logs -f vlm
	```

	## Basic Usage

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/a/v1",
	api_key="not-needed"
	)

	# Image understanding
	response = client.chat.completions.create(
	model="track_a_model",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
	{"type": "text", "text": "Describe this image."}
	]
	}
	],
	max_tokens=512,
	extra_body={"chat_template_kwargs": {"thinking": False}}
	)

	print(response.choices[0].message.content)
	```

	## Reasoning Mode

	Enable chain-of-thought reasoning for complex tasks:

	```python
	response = client.chat.completions.create(
	model="track_a_model",
	messages=[
	{"role": "user", "content": "Solve step by step: 3x + 7 = 22"}
	],
	max_tokens=1024,
	extra_body={
	"thinking_token_budget": 500,
	"chat_template_kwargs": {"thinking": True}
	}
	)

	# Response includes <think>...</think> with reasoning process
	print(response.choices[0].message.content)
	```

	## More Examples

	<details>
	<summary>Video Understanding</summary>

	```python
	response = client.chat.completions.create(
	model="track_a_model",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
	{"type": "text", "text": "Describe this video."}
	]
	}
	],
	max_tokens=512,
	extra_body={"chat_template_kwargs": {"thinking": False}}
	)
	```

	</details>

	<details>
	<summary>Base64 Image Input</summary>

	```python
	import base64

	with open("image.png", "rb") as f:
	image_b64 = base64.b64encode(f.read()).decode()

	response = client.chat.completions.create(
	model="track_a_model",
	messages=[
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
	{"type": "text", "text": "What is in this image?"}
	]
	}
	],
	max_tokens=512,
	extra_body={"chat_template_kwargs": {"thinking": False}}
	)
	```

	</details>

	<details>
	<summary>Using curl</summary>

	```bash
	curl -X POST http://localhost:8000/a/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "track_a_model",
	"messages": [
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
	{"type": "text", "text": "Describe this image."}
	]
	}
	],
	"max_tokens": 512,
	"extra_body": {"chat_template_kwargs": {"thinking": false}}
	}'
	```

	</details>

	## Model Capabilities

	\| Input \| Output \|
	\|-------\|--------\|
	\| Text \| Text \|
	\| Image \| Text \|
	\| Video \| Text \|
	\| Image + Text \| Text \|
	\| Video + Text \| Text \|

	Features:
	- Reasoning mode with `<think>...</think>` output
	- Multi-turn conversation support
	- Image/Video understanding

	## Architecture

	```
	User Request
	(Image/Video/Text)
	│
	▼
	┌─────────────────────────────────────────────────────────────────────────┐
	│ OmniServe │
	│ POST /a/v1/chat/completions │
	│ │
	│ ┌──────────────────────────────────────────────────────────────────┐ │
	│ │ [1] INPUT ENCODING │ │
	│ │ │ │
	│ │ ┌─────────────────┐ │ │
	│ │ │ Vision Encoder │ │ │
	│ │ └────────┬────────┘ │ │
	│ │ │ embeddings │ │
	│ └────────────────────────────┼─────────────────────────────────────┘ │
	│ ▼ │
	│ ┌──────────────┐ │
	│ │ LLM (32B) │◀──── text │
	│ └──────┬───────┘ │
	│ │ │
	│ ▼ │
	│ Text Response │
	│ │
	└─────────────────────────────────────────────────────────────────────────┘
	│
	▼
	Response
	(Text)
	```

	## Hardware Requirements

	\| Component \| GPU \| VRAM \|
	\|-----------\|-----\|------\|
	\| Vision Encoder \| 1x \| ~8GB \|
	\| LLM (32B) \| 2x \| ~60GB \|
	\| Total \| 3x \| ~68GB \|

	## Key Parameters

	\| Parameter \| Description \| Default \|
	\|-----------\|-------------\|---------\|
	\| `chat_template_kwargs.thinking` \| Enable reasoning \| `false` \|
	\| `thinking_token_budget` \| Max reasoning tokens \| 500 \|
	\| `max_tokens` \| Max output tokens \| - \|
	\| `temperature` \| Sampling temperature \| 0.7 \|

	For more details, see [OmniServe documentation](https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe).

	---

	# Citation
	TBU (Technical Report)

	---

	# Questions
	For any other questions, please feel free to contact us at dl_hcxopensource@navercorp.com.

	---

	# License
	The model is licensed under [HyperCLOVA X SEED 32B Think Model License Agreement](./LICENSE)