Instructions to use naver-hyperclovax/HyperCLOVAX-SEED-Think-14B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use naver-hyperclovax/HyperCLOVAX-SEED-Think-14B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="naver-hyperclovax/HyperCLOVAX-SEED-Think-14B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("naver-hyperclovax/HyperCLOVAX-SEED-Think-14B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("naver-hyperclovax/HyperCLOVAX-SEED-Think-14B", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use naver-hyperclovax/HyperCLOVAX-SEED-Think-14B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-14B

SGLang

How to use naver-hyperclovax/HyperCLOVAX-SEED-Think-14B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use naver-hyperclovax/HyperCLOVAX-SEED-Think-14B with Docker Model Runner:
```
docker model run hf.co/naver-hyperclovax/HyperCLOVAX-SEED-Think-14B
```

RTX 5070 (Blackwell) 환경에서 HyperCLOVA X SEED 14B 모델 성능 및 호환성 이슈

by taedyv - opened Jul 24, 2025

Discussion

taedyv

Jul 24, 2025

안녕하세요! HyperCLOVA X SEED 14B Think 모델 사용 중 성능 이슈로 문의드립니다.

현재 환경

GPU: RTX 5070 (12GB VRAM)
CUDA: 12.9
Python: 3.10.9
transformers: 4.45.0
Windows 11 + Docker Desktop

발생하는 문제

공식 문서의 transformers 예제 코드를 사용하면:

model = AutoModelForCausalLM.from_pretrained(
    "naver-hyperclovax/HyperCLOVAX-SEED-Think-14B", 
    trust_remote_code=True, 
    device_map="auto"
)

한국어 시 한 편을 요청했을 때 응답받는데 1000초 이상 소요됩니다.

시도해본 방법들

vLLM pip 설치 → RTX 5070 (Blackwell 아키텍처) 호환성 문제로 "no kernel image is available" 오류
WSL에서 vLLM 소스 빌드 → gcc/cmake 컴파일 오류로 실패
Docker 기반 vLLM 빌드 → nvidia/pytorch:25.03-py3 베이스로 현재 시도 중
4bit 양자화 시도 → BitsAndBytes와 RTX 5070 호환성 확인 필요

RTX 5070 특화 이슈들

Blackwell 아키텍처 (sm_120)가 너무 새로워서 기존 PyTorch/vLLM 바이너리에서 미지원
CUDA 12.9 드라이버는 지원하지만 대부분의 패키지가 CUDA 12.8 기반으로 빌드됨
소스 빌드 시 TORCH_CUDA_ARCH_LIST="12.0" 설정 필요

질문들

RTX 5070에서 권장하는 실행 방법이 있을까요?
양자화된 모델 (4bit/8bit GGUF 등)을 제공할 계획이 있나요?
Ollama 지원 계획이 있나요? (이미 #2에서 요청이 있었네요)
Docker 환경에서의 성공 사례가 있나요?
메모리 사용량 최적화를 위한 권장 설정이 있나요?

추가 시도 예정인 방법들

# 메모리 최적화 설정
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
    device_map="auto",
    max_memory={0: "10GB"}  # RTX 5070 12GB 중 10GB만 사용
)

커뮤니티 요청

다른 RTX 5070 사용자들의 성공적인 설정 공유
새로운 GPU 아키텍처 지원을 위한 가이드라인
vLLM 대신 사용할 수 있는 빠른 추론 방법

RTX 5070 같은 최신 GPU 사용자들을 위한 공식 가이드가 있으면 정말 도움이 될 것 같습니다.

감사합니다.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment