Instructions to use MyeongHo0621/Qwen2.5-3B-Korean with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MyeongHo0621/Qwen2.5-3B-Korean with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MyeongHo0621/Qwen2.5-3B-Korean", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/Qwen2.5-3B-Korean")
model = AutoModelForCausalLM.from_pretrained("MyeongHo0621/Qwen2.5-3B-Korean", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use MyeongHo0621/Qwen2.5-3B-Korean with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MyeongHo0621/Qwen2.5-3B-Korean",
	filename="gguf/qwen25-3b-korean-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MyeongHo0621/Qwen2.5-3B-Korean with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Use Docker

docker model run hf.co/MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

LM Studio
Jan

vLLM

How to use MyeongHo0621/Qwen2.5-3B-Korean with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MyeongHo0621/Qwen2.5-3B-Korean"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MyeongHo0621/Qwen2.5-3B-Korean",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

SGLang

How to use MyeongHo0621/Qwen2.5-3B-Korean with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MyeongHo0621/Qwen2.5-3B-Korean" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MyeongHo0621/Qwen2.5-3B-Korean",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MyeongHo0621/Qwen2.5-3B-Korean" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MyeongHo0621/Qwen2.5-3B-Korean",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use MyeongHo0621/Qwen2.5-3B-Korean with Ollama:
```
ollama run hf.co/MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M
```

Unsloth Studio

How to use MyeongHo0621/Qwen2.5-3B-Korean with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MyeongHo0621/Qwen2.5-3B-Korean to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MyeongHo0621/Qwen2.5-3B-Korean to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MyeongHo0621/Qwen2.5-3B-Korean to start chatting

How to use MyeongHo0621/Qwen2.5-3B-Korean with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use MyeongHo0621/Qwen2.5-3B-Korean with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use MyeongHo0621/Qwen2.5-3B-Korean with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use MyeongHo0621/Qwen2.5-3B-Korean with Docker Model Runner:
```
docker model run hf.co/MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M
```

Lemonade

How to use MyeongHo0621/Qwen2.5-3B-Korean with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MyeongHo0621/Qwen2.5-3B-Korean:Q4_K_M

Run and chat with the model

lemonade run user.Qwen2.5-3B-Korean-Q4_K_M

List all available models

lemonade list

Qwen2.5-3B-Korean

Model Description

Qwen2.5-3B-Korean은 Qwen/Qwen2.5-3B-Instruct를 한국어로 파인튜닝한 Merged 모델입니다.

이 리포지토리는 LoRA 어댑터가 이미 병합된 완전한 모델과 GGUF 파일을 제공합니다.

PEFT/LoRA 어댑터가 필요하신 경우: MyeongHo0621/Qwen2.5-3B-Korean-QLoRA

🎯 Key Features

🇰🇷 Korean Optimization: 200,000개 고품질 한국어 대화 데이터로 학습
📦 Ready-to-Use: LoRA 병합 완료, 즉시 사용 가능
🚀 Multi-Format: Safetensors (루트) + GGUF (gguf/)
💻 All Frameworks: Transformers, vLLM, SGLang, Ollama, Llama.cpp
⚖️ Apache 2.0: 상업적 사용 가능

📦 Available Formats

Format	Path	Use Case	Size
Safetensors	`/` (루트)	Transformers, vLLM, SGLang	~6GB
GGUF Q4_K_M	`gguf/qwen25-3b-korean-Q4_K_M.gguf`	Ollama, Llama.cpp (권장)	~2GB
GGUF Q5_K_M	`gguf/qwen25-3b-korean-Q5_K_M.gguf`	고품질	~2.5GB
GGUF Q8_0	`gguf/qwen25-3b-korean-Q8_0.gguf`	최고 품질	~3.5GB
GGUF F16	`gguf/qwen25-3b-korean-F16.gguf`	벤치마크	~6GB

🚀 Quick Start

1️⃣ Transformers (가장 간단)

from transformers import AutoModelForCausalLM, AutoTokenizer

# 모델 로딩 (Merged 모델)
model = AutoModelForCausalLM.from_pretrained(
    "MyeongHo0621/Qwen2.5-3B-Korean",
    torch_dtype="auto",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("MyeongHo0621/Qwen2.5-3B-Korean")

# 채팅 템플릿 사용
messages = [
    {"role": "system", "content": "You are a helpful Korean assistant."},
    {"role": "user", "content": "한국의 수도는 어디인가요?"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

2️⃣ vLLM (Production Serving)

from vllm import LLM, SamplingParams

# Merged 모델 로딩
llm = LLM(
    model="MyeongHo0621/Qwen2.5-3B-Korean",
    quantization="bitsandbytes",  # 옵션: 4-bit 양자화
    gpu_memory_utilization=0.6
)

prompts = ["한국의 수도는 어디인가요?"]
params = SamplingParams(temperature=0.7, max_tokens=512)

outputs = llm.generate(prompts, params)
for output in outputs:
    print(output.outputs[0].text)

Server Mode:

vllm serve MyeongHo0621/Qwen2.5-3B-Korean \
    --quantization bitsandbytes \
    --port 8000

3️⃣ SGLang (Fastest)

import sglang as sgl

runtime = sgl.Runtime(
    model_path="MyeongHo0621/Qwen2.5-3B-Korean",
    quantization="bitsandbytes"
)

sgl.set_default_backend(runtime)

@sgl.function
def chat(s, prompt):
    s += sgl.user(prompt)
    s += sgl.assistant(sgl.gen("response", max_tokens=512))

state = chat.run(prompt="한국의 수도는?")
print(state["response"])

4️⃣ Ollama (Local Desktop)

# 1. GGUF 다운로드
huggingface-cli download MyeongHo0621/Qwen2.5-3B-Korean \
    gguf/qwen25-3b-korean-Q4_K_M.gguf \
    --local-dir ./

# 2. Modelfile 생성
cat > Modelfile << 'EOF'
FROM ./gguf/qwen25-3b-korean-Q4_K_M.gguf

TEMPLATE """<|im_start|>system
You are a helpful Korean assistant.<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER temperature 0.7
EOF

# 3. 모델 생성 & 실행
ollama create qwen25-korean -f Modelfile
ollama run qwen25-korean "한국의 수도는?"

5️⃣ Llama.cpp (CPU/Edge)

# 1. GGUF 다운로드
huggingface-cli download MyeongHo0621/Qwen2.5-3B-Korean \
    gguf/qwen25-3b-korean-Q4_K_M.gguf \
    --local-dir ./

# 2. 추론 (GPU)
./llama.cpp/main \
    -m ./gguf/qwen25-3b-korean-Q4_K_M.gguf \
    -p "<|im_start|>user\n한국의 수도는?<|im_end|>\n<|im_start|>assistant\n" \
    -n 512 \
    --temp 0.7 \
    -ngl 99

# 3. 추론 (CPU)
./llama.cpp/main \
    -m ./gguf/qwen25-3b-korean-Q4_K_M.gguf \
    -p "<|im_start|>user\n한국의 수도는?<|im_end|>\n<|im_start|>assistant\n" \
    -n 512 \
    -t 8

🔧 Training Details

Dataset

Source: MyeongHo0621/smol-koreantalk
Samples: 200,000 한국어 대화 쌍
Domain: 일반 대화, 지시 수행, 지식 Q&A

Training Configuration

Hyperparameter	Value
Method	QLoRA (4-bit NF4)
LoRA Rank	64
LoRA Alpha	128
Learning Rate	2e-4
Batch Size	128 (effective)
Epochs	3
Steps	4689
Max Length	2048

📊 Repository Structure

MyeongHo0621/Qwen2.5-3B-Korean/
├── config.json                 # 모델 설정
├── model.safetensors          # Merged 모델 (~6GB)
├── tokenizer.json             # 토크나이저
├── tokenizer_config.json
└── gguf/                      # GGUF 파일들
    ├── qwen25-3b-korean-Q4_K_M.gguf  (~2GB) ⭐ 권장
    ├── qwen25-3b-korean-Q5_K_M.gguf  (~2.5GB)
    ├── qwen25-3b-korean-Q8_0.gguf    (~3.5GB)
    └── qwen25-3b-korean-F16.gguf     (~6GB)

🔗 Related Repositories

PEFT Adapter: MyeongHo0621/Qwen2.5-3B-Korean-QLoRA
- LoRA 어댑터만 필요한 경우
- 파인튜닝 연구용
- ~479MB (경량)

📝 Citation

@misc{qwen25-korean-2025,
  author = {MyeongHo Shin},
  title = {Qwen2.5-3B-Korean: Korean-Optimized Conversational Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/MyeongHo0621/Qwen2.5-3B-Korean}},
}

🙏 Acknowledgments

Base Model: Qwen2.5-3B-Instruct by Alibaba Cloud
Dataset: smol-koreantalk
Tools: Unsloth, PEFT, vLLM, SGLang, Llama.cpp

📞 Contact

Author: MyeongHo Shin
HuggingFace: @MyeongHo0621

⚖️ License

Apache 2.0 - 상업적 사용, 수정, 배포 가능

Evaluation results

Benchmark Results

General Benchmarks

Task	Score	Metric
gsm8k	42.00%	acc
mmlu	58.00%	acc
hellaswag	71.00%	acc_norm
winogrande	65.00%	acc
arc_easy	78.00%	acc
arc_challenge	48.00%	acc_norm

Average Score: 60.33%

Downloads last month: 292

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for MyeongHo0621/Qwen2.5-3B-Korean

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Quantized

(254)

this model

Quantizations

1 model