Text Generation
Transformers
Safetensors
GGUF
llama-cpp-python
MLX
Korean
English
qwen2
finance
korean
stock-analysis
reasoning
dpo
llama-cpp
apple-silicon
4bit
quantized
vllm
ollama
conversational
text-generation-inference
Instructions to use intrect/VELA with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use intrect/VELA with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="intrect/VELA") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("intrect/VELA") model = AutoModelForCausalLM.from_pretrained("intrect/VELA") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use intrect/VELA with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="intrect/VELA", filename="vela-dpo-v6-q4_k_m.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - MLX
How to use intrect/VELA with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("intrect/VELA") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use intrect/VELA with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf intrect/VELA:Q4_K_M # Run inference directly in the terminal: llama-cli -hf intrect/VELA:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf intrect/VELA:Q4_K_M # Run inference directly in the terminal: llama-cli -hf intrect/VELA:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf intrect/VELA:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf intrect/VELA:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf intrect/VELA:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf intrect/VELA:Q4_K_M
Use Docker
docker model run hf.co/intrect/VELA:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use intrect/VELA with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "intrect/VELA" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "intrect/VELA", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/intrect/VELA:Q4_K_M
- SGLang
How to use intrect/VELA with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "intrect/VELA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "intrect/VELA", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "intrect/VELA" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "intrect/VELA", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use intrect/VELA with Ollama:
ollama run hf.co/intrect/VELA:Q4_K_M
- Unsloth Studio new
How to use intrect/VELA with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for intrect/VELA to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for intrect/VELA to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for intrect/VELA to start chatting
- Pi new
How to use intrect/VELA with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "intrect/VELA"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "intrect/VELA" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use intrect/VELA with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "intrect/VELA"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default intrect/VELA
Run Hermes
hermes
- MLX LM
How to use intrect/VELA with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "intrect/VELA"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "intrect/VELA" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "intrect/VELA", "messages": [ {"role": "user", "content": "Hello"} ] }' - Docker Model Runner
How to use intrect/VELA with Docker Model Runner:
docker model run hf.co/intrect/VELA:Q4_K_M
- Lemonade
How to use intrect/VELA with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull intrect/VELA:Q4_K_M
Run and chat with the model
lemonade run user.VELA-Q4_K_M
List all available models
lemonade list
docs: 벤치마크 섹션 한국어로 변환
Browse files
README.md
CHANGED
|
@@ -194,77 +194,76 @@ Qwen/Qwen2.5-7B-Instruct
|
|
| 194 |
|
| 195 |
## Benchmarks
|
| 196 |
|
| 197 |
-
###
|
| 198 |
|
| 199 |
-
|
| 200 |
|
| 201 |
-
#### KMMLU (
|
| 202 |
|
| 203 |
-
|
|
| 204 |
-
|------
|
| 205 |
-
|
|
| 206 |
-
|
|
| 207 |
-
|
|
| 208 |
-
|
|
| 209 |
-
|
|
| 210 |
-
|
|
| 211 |
-
|
|
| 212 |
-
|
|
| 213 |
-
|
|
| 214 |
-
|
|
| 215 |
-
| **
|
| 216 |
|
| 217 |
-
#### HAE-RAE Bench (
|
| 218 |
|
| 219 |
-
|
|
| 220 |
-
|------
|
| 221 |
-
|
|
| 222 |
-
|
|
| 223 |
-
|
|
| 224 |
-
|
|
| 225 |
-
|
|
| 226 |
-
| **
|
| 227 |
|
| 228 |
-
####
|
| 229 |
|
| 230 |
-
- **
|
| 231 |
-
- **
|
| 232 |
-
- **
|
| 233 |
|
| 234 |
-
###
|
| 235 |
|
| 236 |
RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
|
| 237 |
|
| 238 |
-
|
|
| 239 |
-
|------
|
| 240 |
-
| **Q4_K_M (v6)** | **36 tok/s** | 0/5
|
| 241 |
|
| 242 |
-
>
|
| 243 |
|
| 244 |
-
### MLX
|
| 245 |
|
| 246 |
-
M1 Max 32GB, MLX 4-bit
|
| 247 |
|
| 248 |
-
|
|
| 249 |
-
|------
|
| 250 |
-
| **MLX 4-bit** | 4-bit (4.5 bpw) | 0.
|
| 251 |
-
| PyTorch (CPU) | BF16 | 0.
|
| 252 |
-
| PyTorch + LoRA (CPU) | BF16 | 1.
|
| 253 |
|
| 254 |
MLX 4-bit vs PyTorch CPU:
|
| 255 |
-
- **3.
|
| 256 |
-
- **73%**
|
| 257 |
-
- **68%**
|
| 258 |
-
|
| 259 |
-
### DPO
|
| 260 |
-
|
| 261 |
-
|
|
| 262 |
-
|------
|
| 263 |
-
|
|
| 264 |
-
|
|
| 265 |
-
| RT
|
| 266 |
-
|
|
| 267 |
-
|
| 268 |
---
|
| 269 |
|
| 270 |
## Usage
|
|
|
|
| 194 |
|
| 195 |
## Benchmarks
|
| 196 |
|
| 197 |
+
### 한국어 LLM 벤치마크 (KMMLU + HAE-RAE)
|
| 198 |
|
| 199 |
+
모든 모델 **Q4_K_M 양자화**, **0-shot** 조건으로 평가. `lm-evaluation-harness` v0.4.9 + `llama.cpp`, Apple M1 Max 32GB 환경.
|
| 200 |
|
| 201 |
+
#### KMMLU (한국어 MMLU, 10과목)
|
| 202 |
|
| 203 |
+
| 과목 | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
|
| 204 |
+
|------|:-----------:|:--------------------:|:---------------:|
|
| 205 |
+
| 마케팅 | **75.7** | 72.5 | 75.6 |
|
| 206 |
+
| 컴퓨터과학 | **73.7** | 69.7 | 69.7 |
|
| 207 |
+
| 경영학 | 54.0 | 55.2 | **57.3** |
|
| 208 |
+
| 정치사회학 | 49.0 | 49.3 | **56.0** |
|
| 209 |
+
| 경제학 | 45.4 | 47.7 | **51.5** |
|
| 210 |
+
| 법학 | 43.4 | 46.1 | **49.9** |
|
| 211 |
+
| 심리학 | 39.2 | 39.3 | **45.7** |
|
| 212 |
+
| 회계 | 38.0 | 33.0 | **42.0** |
|
| 213 |
+
| 수학 | **33.0** | **33.7** | 27.7 |
|
| 214 |
+
| 한국사 | **31.0** | 29.0 | 22.0 |
|
| 215 |
+
| **평균** | **48.2** | **47.6** | **49.7** |
|
| 216 |
|
| 217 |
+
#### HAE-RAE Bench (한국어 특화)
|
| 218 |
|
| 219 |
+
| 영역 | VELA DPO v6 | Qwen2.5-7B-Instruct | EXAONE-3.5-7.8B |
|
| 220 |
+
|------|:-----------:|:--------------------:|:---------------:|
|
| 221 |
+
| 희귀어 | 69.9 | 68.4 | **78.8** |
|
| 222 |
+
| 표준명칭 | 64.7 | 66.0 | **71.9** |
|
| 223 |
+
| 외래어 | 48.5 | 57.4 | **81.1** |
|
| 224 |
+
| 한국사 | 45.7 | 42.6 | **77.7** |
|
| 225 |
+
| 일반상식 | **44.3** | 42.1 | 44.3 |
|
| 226 |
+
| **평균** | **54.5** | **55.3** | **70.7** |
|
| 227 |
|
| 228 |
+
#### 주요 발견
|
| 229 |
|
| 230 |
+
- **Catastrophic forgetting 없음**: 도메인 특화 fine-tuning 후에도 베이스 모델(Qwen2.5) 능력 유지 (KMMLU 평균 48.2% vs 47.6%)
|
| 231 |
+
- **도메인 전이 효과**: 금융 관련 과목에서 베이스 모델 대비 향상 — 마케팅(+3.2%), 컴퓨터과학(+4.0%), 회계(+5.0%)
|
| 232 |
+
- **한국어 네이티브 모델과 경쟁**: 대규모 한국어 코퍼스로 사전학습된 EXAONE-3.5-7.8B (LG AI Research) 대비 KMMLU 10과목 중 4개에서 우위
|
| 233 |
|
| 234 |
+
### 양자화 벤치마크 (GGUF)
|
| 235 |
|
| 236 |
RTX 3060 12GB, llama-cpp-python, `n_gpu_layers=-1`, `n_ctx=4096`
|
| 237 |
|
| 238 |
+
| 포맷 | 속도 | 중국어 Leak | 품질 |
|
| 239 |
+
|------|------|-------------|------|
|
| 240 |
+
| **Q4_K_M (v6)** | **36 tok/s** | 0/5 클린 | RT + 리포트 정상 |
|
| 241 |
|
| 242 |
+
> 스트레스 테스트 5회: Synthesis + 3K Reasoning Trace 교대 — 전 구간 **중국어 leak 제로**
|
| 243 |
|
| 244 |
+
### MLX 벤치마크 (Apple Silicon)
|
| 245 |
|
| 246 |
+
M1 Max 32GB, MLX 4-bit 양자화
|
| 247 |
|
| 248 |
+
| 구성 | 양자화 | 로딩 시간 | 추론 속도 | 메모리 |
|
| 249 |
+
|------|--------|----------|----------|--------|
|
| 250 |
+
| **MLX 4-bit** | 4-bit (4.5 bpw) | 0.59초 | **15.93 tok/s** | 4.4 GB |
|
| 251 |
+
| PyTorch (CPU) | BF16 | 0.10초 | 4.93 tok/s | 0.3 GB |
|
| 252 |
+
| PyTorch + LoRA (CPU) | BF16 | 1.64초 | 4.22 tok/s | 14.1 GB |
|
| 253 |
|
| 254 |
MLX 4-bit vs PyTorch CPU:
|
| 255 |
+
- 추론 속도 **3.2배** (15.93 vs 4.93 tok/s)
|
| 256 |
+
- 모델 크기 **73% 감소** (4 GB vs 15 GB)
|
| 257 |
+
- 메모리 **68% 절약** (4.4 vs 14.1 GB)
|
| 258 |
+
|
| 259 |
+
### DPO 학습 품질 개선
|
| 260 |
+
|
| 261 |
+
| 지표 | DPO 전 | DPO 후 |
|
| 262 |
+
|------|--------|--------|
|
| 263 |
+
| 중국어 leak | 빈번 | **0/10 클린** |
|
| 264 |
+
| 영어 leak | 간헐적 | 최소화 |
|
| 265 |
+
| RT 형식 준수율 | ~80% | **~98%** |
|
| 266 |
+
| 한국어 유창성 | 양호 | **우수** |
|
|
|
|
| 267 |
---
|
| 268 |
|
| 269 |
## Usage
|