Instructions to use DSAiLab/KoMultiGen-general-gptq-4bit-32g with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DSAiLab/KoMultiGen-general-gptq-4bit-32g with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DSAiLab/KoMultiGen-general-gptq-4bit-32g")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DSAiLab/KoMultiGen-general-gptq-4bit-32g")
model = AutoModelForCausalLM.from_pretrained("DSAiLab/KoMultiGen-general-gptq-4bit-32g")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use DSAiLab/KoMultiGen-general-gptq-4bit-32g with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DSAiLab/KoMultiGen-general-gptq-4bit-32g"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DSAiLab/KoMultiGen-general-gptq-4bit-32g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DSAiLab/KoMultiGen-general-gptq-4bit-32g

SGLang

How to use DSAiLab/KoMultiGen-general-gptq-4bit-32g with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DSAiLab/KoMultiGen-general-gptq-4bit-32g" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DSAiLab/KoMultiGen-general-gptq-4bit-32g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DSAiLab/KoMultiGen-general-gptq-4bit-32g" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DSAiLab/KoMultiGen-general-gptq-4bit-32g",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use DSAiLab/KoMultiGen-general-gptq-4bit-32g with Docker Model Runner:
```
docker model run hf.co/DSAiLab/KoMultiGen-general-gptq-4bit-32g
```

KoMultiGen-General

본 모델은 대규모 한국어 멀티턴 데이터셋 koVast 프로젝트의 일부로써 제작되었습니다.

KoMultiGen-General은 시나트라-믹스트랄 모델을 기반으로 제작되었습니다. 본 모델은 주어진 비정형 데이터를 기반으로 3~5턴의 한국어 멀티턴 데이터를 생성할 수 있습니다.

Methodology

GPT-4 Turbo를 이용하여 생성한 데이터셋을 기반으로 모델을 훈련을 진행하였습니다. 일반적으로, 한국어 모델과의 질의 응답의 경우에 '해체'를 사용하는 경향이 매우 짙게 나타남에 따라 Question을 '해체'로 구성하였습니다. 또한, 데이터를 생성하기 힘든 입력에 대해서는 생성하지 않는 판단을 유도함으로써 데이터 정제과정에서의 어려움을 최소화하고자 했습니다.

Quantization (GPTQ)

이 모델은 원본 FP16 모델을 기반으로 GPTQ (Groupwise Post-training Quantization) 기법을 통해 4bit 정밀도로 양자화 하였습니다.

Quantization Type: GPTQ 4bit
Group Size: 32
Activation Ordering: Enabled (if applicable)
Compatibility: vLLM, SGLang 지원
정확도 손실: 다중 턴 QA 문장 생성 성능은 양자화 전과 비교하여 미미한 차이 수준으로 유지됨

Example

일반적으로 아래 프롬프트를 따릅니다.

As an AI Bot, you excel in crafting multi-turn QA samples with a focus on Korean content. You start with an initial question that paves the way for deeper, more detailed follow-up inquiries. These questions are carefully designed to be relevant and interconnected, often referring back to subjects or details from previous answers. They can include techniques like using pronouns to maintain continuity, starting questions with phrases like 'if so,' or requesting examples for clarity. Your answers are expected to be rooted in thorough logical analysis. The dialogue can unfold over 3 to 5 exchanges. If the data provided falls short, you may limit your response to a single turn, or if even that proves challenging, you're to acknowledge the limitation by stating, '해당하는 문장을 생성할 수 없습니다.'

### Instruction:
### Start of provided data
{prompt}
### End of provided data

주어진 데이터로 3~5 turn의 QA를 생성해라.
인용표현을 사용하지 말아야한다.
Question의 말투는 '~는 ~야?' 같은 반말 어휘를 사용하며, Answer는 '~입니다.' 어휘를 사용한다.

### Response:

Downloads last month: 2

Safetensors

Model size

47B params

Tensor type

I32

BF16

F16