Instructions to use moreh/Llama-3-Motif-102B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moreh/Llama-3-Motif-102B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moreh/Llama-3-Motif-102B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("moreh/Llama-3-Motif-102B-Instruct")
model = AutoModelForCausalLM.from_pretrained("moreh/Llama-3-Motif-102B-Instruct")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use moreh/Llama-3-Motif-102B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moreh/Llama-3-Motif-102B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moreh/Llama-3-Motif-102B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moreh/Llama-3-Motif-102B-Instruct

SGLang

How to use moreh/Llama-3-Motif-102B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moreh/Llama-3-Motif-102B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moreh/Llama-3-Motif-102B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moreh/Llama-3-Motif-102B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moreh/Llama-3-Motif-102B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moreh/Llama-3-Motif-102B-Instruct with Docker Model Runner:
```
docker model run hf.co/moreh/Llama-3-Motif-102B-Instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Introduction

We introduce Llama-3-Motif, a new language model family of Moreh, specialized in Korean and English.
Llama-3-Motif-102B-Instruct is a chat model tuned from the base model Llama-3-Motif-102B.

Training Platform

Llama-3-Motif-102B model family is trained on MoAI platform, refer to link for more information.

Quick Usage

You can chat directly with our model Llama-3-Motif through our Model hub.

Details

More details will be provided in the upcoming technical report.
Effective context length is 32k(avg 81) based on RULER benchmark.

Release Date

2024.12.02

Benchmark Results

Provider	Model	kmmlu_direct score
Moreh	Llama-3-Motif-102B	64.74	+
Moreh	Llama-3-Motif-102B-Instruct	64.81	+
Meta	Llama3-70B-instruct	54.5*
Meta	Llama3.1-70B-instruct	52.1*
Meta	Llama3.1-405B-instruct	65.8*
Alibaba	Qwen2-72B-instruct	64.1*
OpenAI	GPT-4-0125-preview	59.95*
OpenAI	GPT-4o-2024-05-13	64.11**
Google	gemini pro	50.18*
LG	exaone 3.0	44.5*	+
Naver	HyperCLOVA X	53.4*	+
Upstage	SOLAR-10.7B	41.65*	+

* : Community report
** : Measured by Moreh
+ : Claimed to have better capability in Korean

How to use

Use with vLLM

Refer to this link to install vllm

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# Change tensor_parallel_size to GPU numbers you can afford
model = LLM("moreh/Motif-102B-Instruct", tensor_parallel_size=4)
tokenizer = AutoTokenizer.from_pretrained("moreh/Llama-3-Motif-102B-Instruct")
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "유치원생에게 빅뱅 이론의 개념을 설명해보세요"},
]

messages_batch = [tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)]

# vllm does not support generation_config of hf. So we have to set it like below
sampling_params = SamplingParams(max_tokens=512, temperature=0, repetition_penalty=1.0, stop_token_ids=[tokenizer.eos_token_id])
responses = model.generate(messages_batch, sampling_params=sampling_params)

print(responses[0].outputs[0].text)

Use with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "moreh/Llama-3-Motif-102B-Instruct"

# all generation configs are set in generation_configs.json
model = AutoModelForCausalLM.from_pretrained(model_id).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "유치원생에게 빅뱅 이론의 개념을 설명해보세요"},
]

messages_batch = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
input_ids = tokenizer(messages_batch, padding=True, return_tensors='pt')['input_ids'].cuda()

outputs = model.generate(input_ids)