Instructions to use Zoyd/LLM360_K2-Chat-2_2bpw_exl2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Zoyd/LLM360_K2-Chat-2_2bpw_exl2 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Zoyd/LLM360_K2-Chat-2_2bpw_exl2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Zoyd/LLM360_K2-Chat-2_2bpw_exl2")
model = AutoModelForCausalLM.from_pretrained("Zoyd/LLM360_K2-Chat-2_2bpw_exl2")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Zoyd/LLM360_K2-Chat-2_2bpw_exl2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Zoyd/LLM360_K2-Chat-2_2bpw_exl2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zoyd/LLM360_K2-Chat-2_2bpw_exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Zoyd/LLM360_K2-Chat-2_2bpw_exl2

SGLang

How to use Zoyd/LLM360_K2-Chat-2_2bpw_exl2 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Zoyd/LLM360_K2-Chat-2_2bpw_exl2" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zoyd/LLM360_K2-Chat-2_2bpw_exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Zoyd/LLM360_K2-Chat-2_2bpw_exl2" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zoyd/LLM360_K2-Chat-2_2bpw_exl2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Zoyd/LLM360_K2-Chat-2_2bpw_exl2 with Docker Model Runner:
```
docker model run hf.co/Zoyd/LLM360_K2-Chat-2_2bpw_exl2
```

Configuration Parsing Warning:In config.json: "quantization_config.bits" must be an integer

Exllamav2 quant (exl2 / 2.2 bpw) made with ExLlamaV2 v0.1.1

Other EXL2 quants:

Quant	Model Size	lm_head
2.2	17685 MB	6
2.5	20000 MB	6
3.0	23857 MB	6
3.5	27721 MB	6
3.75	29647 MB	6
4.0	31549 MB	6
4.25	33505 MB	6
5.0	39300 MB	6
6.0	46927 MB	8
6.5	50613 MB	8
8.0	49516 MB	8

K2-Chat: a fully-reproducible large language model outperforming Llama 2 70B Chat using 35% less compute

K2 Chat is finetuned from K2-65B. K2 Chat outperforms Llama 2-70B-Chat on all evaluations conducted. The model also outperforms Llama 3-70B-Instruct on coding tasks.

LLM360 Model Performance and Evaluation Collection

The LLM360 Performance and Evaluation Collection is a robust evaluations set consisting of general and domain specific evaluations to assess model knowledge and function.

Evaluations include standard best practice benchmarks, medical, math, and coding knowledge. More about the evaluations can be found here.

Datasets and Mix

Subset	#Tokens	Avg. #Q	Avg. Query Len	Avg. #R	Avg. Reply Len
MathInstruct	66,639,699	1.00	81.53	1.00	172.78
OpenHermes-2	404,820,694	1.01	152.38	1.01	249.12
FLAN_3M	2,346,961,387	1.00	727.49	1.00	54.83
Standford Encyclopedia Philosophy	786,928	1.00	219.09	1.00	166.28
TinyStories	1,448,898	1.00	260.82	1.00	207.47
Safety & Alignment Data	99,976,621	1.00	126.71	1.00	373.79
Total	2,920,634,227

Loading K2-Chat

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-Chat")
model = AutoModelForCausalLM.from_pretrained("LLM360/K2-Chat")

prompt = '<|beginofuser|>what is the highest mountain on earth?<|beginofsystem|>'

input_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen_tokens = model.generate(input_ids, do_sample=True, max_new_tokens=128)

print("-"*20 + "Output for model"  + 20 * '-')
print(tokenizer.batch_decode(gen_tokens)[0])

Alternatively, you can construct the prompt by applying the chat template of tokenizer on input conversation:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("LLM360/K2-Chat")
model = AutoModelForCausalLM.from_pretrained("LLM360/K2-Chat")

messages = [{"role": "user", "content": "what is the highest mountain on earth?"}]

input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
gen_tokens = model.generate(input_ids, do_sample=True, max_new_tokens=128)

print("-"*20 + "Output for model"  + 20 * '-')
print(tokenizer.batch_decode(gen_tokens)[0])

LLM360 Developer Suite

We provide step-by-step finetuning tutorials for tech enthusiasts, AI practitioners and academic or industry researchers here.

About LLM360

LLM360 is an open research lab enabling community-owned AGI through open-source large model research and development.

LLM360 enables community-owned AGI by creating standards and tools to advance the bleeding edge of LLM capability and empower knowledge transfer, research, and development.

We believe in a future where artificial general intelligence (AGI) is created by the community, for the community. Through an open ecosystem of equitable computational resources, high quality data, and flowing technical knowledge, we can ensure ethical AGI development and universal access for all innovators.

Visit us

Citation

BibTeX:

@article{
      title={LLM360 K2-65B: Scaling Up Fully Transparent Open-Source LLMs}, 
      author={The LLM360 Team},
      year={2024},
}

Downloads last month: 2

Paper for Zoyd/LLM360_K2-Chat-2_2bpw_exl2

Finetuned Language Models Are Zero-Shot Learners

Paper • 2109.01652 • Published Sep 3, 2021 • 5