Instructions to use kalo-team/llama3-4x8b-pythonT2_step_final with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use kalo-team/llama3-4x8b-pythonT2_step_final with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="kalo-team/llama3-4x8b-pythonT2_step_final")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("kalo-team/llama3-4x8b-pythonT2_step_final")
model = AutoModelForMultimodalLM.from_pretrained("kalo-team/llama3-4x8b-pythonT2_step_final")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use kalo-team/llama3-4x8b-pythonT2_step_final with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "kalo-team/llama3-4x8b-pythonT2_step_final"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kalo-team/llama3-4x8b-pythonT2_step_final",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/kalo-team/llama3-4x8b-pythonT2_step_final

SGLang

How to use kalo-team/llama3-4x8b-pythonT2_step_final with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "kalo-team/llama3-4x8b-pythonT2_step_final" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kalo-team/llama3-4x8b-pythonT2_step_final",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "kalo-team/llama3-4x8b-pythonT2_step_final" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "kalo-team/llama3-4x8b-pythonT2_step_final",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use kalo-team/llama3-4x8b-pythonT2_step_final with Docker Model Runner:
```
docker model run hf.co/kalo-team/llama3-4x8b-pythonT2_step_final
```

kalomaze commited on May 22, 2024

Commit

e7766e1

verified ·

1 Parent(s): f4ccb00

Update README.md

Browse files

Files changed (1) hide show

README.md +34 -2

README.md CHANGED Viewed

@@ -1,2 +1,34 @@
-lr = 2e-6, ~2.5 mil tokens of Python instruct data, all around ~7k tokens ish for each sample (300 total samples).
-1 epoch distillation of 70b logprobs, topk=200

+# 70b Distillation Experiment
+This is not the full-fledged run that I plan to do for a large scale distillation of Llama3 70b.
+Instead, it's a preliminary test train of the custom distillation trainer, where we target KL divergence from the larger Llama3 70b teacher model onto 4x8b (the student).
+I'm releasing it here mainly so that people who are interested can tinker with it / finetune it to see how it behaves before I am ready to do a larger run.
+# Training details
+Each of the 8b expert MLP layers is duplicated 3x from the original Llama3 8b in a typical Mixtral-style Sparse MoE layout.
+Over the course of the training run, the expert selection count was gradually increased from the minimum (topk=1) to the maximum (topk=4), as in [Sparse MoE as the New Dropout](https://arxiv.org/abs/2303.01610). This was done with a stochastic / randomized top_k expert selection with **frozen gate layers**, as recommended in the paper.
+LR = 2e-6, ~2.5 mil tokens of Python instruct data, all around ~8k tokens ish for each sample ~(300 total samples).
+Despite the use of instruct data, the model does not necessarily behave like one, as the training process involves mimicking a larger base model's distributions over to said data.
+1 epoch distillation of 70b logprobs, topk=200 logits from the fp16 Llama3-70b.
+# Evals
+## llama3-4x8b-pythonT2_step_final
+* mmlu: 65.10 (66.69) - 0.97x
+* arc: 57.94 (59.47) - 0.97x
+* hellaswag: 81.93 (82.09) - 0.99x
+* winogrande: 77.03 (77.35) - 0.99x
+* gsm8k: 50.95 (45.79) - 1.11x
+* truthfulqa-mc1: 27.66
+* truthfulqa-mc2: 44.53 (43.9) - 1.01x
+* humaneval+: 32.9 (29.3) - 1.12x
+* humaneval: 37.2 (33.5) - 1.11x
+# Current Conclusions
+Going by evals (and evals alone), full-finetuning seems to have caused some degree of mild catastrophic forgetting outside of the domains that were specifically distilled, as you might expect from the lack of data. I plan to remedy this with lower LRs and/or bigger batch sizes, and of course, on a much larger dataset than the limited selection seen here.
+The plan is to do at least 1 billion unique tokens; we are still conducting custom tests for alternative loss functions (i.e, things in the vein of a weighted Cross-Entropy loss function to be used in tandem with KL divergence.)
+![Embedded Image](https://cdn.discordapp.com/attachments/1201939812488052757/1242759414033813565/image.png?ex=664faa25&is=664e58a5&hm=a15028990e0f6dfba573ea2a00207daa01e953b19c2637e40546d5e594b51b36&)