Text Generation
Transformers
Safetensors
English
mixtral
code
conversational
text-generation-inference
Instructions to use kalo-team/llama3-4x8b-pythonT2_step_final with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kalo-team/llama3-4x8b-pythonT2_step_final with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kalo-team/llama3-4x8b-pythonT2_step_final") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("kalo-team/llama3-4x8b-pythonT2_step_final") model = AutoModelForMultimodalLM.from_pretrained("kalo-team/llama3-4x8b-pythonT2_step_final") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use kalo-team/llama3-4x8b-pythonT2_step_final with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kalo-team/llama3-4x8b-pythonT2_step_final" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kalo-team/llama3-4x8b-pythonT2_step_final", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kalo-team/llama3-4x8b-pythonT2_step_final
- SGLang
How to use kalo-team/llama3-4x8b-pythonT2_step_final with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kalo-team/llama3-4x8b-pythonT2_step_final" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kalo-team/llama3-4x8b-pythonT2_step_final", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kalo-team/llama3-4x8b-pythonT2_step_final" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kalo-team/llama3-4x8b-pythonT2_step_final", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kalo-team/llama3-4x8b-pythonT2_step_final with Docker Model Runner:
docker model run hf.co/kalo-team/llama3-4x8b-pythonT2_step_final
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,2 +1,34 @@
|
|
| 1 |
-
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# 70b Distillation Experiment
|
| 2 |
+
This is not the full-fledged run that I plan to do for a large scale distillation of Llama3 70b.
|
| 3 |
+
Instead, it's a preliminary test train of the custom distillation trainer, where we target KL divergence from the larger Llama3 70b teacher model onto 4x8b (the student).
|
| 4 |
+
I'm releasing it here mainly so that people who are interested can tinker with it / finetune it to see how it behaves before I am ready to do a larger run.
|
| 5 |
+
|
| 6 |
+
# Training details
|
| 7 |
+
Each of the 8b expert MLP layers is duplicated 3x from the original Llama3 8b in a typical Mixtral-style Sparse MoE layout.
|
| 8 |
+
|
| 9 |
+
Over the course of the training run, the expert selection count was gradually increased from the minimum (topk=1) to the maximum (topk=4), as in [Sparse MoE as the New Dropout](https://arxiv.org/abs/2303.01610). This was done with a stochastic / randomized top_k expert selection with **frozen gate layers**, as recommended in the paper.
|
| 10 |
+
|
| 11 |
+
LR = 2e-6, ~2.5 mil tokens of Python instruct data, all around ~8k tokens ish for each sample ~(300 total samples).
|
| 12 |
+
Despite the use of instruct data, the model does not necessarily behave like one, as the training process involves mimicking a larger base model's distributions over to said data.
|
| 13 |
+
|
| 14 |
+
1 epoch distillation of 70b logprobs, topk=200 logits from the fp16 Llama3-70b.
|
| 15 |
+
|
| 16 |
+
# Evals
|
| 17 |
+
|
| 18 |
+
## llama3-4x8b-pythonT2_step_final
|
| 19 |
+
|
| 20 |
+
* mmlu: 65.10 (66.69) - 0.97x
|
| 21 |
+
* arc: 57.94 (59.47) - 0.97x
|
| 22 |
+
* hellaswag: 81.93 (82.09) - 0.99x
|
| 23 |
+
* winogrande: 77.03 (77.35) - 0.99x
|
| 24 |
+
* gsm8k: 50.95 (45.79) - 1.11x
|
| 25 |
+
* truthfulqa-mc1: 27.66
|
| 26 |
+
* truthfulqa-mc2: 44.53 (43.9) - 1.01x
|
| 27 |
+
* humaneval+: 32.9 (29.3) - 1.12x
|
| 28 |
+
* humaneval: 37.2 (33.5) - 1.11x
|
| 29 |
+
|
| 30 |
+
# Current Conclusions
|
| 31 |
+
Going by evals (and evals alone), full-finetuning seems to have caused some degree of mild catastrophic forgetting outside of the domains that were specifically distilled, as you might expect from the lack of data. I plan to remedy this with lower LRs and/or bigger batch sizes, and of course, on a much larger dataset than the limited selection seen here.
|
| 32 |
+
The plan is to do at least 1 billion unique tokens; we are still conducting custom tests for alternative loss functions (i.e, things in the vein of a weighted Cross-Entropy loss function to be used in tandem with KL divergence.)
|
| 33 |
+
|
| 34 |
+

|