Instructions to use cloudyu/Mixtral_34Bx2_MoE_60B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cloudyu/Mixtral_34Bx2_MoE_60B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="cloudyu/Mixtral_34Bx2_MoE_60B")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("cloudyu/Mixtral_34Bx2_MoE_60B")
model = AutoModelForCausalLM.from_pretrained("cloudyu/Mixtral_34Bx2_MoE_60B")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use cloudyu/Mixtral_34Bx2_MoE_60B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "cloudyu/Mixtral_34Bx2_MoE_60B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudyu/Mixtral_34Bx2_MoE_60B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/cloudyu/Mixtral_34Bx2_MoE_60B

SGLang

How to use cloudyu/Mixtral_34Bx2_MoE_60B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "cloudyu/Mixtral_34Bx2_MoE_60B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudyu/Mixtral_34Bx2_MoE_60B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "cloudyu/Mixtral_34Bx2_MoE_60B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "cloudyu/Mixtral_34Bx2_MoE_60B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use cloudyu/Mixtral_34Bx2_MoE_60B with Docker Model Runner:
```
docker model run hf.co/cloudyu/Mixtral_34Bx2_MoE_60B
```

source code and paper?

by josephykwang - opened Jan 8, 2024

Discussion

josephykwang

Jan 8, 2024

Any writeup about

how did you decide on these two models?
what merge technique do you use?

cloudyu

Owner Jan 8, 2024

I am a loyal player of kaggle.
The most important thing I learned from Kaggle is model ensemble or stacking is all you need.
I believe this also applies to transformers.

rongzhou

Jan 9, 2024

How to ensemble two LLMs? I searched the Internet, found a tool call LLM-Blender, did you use it?

cloudyu

Owner Jan 9, 2024

maybe this one?

Chai Research presents Blending Is All You Need

Cheaper, Better Alternative to Trillion-Parameters LLM

In conversational AI research, there's a noticeable trend towards developing models with a larger number of parameters, exemplified by models like ChatGPT. While these expansive models tend to generate increasingly better chat responses, they demand significant computational resources and memory. This study explores a pertinent question: Can a combination of smaller models collaboratively achieve comparable or enhanced performance relative to a singular large model? We introduce an approach termed "blending", a straightforward yet effective method of integrating multiple chat AIs. Our empirical evidence suggests that when specific smaller models are synergistically blended, they can potentially outperform or match the capabilities of much larger counterparts. For instance, integrating just three models of moderate size (6B/13B paramaeters) can rival or even surpass the performance metrics of a substantially larger model like ChatGPT (175B+ paramaters). This hypothesis is rigorously tested using A/B testing methodologies with a large user base on the Chai research platform over a span of thirty days. The findings underscore the potential of the "blending" strategy as a viable approach for enhancing chat AI efficacy without a corresponding surge in computational demands.

cloudyu

Owner Jan 9, 2024

https://arxiv.org/abs/2401.04088

josephykwang

Jan 9, 2024

https://arxiv.org/abs/2401.04088 is a sparse moe. in their https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1, there are 8 experts.

in you model card,

you are using two models
it is not clear if you are simply "merging" two dense models' outputs

josephykwang

Jan 9, 2024

maybe this one?

Chai Research presents Blending Is All You Need

Cheaper, Better Alternative to Trillion-Parameters LLM

see This means that the different chat AIs are able to implicitly influence the output of the current response. As a result, the current response is a blending of individual chat AI strengths, as they collaborate to create an overall more engaging conversation.

don't think this is not the same as MOE approach

Minami-su

Jan 11, 2024

https://arxiv.org/pdf/2312.15166.pdf surely

lucasjin

Jan 11, 2024

Have u train the model? these chat model even have different template

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment