Instructions to use wolfram/miqu-1-103b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wolfram/miqu-1-103b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="wolfram/miqu-1-103b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("wolfram/miqu-1-103b")
model = AutoModelForCausalLM.from_pretrained("wolfram/miqu-1-103b", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use wolfram/miqu-1-103b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wolfram/miqu-1-103b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wolfram/miqu-1-103b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/wolfram/miqu-1-103b

SGLang

How to use wolfram/miqu-1-103b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wolfram/miqu-1-103b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wolfram/miqu-1-103b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wolfram/miqu-1-103b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wolfram/miqu-1-103b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use wolfram/miqu-1-103b with Docker Model Runner:
```
docker model run hf.co/wolfram/miqu-1-103b
```

Can't wait to test

by froggeric - opened Feb 27, 2024

Discussion

froggeric

Feb 27, 2024

I am very excited to test this model. I just finished testing my iMatrix Q4_K_S quantise of your miqu-1-120b, and it is head and shoulders above the original miqu-1-70b. Here is the comparison (higher score = better):

wolfram

Owner Feb 27, 2024

Thanks for sharing your test results! That looks great. Would love to see how my other models rank in your tests.

froggeric

Mar 3, 2024

•

edited Mar 3, 2024

I just finished testing it at q4_km (imatrix), here is the update with other miqu based models, including yours:

What I have noticed when compared with your 120b version, is, the 103b version has a bit more difficulties following instructions (but still very good at it). However in general it gives more detailed replies. I see 2 big advantages with the 103b version:

being smaller, it is possible to run a larger context
size for size, it is possible to use it 1 quant higher than the 120b, which should give even better results
I am just starting another round of tests with the q5_ks imatrix version :)

froggeric

Mar 3, 2024

Finished testing the q5_ks (imatrix) version:

Slight improvements over q4_km, but as it uses more memory, it reduces what it is available for context. Still, with 96GB I can still use a context larger than 16k.

froggeric

Mar 4, 2024

•

edited Mar 4, 2024

I have revised my scores for the 103b q5_ks version. I had the feeling I had been slightly biased. And indeed, after reviewing the answers it gave, I had overlooked some glaring logical problems in favour of the writing quality. Here are the correct scores:

Even though the total scores are the same, my favourite is miqu-1-120b. miqu-1-103b clearly has more problem following instructions, and steering it in the right direction is hard work. miquliz-120b is not as good as miqu-120b for storytelling, and I would say has a worrying tendencing of getting dumber when a large context gets filled in; however, for short-medium smart assistant role, it actually scores better than miqu-120b.

I think the most potential for getting the best large model with what is available now, is with self-merges of miqu, followed by a finetuning like Westlake to restore some of the information lost. I don't think we have yet discovered what the best self-merge pattern is. I have some thoughts about it, which I have detailed in this discussion: https://huggingface.co/llmixer/BigWeave-v16-103b/discussions/2

wolfram

Owner Mar 4, 2024

Thanks a lot for the in-depth testing and well-written reviews! And also for sharing your thoughts on how self-merging could be further improved.

I'd love to see Repeat layers to create FrankenModels by dnhkng · Pull Request #275 · turboderp/exllamav2 finally gaining traction. I think there's enough evidence by now that the self-merging actually improves performance, so by doing on the fly would let us iterate and get even better results much faster.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment