Instructions to use rhymes-ai/Aria with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rhymes-ai/Aria with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="rhymes-ai/Aria")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("rhymes-ai/Aria")
model = AutoModelForMultimodalLM.from_pretrained("rhymes-ai/Aria")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use rhymes-ai/Aria with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rhymes-ai/Aria"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rhymes-ai/Aria",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/rhymes-ai/Aria

SGLang

How to use rhymes-ai/Aria with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rhymes-ai/Aria" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rhymes-ai/Aria",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rhymes-ai/Aria" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rhymes-ai/Aria",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use rhymes-ai/Aria with Docker Model Runner:
```
docker model run hf.co/rhymes-ai/Aria
```

llama.cpp support

by ayyylol - opened Oct 10, 2024

Discussion

ayyylol

Oct 10, 2024

Hey Rhymes team, I've got a dream,
To use llama.cpp and make this model beam!

I've searched far and wide, through libraries so fine,
But none compare to llama.cpp, it's truly like fine wine.

I know you're busy, with AI on your mind,
But I hope you'll consider, this humble request of mine.

I'd love for you to integrate, llama.cpp with a delight,
And make my prompting life, a joyous sight.

So please, dear Rhymes, don't be slow,
Create a llama.cpp pull request, and let our prompts glow!

I'll be grateful, and shout it from the roof,
If you'll just make our llama.cpp dream, an inference truth!

nonetrix

Oct 10, 2024

Llama cpp seems to be slow with implementing multi modal models these days, might never come :/

snapo

Oct 10, 2024

Are there any good alternatives ollama that dont use llama.cpp? i agree implementations especialy multimodal are super slow/delayed.... I would immediately switch in case its a ollama inference replacement because i host everything local.

ArthurZ

Oct 10, 2024

You can use transformers and compile the model + quantize it. We'll come up with cool pre-config for all that soon!

SoulFireMage

Oct 10, 2024

Another vote for llama.cpp please. I wouldn't have the foggiest how to compile and run it myself 🤣

bash99

Oct 10, 2024

You can use transformers and compile the model + quantize it. We'll come up with cool pre-config for all that soon!

Hope there are some gptq quantizaton for our GPU poor with old gpu(sm 7.0 or sm 7.5)

Ryozu

Oct 12, 2024

Would love to see a 4bit BNB quantization of this model, if that's even doable

goodasdgood

Oct 16, 2024

wher lama cpp support aria?

maazel

Rhymes.AI org Oct 18, 2024

wher lama cpp support aria?

llama.cpp probably needs more time, multimodal implementations are really different.
We're also exploring possible solutions for Aria, stay tuned.

aria-dev

Oct 18, 2024

For everyone wants to get their hands on quantizing Aria, W've uploaded a fork of aria model that replaces the grouped gemm with a sequential mlp, in which each expert is implemented as a torch.nn.Linear layer executed in sequence. This adjustment simplifies quantization with current open source libraries that are optimized for nn.Linear layers.

If you want to quantize an Aria model, please consider using rhymes-ai/Aria-sequential_mlp

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment