This is a converted and quantized version to mxfp4 of Joycaption Beta One, please refer to the model card for more details.

It is intended for use with this fork of mlx-vlm.

Since this particular brand of Llava is making use of Siglip2 and Llama3 (and the implementation of RoPE in mlx_vlm is not compatible, plus Siglip2 is not implemented for Llava), i had to create a custom type of model and change the config.json for detection (from llava to llava_joycaption).

I have also altered the chat_template.json to fit the parser used by mlx_vlm.

Please note that installing torchvision with mlx_vlm will result in an error when interpolation is executed (with a mention to lanczos not being implemented in torch.nn.functional.interpolate).

A sample code for testing is available here.

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "n-Arno/joycaption-mlx-mxfp4"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Write a long descriptive caption for this image in a formal tone."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

The specific fork needed can be installed using this line in requirements.txt

mlx-vlm @ git+https://github.com/nArn0/mlx-vlm@main

(I'll try proposing a PR, but since it is a bit roughly coded, i don't know if it will be accepted)

Downloads last month
17
Safetensors
Model size
2B params
Tensor type
U8
U32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for n-Arno/joycaption-mlx-mxfp4

Space using n-Arno/joycaption-mlx-mxfp4 1