Update README.md

012c25f verified about 1 year ago

4.25 kB

	---
	language:
	- en
	- fr
	- de
	- es
	- it
	- pt
	- zh
	- ja
	- ru
	- ko
	license: other
	license_name: mrl
	base_model: mistralai/Pixtral-Large-Instruct-2411
	inference: false
	license_link: https://mistral.ai/licenses/MRL-0.1.md
	library_name: transformers
	pipeline_tag: image-text-to-text
	---

	# Pixtral-Large-Instruct-2411 🧡

	Transformers implementation of [Pixtral-Large-Instruct-2411](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411).

	*21 Dec 2024: This model has been a LOT of fun to experiment and learn with. Model card updated below with changes made to this repo
	over the last week.*

	## Architecture Differences to Pixtral 12B
	Pixtral 12B has bias keys for the multi_modal_projector layers, whereas Pixtral Large does not. Instead of including with low/zero values
	this conversion does not include those bias keys, aligning with the keys present in the original Pixtral Large upload from Mistral. The
	model's config.json file includes `"multimodal_projector_bias": false` to flag this. *n.b. If anyone in the community confirms initializing
	these keys with zero values is the better way to go I'm happy to reupload without them excluded.*

	## Tokenizer
	This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes,
	so make sure you use the right tokenizer.

	## Prompting / Chat Template
	The included chat_template.json supports all of Mistral's defined features with some of my own additions.

	I believe this implementation should give quite a lot of flexibility for using the model, and in my testing has worked quite well.

	Example (line breaks added for readability)
	```
	<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
	[INST] [IMG]<user message>
	[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]
	[IMG]<assistant response>
	[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]
	[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]
	</s>[INST] <user message>[/INST]
	```

	System Prompts:
	Messages with role "system" will be parsed as `[SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT]` anywhere they appear in chat history.

	This appears to work pretty well for passing extra instructions at various depths, and keeps instructions separate from conversation.

	Allowing Non-Alternating Roles:
	Multiple user messages in a row can be provided, and each will be separated with `[INST][/INST]`. This could work well in group conversation
	settings, or environments where multiple user messages can be provided before the model is invoked. Having a `[/INST]` breaking each one up
	appeared to help prevent the model thinking it needs to respond to every previous message and focus on the last message, while still retaining
	knowledge of what messages sit before it.

	Image Inputs Everywhere:
	Images can now be sent in user, assistant, and tool result messages. And seems to actually work. I did tests like including an image on an
	assistant reply 10-15 messages back in the conversation, asked the assistant to recall what image they previously sent, and it was able to
	accurately describe it.

	Having this flexibility could allow for interesting applications, for example if you were to define a tool definition for image generation:
	- tool is invoked and calls image generation api/model
	- image returned inside tool result message
	- model responds with a message with context of the image generated
	- you can have further conversation about the generated image, or make revisions with the model actually knowing what was created

	## Usage
	When loading in transformers you'll probably want to add some handling to ensure the lack of mmproj bias is respected for it to handle
	vision input properly.

	Most of my testing has been using TabbyAPI and ExLlamaV2 (dev branch) with working vision input.
	<img src="https://huggingface.co/nintwentydo/Pixtral-Large-Instruct-2411/resolve/main/image-input-example.jpg">


	## Quantizations
	EXL2 quants are available in different sizes [here](https://huggingface.co/models?author=nintwentydo&other=base_model:quantized:mistralai/Pixtral-Large-Instruct-2411). You'll need to use dev branch of [ExLlamaV2](https://github.com/turboderp/exllamav2/tree/dev) for vision input.