Update README.md
Browse files
README.md
CHANGED
|
@@ -22,22 +22,64 @@ pipeline_tag: image-text-to-text
|
|
| 22 |
# Pixtral-Large-Instruct-2411 🧡
|
| 23 |
|
| 24 |
Transformers implementation of [Pixtral-Large-Instruct-2411](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411).
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
```
|
| 33 |
-
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
```
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
|
| 43 |
## Quantizations
|
|
|
|
| 22 |
# Pixtral-Large-Instruct-2411 🧡
|
| 23 |
|
| 24 |
Transformers implementation of [Pixtral-Large-Instruct-2411](https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411).
|
| 25 |
+
|
| 26 |
+
***21 Dec 2024:** This model has been a LOT of fun to experiment and learn with. Model card updated below with changes made to this repo
|
| 27 |
+
over the last week.*
|
| 28 |
+
|
| 29 |
+
## Architecture Differences to Pixtral 12B
|
| 30 |
+
Pixtral 12B has bias keys for the multi_modal_projector layers, whereas Pixtral Large does not. Instead of including with low/zero values
|
| 31 |
+
this conversion does not include those bias keys, aligning with the keys present in the original Pixtral Large upload from Mistral. The
|
| 32 |
+
model's config.json file includes `"multimodal_projector_bias": false` to flag this. *n.b. If anyone in the community confirms initializing
|
| 33 |
+
these keys with zero values is the better way to go I'm happy to reupload without them excluded.*
|
| 34 |
+
|
| 35 |
+
## Tokenizer
|
| 36 |
+
This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes,
|
| 37 |
+
so make sure you use the right tokenizer.
|
| 38 |
+
|
| 39 |
+
## Prompting / Chat Template
|
| 40 |
+
The included chat_template.json supports all of Mistral's defined features with some of my own additions.
|
| 41 |
+
|
| 42 |
+
I believe this implementation should give quite a lot of flexibility for using the model, and in my testing has worked quite well.
|
| 43 |
+
|
| 44 |
+
Example *(line breaks added for readability)*
|
| 45 |
```
|
| 46 |
+
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
|
| 47 |
+
[INST] [IMG]<user message>
|
| 48 |
+
[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]
|
| 49 |
+
[IMG]<assistant response>
|
| 50 |
+
[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]
|
| 51 |
+
[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]
|
| 52 |
+
</s>[INST] <user message>[/INST]
|
| 53 |
```
|
| 54 |
|
| 55 |
+
**System Prompts**:
|
| 56 |
+
Messages with role "system" will be parsed as `[SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT]` anywhere they appear in chat history.
|
| 57 |
+
|
| 58 |
+
This appears to work pretty well for passing extra instructions at various depths, and keeps instructions separate from conversation.
|
| 59 |
+
|
| 60 |
+
**Allowing Non-Alternating Roles**:
|
| 61 |
+
Multiple user messages in a row can be provided, and each will be separated with `[INST][/INST]`. This could work well in group conversation
|
| 62 |
+
settings, or environments where multiple user messages can be provided before the model is invoked. Having a `[/INST]` breaking each one up
|
| 63 |
+
appeared to help prevent the model thinking it needs to respond to every previous message and focus on the last message, while still retaining
|
| 64 |
+
knowledge of what messages sit before it.
|
| 65 |
+
|
| 66 |
+
**Image Inputs Everywhere**:
|
| 67 |
+
Images can now be sent in user, assistant, and tool result messages. And seems to actually work. I did tests like including an image on an
|
| 68 |
+
assistant reply 10-15 messages back in the conversation, asked the assistant to recall what image they previously sent, and it was able to
|
| 69 |
+
accurately describe it.
|
| 70 |
+
|
| 71 |
+
Having this flexibility could allow for interesting applications, for example if you were to define a tool definition for image generation:
|
| 72 |
+
- tool is invoked and calls image generation api/model
|
| 73 |
+
- image returned inside tool result message
|
| 74 |
+
- model responds with a message with context of the image generated
|
| 75 |
+
- you can have further conversation about the generated image, or make revisions with the model actually knowing what was created
|
| 76 |
+
|
| 77 |
+
## Usage
|
| 78 |
+
When loading in transformers you'll probably want to add some handling to ensure the lack of mmproj bias is respected for it to handle
|
| 79 |
+
vision input properly.
|
| 80 |
+
|
| 81 |
+
Most of my testing has been using TabbyAPI and ExLlamaV2 (dev branch) with working vision input.
|
| 82 |
+
<img src="https://huggingface.co/nintwentydo/Pixtral-Large-Instruct-2411/resolve/main/image-input-example.jpg">
|
| 83 |
|
| 84 |
|
| 85 |
## Quantizations
|