ShayanCyan commited on
Commit
7ad8847
·
verified ·
1 Parent(s): b6a87ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -182
README.md CHANGED
@@ -3,235 +3,133 @@ license: other
3
  license_name: phi4-model-license
4
  license_link: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/LICENSE
5
  language:
6
- - en
7
- - ur
8
- - de
9
- - es
10
- - tr
11
- - fr
12
- - it
13
  base_model:
14
- - microsoft/Phi-4-multimodal-instruct
15
  tags:
16
- - phi
17
- - phi4-multimodal
18
- - quantized
19
- - visual-question-answering
20
- - speech-translation
21
- - speech-summarization
22
- - audio
23
- - vision
24
- - gguf
25
  library_name: other
 
26
  ---
27
 
28
  # Phi-4 Multimodal – Quantized GGUF + Omni Projector
29
 
30
- This repository provides **GGUF-converted weights** for running
31
- `microsoft/Phi-4-multimodal-instruct` locally using `llama.cpp`.
32
 
33
- It includes:
34
 
35
- - A **quantized language model (LLM)**
36
- - A separate **multimodal projector (mmproj)** containing vision (+ optional audio) encoders
37
-
38
- No additional training was performed.
39
- This is a **pure format + quantization conversion** of the original Microsoft model.
40
 
41
  ---
42
 
43
- # Files
 
 
 
 
 
 
 
 
 
44
 
45
- You should have:
46
 
47
- - `phi4-mm-Q4_K_M.gguf`
48
- → Quantized multimodal **language model** (Q4_K_M)
49
 
50
- - `phi4-mm-omni.gguf`
51
- **Multimodal projector (mmproj)**
52
- Contains:
53
- - Vision encoder (image understanding)
54
- - Optional audio Conformer encoder (if enabled in your runtime)
55
 
56
- You need:
57
- - One LLM GGUF
58
- - One mmproj GGUF
59
 
60
  ---
61
 
62
- # How the MMProj Was Exported
63
-
64
- To export the vision + (optionally) audio encoder into a separate GGUF:
65
 
 
66
  ```bash
 
 
 
 
 
 
67
  python convert_hf_to_gguf.py \
68
  /path/to/phi-4-multimodal \
69
- --mmproj \
70
  --outtype f16 \
71
- --outfile phi4-mm-omni.gguf
72
- ```
73
-
74
- ### What This Does
75
-
76
- A custom `MmprojModel` / Phi-specific path in `convert_hf_to_gguf.py`:
77
-
78
- - Reads tensors from:
79
- ```
80
- model.embed_tokens_extend.image_embed.*
81
- ```
82
- and maps them to **CLIP-style names** expected by `llama.cpp`.
83
 
84
- - Optionally maps audio tensors from:
85
- ```
86
- model.embed_tokens_extend.audio_embed.*
87
- ```
88
- to the **Conformer layout** expected by the runtime.
89
 
90
- - Writes a valid GGUF file usable as:
91
- ```
92
- --mmproj
93
- ```
94
- or
95
- ```
96
- -mm
97
- ```
98
- in `llama.cpp`.
99
-
100
- ⚠️ No training occurs.
101
- This is strictly format conversion + optional quantization.
102
-
103
- ---
104
 
105
- # Building llama.cpp
106
-
107
- Make sure you have built your `llama.cpp` fork:
 
 
108
 
109
- ```bash
110
- cmake -B build
111
- cmake --build build --config Release
112
- ```
113
 
114
  ---
115
 
116
- # Running the Model
117
 
118
- Assuming:
 
119
 
120
- - `phi4-mm-Q4_K_M.gguf` → LLM
121
- - `phi4-mm-omni.gguf` → mmproj
122
-
123
- ---
124
-
125
- ## Server Mode (Recommended)
126
-
127
- ```bash
128
  ./build/bin/llama-server \
129
  -m /path/to/phi4-mm-Q4_K_M.gguf \
130
  -mm /path/to/phi4-mm-omni.gguf \
131
  --host 0.0.0.0 \
132
  --port 8080
133
- ```
134
-
135
- ### The Server:
136
 
137
- - Exposes an **OpenAI-style HTTP API**
138
- - Supports **multimodal prompts**
139
- - Enables text + image (+ audio if runtime supports it)
140
 
141
- ### Typical Usage
142
-
143
- - Send text prompts to:
144
- ```
145
- /v1/chat/completions
146
- ```
147
-
148
- - For vision:
149
- - Use `image_url` parts
150
- - Or use MTMD markers described in:
151
- ```
152
- llama.cpp/tools/server/README.md
153
- ```
154
-
155
- - For audio (if supported in your build):
156
- - Send audio content following the multimodal documentation of your runtime
157
-
158
- ---
159
-
160
- ## CLI Mode
161
-
162
- ```bash
163
  ./build/bin/llama-cli \
164
  -m /path/to/phi4-mm-Q4_K_M.gguf \
165
  -mm /path/to/phi4-mm-omni.gguf \
166
  --color \
167
  --prompt "Explain this image in detail:"
168
- ```
169
-
170
- Add image/audio flags as required by your specific `llama.cpp` fork.
171
-
172
- ---
173
-
174
- # Example Capabilities
175
-
176
- With proper runtime support:
177
-
178
- ## Text
179
- - Instruction following
180
- - Multi-turn chat
181
- - Coding
182
- - Reasoning
183
-
184
- ## Vision
185
- - Visual Question Answering (VQA)
186
- - Image captioning
187
- - Detailed scene description
188
- - Chart / document understanding (within model limits)
189
-
190
- ## Audio (if Conformer path enabled)
191
- - Automatic Speech Recognition (ASR)
192
- - Speech translation (e.g. EN → FR)
193
- - Speech summarization
194
-
195
- Performance depends heavily on:
196
- - Hardware
197
- - GPU backend (CUDA, Metal, etc.)
198
- - Runtime implementation
199
-
200
- ---
201
-
202
- # Limitations & Risks
203
-
204
- ### General Model Limitations
205
- - May hallucinate
206
- - May misinterpret inputs
207
- - Not suitable for critical factual workflows without verification
208
-
209
- ### Multimodal Limitations
210
- - Vision and audio understanding are powerful but imperfect
211
- - Do NOT use for:
212
- - Medical decisions
213
- - Legal advice
214
- - Safety-critical systems
215
- - Biometric identification
216
-
217
- ### License Constraints
218
-
219
- You must comply with the original license:
220
-
221
- Base model:
222
- `microsoft/Phi-4-multimodal-instruct`
223
-
224
- License:
225
- https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/LICENSE
226
-
227
- This repository does **not** modify or override that license.
228
 
229
  ---
230
 
231
- # Acknowledgements
 
 
 
232
 
233
- - Base model: Microsoft Phi-4 team
234
- - Quantization & runtime: `llama.cpp` contributors
235
- - Conversion tweaks & multimodal handling: custom Phi-4 GGUF pipeline
 
236
 
237
- All credit for pretraining and model architecture goes to Microsoft and the Phi-4 team.
 
 
 
 
3
  license_name: phi4-model-license
4
  license_link: https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/LICENSE
5
  language:
6
+ - en
7
+ - ur
8
+ - de
9
+ - es
10
+ - tr
11
+ - fr
12
+ - it
13
  base_model:
14
+ - microsoft/Phi-4-multimodal-instruct
15
  tags:
16
+ - phi
17
+ - phi4-multimodal
18
+ - quantized
19
+ - visual-question-answering
20
+ - speech-translation
21
+ - speech-summarization
22
+ - audio
23
+ - vision
24
+ - gguf
25
  library_name: other
26
+ pipeline_tag: image-to-text
27
  ---
28
 
29
  # Phi-4 Multimodal – Quantized GGUF + Omni Projector
30
 
31
+ This repository provides **pre-converted GGUF weights** for running **[microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)** with a **quantized language model** and a **multimodal projector (mmproj)** on top of a specialized llama.cpp fork.
 
32
 
33
+ - **GitHub (code + server setup):** [Ahmed-Shayan-Arsalan/Phi4-multimodal-Quantisized-Llama.cpp](https://github.com/Ahmed-Shayan-Arsalan/Phi4-multimodal-Quantisized-Llama.cpp)
34
 
35
+ The goal is to make Phi‑4 multimodal practical to run locally for text, vision, and audio tasks. All weights here are format conversions of the original Microsoft model and do not introduce new training data.
 
 
 
 
36
 
37
  ---
38
 
39
+ ## Files in This Repository
40
+
41
+ * **phi4-mm-Q4_K_M.gguf**: Quantized Phi‑4 multimodal **language model** (LLM).
42
+ * **Quantization:** Q4_K_M (4‑bit group-wise).
43
+ * **Usage:** Your main `-m` model in llama.cpp.
44
+ * **phi4-mm-omni.gguf**: **Multimodal projector (mmproj)**.
45
+ * **Contents:** Vision encoder (SigLIP/Navit-style) and audio Conformer encoder.
46
+ * **Precision:** Stored in **F16 / F32** to preserve multimodal quality.
47
+ * **Usage:** Your `--mmproj` or `-mm` model in llama.cpp.
48
+ * *(Optional variants)*: `phi4-mm-f16.gguf` (unquantized ref), `phi4-mm-vision-q8.gguf` (alternative quantization).
49
 
50
+ ---
51
 
52
+ ## Intended Use
 
53
 
54
+ These GGUF files are designed for:
55
+ * **Local inference** with llama.cpp or compatible runtimes.
56
+ * **Research and experimentation** on multimodal reasoning.
57
+ * **Prototyping** agents that consume text, images, and audio.
 
58
 
59
+ **Not intended for:**
60
+ * Training from scratch.
61
+ * Any use violating the original [Microsoft Phi-4 License](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/LICENSE).
62
 
63
  ---
64
 
65
+ ## How These GGUFs Were Created
 
 
66
 
67
+ ### 1. Download the Base Model
68
  ```bash
69
+ git lfs install
70
+ git clone [https://huggingface.co/microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) phi-4-multimodal
71
+
72
+ # Phi-4-Multimodal Deployment Guide (llama.cpp)
73
+
74
+ ## 2. Export the text LLM to GGUF
75
  python convert_hf_to_gguf.py \
76
  /path/to/phi-4-multimodal \
 
77
  --outtype f16 \
78
+ --outfile phi4-mm-f16.gguf
 
 
 
 
 
 
 
 
 
 
 
79
 
80
+ ## 3. Quantize the LLM
81
+ ./build/bin/llama-quantize \
82
+ phi4-mm-f16.gguf \
83
+ phi4-mm-Q4_K_M.gguf \
84
+ Q4_K_M
85
 
86
+ ## 4. Export the Multimodal Projector (mmproj)
87
+ To extract the vision and audio encoders into a separate GGUF:
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
+ python convert_hf_to_gguf.py \
90
+ /path/to/phi-4-multimodal \
91
+ --mmproj \
92
+ --outtype f16 \
93
+ --outfile phi4-mm-omni.gguf
94
 
95
+ Technical Note: A custom MmprojModel path in the conversion script maps tensors from model.embed_tokens_extend.* to the CLIP-style and Conformer layouts expected by the llama.cpp runtime.
 
 
 
96
 
97
  ---
98
 
99
+ ## How to Use (llama.cpp)
100
 
101
+ ### Server Mode (Recommended)
102
+ This exposes an OpenAI-style HTTP API supporting multimodal prompts.
103
 
 
 
 
 
 
 
 
 
104
  ./build/bin/llama-server \
105
  -m /path/to/phi4-mm-Q4_K_M.gguf \
106
  -mm /path/to/phi4-mm-omni.gguf \
107
  --host 0.0.0.0 \
108
  --port 8080
 
 
 
109
 
110
+ Vision: Send image_url parts or MTMD markers.
111
+ Audio: Send audio content according to the multimodal documentation.
 
112
 
113
+ ### CLI Mode
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ./build/bin/llama-cli \
115
  -m /path/to/phi4-mm-Q4_K_M.gguf \
116
  -mm /path/to/phi4-mm-omni.gguf \
117
  --color \
118
  --prompt "Explain this image in detail:"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ---
121
 
122
+ ## Example Capabilities
123
+ - Text: Instruction following, reasoning, coding, multi‑turn chat.
124
+ - Vision: Visual question answering (VQA), captioning, document/chart understanding.
125
+ - Audio: Automatic speech recognition (ASR), translation (EN → FR), and summarization (where Conformer path is enabled).
126
 
127
+ ## Limitations & Risks
128
+ - Hallucinations: May misinterpret content or hallucinate facts.
129
+ - Verification: Not suitable for medical, legal, or safety-critical decisions without human verification.
130
+ - Compliance: You must comply with the original Microsoft license.
131
 
132
+ ## Acknowledgements
133
+ - Base model: microsoft/Phi-4-multimodal-instruct
134
+ - Serving stack: llama.cpp and its contributors.
135
+ - Special thanks to the Microsoft Phi-4 team for the underlying pretraining.