RedHatAI
/

Llama-3.2-11B-Vision-Instruct-FP8-dynamic

@@ -2,21 +2,92 @@
 tags:
 - fp8
 - vllm
 ---
-Run with `vllm==0.6.2` on 1xH100:
-```
-vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16
 ```
-## Evaluation
 ```
-TBD
 ```
 ## Creation
-https://github.com/vllm-project/llm-compressor/pull/185
 ```python
 from transformers import AutoProcessor, MllamaForConditionalGeneration
@@ -52,4 +123,12 @@ input_ids = processor(text="Hello my name is", return_tensors="pt").input_ids.to
 output = model.generate(input_ids, max_new_tokens=20)
 print(processor.decode(output[0]))
 print("==========================================")
-```

 tags:
 - fp8
 - vllm
+language:
+- en
+- de
+- fr
+- it
+- pt
+- hi
+- es
+- th
+pipeline_tag: text-generation
+license: llama3.2
+base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
 ---
+# Llama-3.2-11B-Vision-Instruct-FP8-dynamic
+## Model Overview
+- **Model Architecture:** Meta-Llama-3.2
+  - **Input:** Text/Image
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP8
+  - **Activation quantization:** FP8
+- **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct), this models is intended for assistant-like chat.
+- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
+- **Release Date:** 9/25/2024
+- **Version:** 1.0
+- **License(s):** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/blob/main/LICENSE.txt)
+- **Model Developers:** Neural Magic
+Quantized version of [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).
+### Model Optimizations
+This model was obtained by quantizing the weights and activations of [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) to FP8 data type, ready for inference with vLLM built from source.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
+Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
+[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
+## Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
+```python
+from vllm import LLM, SamplingParams
+from vllm.assets.image import ImageAsset
+# Initialize the LLM
+model_name = "neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic"
+llm = LLM(model=model_name, max_num_seqs=1, enforce_eager=True)
+# Load the image
+image = ImageAsset("cherry_blossom").pil_image.convert("RGB")
+# Create the prompt
+question = "If I had to write a haiku for this one, it would be: "
+prompt = f"<|image|><|begin_of_text|>{question}"
+# Set up sampling parameters
+sampling_params = SamplingParams(temperature=0.2, max_tokens=30)
+# Generate the response
+inputs = {
+    "prompt": prompt,
+    "multi_modal_data": {
+        "image": image
+    },
+}
+outputs = llm.generate(inputs, sampling_params=sampling_params)
+# Print the generated text
+print(outputs[0].outputs[0].text)
 ```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ```
+vllm serve neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic --enforce-eager --max-num-seqs 16
 ```
 ## Creation
+This model was created by applying [LLM Compressor](https://github.com/vllm-project/llm-compressor/blob/f90013702b15bd1690e4e2fe9ed434921b6a6199/examples/quantization_w8a8_fp8/llama3.2_vision_example.py), as presented in the code snipet below.
 ```python
 from transformers import AutoProcessor, MllamaForConditionalGeneration
 output = model.generate(input_ids, max_new_tokens=20)
 print(processor.decode(output[0]))
 print("==========================================")
+```
+## Evaluation
+TBD
+### Reproduction
+TBD