Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: facebook/Perception-LM-1B
|
| 3 |
+
base_model_relation: quantized
|
| 4 |
+
tags:
|
| 5 |
+
- quantized
|
| 6 |
+
- int4
|
| 7 |
+
- perception_lm
|
| 8 |
+
- language-model
|
| 9 |
+
library_name: transformers
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Perception-LM-1B Int4-bit Quantized
|
| 13 |
+
|
| 14 |
+
This repository contains **a 4-bit quantized version** of Perception-LM-1B — optimized for reduced memory usage and faster inference, while retaining most of the capabilities of the full-precision model.
|
| 15 |
+
|
| 16 |
+
## ⚙️ Model Description
|
| 17 |
+
|
| 18 |
+
- **Base model**: `facebook/Perception-LM-1B`
|
| 19 |
+
- **Quantization**: 4-bit integer quantization (INT4).
|
| 20 |
+
- **Purpose**: Provide a lighter, more resource-efficient variant for inference, deployment on resource-constrained hardware, or quick prototyping.
|
| 21 |
+
|
| 22 |
+
## ✅ Intended Use & Use Cases
|
| 23 |
+
|
| 24 |
+
This quantized model is suited for:
|
| 25 |
+
|
| 26 |
+
- Fast inference when GPU/CPU memory or VRAM is limited
|
| 27 |
+
- Prototyping or integrating into applications where resource efficiency matters
|
| 28 |
+
- Use in research or production pipelines where quantization is acceptable
|
| 29 |
+
|
| 30 |
+
### ⚠️ Limitations (Things to Watch Out For)
|
| 31 |
+
|
| 32 |
+
- Quantization can introduce **slight degradation** compared to full-precision: responses may be less accurate or fluent in edge cases.
|
| 33 |
+
- Not recommended for use-cases requiring **maximum fidelity** (e.g. very fine-grained reasoning, sensitive safety-critical tasks).
|
| 34 |
+
- Performance may depend on hardware: quantized weights may require specific inference settings (device map, memory constraints).
|
| 35 |
+
|
| 36 |
+
## 🔄 How to Use
|
| 37 |
+
|
| 38 |
+
Here is an example of how you can load the quantized model using `transformers`:
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
|
| 42 |
+
import torch
|
| 43 |
+
from transformers import AutoProcessor, AutoModelForImageTextToText
|
| 44 |
+
|
| 45 |
+
model_id = "Dhruvil03/Perception-LM-1B-Int4bit"
|
| 46 |
+
|
| 47 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
| 48 |
+
|
| 49 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
| 50 |
+
model_id,
|
| 51 |
+
torch_dtype=torch.float16
|
| 52 |
+
).to("cuda").eval()
|
| 53 |
+
|
| 54 |
+
conversation = [{
|
| 55 |
+
"role": "user",
|
| 56 |
+
"content": [
|
| 57 |
+
{"type": "video", "url": "test.mp4"},
|
| 58 |
+
{"type": "text", "text": "Can you describe the video in detail?"},
|
| 59 |
+
],
|
| 60 |
+
}]
|
| 61 |
+
|
| 62 |
+
inputs = processor.apply_chat_template(
|
| 63 |
+
conversation,
|
| 64 |
+
num_frames=16, # change number of frames as per the CUDA memory availability
|
| 65 |
+
add_generation_prompt=True,
|
| 66 |
+
tokenize=True,
|
| 67 |
+
return_dict=True,
|
| 68 |
+
return_tensors="pt",
|
| 69 |
+
video_load_backend="pyav",
|
| 70 |
+
)
|
| 71 |
+
|
| 72 |
+
inputs = {k: (v.to("cuda") if hasattr(v, "to") else v) for k, v in inputs.items()}
|
| 73 |
+
|
| 74 |
+
with torch.inference_mode():
|
| 75 |
+
outputs = model.generate(**inputs, max_new_tokens=64)
|
| 76 |
+
|
| 77 |
+
ilen = inputs["input_ids"].shape[1]
|
| 78 |
+
decoded = processor.batch_decode(outputs[:, ilen:], skip_special_tokens=True)
|
| 79 |
+
print(decoded[0])
|