|
|
--- |
|
|
language: en |
|
|
license: other |
|
|
datasets: |
|
|
- DeepMount00/ner_training |
|
|
tags: |
|
|
- vision |
|
|
- multimodal |
|
|
- OCR |
|
|
- SmolVLM |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# SmolVLM Base - OCR Fine-tuned |
|
|
|
|
|
This is a merged version of SmolVLM-Base fine-tuned for OCR tasks. The model was trained using QLoRA on the DeepMount00/ner_training dataset. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: HuggingFaceTB/SmolVLM-Base |
|
|
- **Task**: Optical Character Recognition (OCR) |
|
|
- **Training Method**: QLoRA with 4-bit quantization |
|
|
- **Target Modules**: down_proj, o_proj, k_proj, q_proj, gate_proj, up_proj, v_proj |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoProcessor, Idefics3ForConditionalGeneration |
|
|
import torch |
|
|
from PIL import Image |
|
|
|
|
|
model_id = "DeepMount00/SmolVLM-Base-ocr_base" |
|
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
model = Idefics3ForConditionalGeneration.from_pretrained(model_id) |
|
|
|
|
|
# Load your image |
|
|
image = Image.open("path_to_your_image.jpg").convert("RGB") |
|
|
|
|
|
# Prepare the prompt |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "text", "text": "You are a model specialized in OCR"}, |
|
|
{"type": "image"}, |
|
|
{"type": "text", "text": "Extract the text from this image"} |
|
|
] |
|
|
} |
|
|
] |
|
|
|
|
|
# Process inputs |
|
|
inputs = processor(text=messages, images=[image], return_tensors="pt") |
|
|
|
|
|
# Generate |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
|
|
|
# Decode and print the response |
|
|
print(processor.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|