Update README.md

3965d42 verified 11 months ago

864 Bytes

library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
language:
  - en
base_model:
  - HuggingFaceTB/SmolVLM-256M-Instruct

SmolVLM-256M-Detection

experimental and for learning purposes; wouldn't recommend using unless

check out github/shreydan/VLM-OD for results and details.

Usage

load the model same as HuggingFaceTB/SmolVLM-256M-Instruct
inputs: detect car / detect person;car etc. Apply chat template with add_generation_prompt=True
parse the output tokens <loc000> to <loc255> (code in eval.ipynb of my github repo)
to reiterate I have not added any <locXXX> special tokens (that needs wayyy more training than this method), the model itself is generating them.