| --- |
| license: apache-2.0 |
| datasets: |
| - google/docci |
| - gokaygokay/random_instruct_docci |
| language: |
| - en |
| pipeline_tag: image-text-to-text |
| --- |
| |
| Fine tuned version of [moondream2](https://huggingface.co/vikhyatk/moondream2) model using [gokaygokay/random_instruct_docci](https://huggingface.co/datasets/gokaygokay/random_instruct_docci) dataset. Which gives extremely detailed captions of the images. |
|
|
| ``` |
| pip install transformers timm einops bitsandbytes accelerate flash-attn |
| ``` |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| from PIL import Image |
| |
| DEVICE = "cuda" |
| DTYPE = ( |
| torch.float32 if DEVICE == "cpu" else torch.float16 |
| ) # CPU doesn't support float16 |
| revision = "3ec40c7b6b5d87bc0c51edee45e21f5f29b449d8" |
| tokenizer = AutoTokenizer.from_pretrained( |
| "fal-ai/moondream2-docci-instruct", |
| trust_remote_code=True, |
| revision=revision |
| ) |
| moondream = AutoModelForCausalLM.from_pretrained( |
| "fal-ai/moondream2-docci-instruct", |
| trust_remote_code=True, |
| torch_dtype=DTYPE, |
| device_map={"": DEVICE}, |
| attn_implementation="flash_attention_2", |
| revision=revision |
| ) |
| moondream.eval() |
| |
| image_path = "<your_image_path>" |
| image = Image.open(image_path).convert("RGB") |
| md_answer = moondream.answer_question( |
| moondream.encode_image(image), |
| "what is this picture about", |
| tokenizer=tokenizer, |
| ) |
| |
| print(md_answer) |
| ``` |