| | --- |
| | library_name: transformers |
| | license: apache-2.0 |
| | pipeline_tag: image-to-text |
| | --- |
| | |
| | # rmfg |
| |
|
| | <!-- Provide a quick summary of what the model is/does. --> |
| | <img src="https://i.pinimg.com/736x/7e/46/a6/7e46a6881623dfd3e1a2a5a2ae692374.jpg" width="300"> |
| |
|
| |
|
| |
|
| | ## Example |
| |
|
| | **Image** |
| | <img src="https://media-cldnry.s-nbcnews.com/image/upload/t_fit-760w,f_auto,q_auto:best/rockcms/2023-12/231202-elon-musk-mjf-1715-fc0be2.jpg" width="300"> |
| | **Output** |
| | > A man in a black cowboy hat and sunglasses stands in front of a white car, holding a microphone and speaking into it. |
| |
|
| | ----------------------------------------------------------------------------------- |
| |
|
| | - underfit, doesn't perform well |
| | - this marks the beginning of my tiny vision language model series, with this model serving as a prelude to what's to come in the next few days. |
| |
|
| | ``` |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | from PIL import Image |
| | |
| | model_id = "aloobun/rmfg" |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, trust_remote_code=True |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | |
| | image = Image.open('692374.jpg') |
| | enc_image = model.encode_image(image) |
| | print(model.answer_question(enc_image, "Describe this image.", tokenizer)) |
| | ``` |