Qwen2-VL-7B-Captioner

Image-to-Text

Transformers

text-generation-inference

Model card Files Files and versions

xet

Community

singi73737373 commited on Mar 2, 2025

Commit

60cf22d

verified ·

1 Parent(s): 34706e0

Update README.md

Browse files

Files changed (1) hide show

README.md +52 -52

README.md CHANGED Viewed

@@ -1,87 +1,87 @@
 ---
-library_name: transformers
-license: apache-2.0
-language:
 - en
 base_model:
 - Qwen/Qwen2-VL-7B-Instruct
-pipeline_tag: image-to-text
 ---
-# Qwen2-VL-7B-Captioner-Relaxed
-## Introduction
-Qwen2-VL-7B-Captioner-Relaxed is an instruction-tuned version of [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), an advanced multimodal large language model. This fine-tuned version is based on a hand-curated dataset for text-to-image models, providing significantly more detailed descriptions of given images.
-### Key Features:
-* **Enhanced Detail:** Generates more comprehensive and nuanced image descriptions.
-* **Relaxed Constraints:** Offers less restrictive image descriptions compared to the base model.
-* **Natural Language Output:** Describes different subjects in the image while specifying their locations using natural language.
-* **Optimized for Image Generation:** Produces captions in formats compatible with state-of-the-art text-to-image generation models.
-**Note:** This fine-tuned model is optimized for creating text-to-image datasets. As a result, performance on other tasks (e.g., ~10% decrease on mmmu_val) may be lower compared to the original model.
-## Requirements
-If you encounter errors such as `KeyError: 'qwen2_vl'` or `ImportError: cannot import name 'Qwen2VLForConditionalGeneration' from 'transformers'`, try installing the latest version of the transformers library from source:
-`pip install git+https://github.com/huggingface/transformers`
-## Quickstart
-```python
-from PIL import Image
-from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
-from transformers import BitsAndBytesConfig
-import torch
-model_id = "Ertugrul/Qwen2-VL-7B-Captioner-Relaxed"
-model = Qwen2VLForConditionalGeneration.from_pretrained(
-    model_id, torch_dtype=torch.bfloat16, device_map="auto"
 )
-processor = AutoProcessor.from_pretrained(model_id)
-conversation = [
     {
-        "role": "user",
-        "content": [
             {
-                "type": "image",
-            },
-            {"type": "text", "text": "Describe this image."},
-        ],
     }
 ]
-image = Image.open(r"PATH_TO_YOUR_IMAGE")
-# you can resize the image here if it's not fitting to vram, or set model max sizes.
-# image = image.resize((1024, 1024)) # like this
-text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
-inputs = processor(
-    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
 )
-inputs = inputs.to("cuda")
-with torch.no_grad():
-    with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
-        output_ids  = model.generate(**inputs, max_new_tokens=384, do_sample=True, temperature=0.7, use_cache=True, top_k=50)
-generated_ids = [
-    output_ids[len(input_ids) :]
-    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
 ]
-output_text = processor.batch_decode(
-    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
-)[0]
-print(output_text)
-```
-For more detailed options, refer to the [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) documentation.

 ---
+Bibliothek_name: Transformatoren
+Lizenz: Apache-2.0
+Kratze:
 - en
 base_model:
 - Qwen/Qwen2-VL-7B-Instruct
+pipeline_tag: Bild-zu-text
 ---
+# Qwen2-VL-7 B-Captioner-Entspannt
+## Einleitung
+Qwen2-VL-7 B-Captioner-Entspannte ist eine weisungsgestimmte Version von [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), ein fortschrittliches multimodales Großsprachenmodell. Diese fein abgestimmte Version basiert auf einem handkuratieren Datensatz für Text-zu-Bild-Modelle und bietet deutlich detailliert Beschreibungen gebebener Bilder.
+### Hauptmerkmale:
+* **Erweitertes Detail:** Produkte umfassendere und differenzierte Bildbeschreibungen.
+* **Entspanntes Einschränkungen:** Bietet im Verein zum Basismodell weniger restriktive Bildbeschreibungen.
+* **Ausgabe naturlicher Sprache:** Beschreibt verschiedene Motive im Bild und gibt gleichzeit ihre Standorte in naturlicher Sprache an.
+* **Optimiert für die Bilderzeugung:** Produzierte Bildunterschriften in Formaten, die mit modernsten Text-zu-Bild-Generierungsmodelle kompatibel sind.
+**Anmerkung:** Dieses fin abgestimmte Modell ist für die Erstellung von Text-zu-Bild-Datensätzen optimiert. Infolgedessen kann die Lehrung bei anderen Aufgeben (z. B. ~10% Abnahme bei mmmu_val) im Verein zum Originalmodell geringer sein.
+## Anfordungen
+Wenn Sie auf Fehler stoßen wie `KeyError: 'qwen2_vl'` Oder `ImportError: kann Name 'Qwen2VLForConditionalGeneration' nicht aus 'Transformers' importieren`, Versuchen Sie, die neueste Version der Transformatorenbibliothek aus der Quelle zu installieren:
+`pip installieren git+https://github.com/huggingface/transformers`
+## Schnellstart
+`Python
+Von PIL-Import Bild
+Von Transformatoren importieren Qwen2VLForConditionalGeneration, AutoProcessor
+Von Transformatoren BitsAndBytesConfig importieren
+Einfuhrbrenner
+model_id = "Ertugrul/Qwen2-VL-7 B-Captioner-Relaxed"
+Modell = Qwen2VLForConditionalGeneration.from_pretrained (
+ model_id, torch_dtyp=torch.bfloat16, device_map="auto"
 )
+Prozessor = AutoProcessor.from_pretrained (model_id)
+Gesprüch = [
     {
+ "Rolle": "Benutzer",
+ "inhalt": [
             {
+ "type": "Bild",
+ },
+ {"type": "text", "text": "Beschreibe dieses Bild"},
+ ],
     }
 ]
+Bild = Image.open (r "PATH_TO_YOUR_IMAGE")
+# Sie können die Größe des Bildes hier Ändern, wenn es nicht zu vram passt, oder Modellmaximalgrößen festlegen.
+# Bild = Bild.resize ((1024, 1024)) # so
+text_prompt = processor.apply_chat_template (Gespräch, add_generation_prompt=True)
+Eingänge = Prozessor (
+ text=[text_prompt], „images=[Bild], udding=True, return_tensors="pt"
 )
+Eingänge = Eingänge.to ("cuda")
+mit Fackel.no_grad ():
+ mit torch.autocast (device_type=" cuda", dtype=torch.bfloat16):
+ output_ids = model.generate (**inputs, max_new_tokens=384, do_sample=True, temperatur=0.7, use_cache=True, top_k=50)
+generierte_ids = [
+ Ausgabe_ids[len (input_ids):]
+ Für input_ids, output_ids in zip (inputs.input_ids, output_ids)
 ]
+Ausgabe_text = processor.batch_decode (
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
+[0]
+Ausdrucken (output_text)
+`
+Ausführliche Optionen finden Sie in der [Qwen2-VL-7B-Instruct] (https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) Dokumentation.