|
|
--- |
|
|
language: en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- vision-language-model |
|
|
- visual-storytelling |
|
|
- chain-of-thought |
|
|
- grounded-text-generation |
|
|
- cross-frame-consistency |
|
|
- storytelling |
|
|
- image-to-text |
|
|
datasets: |
|
|
- daniel3303/StoryReasoning |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- bleu |
|
|
- meteor |
|
|
- rouge |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-VL-7B-Instruct |
|
|
pipeline_tag: image-to-text |
|
|
model-index: |
|
|
- name: QwenStoryteller |
|
|
results: |
|
|
- task: |
|
|
type: visual-storytelling |
|
|
name: Visual Storytelling |
|
|
dataset: |
|
|
name: StoryReasoning |
|
|
type: daniel3303/StoryReasoning |
|
|
split: test |
|
|
metrics: |
|
|
- name: Character Precision |
|
|
type: precision |
|
|
value: 0.83 |
|
|
- name: Object Precision |
|
|
type: precision |
|
|
value: 0.46 |
|
|
- name: Total Precision |
|
|
type: precision |
|
|
value: 0.57 |
|
|
- name: mAP |
|
|
type: mean_average_precision |
|
|
value: 0.27 |
|
|
- name: Character Recall |
|
|
type: recall |
|
|
value: 0.62 |
|
|
- name: Object Recall |
|
|
type: recall |
|
|
value: 0.25 |
|
|
- name: Total Recall |
|
|
type: recall |
|
|
value: 0.40 |
|
|
- name: METEOR |
|
|
type: meteor |
|
|
value: 0.14 |
|
|
- name: ROUGE-L |
|
|
type: rouge-l |
|
|
value: 0.16 |
|
|
- name: BLEU-4 |
|
|
type: bleu-4 |
|
|
value: 0.054 |
|
|
- name: Description Accuracy |
|
|
type: accuracy |
|
|
value: 2.76 |
|
|
description: "Rating on a scale of 1-5" |
|
|
- name: Average Hallucinations |
|
|
type: error_rate |
|
|
value: 3.56 |
|
|
description: "Average number of hallucinations per story" |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# QwenStoryteller |
|
|
|
|
|
QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Base Model:** Qwen2.5-VL 7B |
|
|
**Training Method:** LoRA fine-tuning (rank 2048, alpha 4096) |
|
|
**Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning) |
|
|
|
|
|
QwenStoryteller processes sequences of images to perform: |
|
|
- End-to-end object detection |
|
|
- Cross-frame object re-identification |
|
|
- Landmark detection |
|
|
- Chain-of-thought reasoning for scene understanding |
|
|
- Grounded story generation with explicit visual references |
|
|
|
|
|
The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision. |
|
|
|
|
|
## System Prompt |
|
|
The model was trained with the following system prompt, and we recommend using it as it is for inference. |
|
|
|
|
|
``` |
|
|
You are an AI storyteller that can analyze sequences of images and create creative narratives. |
|
|
First think step-by-step to analyze characters, objects, settings, and narrative structure. |
|
|
Then create a grounded story that maintains consistent character identity and object references across frames. |
|
|
Use <think></think> tags to show your reasoning process before writing the final story. |
|
|
``` |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques |
|
|
- **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure |
|
|
- **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities |
|
|
- **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
|
|
from qwen_vl_utils import process_vision_info |
|
|
import torch |
|
|
from PIL import Image |
|
|
|
|
|
# Load the model |
|
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
|
|
"daniel3303/QwenStoryteller", torch_dtype="auto", device_map="auto" |
|
|
) |
|
|
|
|
|
# Load processor |
|
|
processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller") |
|
|
|
|
|
# Load images |
|
|
images = [ |
|
|
Image.open("image1.jpg"), |
|
|
Image.open("image2.jpg"), |
|
|
Image.open("image3.jpg"), |
|
|
Image.open("image4.jpg"), |
|
|
Image.open("image5.jpg") |
|
|
] |
|
|
|
|
|
# Create image content list |
|
|
image_content = [] |
|
|
for img in images: |
|
|
image_content.append({ |
|
|
"type": "image", |
|
|
"image": img, |
|
|
}) |
|
|
|
|
|
# Add text prompt at the end |
|
|
image_content.append({"type": "text", "text": "Generate a story based on these images."}) |
|
|
|
|
|
# Create messages with system prompt |
|
|
messages = [ |
|
|
{ |
|
|
"role": "system", |
|
|
"content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story." |
|
|
}, |
|
|
{ |
|
|
"role": "user", |
|
|
"content": image_content, |
|
|
} |
|
|
] |
|
|
|
|
|
# Preparation for inference |
|
|
text = processor.apply_chat_template( |
|
|
messages, tokenize=False, add_generation_prompt=True |
|
|
) |
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
inputs = processor( |
|
|
text=[text], |
|
|
images=image_inputs, |
|
|
videos=video_inputs, |
|
|
padding=True, |
|
|
return_tensors="pt", |
|
|
) |
|
|
inputs = inputs.to(model.device) |
|
|
|
|
|
# Inference: Generation of the output |
|
|
generated_ids = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=4096, |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.9 |
|
|
) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
story = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
)[0] |
|
|
|
|
|
print(story) |
|
|
``` |
|
|
|
|
|
### Using vLLM for faster inference |
|
|
|
|
|
For significantly faster inference, you can use vLLM to serve the model. Simply install vLLM and run: |
|
|
|
|
|
```bash |
|
|
# Install vLLM |
|
|
pip install vllm |
|
|
|
|
|
# Serve the model with vLLM |
|
|
vllm serve daniel3303/QwenStoryteller |
|
|
``` |
|
|
|
|
|
## Output Format |
|
|
|
|
|
QwenStoryteller produces two main outputs: |
|
|
|
|
|
1. **Chain-of-Thought Analysis (`<think></think>`):** A structured analysis containing: |
|
|
- Character tables with consistent identity references, emotions, actions, and spatial locations |
|
|
- Object tables with functions, interactions, and spatial coordinates |
|
|
- Setting tables categorizing environmental elements |
|
|
- Narrative structure tables modeling story progression |
|
|
|
|
|
2. **Grounded Story:** A narrative with specialized XML tags linking text to visual elements: |
|
|
- `<gdi>`: Image tags for specific frames |
|
|
- `<gdo>`: Entity reference tags for character and object mentions |
|
|
- `<gda>`: Action tags for character actions |
|
|
- `<gdl>`: Location/landmark tags for background elements |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Re-identification relies primarily on object appearance rather than overall context, which can lead to confusion with similar-looking objects/persons |
|
|
- Movie-derived training data introduces biases from cinematic composition that may not generalize to candid visual sequences |
|
|
- Low grounding rates for first-person pronouns as they primarily appear in character dialogues |
|
|
- May still produce hallucinations, albeit at a reduced rate compared to the base model |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@misc{oliveira2025storyreasoningdatasetusingchainofthought, |
|
|
title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, |
|
|
author={Daniel A. P. Oliveira and David Martins de Matos}, |
|
|
year={2025}, |
|
|
eprint={2505.10292}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2505.10292}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions or feedback regarding this model, please contact: |
|
|
- Daniel A. P. Oliveira (daniel.oliveira@inesc-id.pt) |