QwenStoryteller / README.md

Added citation

393aff8 verified 9 months ago

8.23 kB

	---
	language: en
	license: apache-2.0
	tags:
	- vision-language-model
	- visual-storytelling
	- chain-of-thought
	- grounded-text-generation
	- cross-frame-consistency
	- storytelling
	- image-to-text
	datasets:
	- daniel3303/StoryReasoning
	metrics:
	- precision
	- recall
	- bleu
	- meteor
	- rouge
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: image-to-text
	model-index:
	- name: QwenStoryteller
	results:
	- task:
	type: visual-storytelling
	name: Visual Storytelling
	dataset:
	name: StoryReasoning
	type: daniel3303/StoryReasoning
	split: test
	metrics:
	- name: Character Precision
	type: precision
	value: 0.83
	- name: Object Precision
	type: precision
	value: 0.46
	- name: Total Precision
	type: precision
	value: 0.57
	- name: mAP
	type: mean_average_precision
	value: 0.27
	- name: Character Recall
	type: recall
	value: 0.62
	- name: Object Recall
	type: recall
	value: 0.25
	- name: Total Recall
	type: recall
	value: 0.40
	- name: METEOR
	type: meteor
	value: 0.14
	- name: ROUGE-L
	type: rouge-l
	value: 0.16
	- name: BLEU-4
	type: bleu-4
	value: 0.054
	- name: Description Accuracy
	type: accuracy
	value: 2.76
	description: "Rating on a scale of 1-5"
	- name: Average Hallucinations
	type: error_rate
	value: 3.56
	description: "Average number of hallucinations per story"
	library_name: transformers
	---

	# QwenStoryteller

	QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.

	## Model Description

	Base Model: Qwen2.5-VL 7B
	Training Method: LoRA fine-tuning (rank 2048, alpha 4096)
	Training Dataset: [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning)

	QwenStoryteller processes sequences of images to perform:
	- End-to-end object detection
	- Cross-frame object re-identification
	- Landmark detection
	- Chain-of-thought reasoning for scene understanding
	- Grounded story generation with explicit visual references

	The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.

	## System Prompt
	The model was trained with the following system prompt, and we recommend using it as it is for inference.

	```
	You are an AI storyteller that can analyze sequences of images and create creative narratives.
	First think step-by-step to analyze characters, objects, settings, and narrative structure.
	Then create a grounded story that maintains consistent character identity and object references across frames.
	Use <think></think> tags to show your reasoning process before writing the final story.
	```

	## Key Features

	- Cross-Frame Consistency: Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
	- Structured Reasoning: Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
	- Grounded Storytelling: Uses specialized XML tags to link narrative elements directly to visual entities
	- Reduced Hallucinations: Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model

	## Usage

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info
	import torch
	from PIL import Image

	# Load the model
	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"daniel3303/QwenStoryteller", torch_dtype="auto", device_map="auto"
	)

	# Load processor
	processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller")

	# Load images
	images = [
	Image.open("image1.jpg"),
	Image.open("image2.jpg"),
	Image.open("image3.jpg"),
	Image.open("image4.jpg"),
	Image.open("image5.jpg")
	]

	# Create image content list
	image_content = []
	for img in images:
	image_content.append({
	"type": "image",
	"image": img,
	})

	# Add text prompt at the end
	image_content.append({"type": "text", "text": "Generate a story based on these images."})

	# Create messages with system prompt
	messages = [
	{
	"role": "system",
	"content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use <think></think> tags to show your reasoning process before writing the final story."
	},
	{
	"role": "user",
	"content": image_content,
	}
	]

	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device)

	# Inference: Generation of the output
	generated_ids = model.generate(
	**inputs,
	max_new_tokens=4096,
	do_sample=True,
	temperature=0.7,
	top_p=0.9
	)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	story = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]

	print(story)
	```

	### Using vLLM for faster inference

	For significantly faster inference, you can use vLLM to serve the model. Simply install vLLM and run:

	```bash
	# Install vLLM
	pip install vllm

	# Serve the model with vLLM
	vllm serve daniel3303/QwenStoryteller
	```

	## Output Format

	QwenStoryteller produces two main outputs:

	1. Chain-of-Thought Analysis (`<think></think>`): A structured analysis containing:
	- Character tables with consistent identity references, emotions, actions, and spatial locations
	- Object tables with functions, interactions, and spatial coordinates
	- Setting tables categorizing environmental elements
	- Narrative structure tables modeling story progression

	2. Grounded Story: A narrative with specialized XML tags linking text to visual elements:
	- `<gdi>`: Image tags for specific frames
	- `<gdo>`: Entity reference tags for character and object mentions
	- `<gda>`: Action tags for character actions
	- `<gdl>`: Location/landmark tags for background elements

	## Limitations

	- Re-identification relies primarily on object appearance rather than overall context, which can lead to confusion with similar-looking objects/persons
	- Movie-derived training data introduces biases from cinematic composition that may not generalize to candid visual sequences
	- Low grounding rates for first-person pronouns as they primarily appear in character dialogues
	- May still produce hallucinations, albeit at a reduced rate compared to the base model

	## Citation

	```
	@misc{oliveira2025storyreasoningdatasetusingchainofthought,
	title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation},
	author={Daniel A. P. Oliveira and David Martins de Matos},
	year={2025},
	eprint={2505.10292},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2505.10292},
	}
	```

	## Contact

	For questions or feedback regarding this model, please contact:
	- Daniel A. P. Oliveira (daniel.oliveira@inesc-id.pt)