Spaces:
Runtime error
Runtime error
| title: Qwen2.5-VL | π Storyteller v2 | |
| emoji: π | |
| colorFrom: red | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: 5.30.0 | |
| app_file: app.py | |
| pinned: true | |
| tags: | |
| - vision-language-model | |
| - visual-storytelling | |
| - chain-of-thought | |
| - grounded-text-generation | |
| - cross-frame-consistency | |
| - storytelling | |
| - image-to-text | |
| license: apache-2.0 | |
| datasets: | |
| - daniel3303/StoryReasoning | |
| models: | |
| - daniel3303/QwenStoryteller2 | |
| - daniel3303/QwenStoryteller | |
| pipeline_tag: image-to-text | |
| language: en, zh | |
| # QwenStoryteller | |
| This HF Space is a simple implementation of [2505.10292](https://arxiv.org/abs/2505.10292) by Daniel A. P. Oliveira and David Martins de Matos. BibTeX citation provided below. The space was created as a POC, all other credits go to Daniel and David. | |
| QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story. | |
| ## Model Description | |
| **Base Model:** Qwen2.5-VL 7B | |
| **Training Method:** LoRA fine-tuning (rank 2048, alpha 4096) | |
| **Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning) | |
| QwenStoryteller processes sequences of images to perform: | |
| - End-to-end object detection | |
| - Cross-frame object re-identification | |
| - Landmark detection | |
| - Chain-of-thought reasoning for scene understanding | |
| - Grounded story generation with explicit visual references | |
| The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1Γ10β»β΄ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision. | |
| ## System Prompt | |
| The model was trained with the following system prompt, and we recommend using it as it is for inference. | |
| ``` | |
| You are an AI storyteller that can analyze sequences of images and create creative narratives. | |
| First think step-by-step to analyze characters, objects, settings, and narrative structure. | |
| Then create a grounded story that maintains consistent character identity and object references across frames. | |
| Use <think></think> tags to show your reasoning process before writing the final story. | |
| ``` | |
| ## Key Features | |
| - **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques | |
| - **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure | |
| - **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities | |
| - **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model | |
| ``` | |
| @misc{oliveira2025storyreasoningdatasetusingchainofthought, | |
| title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, | |
| author={Daniel A. P. Oliveira and David Martins de Matos}, | |
| year={2025}, | |
| eprint={2505.10292}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2505.10292}, | |
| } | |
| ``` |