--- language: en license: apache-2.0 tags: - vision-language-model - visual-storytelling - chain-of-thought - grounded-text-generation - cross-frame-consistency - storytelling - image-to-text - contrastive-learning - reinforcement-learning - entity-reidentification datasets: - daniel3303/StoryReasoningAdversarialDPO - daniel3303/StoryReasoning metrics: - precision - recall - bleu - meteor - rouge - map base_model: - daniel3303/QwenStoryteller pipeline_tag: image-to-text model-index: - name: QwenStoryteller2 results: - task: type: visual-storytelling name: Visual Storytelling dataset: name: StoryReasoningAdversarialDPO type: daniel3303/StoryReasoningAdversarialDPO split: test metrics: - name: Character Precision type: precision value: 0.78 - name: Object Precision type: precision value: 0.29 - name: Total Precision type: precision value: 0.45 - name: mAP type: mean_average_precision value: 0.31 - name: Character Recall type: recall value: 0.77 - name: Object Recall type: recall value: 0.28 - name: Total Recall type: recall value: 0.48 - name: F1 Score type: f1 value: 0.41 - name: METEOR type: meteor value: 0.17 - name: ROUGE-L type: rouge-l value: 0.18 - name: BLEU-4 type: bleu-4 value: 0.057 - name: Character Persistence (≥5 frames) type: accuracy value: 0.493 - name: Object Persistence (≥5 frames) type: accuracy value: 0.213 - name: Well-structured Stories type: accuracy value: 0.975 library_name: transformers --- # QwenStoryteller2 QwenStoryteller2 is an improved version of QwenStoryteller, fine-tuned using contrastive reinforcement learning with Direct Preference Optimization (DPO) to achieve superior entity re-identification and visual grounding in cross-frame storytelling scenarios. ## Model Description **Base Model:** QwenStoryteller (Qwen2.5-VL 7B) **Training Method:** Contrastive Reinforcement Learning with Direct Preference Optimization (LoRA rank 2048, alpha 4096) **Training Dataset:** [StoryReasoningAdversarialDPO](https://huggingface.co/datasets/daniel3303/StoryReasoningAdversarialDPO) QwenStoryteller2 builds upon the original QwenStoryteller by addressing critical limitations in cross-frame entity consistency through: - **Contrastive Learning:** Training on both real and synthetic negative story examples - **Enhanced Entity Re-identification:** Improved tracking of characters and objects across frames - **Better Grounding:** Superior alignment between narrative elements and visual entities - **Reduced Hallucinations:** More reliable entity connections and fewer spurious references The model employs a dual-component reward function that promotes appropriate entity connections in coherent sequences while discouraging incorrect connections in synthetic arrangements. ## Key Improvements Over QwenStoryteller - **Grounding Performance:** mAP improved from 0.27 to 0.31 (+14.8%), F1 score from 0.35 to 0.41 (+17.1%) - **Cross-frame Consistency:** Character persistence on ≥5 frames increased from 37.7% to 49.3% (+30.8%) - **Pronoun Grounding:** Significant improvements across all pronoun types (he: 90.1%→99.1%, she: 91.1%→98.6%, they: 47.6%→68.8%) - **Structural Quality:** Well-structured stories increased from 79.1% to 97.5% (+23.3%) - **Entity Tracking:** Object persistence on ≥5 frames improved from 20.9% to 21.3% ## System Prompt The model was trained with the following system prompt, and we recommend using it for optimal performance: ``` You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use tags to show your reasoning process before writing the final story. ``` ## Usage ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info import torch from PIL import Image # Load the model model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "daniel3303/QwenStoryteller2", torch_dtype="auto", device_map="auto" ) # Load processor processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller2") # Load images images = [ Image.open("image1.jpg"), Image.open("image2.jpg"), Image.open("image3.jpg"), Image.open("image4.jpg"), Image.open("image5.jpg") ] # Create image content list image_content = [] for img in images: image_content.append({ "type": "image", "image": img, }) # Add text prompt at the end image_content.append({"type": "text", "text": "Generate a story based on these images."}) # Create messages with system prompt messages = [ { "role": "system", "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use tags to show your reasoning process before writing the final story." }, { "role": "user", "content": image_content, } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to(model.device) # Inference: Generation of the output generated_ids = model.generate( **inputs, max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9 ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] story = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(story) ``` ### Using vLLM for faster inference For significantly faster inference, you can use vLLM to serve the model: ```bash # Install vLLM pip install vllm # Serve the model with vLLM vllm serve daniel3303/QwenStoryteller2 ``` ## Training Methodology ### Contrastive Learning Framework QwenStoryteller2 was trained using a novel contrastive reinforcement learning approach: 1. **Synthetic Story Generation:** Extended the StoryReasoning dataset with 4,178 synthetic stories created by sampling images from different movies to create incoherent sequences 2. **Dual-Component Reward Function:** Combined entity re-identification (R_reid) and grounding (R_ground) rewards with structural validation 3. **Direct Preference Optimization:** Used offline preference pairs generated from the reward function to train the model ### Reward Function Components - **Entity Re-identification Reward:** Tracks character and object persistence across frames, promoting connections in real stories while penalizing them in synthetic ones - **Grounding Reward:** Evaluates pronoun and proper noun grounding to visual entities - **Structure Validation:** Ensures generated outputs maintain required format and consistency ### Training Configuration - **Method:** Direct Preference Optimization (DPO) with LoRA fine-tuning - **LoRA Parameters:** Rank 2048, alpha 4096 - **Optimizer:** AdamW with learning rate 5×10⁻⁶ - **Batch Size:** 8 - **Epochs:** 3 - **Temperature Parameter (β):** 0.1 ## Performance Metrics | Metric | QwenStoryteller | QwenStoryteller2 | Improvement | |--------|-----------------|------------------|-------------| | Character Precision | 0.83 | 0.78 | -6.0% | | Object Precision | 0.46 | 0.29 | -37.0% | | Total Precision | 0.57 | 0.45 | -21.1% | | mAP | 0.27 | 0.31 | +14.8% | | Character Recall | 0.62 | 0.77 | +24.2% | | Object Recall | 0.25 | 0.28 | +12.0% | | Total Recall | 0.40 | 0.48 | +20.0% | | F1 Score | 0.35 | 0.41 | +17.1% | | METEOR | 0.14 | 0.17 | +21.4% | | ROUGE-L | 0.16 | 0.18 | +12.5% | | BLEU-4 | 0.054 | 0.057 | +5.6% | ## Output Format QwenStoryteller2 produces enhanced outputs with improved consistency: 1. **Chain-of-Thought Analysis (``):** More accurate structured analysis with: - Improved character tables with consistent identity references - Better object tracking with enhanced spatial coordination - More reliable setting categorization - Stronger narrative structure modeling 2. **Grounded Story:** Enhanced narrative with specialized XML tags: - ``: Image tags for specific frames - ``: Entity reference tags with improved accuracy - ``: Action tags with better character-action alignment - ``: Location/landmark tags with enhanced spatial grounding ## Key Features - **Enhanced Cross-Frame Consistency:** Superior character and object identity maintenance through contrastive learning - **Improved Pronoun Grounding:** Better alignment of pronouns with visual entities (up to 99.1% for "he", 98.6% for "she") - **Reduced Hallucinations:** Fewer incorrect entity connections and spurious references - **Robust Entity Discrimination:** Learned ability to distinguish when cross-frame connections are appropriate - **Better Structural Quality:** Near-perfect adherence to expected output format (97.5%) ## Limitations - Precision scores show some reduction compared to the original model due to increased recall - Training data derived from movies may introduce cinematic biases - Entity re-identification still relies primarily on visual similarity within bounding boxes - Performance validated only on 7B parameter scale - Optimal real-to-synthetic story ratio (2:1) may not generalize to all scenarios ## Citation ```bibtex TODO @misc{oliveira2025storyreasoningdatasetusingchainofthought, title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, author={Daniel A. P. Oliveira and David Martins de Matos}, year={2025}, eprint={2505.10292}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.10292} } ``` ## Contact For questions or feedback regarding this model, please contact: - Daniel A. P. Oliveira (daniel.oliveira@inesc-id.pt)