Spaces:

jbilcke-hf
/

SNIPED_rapo

Running on Zero

App Files Files Community

jbilcke-hf commited on Oct 27, 2025

Commit

ee81688

verified ·

1 Parent(s): b109524

Upload repository for paper 2510.20206

Browse files

Files changed (28) hide show

.gitattributes +1 -0
APP_INFO.md +208 -0
CLAUDE.md +244 -0
README.md +48 -4
README_original.md +157 -0
app.py +349 -0
assets/overview.png +3 -0
ckpt/temp.py +1 -0
data/graph_test1.csv +16 -0
data/graph_test2.csv +18 -0
data/test_prompts.txt +15 -0
examples/Stage1_RAPO/add_to_graph.py +167 -0
examples/Stage1_RAPO/construct_graph.py +151 -0
examples/Stage1_RAPO/data/graph_test1.csv +16 -0
examples/Stage1_RAPO/data/graph_test2.csv +18 -0
examples/Stage1_RAPO/data/test_prompts.txt +15 -0
examples/Stage1_RAPO/refactoring.py +96 -0
examples/Stage1_RAPO/refactoring.sh +4 -0
examples/Stage1_RAPO/retrieve_modifiers.py +116 -0
examples/Stage1_RAPO/retrieve_modifiers.sh +3 -0
examples/Stage1_RAPO/rewrite_via_instruction.py +60 -0
examples/Stage1_RAPO/rewrite_via_instruction.sh +5 -0
examples/Stage1_RAPO/word_augment.py +254 -0
examples/Stage1_RAPO/word_augment.sh +17 -0
examples/Stage2_SSPO/examples.csv +10 -0
examples/Stage2_SSPO/phyaware_wan2.1.py +364 -0
requirement.txt +7 -0
requirements.txt +17 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/overview.png filter=lfs diff=lfs merge=lfs -text

APP_INFO.md ADDED Viewed

	@@ -0,0 +1,208 @@

+# RAPO++ Gradio App Documentation
+## Overview
+This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.
+## What It Does
+The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.
+## How It Works
+### Architecture
+1. **Knowledge Graph Construction**
+   - Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
+   - Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
+   - Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")
+2. **Retrieval Process**
+   - Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
+   - Finds top-K most similar places via cosine similarity
+   - Samples connected actions and atmosphere descriptors from graph neighbors
+   - Filters modifiers by relevance to the input prompt
+3. **Prompt Augmentation**
+   - Combines original prompt with retrieved modifiers
+   - Structures the output to maintain coherence
+   - Returns optimized prompt suitable for T2V generation
+### Key Components
+**app.py** (main application):
+- `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts
+- `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU
+- Gradio interface with examples and detailed documentation
+**requirements.txt**:
+- gradio 5.49.1 (pinned for compatibility)
+- sentence-transformers + sentencepiece for embeddings
+- torch 2.5.1 for tensor operations
+- networkx for graph operations
+- huggingface_hub for model downloads
+## Model Downloads
+The app automatically downloads the required model on first run:
+- **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB)
+Downloaded to: `./ckpt/all-MiniLM-L6-v2/`
+## Usage
+### Basic Usage
+1. Enter a simple prompt (e.g., "A person walking")
+2. Click "Optimize Prompt"
+3. View the enhanced prompt with contextual details
+### Advanced Settings
+- **Number of Places to Retrieve**: How many related places to search (1-5, default: 2)
+- **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5)
+### Example Prompts
+Try these examples to see the optimization in action:
+- "A person walking"
+- "A car driving at night"
+- "Someone cooking in a kitchen"
+- "A group of people talking"
+- "A bird flying"
+- "Someone sitting and reading"
+## Technical Details
+### Graph Structure
+**Places (central nodes):**
+- forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake
+**Edge Types:**
+- Place → Verb/Action edges (e.g., "forest" → "walking through")
+- Place → Atmosphere edges (e.g., "forest" → "dense trees")
+**Retrieval Algorithm:**
+1. Encode input prompt: `prompt_emb = model.encode(prompt)`
+2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)`
+3. Select top-K places by similarity score
+4. Sample neighbors from graph: `G.neighbors(place)`
+5. Deduplicate and rank modifiers
+### ZeroGPU Integration
+The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
+- Fast embedding computations
+- Efficient cosine similarity calculations
+- Scalability to larger graphs and batch processing
+### Differences from Full RAPO
+This demo implements a **simplified version** of Stage 1 RAPO:
+**Included:**
+✅ Knowledge graph with place-verb-scene relations
+✅ Embedding-based retrieval via SentenceTransformer
+✅ Cosine similarity ranking
+✅ Basic prompt augmentation
+**Not Included (requires additional models/data):**
+❌ Full relation graph from paper (requires ~GB of graph data)
+❌ LLM-based sentence refactoring (Mistral-7B)
+❌ Iterative merging with similarity thresholds
+❌ Instruction-based rewriting (Llama3.1)
+**Why This Approach:**
+- Full RAPO requires 7B+ LLM downloads (~15GB+)
+- Full graph data requires downloading preprocessed datasets
+- This demo focuses on the **core concept**: retrieval-augmented prompt optimization
+- Users can understand the methodology without waiting for large downloads
+## Running the Full RAPO Pipeline
+To run the complete Stage 1 RAPO from the paper:
+```bash
+cd examples/Stage1_RAPO
+# 1. Retrieve modifiers from graph
+sh retrieve_modifiers.sh
+# 2. Word augmentation
+sh word_augment.sh
+# 3. Sentence refactoring
+sh refactoring.sh
+# 4. Instruction-based rewriting
+sh rewrite_via_instruction.sh
+```
+**Requirements:**
+- Download full relation graph data to `relation_graph/graph_data/`
+- Download Mistral-7B-Instruct-v0.3 to `ckpt/`
+- Download llama3_1_instruct_lora_rewrite to `ckpt/`
+See README.md for full installation instructions.
+## Integration with RAPO++ Stages
+This demo showcases **Stage 1 only**. The complete RAPO++ framework includes:
+**Stage 1 (RAPO)** - *Demonstrated Here*
+- Retrieval-augmented prompt optimization via knowledge graphs
+- Offline refinement using curated data
+**Stage 2 (SSPO)**
+- Self-supervised prompt optimization
+- Iterative refinement based on generated video feedback
+- Physics-aware consistency checks
+- VLM-based alignment scoring
+**Stage 3 (Fine-tuning)**
+- LLM fine-tuning on collected feedback from Stage 2
+- Model-specific prompt refiners
+## Performance Notes
+- First run: ~1-2 minutes (downloads model)
+- Subsequent runs: <1 second per prompt
+- GPU allocation: Automatic via ZeroGPU
+- Memory usage: ~500MB (model + graph)
+## Troubleshooting
+**"No module named 'sentencepiece'"**
+- Ensure `sentencepiece==0.2.1` is in requirements.txt
+- sentence-transformers requires sentencepiece for tokenization
+**"CUDA has been initialized before importing spaces"**
+- The app correctly imports `spaces` FIRST before torch
+- If you modify the code, maintain this import order
+**Model download fails**
+- Check internet connection
+- HuggingFace Hub may be temporarily unavailable
+- Model will retry on next run (cached after successful download)
+## References
+**Papers:**
+- [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts
+- [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization
+**Project Pages:**
+- RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
+- RAPO++: https://whynothaha.github.io/RAPO_plus_github/
+**Code:**
+- GitHub: https://github.com/Vchitect/RAPO
+## License
+Please refer to the original repository for licensing information.
+---
+**Created for HuggingFace Spaces deployment**

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,244 @@

+# CLAUDE.md
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+## Project Overview
+RAPO++ is a three-stage framework for text-to-video (T2V) generation prompt optimization. It combines:
+- **Stage 1 (RAPO)**: Retrieval-Augmented Prompt Optimization using relation graphs
+- **Stage 2 (SSPO)**: Self-Supervised Prompt Optimization with test-time iterative refinement
+- **Stage 3**: LLM fine-tuning on collected feedback data
+The system is model-agnostic and works with various T2V models (Wan2.1, Open-Sora-Plan, HunyuanVideo, etc.).
+## Environment Setup
+```bash
+# Create and activate environment
+conda create -n rapo_plus python=3.10
+conda activate rapo_plus
+# Install dependencies
+pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
+pip install -r requirement.txt
+```
+## Required Checkpoints
+Download and place in `ckpt/` directory:
+**Stage 1:**
+- `all-MiniLM-L6-v2/` - Sentence transformer for embeddings
+- `llama3_1_instruct_lora_rewrite/` - LLM for prompt rewriting
+- `Mistral-7B-Instruct-v0.3/` - Alternative instruction-tuned LLM
+**Stage 2 (example with Wan2.1):**
+- `Wan2.1-T2V-1.3B-Diffusers/` - Base T2V model
+- `Qwen2.5-7B-Instruct/` - Instruction-following LLM for prompt refinement
+- `Qwen2.5-vl-7B-instruct/` - Vision-language model for video alignment assessment
+Also place relation graph data in `relation_graph/graph_data/`.
+## Core Workflows
+### Stage 1: RAPO (Retrieval-Augmented Prompt Optimization)
+**Location:** `examples/Stage1_RAPO/`
+**Pipeline:**
+1. **Graph Construction** (`construct_graph.py`):
+   - Reads CSV with columns: `Input`, `verb_obj_word`, `scenario_word`, `place`
+   - Creates NetworkX graphs linking places to verbs and scenes
+   - Generates embeddings with SentenceTransformer
+   - Outputs: JSON dictionaries, GraphML files to `relation_graph/`
+2. **Modifier Retrieval** (`retrieve_modifiers.py`):
+   - Input: Test prompts from `data/test_prompts.txt`
+   - Encodes prompts and retrieves top-K related places via cosine similarity
+   - Samples connected verbs/scenes from graph neighbors
+   - Outputs: `output/retrieve_words/{filename}.txt` and `.csv`
+   - Run: `sh retrieve_modifiers.sh`
+3. **Word Augmentation** (`word_augment.py`):
+   - Filters retrieved modifiers by similarity threshold
+   - Merges modifiers interactively
+   - Run: `sh word_augment.sh`
+4. **Sentence Refactoring** (`refactoring.py`):
+   - Restructures prompts with augmented modifiers
+   - Run: `sh refactoring.sh`
+5. **Instruction-Based Rewriting** (`rewrite_via_instruction.py`):
+   - Uses LLM to refine prompts with natural language instructions
+   - Run: `sh rewrite_via_instruction.sh`
+**Key Parameters:**
+- `place_num`: Top-K places to retrieve (default: 3)
+- `verb_num`, `topk_num`: Controls verb/scene sampling
+- `SIMILARITY_THRESHOLD`: Filters modifiers in word_augment.py
+### Stage 2: SSPO (Self-Supervised Prompt Optimization)
+**Location:** `examples/Stage2_SSPO/`
+**Main Script:** `phyaware_wan2.1.py`
+**Architecture:**
+This script implements a closed-loop iterative optimization pipeline:
+1. **Video Generation** (`load_model()`, `generate_single_video()`):
+   - Uses WanPipeline to generate videos from prompts
+   - Configurable: height=480, width=832, num_frames=81, fps=15
+2. **Optical Flow Analysis** (`extract_optical_flow()`):
+   - Extracts motion statistics using cv2.calcOpticalFlowFarneback
+   - Samples frames at configurable intervals
+   - Returns sequence of (x, y) flow vectors
+3. **VLM Alignment Assessment** (`misalignment_assessment()`):
+   - Uses Qwen2.5-VL to evaluate video-prompt alignment
+   - Assesses objects, actions, scenes
+   - Returns textual alignment score (1-5 scale)
+4. **Physics Consistency Check + Prompt Refinement** (`evaluate_physical_consistency()`):
+   - **Phase 1**: LLM analyzes optical flow for physical plausibility (inertia, momentum, etc.)
+   - **Phase 2**: Fuses physics analysis + VLM alignment feedback
+   - Rewrites prompt to enforce physical rules and semantic alignment
+   - Uses Qwen2.5-7B-Instruct
+5. **Iterative Loop**:
+   - Generates video → Analyzes → Refines prompt → Generates again
+   - Default: 5 refinement iterations per prompt
+   - Logs to CSV: `results/examples_refined/refined_prompts.csv`
+**Resume Capability:**
+The script checks existing logs and videos to resume from last iteration, maintaining prompt chain consistency.
+**Input Format:**
+CSV with columns: `captions` (prompt), `phys_law` (physical rule to enforce)
+**Key Configuration (lines 248-264):**
+```python
+WAN_MODEL_ID = "../../ckpt/Wan2.1-T2V-1.3B-Diffusers"
+INSTRUCT_LLM_PATH = "../../ckpt/Qwen2.5-7B-Instruct"
+QWEN_VL_PATH = "../../ckpt/qwen2.5-vl-7B-instruct"
+num_refine_iterations = 5
+```
+### Stage 3: LLM Fine-Tuning
+Not provided in code; uses feedback data from Stage 2 to fine-tune model-specific prompt refiners.
+## Key Architectural Patterns
+### Graph-Based Retrieval (Stage 1)
+- **Data Structure**: NetworkX graphs with place nodes as hubs
+- **Retrieval**: Cosine similarity between prompt embeddings and place embeddings
+- **Augmentation**: Graph neighbors provide contextually relevant modifiers
+- **Caching**: Pre-computed embeddings stored in JSON for efficiency
+### Closed-Loop Optimization (Stage 2)
+- **Multi-Modal Feedback**: Combines optical flow (physics) + VLM (semantics)
+- **Iterative Refinement**: Each video informs next prompt
+- **Logging**: CSV tracks full prompt evolution chain
+- **Modularity**: Easy to swap T2V models, reward functions, or VLMs
+### Embedding Model Usage
+- SentenceTransformer for text similarity (Stage 1)
+- Pre-encode and cache all graph tokens to avoid redundant computation
+## Common Commands
+**Stage 1 - Full Pipeline:**
+```bash
+cd examples/Stage1_RAPO
+# Build graph from scratch
+python construct_graph.py
+# Run full RAPO pipeline
+sh retrieve_modifiers.sh
+sh word_augment.sh
+sh refactoring.sh
+sh rewrite_via_instruction.sh
+```
+**Stage 2 - SSPO:**
+```bash
+cd examples/Stage2_SSPO
+python phyaware_wan2.1.py
+```
+## File Dependencies
+**Input Files:**
+- `data/test_prompts.txt` - One prompt per line for Stage 1
+- `examples/Stage2_SSPO/examples.csv` - Prompts + physical rules for Stage 2
+- `relation_graph/graph_data/*.json` - Pre-built graph data
+- `relation_graph/graph_data/*.graphml` - Graph structure
+**Output Structure:**
+- `examples/Stage1_RAPO/output/retrieve_words/` - Retrieved modifiers
+- `examples/Stage1_RAPO/output/refactor/` - Augmented prompts
+- `examples/Stage2_SSPO/results/examples_refined/` - Videos + logs
+## Critical Implementation Details
+### Stage 1 Graph Construction
+- Place tokens serve as central nodes linking verbs and scenes
+- Edge weights implicitly represent co-occurrence frequency
+- Embedding dimension from SentenceTransformer: 384 (all-MiniLM-L6-v2)
+### Stage 2 Physics Analysis
+The `evaluate_physical_consistency()` function uses a two-phase LLM prompting strategy:
+1. First call: Analyze optical flow for physics violations
+2. Second call: Synthesize physics + VLM feedback into refined prompt
+The prompt rewriting instruction explicitly constrains:
+- Motion continuity and force consistency
+- Object states and timings
+- Camera motion if needed
+- Output limited to <120 words
+### Optical Flow Extraction
+- Uses Farneback algorithm (dense optical flow)
+- Samples frames at 0.5-second intervals by default
+- Returns mean (x, y) flow per frame pair
+- Sudden reversals or inconsistent magnitudes indicate physics violations
+## Model Swapping
+**To use a different T2V model in Stage 2:**
+1. Update pipeline loading in `load_model()` function
+2. Adjust generation parameters (height, width, num_frames)
+3. Ensure model outputs diffusers-compatible format
+4. Update checkpoint path constants (lines 249-251)
+**To use a different VLM:**
+- Replace `Qwen2_5_VLForConditionalGeneration` with alternative
+- Adjust processor and prompt template in `misalignment_assessment()`
+**To use a different LLM for refinement:**
+- Update `INSTRUCT_LLM_PATH` and ensure transformers compatibility
+- Modify system/user message format if needed
+## Troubleshooting
+**Graph loading errors:**
+- Ensure all JSON files exist in `relation_graph/graph_data/`
+- Check GraphML files are valid NetworkX format
+**CUDA OOM:**
+- Stage 2 loads 3 large models simultaneously (T2V, VLM, LLM)
+- Reduce batch size or use smaller models
+- Consider offloading models between steps
+**Syntax error in phyaware_wan2.1.py line 251:**
+- Missing opening quote: `QWEN_VL_PATH = ../../ckpt//qwen2.5-vl-7B-instruct"`
+- Should be: `QWEN_VL_PATH = "../../ckpt/qwen2.5-vl-7B-instruct"`
+## Paper References
+- **RAPO**: "The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation" (CVPR 2025)
+- **RAPO++**: arXiv:2510.20206
+- Project pages and models available on HuggingFace

README.md CHANGED Viewed

@@ -1,12 +1,56 @@
 ---
-title: SNIPED Rapo
-emoji: 🏃
 colorFrom: yellow
-colorTo: green
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: "RAPO"
+emoji: 🤖
 colorFrom: yellow
+colorTo: blue
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
+short_description: "Manual Entry: https://huggingface.co/papers/2510.20206"
+hardware: zerogpu
+tags:
+  - research
+  - paper
+  - code
+  - cheatcode
+license: mit
 ---
+# RAPO
+**Automated upload by CheatCode** 🚀
+## 📄 Paper Information
+- **Paper ID**: 2510.20206
+- **Title**: Manual Entry: https://huggingface.co/papers/2510.20206
+- **Original Repository**: [https://github.com/Vchitect/RAPO](https://github.com/Vchitect/RAPO)
+## 🛠️ Repository Information
+- **Languages**: Python, Shell
+- **Gradio App**: ✅ Generated by CheatCode
+## 🤖 About CheatCode
+This Space was automatically created by [CheatCode](https://github.com/jbilcke-hf/CheatCode),
+an AI-powered tool that:
+1. Discovers research papers from HuggingFace
+2. Extracts and analyzes linked repositories
+3. Generates Gradio demo applications
+4. Uploads everything to HuggingFace Spaces
+## 📝 Usage
+This Space includes a Gradio app that was automatically generated from the repository code.
+## ⚠️ Disclaimer
+This is an automated upload. The code comes from the original repository and may require
+additional configuration or dependencies to run properly.
+## 📜 License
+Please refer to the original repository for licensing information: https://github.com/Vchitect/RAPO

README_original.md ADDED Viewed

	@@ -0,0 +1,157 @@

+---
+title: RAPO++ Text-to-Video Prompt Optimization
+emoji: 🎬
+colorFrom: purple
+colorTo: blue
+sdk: gradio
+sdk_version: 5.49.1
+app_file: app.py
+pinned: false
+short_description: A three-stage framework for optimizing text-to-video generation prompts via retrieval, self-supervised refinement, and LLM fine-tuning
+hardware: zerogpu
+---
+# RAPO++: Prompting Test-Time Scaling for Text-to-Video Generation
+<p align="center">
+  <a href="https://arxiv.org/pdf/2504.11739" target="_blank"><img src="https://img.shields.io/badge/Paper-RAPO-red"></a>
+  <a href='https://whynothaha.github.io/Prompt_optimizer/RAPO.html' target="_blank"><img src='https://img.shields.io/badge/ProjectPage-RAPO-blue'></a>
+  <a href="https://arxiv.org/abs/2510.20206" target="_blank"><img src="https://img.shields.io/badge/Paper-RAPO++-red"></a>
+  <a href='https://whynothaha.github.io/RAPO_plus_github/' target="_blank"><img src='https://img.shields.io/badge/ProjectPage-RAPO++-blue'></a>
+  <a href="https://huggingface.co/papers/2510.20206" target="_blank"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Daily Papers-red"></a>
+</p >
+<p align="center">
+<strong><big>
+If you find our work useful, please consider giving us a star🌟</big></strong>
+</p>
+## 📚 AutoPage
+Our website is automatically generated using our [**AutoPage**](https://mqleet.github.io/AutoPage_ProjectPage/), a multi-agent system we highly recommend for effortless academic page creation.
+## 📋 Table of Contents
+This is the official implementation for
+- [RAPO] [The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation, **CVPR 2025**](https://arxiv.org/abs/2502.07516)
+- [RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling, arXiv:2510.20206](https://arxiv.org/abs/2510.20206)
+  - [🔎 Overview](#-overview)
+  - [🤗 Checkpoint](#-checkpoint)
+  - [🛠️ Installation](#-installation)
+  - [🚀 Quick Start](#-quick-start)
+  - [📐 Evaluation](#-evaluation)
+## 🔎 Overview
+RAPO++ is a three-stage framework that enhances text-to-video generation without modifying model architectures. It unifies data-aligned prompt refinement (RAPO), test-time iterative optimization (SSPO), and LLM fine-tuning**, enabling more coherent, compositional, and physically realistic video synthesis. Tested on five state-of-the-art models and benchmarks, RAPO++ consistently improves semantic alignment, temporal stability, and visual fidelity, setting a new standard for prompt optimization in T2V generation.
+The core contribution of RAPO++ lies in SSPO, a model-agnostic, closed-loop mechanism that iteratively refines prompts through feedback from generated videos. When using RAPO++, users can replace RAPO with their model’s built-in prompt refiner as initialization. The feedback data collected during SSPO can then be used to fine-tune the refiner itself, further enhancing model-specific prompt optimization.
+![Overview](assets/overview.png)
+## 🛠️ Installation
+1. Clone the Repository
+```
+git clone https://github.com/Vchitect/RAPO.git
+cd RAPO
+```
+2. Set up Environment
+```
+conda create -n rapo_plus python=3.10
+conda activate RAPO
+pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118
+pip install -r requirements.txt
+```
+## 🤗 Checkpoint
+### Stage 1 RAPO
+Download the required model weights [RAPO](https://huggingface.co/bingjie/RAPO/tree/main), relation graph and pretrained LLM (e.g. , [
+Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3/tree/main) )and place them in the `ckpt/` and `relation_graph/` directory.
+```
+ckpt/
+│── all-MiniLM-L6-v2/
+│── llama3_1_instruct_lora_rewrite/
+│── Mistral-7B-Instruct-v0.3/
+relation_graph/
+│── graph_data/
+```
+### Stage 2 SSPO
+We take Wan2.1-T2V as the base model to illustrate the process of SSPO. Download the required model weights [Wan2.1-T2V](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B/tree/main), [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct/tree/main), [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/tree/main]), and place them in the `ckpt/` directory.
+```
+ckpt/
+│── Wan2.1-T2V-1.3B-Diffusers/
+│── Qwen2.5-7B-Instruct/
+│── Qwen2.5-vl-7B-instruct/
+```
+## 🚀 Quick Start
+### Stage 1 RAPO
+```
+cd ./examples/Stage1_RAPO/
+```
+0. We provide the codes to compose the graph data. We provide two examples of inputs to compose the graph data (`./dataset/graph_test1.csv` and `./dataset/graph_test2.csv`). You can build a relation graph from scratch based on the constructed data:
+```
+python construct_graph.py
+```
+or you can add data based on the already constructed relation graph:
+```
+python add_to_graph.py
+```
+1. Retrieve related modifiers from relation graph. You can adjust the hyperparameters in `retrieve_modifiers.py` to modify the number of retrieval modifiers.
+```
+sh retrieve_modifiers.sh
+```
+2. Word augmentation and sentence refactoring.
+```
+sh word_augment.sh
+sh refactoring.sh
+```
+3. Rewrite via instruction.
+```
+sh rewrite_via_instruction.sh
+```
+### Stage 2 SSPO
+```
+cd ./examples/Stage2_SSPO/
+```
+We take **physical-aware video generation** based on Wan2.1 as an example. We provide an examples.csv file in this directory, which contains some test prompts and the physical rules that T2V generation needs to comply with.
+For quickly start, the script generates and refines videos iteratively by combining Wan 2.1 T2V generation, Qwen2.5-VL alignment scoring, and physics-based prompt rewriting to enhance realism and consistency. You can modify the script to change the base model, include custom reward functions and historical-prompt backtracking for task-specific adaptation.
+```
+python phyaware_wan2.1.py
+```
+### Stage 3 LLM finetuning
+For LLM fine-tuning, the process depends on the selected T2V base models and further refines the Stage 2 naive-optimized prompts.
+Examples include [Open-Sora-Plan](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.3.0), [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B), [HunyuanVideo](https://huggingface.co/tencent/HunyuanVideo-PromptRewrite), and et.al.
+## ✒️ Citation
+If you find our work helpful for your research, please consider giving a citation 📝
+```
+@article{gao2025rapopp,
+  title   = {RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling},
+  author  = {Gao, Bingjie and Ma, Qianli and Wu, Xiaoxue and Yang, Shuai and Lan, Guanzhou and Zhao, Haonan and Chen, Jiaxuan and Liu, Qingyang and Qiao, Yu and Chen, Xinyuan and Wang, Yaohui and Niu, Li},
+  journal = {arXiv preprint arXiv:2510.20206},
+  year    = {2025}
+}
+```
+```
+@InProceedings{Gao_2025_CVPR,
+    author    = {Gao, Bingjie and Gao, Xinyu and Wu, Xiaoxue and Zhou, Yujie and Qiao, Yu and Niu, Li and Chen, Xinyuan and Wang, Yaohui},
+    title     = {The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation},
+    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+    month     = {June},
+    year      = {2025},
+    pages     = {3173-3183}
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,349 @@

+"""
+RAPO++ Text-to-Video Prompt Optimization Demo
+This demo showcases Stage 1 (RAPO): Retrieval-Augmented Prompt Optimization
+It demonstrates how simple prompts can be enriched with contextually relevant modifiers
+retrieved from a knowledge graph for better text-to-video generation.
+"""
+# CRITICAL: Import spaces FIRST before any CUDA-related packages
+import spaces
+import gradio as gr
+import torch
+from sentence_transformers import SentenceTransformer
+from torch.nn.functional import cosine_similarity
+import networkx as nx
+import json
+import os
+import random
+from huggingface_hub import snapshot_download, hf_hub_download
+# =============================================================================
+# Model and Data Setup (runs once at startup)
+# =============================================================================
+print("=" * 60)
+print("Setting up RAPO++ Demo...")
+print("=" * 60)
+# Create necessary directories
+os.makedirs("./ckpt", exist_ok=True)
+os.makedirs("./relation_graph/graph_data", exist_ok=True)
+# Download SentenceTransformer model for embeddings
+SENTENCE_TRANSFORMER_PATH = "./ckpt/all-MiniLM-L6-v2"
+if not os.path.exists(SENTENCE_TRANSFORMER_PATH):
+    print("Downloading SentenceTransformer model...")
+    snapshot_download(
+        repo_id="sentence-transformers/all-MiniLM-L6-v2",
+        local_dir=SENTENCE_TRANSFORMER_PATH,
+        local_dir_use_symlinks=False
+    )
+    print("✓ SentenceTransformer downloaded")
+else:
+    print("✓ SentenceTransformer already cached")
+# Load SentenceTransformer model
+print("Loading SentenceTransformer model...")
+embedding_model = SentenceTransformer(SENTENCE_TRANSFORMER_PATH)
+print("✓ Model loaded")
+# =============================================================================
+# Simple Demo Graph Creation (since full graph data requires large download)
+# =============================================================================
+def create_demo_graph():
+    """Create a simplified demo graph with common T2V generation concepts"""
+    # Create sample place-verb and place-scene graphs
+    G_place_verb = nx.Graph()
+    G_place_scene = nx.Graph()
+    # Define places (central nodes)
+    places = [
+        "forest", "beach", "city street", "mountain", "room", "park",
+        "studio", "kitchen", "bridge", "parking lot", "desert", "lake"
+    ]
+    # Define verbs/actions for each place
+    place_verbs = {
+        "forest": ["walking through", "hiking in", "exploring", "camping in", "running through"],
+        "beach": ["walking on", "swimming at", "surfing at", "relaxing on", "playing on"],
+        "city street": ["walking down", "driving through", "running along", "biking through"],
+        "mountain": ["climbing", "hiking up", "descending", "exploring", "camping on"],
+        "room": ["sitting in", "working in", "relaxing in", "reading in", "sleeping in"],
+        "park": ["walking in", "playing in", "jogging through", "sitting in", "picnicking in"],
+        "studio": ["working in", "dancing in", "recording in", "practicing in"],
+        "kitchen": ["cooking in", "preparing food in", "baking in", "cleaning"],
+        "bridge": ["walking across", "driving across", "standing on", "running across"],
+        "parking lot": ["standing in", "walking through", "driving in", "parking in"],
+        "desert": ["walking through", "driving through", "camping in", "exploring"],
+        "lake": ["swimming in", "boating on", "fishing at", "relaxing by"]
+    }
+    # Define scenarios/atmospheres for each place
+    place_scenes = {
+        "forest": ["dense trees", "peaceful atmosphere", "natural setting", "quiet surroundings"],
+        "beach": ["ocean waves", "sunny day", "sandy shore", "coastal view"],
+        "city street": ["busy traffic", "urban environment", "city lights", "crowded sidewalk"],
+        "mountain": ["scenic view", "high altitude", "rocky terrain", "mountain peak"],
+        "room": ["indoor setting", "comfortable space", "quiet environment", "cozy atmosphere"],
+        "park": ["green grass", "open space", "trees around", "peaceful setting"],
+        "studio": ["professional lighting", "indoor space", "creative environment"],
+        "kitchen": ["modern appliances", "cooking area", "indoor setting", "bright lighting"],
+        "bridge": ["elevated view", "water below", "connecting path", "architectural structure"],
+        "parking lot": ["outdoor area", "vehicles around", "paved surface", "open space"],
+        "desert": ["sandy terrain", "hot climate", "barren landscape", "vast expanse"],
+        "lake": ["calm water", "natural scenery", "peaceful setting", "reflection on water"]
+    }
+    # Build graphs
+    for place in places:
+        # Add place-verb connections
+        for verb in place_verbs.get(place, []):
+            G_place_verb.add_edge(place, verb)
+        # Add place-scene connections
+        for scene in place_scenes.get(place, []):
+            G_place_scene.add_edge(place, scene)
+    # Create embeddings for all places
+    place_embeddings = embedding_model.encode(places)
+    # Create lookup dictionaries
+    place_to_idx = {place: idx for idx, place in enumerate(places)}
+    idx_to_place = {idx: place for place, idx in place_to_idx.items()}
+    return G_place_verb, G_place_scene, place_embeddings, place_to_idx, idx_to_place
+# Initialize demo graph
+print("Creating demo knowledge graph...")
+G_place_verb, G_place_scene, place_embeddings, place_to_idx, idx_to_place = create_demo_graph()
+print("✓ Demo graph created")
+print("=" * 60)
+print("✓ Setup complete!")
+print("=" * 60)
+# =============================================================================
+# Core RAPO Functions
+# =============================================================================
+@spaces.GPU
+def retrieve_and_augment_prompt(prompt: str, place_num: int = 2, modifier_num: int = 5) -> tuple:
+    """
+    Main RAPO function: Retrieves relevant modifiers from the graph and augments the prompt.
+    Args:
+        prompt: Input text-to-video generation prompt
+        place_num: Number of top places to retrieve
+        modifier_num: Number of modifiers to sample per place
+    Returns:
+        Tuple of (augmented_prompt, retrieved_info, places_found)
+    """
+    # Encode input prompt
+    prompt_embedding = embedding_model.encode(prompt)
+    # Compute similarity with all places
+    similarities = cosine_similarity(
+        torch.tensor(prompt_embedding).unsqueeze(0),
+        torch.tensor(place_embeddings)
+    )
+    # Get top-K most similar places
+    top_indices = torch.topk(similarities, min(place_num, len(place_to_idx))).indices
+    # Retrieve modifiers from graph
+    retrieved_verbs = []
+    retrieved_scenes = []
+    places_found = []
+    for idx in top_indices.numpy().tolist():
+        place = idx_to_place[idx]
+        places_found.append(place)
+        # Get verb neighbors
+        verb_neighbors = list(G_place_verb.neighbors(place))
+        verb_samples = random.sample(verb_neighbors, min(modifier_num, len(verb_neighbors)))
+        retrieved_verbs.extend(verb_samples)
+        # Get scene neighbors
+        scene_neighbors = list(G_place_scene.neighbors(place))
+        scene_samples = random.sample(scene_neighbors, min(modifier_num, len(scene_neighbors)))
+        retrieved_scenes.extend(scene_samples)
+    # Remove duplicates while preserving order
+    retrieved_verbs = list(dict.fromkeys(retrieved_verbs))
+    retrieved_scenes = list(dict.fromkeys(retrieved_scenes))
+    # Create augmented prompt (simple version - just add contextual details)
+    augmented_parts = [prompt.strip()]
+    # Add most relevant modifiers
+    if retrieved_verbs:
+        augmented_parts.append(f"The scene shows {retrieved_verbs[0]}")
+    if retrieved_scenes:
+        augmented_parts.append(f"with {retrieved_scenes[0]}")
+    augmented_prompt = ", ".join(augmented_parts) + "."
+    # Format retrieved info for display
+    retrieved_info = {
+        "Places": places_found,
+        "Actions": retrieved_verbs[:5],
+        "Atmosphere": retrieved_scenes[:5]
+    }
+    return augmented_prompt, retrieved_info, places_found
+# =============================================================================
+# Gradio Interface
+# =============================================================================
+def process_prompt(prompt, place_num, modifier_num):
+    """Process prompt and return results for Gradio"""
+    if not prompt.strip():
+        return "Please enter a prompt.", {}, []
+    try:
+        augmented_prompt, retrieved_info, places = retrieve_and_augment_prompt(
+            prompt, place_num, modifier_num
+        )
+        # Format retrieved info for display
+        info_text = "**Retrieved Modifiers:**\n\n"
+        info_text += f"**📍 Top Places:** {', '.join(places)}\n\n"
+        info_text += f"**🎬 Actions:** {', '.join(retrieved_info['Actions'])}\n\n"
+        info_text += f"**🌅 Atmosphere:** {', '.join(retrieved_info['Atmosphere'])}\n\n"
+        return augmented_prompt, info_text
+    except Exception as e:
+        return f"Error: {str(e)}", ""
+# Create Gradio interface
+with gr.Blocks(
+    theme=gr.themes.Soft(
+        primary_hue="purple",
+        secondary_hue="blue"
+    ),
+    title="RAPO++ Text-to-Video Prompt Optimization"
+) as demo:
+    gr.Markdown("""
+    # 🎬 RAPO++ Text-to-Video Prompt Optimization
+    This demo showcases **Stage 1 (RAPO)**: Retrieval-Augmented Prompt Optimization using knowledge graphs.
+    **How it works:**
+    1. Enter a simple text-to-video prompt
+    2. The system retrieves contextually relevant modifiers from a knowledge graph
+    3. Your prompt is enhanced with specific actions and atmospheric details
+    4. Use the optimized prompt for better T2V generation results!
+    **Example prompts to try:**
+    - "A person walking"
+    - "A car driving"
+    - "Someone cooking"
+    - "A group of people talking"
+    Based on the paper: [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206)
+    """)
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### Input")
+            input_prompt = gr.Textbox(
+                label="Original Prompt",
+                placeholder="Enter your text-to-video prompt (e.g., 'A person walking')",
+                lines=3
+            )
+            with gr.Accordion("Advanced Settings", open=False):
+                place_num = gr.Slider(
+                    minimum=1,
+                    maximum=5,
+                    value=2,
+                    step=1,
+                    label="Number of Places to Retrieve",
+                    info="How many related places to search in the knowledge graph"
+                )
+                modifier_num = gr.Slider(
+                    minimum=1,
+                    maximum=10,
+                    value=5,
+                    step=1,
+                    label="Modifiers per Place",
+                    info="How many modifiers to sample from each place"
+                )
+            process_btn = gr.Button("✨ Optimize Prompt", variant="primary", size="lg")
+        with gr.Column(scale=1):
+            gr.Markdown("### Results")
+            output_prompt = gr.Textbox(
+                label="Optimized Prompt",
+                lines=5,
+                show_copy_button=True
+            )
+            retrieved_info = gr.Markdown(
+                label="Retrieved Information"
+            )
+    # Example prompts
+    gr.Examples(
+        examples=[
+            ["A person walking", 2, 5],
+            ["A car driving at night", 2, 5],
+            ["Someone cooking in a kitchen", 2, 5],
+            ["A group of people talking", 2, 5],
+            ["A bird flying", 2, 5],
+            ["Someone sitting and reading", 2, 5],
+        ],
+        inputs=[input_prompt, place_num, modifier_num],
+        outputs=[output_prompt, retrieved_info],
+        fn=process_prompt,
+        cache_examples=False
+    )
+    gr.Markdown("""
+    ---
+    ### About RAPO++
+    RAPO++ is a three-stage framework for text-to-video generation prompt optimization:
+    - **Stage 1 (RAPO)**: Retrieval-Augmented Prompt Optimization using relation graphs *(demonstrated here)*
+    - **Stage 2 (SSPO)**: Self-Supervised Prompt Optimization with test-time iterative refinement
+    - **Stage 3**: LLM fine-tuning on collected feedback data
+    The system is model-agnostic and works with various T2V models (Wan2.1, Open-Sora-Plan, HunyuanVideo, etc.).
+    **Papers:**
+    - [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
+    - [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
+    **Project Page:** [https://whynothaha.github.io/RAPO_plus_github/](https://whynothaha.github.io/RAPO_plus_github/)
+    **GitHub:** [https://github.com/Vchitect/RAPO](https://github.com/Vchitect/RAPO)
+    """)
+    # Event handlers
+    process_btn.click(
+        fn=process_prompt,
+        inputs=[input_prompt, place_num, modifier_num],
+        outputs=[output_prompt, retrieved_info]
+    )
+    input_prompt.submit(
+        fn=process_prompt,
+        inputs=[input_prompt, place_num, modifier_num],
+        outputs=[output_prompt, retrieved_info]
+    )
+# Launch the app
+if __name__ == "__main__":
+    demo.launch()

assets/overview.png ADDED Viewed

Git LFS Details

SHA256: 5c270f0e4f10c4f1558d45341cb6479b3ca56f2db7aee21cd1584fe167d25ff2
Pointer size: 132 Bytes
Size of remote file: 1.63 MB

ckpt/temp.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

data/graph_test1.csv ADDED Viewed

	@@ -0,0 +1,16 @@

+Input,Output,verb_obj_word,scenario_word,place
+two boys standing in a parking lot and talking to each other. One of the boys is wearing a jacket and the other is wearing a vest. They seem to be having a friendly conversation.,"verb_obj_word: ['standing in a parking lot', 'talking to each other'], scenario_word: ['having a friendly conversation'], place: ['standing in a parking lot']","['standing in a parking lot', 'talking to each other']",['having a friendly conversation'],['standing in a parking lot']
+a man and a woman sitting in chairs and talking to each other. The man is wearing a plaid shirt and the woman is wearing glasses. They seem to be discussing something.,"verb_obj_word: ['talking to each other'], scenario_word: ['discussing something'], place: ['sitting in chairs']",['talking to each other'],['discussing something'],['sitting in chairs']
+a woman wearing a hat walking through a dense forest. She is carrying a camera and appears to be taking pictures.,"verb_obj_word: ['walking through a dense forest', 'carrying a camera', 'taking pictures'], scenario_word: [''], place: ['walking through a dense forest']","['walking through a dense forest', 'carrying a camera', 'taking pictures']",[''],['walking through a dense forest']
+a woman working on a pottery wheel in a studio. She appears to be creating a piece of pottery by shaping and molding the clay on the wheel. The woman is focused on her work and seems to be enjoying the process.,"verb_obj_word: ['creating a piece of pottery by shaping and molding the clay on the wheel'], scenario_word: ['enjoying the process'], place: ['working on a pottery wheel in a studio']",['creating a piece of pottery by shaping and molding the clay on the wheel'],['enjoying the process'],['working on a pottery wheel in a studio']
+a close-up view of a statue located on the side of a building. The statue appears to be made of stone and has intricate carvings on it.,"verb_obj_word: ['located on the side of a building'], scenario_word: [''], place: ['a close-up view of a statue located on the side of a building']",['located on the side of a building'],[''],['a close-up view of a statue located on the side of a building']
+"a car driving through a city at night. The car appears to be a sports car, and the driver seems to be enjoying the ride. The city lights can be seen in the background.","verb_obj_word: ['driving through a city at night'], scenario_word: ['enjoying the ride'], place: ['driving a sports car in the city at night']",['driving through a city at night'],['enjoying the ride'],['driving a sports car in the city at night']
+a group of snowboarders performing various tricks and stunts on a snow-covered slope. One of the snowboarders can be seen jumping off a ramp and performing a flip. The video captures the thrill and excitement of snowboarding in the mountains.,"verb_obj_word: ['performing various tricks and stunts', 'jumping off a ramp and performing a flip'], scenario_word: ['the thrill and excitement of snowboarding in the mountains'], place: ['performing tricks and stunts on a snow-covered slope']","['performing various tricks and stunts', 'jumping off a ramp and performing a flip']",['the thrill and excitement of snowboarding in the mountains'],['performing tricks and stunts on a snow-covered slope']
+a green liquid being poured into a beaker. It appears to be a chemical reaction taking place.,"verb_obj_word: ['being poured into a beaker'], scenario_word: ['chemical reaction'], place: ['pouring a green liquid']",['being poured into a beaker'],['chemical reaction'],['pouring a green liquid']
+a man riding a bicycle on a deserted road. He is wearing a yellow shirt and appears to be enjoying the ride. The road is surrounded by trees and there is no traffic in sight.,"verb_obj_word: ['riding a bicycle'], scenario_word: ['enjoying the ride'], place: ['riding a bicycle on a deserted road']",['riding a bicycle'],['enjoying the ride'],['riding a bicycle on a deserted road']
+a woman holding a bunch of balloons while standing in a dark room. The balloons appear to be white in color.,"verb_obj_word: ['holding a bunch of balloons', 'standing in a dark room'], scenario_word: [''], place: ['holding a bunch of balloons in a dark room']","['holding a bunch of balloons', 'standing in a dark room']",[''],['holding a bunch of balloons in a dark room']
+a group of people standing in front of a statue. They are wearing traditional clothing and appear to be posing for a photo. It seems to be a historical or cultural event.,"verb_obj_word: ['posing for a photo'], scenario_word: ['historical or cultural event'], place: ['standing in front of a statue']",['posing for a photo'],['historical or cultural event'],['standing in front of a statue']
+a man wearing a jacket and sitting in front of a screen. He is talking and gesturing with his hands. The background of the video is a purple wall.,"verb_obj_word: ['talking and gesturing with his hands'], scenario_word: [''], place: ['sitting in front of a screen']",['talking and gesturing with his hands'],[''],['sitting in front of a screen']
+a man walking across a bridge in a forest. The man is wearing a blue shirt and appears to be enjoying the scenery around him.,"verb_obj_word: ['walking across a bridge', 'enjoying the scenery'], scenario_word: [''], place: ['walking across a bridge in a forest']","['walking across a bridge', 'enjoying the scenery']",[''],['walking across a bridge in a forest']
+"a car driving on a dirt road in a forest. The car appears to be old and rusty, and it seems to be stuck in the mud. The driver of the car seems to be trying to get it out of the mud.","verb_obj_word: ['driving on a dirt road', 'trying to get it out of the mud'], scenario_word: ['an old and rusty car stuck in the mud'], place: ['a car driving on a dirt road in a forest']","['driving on a dirt road', 'trying to get it out of the mud']",['an old and rusty car stuck in the mud'],['a car driving on a dirt road in a forest']
+a man in a wheelchair who is talking to the camera. He is wearing a black shirt and appears to be in good spirits. There is also a group of people in the background who are dancing.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in good spirits'], place: ['a man in a wheelchair']",['talking to the camera'],['appears to be in good spirits'],['a man in a wheelchair']

data/graph_test2.csv ADDED Viewed

	@@ -0,0 +1,18 @@

+Input,Output,verb_obj_word,scenario_word,place
+a woman wearing a white shirt and sitting in a room. She is looking at the camera and smiling. The room appears to be dimly lit.,"verb_obj_word: ['looking at the camera', 'smiling'], scenario_word: ['appears to be dimly lit'], place: ['sitting in a room']","['looking at the camera', 'smiling']",['appears to be dimly lit'],['sitting in a room']
+"a train traveling along a track in the countryside. The train appears to be moving at a steady pace, and the scenery around it is picturesque. There are no other objects or people visible in the video.","verb_obj_word: ['traveling along a track', 'moving at a steady pace'], scenario_word: ['picturesque scenery'], place: ['traveling in the countryside']","['traveling along a track', 'moving at a steady pace']",['picturesque scenery'],['traveling in the countryside']
+a woman wearing a pink dress standing on a rooftop. She is holding a purse in her hand and appears to be posing for the camera.,"verb_obj_word: ['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera'], scenario_word: [''], place: ['standing on a rooftop']","['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera']",[''],['standing on a rooftop']
+"a woman performing a ballet dance on a stage. She is wearing a pink costume and appears to be practicing her moves. The stage is dimly lit, and there are no other people or objects visible in the background.","verb_obj_word: ['performing a ballet dance', 'wearing a pink costume', 'practicing her moves'], scenario_word: [''], place: ['performing a ballet dance on a stage']","['performing a ballet dance', 'wearing a pink costume', 'practicing her moves']",[''],['performing a ballet dance on a stage']
+"a man standing on a stage, holding a book and speaking to the audience. He is wearing a blue shirt and glasses, and he seems to be giving a lecture or presentation.","verb_obj_word: ['standing on a stage', 'holding a book and speaking to the audience'], scenario_word: ['giving a lecture or presentation'], place: ['standing on a stage']","['standing on a stage', 'holding a book and speaking to the audience']",['giving a lecture or presentation'],['standing on a stage']
+a person wearing a blue jacket walking down a snow-covered road. The person seems to be enjoying the winter weather.,"verb_obj_word: ['walking down a snow-covered road'], scenario_word: ['enjoying the winter weather'], place: ['walking down a snow-covered road']",['walking down a snow-covered road'],['enjoying the winter weather'],['walking down a snow-covered road']
+a group of people riding bicycles on a trail in the mountains. They seem to be enjoying the beautiful scenery and the fresh mountain air.,"verb_obj_word: ['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air'], scenario_word: [''], place: ['riding bicycles on a trail in the mountains']","['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air']",[''],['riding bicycles on a trail in the mountains']
+"a group of people sitting at a desk and working on their computers. They appear to be focused on their tasks, and there is a sense of productivity in the air.","verb_obj_word: ['working on their computers'], scenario_word: ['a sense of productivity in the air'], place: ['sitting at a desk']",['working on their computers'],['a sense of productivity in the air'],['sitting at a desk']
+a group of men sitting at a table and enjoying a meal together. They seem to be having a good time as they eat and chat with each other.,"verb_obj_word: ['eating a meal together', 'chatting with each other'], scenario_word: ['having a good time'], place: ['sitting at a table']","['eating a meal together', 'chatting with each other']",['having a good time'],['sitting at a table']
+a young woman wearing a colorful shirt and headphones walking down a street while listening to music on her phone. She appears to be enjoying the music and the surroundings.,"verb_obj_word: ['walking down a street', 'listening to music on her phone'], scenario_word: ['enjoying the music and the surroundings'], place: ['walking down a street']","['walking down a street', 'listening to music on her phone']",['enjoying the music and the surroundings'],['walking down a street']
+a woman singing and playing the guitar. She is wearing a polka dot dress and appears to be enjoying herself while performing.,"verb_obj_word: ['singing and playing the guitar'], scenario_word: ['enjoying herself while performing'], place: ['singing and playing the guitar']",['singing and playing the guitar'],['enjoying herself while performing'],['singing and playing the guitar']
+a man sitting on a bench and reading a newspaper while drinking a cup of coffee. He seems to be enjoying his time and taking a break from his daily routine.,"verb_obj_word: ['reading a newspaper', 'drinking a cup of coffee'], scenario_word: ['enjoying his time', 'taking a break from his daily routine'], place: ['sitting on a bench']","['reading a newspaper', 'drinking a cup of coffee']","['enjoying his time', 'taking a break from his daily routine']",['sitting on a bench']
+a pair of puppets sitting at a desk and talking to each other. The puppets are dressed in suits and appear to be having a conversation.,"verb_obj_word: ['talking to each other'], scenario_word: ['having a conversation'], place: ['a pair of puppets sitting at a desk']",['talking to each other'],['having a conversation'],['a pair of puppets sitting at a desk']
+a man wearing a suit and tie playing the guitar. He appears to be a professional musician and is playing the guitar with great skill.,"verb_obj_word: ['playing the guitar with great skill'], scenario_word: ['appears to be a professional musician'], place: ['a man wearing a suit and tie playing the guitar']",['playing the guitar with great skill'],['appears to be a professional musician'],['a man wearing a suit and tie playing the guitar']
+"a group of people walking down a busy street. They seem to be in a hurry, and there is a lot of traffic on the road. It appears to be a busy day in the city.","verb_obj_word: ['walking down a busy street', 'being in a hurry'], scenario_word: ['a busy day in the city'], place: ['walking down a busy street']","['walking down a busy street', 'being in a hurry']",['a busy day in the city'],['walking down a busy street']
+a young boy standing in a room and talking to the camera. He is wearing a white shirt and appears to be in a playful mood.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in a playful mood'], place: ['standing in a room']",['talking to the camera'],['appears to be in a playful mood'],['standing in a room']
+a group of people walking down the street with their dogs. They appear to be enjoying a leisurely stroll with their furry companions.,"verb_obj_word: ['walking down the street with their dogs'], scenario_word: ['enjoying a leisurely stroll'], place: ['walking down the street']",['walking down the street with their dogs'],['enjoying a leisurely stroll'],['walking down the street']

data/test_prompts.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+A tranquil tableau of alley
+A tranquil tableau of barn
+a bird and a cat
+a chair and a couch
+a couch and a potted plant
+a potted plant and a tv
+a tv and a laptop
+a laptop and a remote
+a remote and a keyboard
+a keyboard and a cell phone
+a cell phone and a book
+a book and a clock
+A lightning striking atop of eiffel tower, dark clouds in the sky
+a bicycle on the left of a car, front view
+A modern art museum, with colorful paintings

examples/Stage1_RAPO/add_to_graph.py ADDED Viewed

	@@ -0,0 +1,167 @@

+import os
+import json
+import ast
+import torch
+import numpy as np
+import pandas as pd
+import networkx as nx
+from tqdm import tqdm
+from collections import defaultdict
+from sentence_transformers import SentenceTransformer
+def open_dataset(filename):
+    """Load a JSON file and return its content."""
+    with open(filename, 'r') as file:
+        return json.load(file)
+def update_graph_from_csv(
+    csv_file: str,
+    data_prefix_before: str,
+    data_prefix_after: str,
+    model_path: str = './ckpt/all-MiniLM-L6-v2',
+    valid_sentence_log: str = 'valid_sentence.txt'
+):
+    """Update word embeddings, indices, and co-occurrence graphs from new CSV data."""
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model = SentenceTransformer(model_path)
+    # Load dictionaries
+    verb_to_idx = open_dataset(f'{data_prefix_before}/verb_to_idx.json')
+    scenario_to_idx = open_dataset(f'{data_prefix_before}/scenario_to_idx.json')
+    place_to_idx = open_dataset(f'{data_prefix_before}/place_to_idx.json')
+    # Load sentence index mappings
+    verb_in_sentence = open_dataset(f'{data_prefix_before}/verb_in_sentence.json')
+    scenario_in_sentence = open_dataset(f'{data_prefix_before}/scenario_in_sentence.json')
+    place_in_sentence = open_dataset(f'{data_prefix_before}/place_in_sentence.json')
+    # Load embeddings
+    verb_words_embed = open_dataset(f'{data_prefix_before}/verb_words_embed.json')
+    scenario_words_embed = open_dataset(f'{data_prefix_before}/scenario_words_embed.json')
+    place_embed = open_dataset(f'{data_prefix_before}/place_embed.json')
+    # Load graphs
+    G_place_verb = nx.read_graphml(f'{data_prefix_before}/graph_place_verb.graphml')
+    G_place_scene = nx.read_graphml(f'{data_prefix_before}/graph_place_scene.graphml')
+    # Load meta information
+    data_info = open_dataset(f'{data_prefix_before}/data_info.json')
+    valid_sentence = valid_cnt = data_info['valid_sentence']
+    v_idx, s_idx, p_idx = data_info['v_idx'], data_info['s_idx'], data_info['p_idx']
+    # Cache to avoid redundant encoding
+    verb_cache, scenario_cache, place_cache = {}, {}, {}
+    # Read new CSV data
+    df = pd.read_csv(csv_file)
+    texts = []
+    for i, row in df.iterrows():
+        sentence = row['Input']
+        try:
+            verb_obj_word = ast.literal_eval(row['verb_obj_word'])
+            scenario_word = ast.literal_eval(row['scenario_word'])
+            place = ast.literal_eval(row['place'])
+        except (ValueError, SyntaxError) as e:
+            print(f"Error parsing row {i}: {e}")
+            continue
+        # Sanitize empty lists
+        verb_obj_word = [] if not verb_obj_word or verb_obj_word[0] == '' else verb_obj_word
+        scenario_word = [] if not scenario_word or scenario_word[0] == '' else scenario_word
+        place = [] if not place or place[0] == '' else place
+        texts.append([verb_obj_word, scenario_word, place])
+        if len(verb_obj_word) > 0 and len(scenario_word) > 0 and len(place) > 0:
+            with open(valid_sentence_log, 'a') as f_valid:
+                f_valid.write(f'{sentence}\n')
+            valid_sentence += 1
+    print(f"{len(texts)} sentences have been read from the CSV file.")
+    # Process and update graph/embedding/index info
+    for i in tqdm(range(len(texts))):
+        verbs, scenes, places = texts[i]
+        if len(verbs) and len(scenes) and len(places):
+            for p in places:
+                p = p.strip()
+                for s in scenes:
+                    s = s.strip()
+                    if s not in scenario_cache:
+                        s_emb = model.encode(s)
+                        scenario_cache[s] = s_emb.tolist()
+                        if s not in scenario_to_idx:
+                            scenario_to_idx[s] = s_idx
+                            s_idx += 1
+                            scenario_words_embed.append(scenario_cache[s])
+                    scenario_in_sentence.setdefault(s, []).append(valid_cnt)
+                    G_place_scene.add_edge(p, s)
+                for v in verbs:
+                    v = v.strip()
+                    if v not in verb_cache:
+                        v_emb = model.encode(v)
+                        verb_cache[v] = v_emb.tolist()
+                        if v not in verb_to_idx:
+                            verb_to_idx[v] = v_idx
+                            v_idx += 1
+                            verb_words_embed.append(verb_cache[v])
+                    verb_in_sentence.setdefault(v, []).append(valid_cnt)
+                    G_place_verb.add_edge(p, v)
+                if p not in place_cache:
+                    p_emb = model.encode(p)
+                    place_cache[p] = p_emb.tolist()
+                    if p not in place_to_idx:
+                        place_to_idx[p] = p_idx
+                        p_idx += 1
+                        place_embed.append(place_cache[p])
+                place_in_sentence.setdefault(p, []).append(valid_cnt)
+            valid_cnt += 1
+    print(f"Valid sentences processed: {valid_cnt}")
+    print(f"Original valid sentence count: {valid_sentence}")
+    # Update and save metadata
+    data_info.update({
+        'valid_sentence': valid_sentence,
+        'p_idx': p_idx,
+        's_idx': s_idx,
+        'v_idx': v_idx
+    })
+    os.makedirs(data_prefix_after, exist_ok=True)
+    def save_json(data, name):
+        with open(os.path.join(data_prefix_after, f'{name}.json'), 'w') as f:
+            json.dump(data, f, indent=4)
+        print(f"{name} saved!")
+    # Save all updated data
+    save_json(data_info, 'data_info')
+    save_json(verb_to_idx, 'verb_to_idx')
+    save_json(scenario_to_idx, 'scenario_to_idx')
+    save_json(place_to_idx, 'place_to_idx')
+    save_json(verb_in_sentence, 'verb_in_sentence')
+    save_json(scenario_in_sentence, 'scenario_in_sentence')
+    save_json(place_in_sentence, 'place_in_sentence')
+    save_json(verb_words_embed, 'verb_words_embed')
+    save_json(scenario_words_embed, 'scenario_words_embed')
+    save_json(place_embed, 'place_embed')
+    # Save updated graphs
+    nx.write_graphml(G_place_verb, os.path.join(data_prefix_after, 'graph_place_verb.graphml'))
+    nx.write_graphml(G_place_scene, os.path.join(data_prefix_after, 'graph_place_scene.graphml'))
+    print("Graphs are saved!")
+# Example usage
+if __name__ == "__main__":
+    update_graph_from_csv(
+        csv_file="./data/graph_test2.csv",
+        data_prefix_before="./graph/graph_test1",
+        data_prefix_after="./graph/graph_test2"
+    )

examples/Stage1_RAPO/construct_graph.py ADDED Viewed

	@@ -0,0 +1,151 @@

+import torch
+import numpy as np
+import pandas as pd
+from sentence_transformers import SentenceTransformer
+import networkx as nx
+from tqdm import tqdm
+import json
+import ast
+from collections import defaultdict
+import os
+def process_and_save_graph_data(
+    csv_file_path: str,
+    data_prefix: str,
+    model_path: str = './ckpt/all-MiniLM-L6-v2',
+    valid_sentence_log: str = 'valid_sentence.txt'
+):
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model = SentenceTransformer(model_path)
+    # Initialize word-to-index dictionaries
+    verb_to_idx, scenario_to_idx, place_to_idx = {}, {}, {}
+    # Track sentence indices containing each word
+    verb_in_sentence = defaultdict(list)
+    scenario_in_sentence = defaultdict(list)
+    place_in_sentence = defaultdict(list)
+    # Store embeddings
+    verb_words_embed, scenario_words_embed, place_embed = [], [], []
+    # Cache for already encoded words
+    verb_cache, scenario_cache, place_cache = {}, {}, {}
+    # Graphs for co-occurrence relationships
+    G_place_scene = nx.Graph()
+    G_place_verb = nx.Graph()
+    data_info = {}
+    texts = []
+    valid_sentence = 0
+    df = pd.read_csv(csv_file_path)
+    # Read and preprocess CSV data
+    for i, row in df.iterrows():
+        sentence = row['Input']
+        try:
+            verb_obj_word = ast.literal_eval(row['verb_obj_word'])
+            scenario_word = ast.literal_eval(row['scenario_word'])
+            place = ast.literal_eval(row['place'])
+        except (ValueError, SyntaxError) as e:
+            print(f"Error parsing row {i}: {e}")
+            continue
+        # Handle empty lists
+        verb_obj_word = [] if not verb_obj_word or verb_obj_word[0] == '' else verb_obj_word
+        scenario_word = [] if not scenario_word or scenario_word[0] == '' else scenario_word
+        place = [] if not place or place[0] == '' else place
+        texts.append([verb_obj_word, scenario_word, place])
+        if len(verb_obj_word) > 0 and len(scenario_word) > 0 and len(place) > 0:
+            with open(valid_sentence_log, 'a') as f_valid:
+                f_valid.write(f'{sentence}\n')
+            valid_sentence += 1
+    print(f"{len(texts)} sentences have been read from the CSV file.")
+    v_idx = s_idx = p_idx = 0
+    valid_cnt = 0
+    # Batch process all tokens and encode them if needed
+    for i in tqdm(range(len(texts))):
+        verbs, scenes, places = texts[i]
+        if len(verbs) and len(scenes) and len(places):
+            for p in places:
+                # Process scene tokens
+                for s in scenes:
+                    if s not in scenario_cache:
+                        s_emb = model.encode(s)
+                        scenario_cache[s] = s_emb.tolist()
+                        if s not in scenario_to_idx:
+                            scenario_to_idx[s] = s_idx
+                            s_idx += 1
+                            scenario_words_embed.append(scenario_cache[s])
+                    scenario_in_sentence[s].append(valid_cnt)
+                    G_place_scene.add_edge(p, s)
+                # Process verb tokens
+                for v in verbs:
+                    if v not in verb_cache:
+                        v_emb = model.encode(v)
+                        verb_cache[v] = v_emb.tolist()
+                        if v not in verb_to_idx:
+                            verb_to_idx[v] = v_idx
+                            v_idx += 1
+                            verb_words_embed.append(verb_cache[v])
+                    verb_in_sentence[v].append(valid_cnt)
+                    G_place_verb.add_edge(p, v)
+                # Process place tokens
+                if p not in place_cache:
+                    p_emb = model.encode(p)
+                    place_cache[p] = p_emb.tolist()
+                    if p not in place_to_idx:
+                        place_to_idx[p] = p_idx
+                        p_idx += 1
+                        place_embed.append(place_cache[p])
+                place_in_sentence[p].append(valid_cnt)
+            valid_cnt += 1
+    assert valid_cnt == valid_sentence
+    data_info.update({
+        'valid_sentence': valid_sentence,
+        'p_idx': p_idx,
+        's_idx': s_idx,
+        'v_idx': v_idx
+    })
+    os.makedirs(data_prefix, exist_ok=True)
+    # Save dictionaries
+    def save_json(data, name):
+        with open(os.path.join(data_prefix, f'{name}.json'), 'w') as f:
+            json.dump(data, f, indent=4)
+        print(f"{name} saved!")
+    save_json(data_info, 'data_info')
+    save_json(verb_to_idx, 'verb_to_idx')
+    save_json(scenario_to_idx, 'scenario_to_idx')
+    save_json(place_to_idx, 'place_to_idx')
+    save_json(verb_in_sentence, 'verb_in_sentence')
+    save_json(scenario_in_sentence, 'scenario_in_sentence')
+    save_json(place_in_sentence, 'place_in_sentence')
+    save_json(verb_words_embed, 'verb_words_embed')
+    save_json(scenario_words_embed, 'scenario_words_embed')
+    save_json(place_embed, 'place_embed')
+    # Save graph files
+    nx.write_graphml(G_place_verb, os.path.join(data_prefix, 'graph_place_verb.graphml'))
+    nx.write_graphml(G_place_scene, os.path.join(data_prefix, 'graph_place_scene.graphml'))
+    print("Graphs are saved!")
+# Example usage
+if __name__ == "__main__":
+    process_and_save_graph_data(
+        csv_file_path="./data/graph_test1.csv",
+        data_prefix="./graph/graph_test1"
+    )

examples/Stage1_RAPO/data/graph_test1.csv ADDED Viewed

	@@ -0,0 +1,16 @@

+Input,Output,verb_obj_word,scenario_word,place
+two boys standing in a parking lot and talking to each other. One of the boys is wearing a jacket and the other is wearing a vest. They seem to be having a friendly conversation.,"verb_obj_word: ['standing in a parking lot', 'talking to each other'], scenario_word: ['having a friendly conversation'], place: ['standing in a parking lot']","['standing in a parking lot', 'talking to each other']",['having a friendly conversation'],['standing in a parking lot']
+a man and a woman sitting in chairs and talking to each other. The man is wearing a plaid shirt and the woman is wearing glasses. They seem to be discussing something.,"verb_obj_word: ['talking to each other'], scenario_word: ['discussing something'], place: ['sitting in chairs']",['talking to each other'],['discussing something'],['sitting in chairs']
+a woman wearing a hat walking through a dense forest. She is carrying a camera and appears to be taking pictures.,"verb_obj_word: ['walking through a dense forest', 'carrying a camera', 'taking pictures'], scenario_word: [''], place: ['walking through a dense forest']","['walking through a dense forest', 'carrying a camera', 'taking pictures']",[''],['walking through a dense forest']
+a woman working on a pottery wheel in a studio. She appears to be creating a piece of pottery by shaping and molding the clay on the wheel. The woman is focused on her work and seems to be enjoying the process.,"verb_obj_word: ['creating a piece of pottery by shaping and molding the clay on the wheel'], scenario_word: ['enjoying the process'], place: ['working on a pottery wheel in a studio']",['creating a piece of pottery by shaping and molding the clay on the wheel'],['enjoying the process'],['working on a pottery wheel in a studio']
+a close-up view of a statue located on the side of a building. The statue appears to be made of stone and has intricate carvings on it.,"verb_obj_word: ['located on the side of a building'], scenario_word: [''], place: ['a close-up view of a statue located on the side of a building']",['located on the side of a building'],[''],['a close-up view of a statue located on the side of a building']
+"a car driving through a city at night. The car appears to be a sports car, and the driver seems to be enjoying the ride. The city lights can be seen in the background.","verb_obj_word: ['driving through a city at night'], scenario_word: ['enjoying the ride'], place: ['driving a sports car in the city at night']",['driving through a city at night'],['enjoying the ride'],['driving a sports car in the city at night']
+a group of snowboarders performing various tricks and stunts on a snow-covered slope. One of the snowboarders can be seen jumping off a ramp and performing a flip. The video captures the thrill and excitement of snowboarding in the mountains.,"verb_obj_word: ['performing various tricks and stunts', 'jumping off a ramp and performing a flip'], scenario_word: ['the thrill and excitement of snowboarding in the mountains'], place: ['performing tricks and stunts on a snow-covered slope']","['performing various tricks and stunts', 'jumping off a ramp and performing a flip']",['the thrill and excitement of snowboarding in the mountains'],['performing tricks and stunts on a snow-covered slope']
+a green liquid being poured into a beaker. It appears to be a chemical reaction taking place.,"verb_obj_word: ['being poured into a beaker'], scenario_word: ['chemical reaction'], place: ['pouring a green liquid']",['being poured into a beaker'],['chemical reaction'],['pouring a green liquid']
+a man riding a bicycle on a deserted road. He is wearing a yellow shirt and appears to be enjoying the ride. The road is surrounded by trees and there is no traffic in sight.,"verb_obj_word: ['riding a bicycle'], scenario_word: ['enjoying the ride'], place: ['riding a bicycle on a deserted road']",['riding a bicycle'],['enjoying the ride'],['riding a bicycle on a deserted road']
+a woman holding a bunch of balloons while standing in a dark room. The balloons appear to be white in color.,"verb_obj_word: ['holding a bunch of balloons', 'standing in a dark room'], scenario_word: [''], place: ['holding a bunch of balloons in a dark room']","['holding a bunch of balloons', 'standing in a dark room']",[''],['holding a bunch of balloons in a dark room']
+a group of people standing in front of a statue. They are wearing traditional clothing and appear to be posing for a photo. It seems to be a historical or cultural event.,"verb_obj_word: ['posing for a photo'], scenario_word: ['historical or cultural event'], place: ['standing in front of a statue']",['posing for a photo'],['historical or cultural event'],['standing in front of a statue']
+a man wearing a jacket and sitting in front of a screen. He is talking and gesturing with his hands. The background of the video is a purple wall.,"verb_obj_word: ['talking and gesturing with his hands'], scenario_word: [''], place: ['sitting in front of a screen']",['talking and gesturing with his hands'],[''],['sitting in front of a screen']
+a man walking across a bridge in a forest. The man is wearing a blue shirt and appears to be enjoying the scenery around him.,"verb_obj_word: ['walking across a bridge', 'enjoying the scenery'], scenario_word: [''], place: ['walking across a bridge in a forest']","['walking across a bridge', 'enjoying the scenery']",[''],['walking across a bridge in a forest']
+"a car driving on a dirt road in a forest. The car appears to be old and rusty, and it seems to be stuck in the mud. The driver of the car seems to be trying to get it out of the mud.","verb_obj_word: ['driving on a dirt road', 'trying to get it out of the mud'], scenario_word: ['an old and rusty car stuck in the mud'], place: ['a car driving on a dirt road in a forest']","['driving on a dirt road', 'trying to get it out of the mud']",['an old and rusty car stuck in the mud'],['a car driving on a dirt road in a forest']
+a man in a wheelchair who is talking to the camera. He is wearing a black shirt and appears to be in good spirits. There is also a group of people in the background who are dancing.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in good spirits'], place: ['a man in a wheelchair']",['talking to the camera'],['appears to be in good spirits'],['a man in a wheelchair']

examples/Stage1_RAPO/data/graph_test2.csv ADDED Viewed

	@@ -0,0 +1,18 @@

+Input,Output,verb_obj_word,scenario_word,place
+a woman wearing a white shirt and sitting in a room. She is looking at the camera and smiling. The room appears to be dimly lit.,"verb_obj_word: ['looking at the camera', 'smiling'], scenario_word: ['appears to be dimly lit'], place: ['sitting in a room']","['looking at the camera', 'smiling']",['appears to be dimly lit'],['sitting in a room']
+"a train traveling along a track in the countryside. The train appears to be moving at a steady pace, and the scenery around it is picturesque. There are no other objects or people visible in the video.","verb_obj_word: ['traveling along a track', 'moving at a steady pace'], scenario_word: ['picturesque scenery'], place: ['traveling in the countryside']","['traveling along a track', 'moving at a steady pace']",['picturesque scenery'],['traveling in the countryside']
+a woman wearing a pink dress standing on a rooftop. She is holding a purse in her hand and appears to be posing for the camera.,"verb_obj_word: ['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera'], scenario_word: [''], place: ['standing on a rooftop']","['standing on a rooftop', 'holding a purse in her hand', 'posing for the camera']",[''],['standing on a rooftop']
+"a woman performing a ballet dance on a stage. She is wearing a pink costume and appears to be practicing her moves. The stage is dimly lit, and there are no other people or objects visible in the background.","verb_obj_word: ['performing a ballet dance', 'wearing a pink costume', 'practicing her moves'], scenario_word: [''], place: ['performing a ballet dance on a stage']","['performing a ballet dance', 'wearing a pink costume', 'practicing her moves']",[''],['performing a ballet dance on a stage']
+"a man standing on a stage, holding a book and speaking to the audience. He is wearing a blue shirt and glasses, and he seems to be giving a lecture or presentation.","verb_obj_word: ['standing on a stage', 'holding a book and speaking to the audience'], scenario_word: ['giving a lecture or presentation'], place: ['standing on a stage']","['standing on a stage', 'holding a book and speaking to the audience']",['giving a lecture or presentation'],['standing on a stage']
+a person wearing a blue jacket walking down a snow-covered road. The person seems to be enjoying the winter weather.,"verb_obj_word: ['walking down a snow-covered road'], scenario_word: ['enjoying the winter weather'], place: ['walking down a snow-covered road']",['walking down a snow-covered road'],['enjoying the winter weather'],['walking down a snow-covered road']
+a group of people riding bicycles on a trail in the mountains. They seem to be enjoying the beautiful scenery and the fresh mountain air.,"verb_obj_word: ['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air'], scenario_word: [''], place: ['riding bicycles on a trail in the mountains']","['riding bicycles on a trail in the mountains', 'enjoying the beautiful scenery', 'enjoying the fresh mountain air']",[''],['riding bicycles on a trail in the mountains']
+"a group of people sitting at a desk and working on their computers. They appear to be focused on their tasks, and there is a sense of productivity in the air.","verb_obj_word: ['working on their computers'], scenario_word: ['a sense of productivity in the air'], place: ['sitting at a desk']",['working on their computers'],['a sense of productivity in the air'],['sitting at a desk']
+a group of men sitting at a table and enjoying a meal together. They seem to be having a good time as they eat and chat with each other.,"verb_obj_word: ['eating a meal together', 'chatting with each other'], scenario_word: ['having a good time'], place: ['sitting at a table']","['eating a meal together', 'chatting with each other']",['having a good time'],['sitting at a table']
+a young woman wearing a colorful shirt and headphones walking down a street while listening to music on her phone. She appears to be enjoying the music and the surroundings.,"verb_obj_word: ['walking down a street', 'listening to music on her phone'], scenario_word: ['enjoying the music and the surroundings'], place: ['walking down a street']","['walking down a street', 'listening to music on her phone']",['enjoying the music and the surroundings'],['walking down a street']
+a woman singing and playing the guitar. She is wearing a polka dot dress and appears to be enjoying herself while performing.,"verb_obj_word: ['singing and playing the guitar'], scenario_word: ['enjoying herself while performing'], place: ['singing and playing the guitar']",['singing and playing the guitar'],['enjoying herself while performing'],['singing and playing the guitar']
+a man sitting on a bench and reading a newspaper while drinking a cup of coffee. He seems to be enjoying his time and taking a break from his daily routine.,"verb_obj_word: ['reading a newspaper', 'drinking a cup of coffee'], scenario_word: ['enjoying his time', 'taking a break from his daily routine'], place: ['sitting on a bench']","['reading a newspaper', 'drinking a cup of coffee']","['enjoying his time', 'taking a break from his daily routine']",['sitting on a bench']
+a pair of puppets sitting at a desk and talking to each other. The puppets are dressed in suits and appear to be having a conversation.,"verb_obj_word: ['talking to each other'], scenario_word: ['having a conversation'], place: ['a pair of puppets sitting at a desk']",['talking to each other'],['having a conversation'],['a pair of puppets sitting at a desk']
+a man wearing a suit and tie playing the guitar. He appears to be a professional musician and is playing the guitar with great skill.,"verb_obj_word: ['playing the guitar with great skill'], scenario_word: ['appears to be a professional musician'], place: ['a man wearing a suit and tie playing the guitar']",['playing the guitar with great skill'],['appears to be a professional musician'],['a man wearing a suit and tie playing the guitar']
+"a group of people walking down a busy street. They seem to be in a hurry, and there is a lot of traffic on the road. It appears to be a busy day in the city.","verb_obj_word: ['walking down a busy street', 'being in a hurry'], scenario_word: ['a busy day in the city'], place: ['walking down a busy street']","['walking down a busy street', 'being in a hurry']",['a busy day in the city'],['walking down a busy street']
+a young boy standing in a room and talking to the camera. He is wearing a white shirt and appears to be in a playful mood.,"verb_obj_word: ['talking to the camera'], scenario_word: ['appears to be in a playful mood'], place: ['standing in a room']",['talking to the camera'],['appears to be in a playful mood'],['standing in a room']
+a group of people walking down the street with their dogs. They appear to be enjoying a leisurely stroll with their furry companions.,"verb_obj_word: ['walking down the street with their dogs'], scenario_word: ['enjoying a leisurely stroll'], place: ['walking down the street']",['walking down the street with their dogs'],['enjoying a leisurely stroll'],['walking down the street']

examples/Stage1_RAPO/data/test_prompts.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+A tranquil tableau of alley
+A tranquil tableau of barn
+a bird and a cat
+a chair and a couch
+a couch and a potted plant
+a potted plant and a tv
+a tv and a laptop
+a laptop and a remote
+a remote and a keyboard
+a keyboard and a cell phone
+a cell phone and a book
+a book and a clock
+A lightning striking atop of eiffel tower, dark clouds in the sky
+a bicycle on the left of a car, front view
+A modern art museum, with colorful paintings

examples/Stage1_RAPO/refactoring.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import os
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from tqdm import tqdm
+import argparse
+def get_output(prompt):
+    template = (
+        'Refine the sentence: "{}" to contain subject description, action, scene description. '
+        'Transform user-entered text into a concise, detailed description with a specific structure. '
+        '(Optional: camera language, light and shadow, atmosphere) and conceive some additional actions to make the sentence more dynamic. '
+        'Make sure it is a fluent sentence, not nonsense.'
+    )
+    prompt_text = template.format(prompt)
+    messages = [
+        {"role": "system", "content": "You are a caption refiner."},
+        {"role": "user",   "content": prompt_text}
+    ]
+    # prepare inputs
+    input_ids = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    model_inputs = tokenizer([input_ids], return_tensors="pt").to(device)
+    # generate
+    generated_ids = model.generate(
+        model_inputs.input_ids,
+        max_new_tokens=512
+    )
+    # strip prompt prefix
+    generated_ids = [
+        output_ids[len(input_ids):]
+        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+    responses = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
+    return responses[0]
+def get_start_index(txt_path):
+    """
+    Read existing output file to determine resume index (number of lines).
+    """
+    if os.path.exists(txt_path):
+        with open(txt_path, 'r', encoding='utf-8') as f:
+            return len(f.readlines())
+    return 0
+def main():
+    # determine from which line to resume
+    start_idx = get_start_index(output_path)
+    # read all prompts
+    with open(input_path, 'r', encoding='utf-8') as f:
+        prompts = [line.strip() for line in f if line.strip()]
+    # open output file for append
+    with open(output_path, 'a', encoding='utf-8') as outf:
+        for i in tqdm(range(start_idx, len(prompts)), desc="Refining prompts"):
+            prompt = prompts[i]
+            try:
+                refined = get_output(prompt)
+            except Exception as e:
+                refined = f"[ERROR] {e}"
+            outf.write(refined + '\n')
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Refine captions and output to text file')
+    parser.add_argument(
+        '--mode_path', type=str,
+        default='llama3_8B_lora_merged_cn',
+        help='Model path or identifier'
+    )
+    parser.add_argument(
+        '--input_word_augmentation', type=str,
+        default='./output/refactor/merging_reuslts.txt',
+        help='Path to input text prompts'
+    )
+    parser.add_argument(
+        '--output_refactoring', type=str,
+        default='./output/refactor/refactoring_results.txt',
+        help='Path to output text file'
+    )
+    args = parser.parse_args()
+    # setup
+    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+    tokenizer = AutoTokenizer.from_pretrained(args.mode_path, trust_remote_code=True)
+    model     = AutoModelForCausalLM.from_pretrained(
+        args.mode_path,
+        trust_remote_code=True
+    ).to(device).eval()
+    input_path  = args.input_word_augmentation
+    output_path = args.output_refactoring
+    main()

examples/Stage1_RAPO/refactoring.sh ADDED Viewed

	@@ -0,0 +1,4 @@

+python refactoring.py \
+--mode_path "../../ckpt/llama3_1_instruct_lora_rewrite" \
+--input_word_augmentation "./output/refactor/merging_reuslts.txt" \
+--output_refactoring "./output/refactor/refactoring_reuslts.txt" \

examples/Stage1_RAPO/retrieve_modifiers.py ADDED Viewed

	@@ -0,0 +1,116 @@

+import random
+import torch
+import json
+import networkx as nx
+from sentence_transformers import SentenceTransformer
+from torch.nn.functional import cosine_similarity
+from tqdm import tqdm
+import csv
+import os
+import argparse
+model = SentenceTransformer('./ckpt/all-MiniLM-L6-v2')
+def open_dataset(filename):
+    with open(filename, 'r') as file:
+        data = json.load(file)
+    return data
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Retrieve graph construction.')
+    parser.add_argument('--graph_data_dir', type=str, required=True)
+    parser.add_argument('--output_filename', type=str, required=True)
+    # Get command line arguments
+    args = parser.parse_args()
+    place_num = 3
+    verb_num = 5
+    topk_num = verb_num
+    Retrieve_num = 30
+    # Setting variables via command line arguments
+    graph_data_dir = args.graph_data_dir
+    output_filename = args.output_filename
+    test_file = f'./data/test_prompts.txt'
+    output_txt = f'./output/retrieve_words/{output_filename}.txt'
+    output_csv = f'./output/retrieve_words/{output_filename}.csv'
+    # load verb_to_idx, scenario_to_idx, place_to_idx，verb_in_sentence, scenario_in_sentence, place_in_sentence，verb_words_embed, scenario_words_embed, place_embed
+    verb_to_idx = open_dataset(f'{graph_data_dir}/verb_to_idx.json')
+    scenario_to_idx = open_dataset(f'{graph_data_dir}/scenario_to_idx.json')
+    place_to_idx = open_dataset(f'{graph_data_dir}/place_to_idx.json')
+    idx_to_place = {v: k for k, v in place_to_idx.items()}
+    verb_in_sentence = open_dataset(f'{graph_data_dir}/verb_in_sentence.json')
+    scenario_in_sentence = open_dataset(f'{graph_data_dir}/scenario_in_sentence.json')
+    place_in_sentence = open_dataset(f'{graph_data_dir}/place_in_sentence.json')
+    verb_words_embed = open_dataset(f'{graph_data_dir}/verb_words_embed.json')
+    scenario_words_embed = open_dataset(f'{graph_data_dir}/scenario_words_embed.json')
+    place_embed = open_dataset(f'{graph_data_dir}/place_embed.json')
+    # Loading graph structure
+    G_place_verb = nx.read_graphml(f'{graph_data_dir}/graph_place_verb.graphml')
+    G_place_scene = nx.read_graphml(f'{graph_data_dir}/graph_place_scene.graphml')
+    output_folder = os.path.dirname(output_txt)
+    os.makedirs(output_folder, exist_ok=True)
+    output_csv_folder = os.path.dirname(output_csv)
+    os.makedirs(output_csv_folder, exist_ok=True)
+    verb_obj_word, scenario_word, place = "", "", ""
+    with open(test_file, 'r') as f:
+        total_line = sum(1 for _ in f)
+        f.seek(0)
+        for i, line in enumerate(tqdm(f.readlines(), total=total_line)):
+            sentence = line.replace('\n', "")
+            potential_action, potential_sub_atmos, potential_scene = [], [], []
+            sentence_emb = model.encode(sentence)
+            sim = cosine_similarity(torch.tensor(sentence_emb).unsqueeze(0), torch.tensor(place_embed))
+            top1_idx = torch.topk(sim, place_num).indices
+            for idx in top1_idx.numpy().tolist():
+                verb_neighbors = list(G_place_verb.neighbors(idx_to_place[idx]))
+                scene_neighbors = list(G_place_scene.neighbors(idx_to_place[idx]))
+                verb_random = random.sample(verb_neighbors, verb_num) if len(verb_neighbors) >= verb_num else verb_neighbors
+                scene_random = random.sample(scene_neighbors, verb_num) if len(scene_neighbors) >= verb_num else scene_neighbors
+                v_random_embed = []
+                for v_random in verb_random:
+                    if v_random in verb_to_idx:
+                        v_random_embed.append(verb_words_embed[verb_to_idx[v_random]])
+                if len(v_random_embed) > 0:
+                    v_sim = cosine_similarity(torch.tensor(v_random_embed), torch.tensor(sentence_emb).unsqueeze(0))
+                    v_random_candidate = torch.topk(v_sim, topk_num if len(v_sim) >= topk_num else len(v_sim)).indices.numpy().tolist()
+                    potential_action += [verb_random[i] for i in v_random_candidate]
+                s_random_embed, p_random_embed = [], []
+                place_cross_word = []
+                for s_random in scene_random:
+                    if s_random in scenario_to_idx:
+                        s_random_embed.append(scenario_words_embed[scenario_to_idx[s_random]])
+                    else:
+                        p_random_embed.append(place_embed[place_to_idx[s_random]])
+                        place_cross_word.append(s_random)
+                if len(s_random_embed) > 0:
+                    s_sim = cosine_similarity(torch.tensor(s_random_embed), torch.tensor(sentence_emb).unsqueeze(0))
+                    s_random_candidate = torch.topk(s_sim, topk_num if len(s_sim) >= topk_num else len(s_sim)).indices.numpy().tolist()
+                if len(place_cross_word) > 0:
+                    p_sim = cosine_similarity(torch.tensor(p_random_embed), torch.tensor(sentence_emb).unsqueeze(0))
+                    p_random_candidate = torch.topk(p_sim, 1 if len(p_sim) >= 1 else len(p_sim)).indices.numpy().tolist()
+                    potential_scene += [place_cross_word[k] for k in p_random_candidate]
+                potential_sub_atmos += [scene_random[j] for j in s_random_candidate]
+                potential_scene.append(idx_to_place[idx])
+            word_set = set(potential_action + potential_sub_atmos + potential_scene)
+            with open(output_txt, 'a') as f_txt:
+                f_txt.write(f'{sentence}. {", ".join(word_set)}\n')
+            with open(output_csv, 'a', encoding='utf-8', newline="") as fc:
+                writer = csv.writer(fc)
+                if i < Retrieve_num:
+                    writer.writerow(['sentence', 'potential_action', 'potential_sub_atmos', 'potential_scene'])
+                writer.writerow([sentence, set(potential_action), set(potential_sub_atmos), set(potential_scene)])
+    print("Retrieve process is finished!")

examples/Stage1_RAPO/retrieve_modifiers.sh ADDED Viewed

	@@ -0,0 +1,3 @@

+python retrieve_modifiers.py \
+--graph_data_dir "relation_graph/graph_data" \
+--output_filename "retrieved_words" \

examples/Stage1_RAPO/rewrite_via_instruction.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import pandas as pd
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+import re
+import argparse
+import os
+def extract_output(model_output):
+    match = re.search(r'Final Output:\s*(.*)', model_output, re.IGNORECASE)
+    if match:
+        return match.group(1).strip()
+    return model_output.strip()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Process text and generate output.')
+    parser.add_argument('--input_path', type=str, required=True)
+    parser.add_argument('--output_path', type=str, required=True)
+    args = parser.parse_args()
+    input_file_path = args.input_path
+    output_path = args.output_path
+    model_id = './ckpt/Mistral-7B-Instruct-v0.3/'
+    tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="left")
+    tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    model.to(device)
+    output_data = []
+    with open(input_file_path, "r") as infile:
+        for line in infile:
+            The_current_input = line.strip()
+            prompts = [f"""Please limit your output to 30 words or less. Suppose you are a Text Aligner, and your role is to transform user-entered text into a concise, detailed description with a specific structure. You should ensure that the output text is coherent, contextually relevant, and follows the same structure as the examples provided. The output should frequently use common phrases like 'She is,' 'appears to,' 'There are,' 'seems to,' 'It appears to,' and 'They are.' The sentence structure should maintain clarity and be specific about locations and actions. The output should incorporate high-frequency words such as: appears, wearing, woman, enjoying, man, view, sitting, group, seems, seem, people, young, standing, time, beautiful, white, closeup, holding, aerial, shirt, video, appear, surrounded, playing, together, peaceful, front, background, focused, using, working, table, good, black, person, serene, others, sky, walking, trees, around, room, city, water, visible, green, blue, captures, camera, something.
+Examples provided:
+(1) input: A child plays with toys. output: a young child playing with toys in a room. The child is sitting on the floor surrounded by various toys and appears to be having fun.
+(2) input: A bear explores its surroundings. output: a black bear walking around in a grassy area. The bear appears to be exploring its surroundings and seems to be curious about its environment.
+(3) input: A woman ties a string around an orange. output: a woman sitting at a table and tying a string around an orange. She is wearing a brown robe and appears to be preparing a gift.
+(4) input: A doctor performs a procedure on a patient. output: a man wearing a surgical gown and mask standing next to a patient in a hospital room. The man is a doctor who is performing a procedure on the patient.
+(5) input: A monkey looks around. output: a monkey sitting on a tree branch. The monkey appears to be looking around and seems to be curious about its surroundings.
+(6) input: People discuss something at a table. output: a group of people gathered around a table, discussing something together. It appears to be a business meeting or a brainstorming session. The people in the video are engaged in a conversation and seem to be focused on the topic at hand.
+(7) input: A girl holds a cardboard star. output: a young girl wearing a blue dress and holding a cardboard star. She is standing in front of a white background and appears to be smiling.
+(8) input: Young people swim in a pool. output: a group of young people having fun in a swimming pool. They are all wearing swimsuits and enjoying themselves. One of the people in the pool is wearing a bikini.
+(9) input: A woman writes with a pen. output: a close-up shot of a woman's hand holding a pen and writing on a piece of paper. The woman is wearing a ring on her finger and appears to be focused on her work.
+The current input: {The_current_input} , Final Output:"""
+            ]
+            inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(device).input_ids
+            attention_mask = torch.ones(inputs.shape, dtype=torch.long, device=device)
+            outputs = model.generate(inputs, max_new_tokens=500, pad_token_id=tokenizer.eos_token_id, attention_mask=attention_mask)
+            output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+            for output in output_text:
+                extracted_result = extract_output(output)
+                output_data.append({"output": extracted_result})
+                with open(output_path, 'a', encoding='utf-8') as txt_file:
+                    txt_file.write(extracted_result + '\n')
+                    print(f"extracted_result:{extracted_result}")
+    print(f"Rewrite outputs saved to {output_path}")

examples/Stage1_RAPO/rewrite_via_instruction.sh ADDED Viewed

	@@ -0,0 +1,5 @@

+srun -p video-aigc-4 -n1 --gres=gpu:1 --cpus-per-task=8 --quotatype=spot --async \
+-N1 --job-name=python \
+python rewrite_via_instruction.py \
+--input_path "./data/test_prompts.txt" \
+--output_path "./output/rewrite_via_instruction/test_prompts.txt" \

examples/Stage1_RAPO/word_augment.py ADDED Viewed

	@@ -0,0 +1,254 @@

+import random
+import torch
+import json
+import networkx as nx
+from sentence_transformers import SentenceTransformer
+from torch.nn.functional import cosine_similarity
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from tqdm import tqdm
+import csv
+import os
+import argparse
+import pandas as pd
+import string
+import re
+import numpy as np
+def open_dataset(filename):
+    """Load a JSON file"""
+    with open(filename, 'r') as file:
+        data = json.load(file)
+    return data
+# Remove punctuation
+def remove_punctuation(text):
+    return text.translate(str.maketrans('', '', string.punctuation))
+# Compute similarity between two texts
+def compute_similarity(text1, text2, model, embeddings_cache):
+    if text1 not in embeddings_cache:
+        embeddings_cache[text1] = model.encode(text1)
+    if text2 not in embeddings_cache:
+        embeddings_cache[text2] = model.encode(text2)
+    embedding1 = torch.tensor(embeddings_cache[text1]).unsqueeze(0)
+    embedding2 = torch.tensor(embeddings_cache[text2]).unsqueeze(0)
+    similarity = cosine_similarity(embedding1, embedding2).item()
+    return similarity
+# Extract similarity score from string
+def extract_sim_score(text):
+    match = re.search(r'sim_score=([0-9.]+)', text)
+    if match:
+        return float(match.group(1))
+    return 0.0
+# Extract the final output from model output
+def extract_output(model_output):
+    # Assume output contains 'Final Output: ' followed by the desired result
+    match = re.search(r'Final Output:\s*(.*)', model_output, re.IGNORECASE)
+    if match:
+        return match.group(1).strip()
+    return model_output.strip()
+# Get max number of columns in a CSV file
+def get_max_columns(input_csv_path):
+    max_columns = 0
+    with open(input_csv_path, 'r', encoding='utf-8') as f:
+        reader = csv.reader(f)
+        for row in reader:
+            max_columns = max(max_columns, len(row))
+    return max_columns
+### Similarity Ranking
+def Similarity_Ranking(input_txt, simrank_path, SentenceTransformer_model):
+    """
+    input_txt: each line formatted as 'prefix.suffix1,suffix2,...'
+    simrank_path: path to output CSV
+    SentenceTransformer_model: model for computing similarity
+    """
+    embeddings_cache = {}
+    with open(input_txt, 'r', encoding='utf-8') as f, \
+         open(simrank_path, 'w', encoding='utf-8', newline='') as csv_file:
+        writer = csv.writer(csv_file, quoting=csv.QUOTE_MINIMAL)
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            # 1. Split at the first '.'
+            idx = line.find('.')
+            if idx == -1:
+                # Skip if no '.' is found
+                continue
+            first_part = line[:idx+1].strip()
+            rest = line[idx+1:].strip()
+            # 2. Split suffixes by comma and remove whitespace
+            other_parts = [part.strip() for part in rest.split(',') if part.strip()]
+            # 3. Combine into parts list
+            parts = [first_part] + other_parts
+            # 4. Remove punctuation from parts
+            clean_parts_no_punct = [parts[0]] + [
+                remove_punctuation(part) for part in parts[1:]
+            ]
+            # 5. Compute similarity scores
+            processed_with_similarity = []
+            before_period = clean_parts_no_punct[0]
+            for part in clean_parts_no_punct[1:]:
+                sim_score = compute_similarity(
+                    before_period, part, SentenceTransformer_model, embeddings_cache
+                )
+                sim_score = round(sim_score, 4)
+                processed_with_similarity.append((part, sim_score))
+                print(f"processed_part: {part}, sim_score={sim_score}")
+            # 6. Sort by similarity descending
+            processed_with_similarity_sorted = sorted(
+                processed_with_similarity,
+                key=lambda x: x[1],
+                reverse=True
+            )
+            # 7. Format output fields
+            formatted_processed = [first_part]
+            for part, sim in processed_with_similarity_sorted:
+                formatted_processed.append(f"{part}, sim_score={sim}")
+            # 8. Write a row to the output CSV
+            csv_line = [first_part, rest] + formatted_processed
+            writer.writerow(csv_line)
+    print(f"Similarity Ranking completed, results saved to {simrank_path}")
+    return simrank_path
+### Similarity Ranking
+### Iteractively Merging
+def Iteractively_Merging(simrank_path, merging_path, selected_modifiers, SIMILARITY_THRESHOLD):
+    max_columns = get_max_columns(simrank_path)
+    print(f"Max columns: {max_columns}")
+    simrank_file = pd.read_csv(simrank_path, header=None, names=[f"col{i}" for i in range(max_columns)], encoding='utf-8')
+    output_data = []
+    for index, row in simrank_file.iterrows():
+        try:
+            original_text = row.iloc[1]
+            modifiers = row.iloc[2:]
+            modifiers_with_scores = []
+            for modifier in modifiers:
+                if pd.isna(modifier):
+                    continue
+                parts = modifier.split(", sim_score=")
+                if len(parts) == 2:
+                    mod_text = parts[0].strip()
+                    sim_score = extract_sim_score(modifier)
+                    if sim_score >= SIMILARITY_THRESHOLD:
+                        modifiers_with_scores.append((mod_text, sim_score))
+            modifiers_with_scores_sorted = sorted(modifiers_with_scores, key=lambda x: x[1], reverse=True)
+            current_description = original_text
+            processed_outputs = []
+            for modifier, sim_score in modifiers_with_scores_sorted:
+                if sim_score < SIMILARITY_THRESHOLD:
+                    print(f"Row {index+1}: sim_score={sim_score} below threshold {SIMILARITY_THRESHOLD}, stopping inference.")
+                    break
+                try:
+                    prompt = f"""Suppose you are a Text Rewriter, and your role is to transform user-entered text into a concise, detailed description. You receive two inputs from the user: description body and relevant modifiers. Your task is to enrich the description body with relevant modifiers while retaining the description body. You should ensure that the output text is coherent, contextually relevant, and follows the same structure as the examples provided.
+Examples provided:
+(1) Description body: a group of dancers performing a ballet routine in a studio. The dancers are wearing ballet shoes.
+Relevant modifiers: dressed in black leotards.
+Output: a group of dancers performing a ballet routine in a studio. The dancers are wearing ballet shoes and are dressed in black leotards.
+(2) Description body: a woman sitting at a desk and working on her laptop.
+Relevant modifiers: appears to be focused on her work.
+Output: a woman sitting at a desk and working on her laptop, appears to be focused on her work.
+(3) Description body: they seem to be having a good time and enjoying each other's company.
+Relevant modifiers: a casual and relaxed setting.
+Output: They seem to be having a good time and enjoying each other's company. It appears to be a casual and relaxed setting.
+(4) Description body: a woman preparing a delicious meal in her kitchen.
+Relevant modifiers: cutting various fruits and vegetables on a cutting board.
+Output: a woman preparing a delicious meal in her kitchen. She is seen cutting various fruits and vegetables on a cutting board and placing them on a tray.
+The Description body: {current_description}, Relevant modifiers: {modifier}, Final Output:"""
+                    inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
+                    attention_mask = torch.ones(inputs.shape, dtype=torch.long, device=device)
+                    outputs = merging_model.generate(
+                        inputs,
+                        max_new_tokens=500,
+                        pad_token_id=tokenizer.eos_token_id,
+                        attention_mask=attention_mask
+                    )
+                    output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+                    extracted_output = extract_output(output_text)
+                    processed_output = f"{modifier}, sim_score={sim_score}"
+                    processed_outputs.append(processed_output)
+                    current_description = extracted_output
+                    print(f"Row {index+1}, Step with modifier '{modifier}': {processed_output}")
+                except Exception as e:
+                    print(f"Error during inference on row {index+1}: {e}")
+                    continue
+            # Save final description
+            output_row = {
+                "original_text": original_text,
+                "before_period": row.iloc[0] if len(row) > 0 else "",
+                "final_description": current_description,
+                "processed_outputs": "; ".join(processed_outputs)
+            }
+            output_data.append(output_row)
+            with open(merging_path, 'a', encoding='utf-8') as desc_file:
+                desc_file.write(current_description + '\n')
+        except Exception as e:
+            print(f"Error processing row {index+1}: {e}")
+            continue
+    output_df = pd.DataFrame(output_data)
+    output_df.to_csv(selected_modifiers, index=False, encoding='utf-8')
+    return merging_path, selected_modifiers, output_data
+### Iteractively Merging
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='Retrieve graph construction.')
+    parser.add_argument('--retrieved_words', type=str, default="./output/all_dimension.csv", help='path of retrieved modifiers')
+    parser.add_argument('--pretrained_SentenceTransformer', type=str, default='./ckpt/all-MiniLM-L6-v2', help='SentenceTransformer model path')
+    parser.add_argument('--pretrained_merging', type=str, default='./ckpt/Mistral-7B-Instruct-v0.3/', help='merging model path')
+    parser.add_argument('--input_path', type=str, required=True, help='input text file path')
+    parser.add_argument('--output_simrank', type=str, required=True, help='output ranking CSV')
+    parser.add_argument('--output_selected_modifiers', type=str, required=True, help='output selected modifiers CSV')
+    parser.add_argument('--output_interactive_merging', type=str, required=True, help='results after interactive merging')
+    parser.add_argument('--SIMILARITY_THRESHOLD', type=float, required=True, help='similarity threshold')
+    args = parser.parse_args()
+    SentenceTransformer_model = SentenceTransformer(args.pretrained_SentenceTransformer)
+    merging_model_path = args.pretrained_merging
+    input_path = args.input_path
+    retrieved_words = args.retrieved_words
+    simrank_path = args.output_simrank
+    selected_modifiers_path = args.output_selected_modifiers
+    merging_path = args.output_interactive_merging
+    SIMILARITY_THRESHOLD = args.SIMILARITY_THRESHOLD
+    output_folder = os.path.dirname(simrank_path)
+    os.makedirs(output_folder, exist_ok=True)
+    tokenizer = AutoTokenizer.from_pretrained(merging_model_path, padding_side="left")
+    tokenizer.pad_token = tokenizer.eos_token
+    merging_model = AutoModelForCausalLM.from_pretrained(merging_model_path, torch_dtype=torch.bfloat16, device_map="auto", offload_state_dict=False)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    ### Similarity Ranking
+    simrank_path = Similarity_Ranking(retrieved_words, simrank_path, SentenceTransformer_model)
+    ### Similarity Ranking
+    ### Iteractively Merging
+    merging_path, selected_modifiers_path, output_data = Iteractively_Merging(simrank_path, merging_path, selected_modifiers_path, SIMILARITY_THRESHOLD)
+    ### Iteractively Merging

examples/Stage1_RAPO/word_augment.sh ADDED Viewed

	@@ -0,0 +1,17 @@

+#!/bin/bash
+input_path="./data/test_prompts.txt"
+output_dir="./output/refactor/"
+SIMILARITY_THRESHOLD=0.6
+output_simrank="${output_dir}/simrank.csv"
+output_selected_modifiers="${output_dir}/selected_modifiers.txt"
+output_interactive_merging="${output_dir}/merging_reuslts.txt"
+python word_augment.py\
+--retrieved_words "./output/retrieve_words/retrieved_words.txt" \
+--input_path "${input_path}" \
+--output_simrank "${output_simrank}" \
+--output_selected_modifiers "${output_selected_modifiers}" \
+--output_interactive_merging "${output_interactive_merging}" \
+--SIMILARITY_THRESHOLD "${SIMILARITY_THRESHOLD}" \

examples/Stage2_SSPO/examples.csv ADDED Viewed

	@@ -0,0 +1,10 @@

+captions,phys_law
+A swimmer splashing in the sea water.,"Due to momentum transfer to water during strokes and kicks, splashes and waves are generated."
+Pouring milk into still tea.,"Due to density difference and diffusion, milk disperses and mixes with tea."
+Cloth banner hanging from wooden twig.,"Due to gravity balanced by tension, cloth banner reaches static equilibrium."
+Hand shaking salt shaker.,"Due to acceleration overcoming static friction, salt grains flow out."
+Peeler peels an apple.,"Due to shear force exceeding the skin’s strength, thin layers of peel are removed."
+An electric beater whips cream in a bowl.,"Due to rapid mechanical agitation, cream incorporates air and thickens."
+A waterfall cascades over jagged rocks.,"Due to gravitational acceleration, water flows downward and impacts surface creating turbulence."
+A coffee pot pours a morning cup of joe.,"Due to gravity, liquid flows in a stream shaped by surface tension and viscosity."
+Bottle crashes onto concrete floor.,"Due to gravitational fall and brittle fracture on impact, the bottle breaks and energy dissipates as sound and shards."

examples/Stage2_SSPO/phyaware_wan2.1.py ADDED Viewed

	@@ -0,0 +1,364 @@

+# -*- coding: utf-8 -*-
+import csv
+import torch
+from diffusers import AutoencoderKLWan, WanPipeline
+from diffusers.utils import export_to_video
+from pathlib import Path
+import numpy as np
+import cv2
+import pandas as pd
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# === VLM dependencies ===
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+# ---------------------------
+# VLM-based alignment assessment
+# ---------------------------
+def misalignment_assessment(
+    qwen_vl_path: str,
+    video_path: str = "",
+    prompt: str = "",
+    max_new_tokens: int = 256,
+    device: str = "cuda"
+):
+    """
+    Use Qwen2.5-VL to assess how well the video aligns with the text description.
+    Return the model's response string.
+    """
+    # Load model and processor
+    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+        qwen_vl_path, torch_dtype="auto", device_map="auto"
+    )
+    processor = AutoProcessor.from_pretrained(qwen_vl_path)
+    # Evaluation template
+    eval_template = f"""
+Evaluate how well the video aligns with the given text prompt.
+Consider whether the objects, actions, and scene described in the prompt are accurately represented in the video.
+Provide a brief explanation and assign an alignment score from 1 (completely misaligned) to 5 (perfectly aligned).
+(A) PROMPT: \"\"\"{prompt}\"\"\"
+    """
+    # Build messages
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "video", "video": video_path},
+                {"type": "text", "text": eval_template},
+            ],
+        }
+    ]
+    # Prepare inputs
+    text = processor.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    image_inputs, video_inputs = process_vision_info(messages)
+    inputs = processor(
+        text=[text],
+        images=image_inputs,
+        videos=video_inputs,
+        padding=True,
+        return_tensors="pt",
+    )
+    inputs = inputs.to(device)
+    # Generate output
+    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
+    generated_ids_trimmed = [
+        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+    ]
+    output_text_list = processor.batch_decode(
+        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+    )
+    # Return string (take the first item if it's a list)
+    return output_text_list[0] if isinstance(output_text_list, list) and len(output_text_list) > 0 else ""
+# ---------------------------
+# Wan pipeline and video generation
+# ---------------------------
+def load_model(model_id: str) -> WanPipeline:
+    """
+    Load WanPipeline with its VAE.
+    """
+    vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
+    pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
+    pipe.to("cuda")
+    return pipe
+def generate_single_video(
+    pipe: WanPipeline,
+    prompt: str,
+    output_file_path: Path,
+    negative_prompt: str,
+    seed: int = 1423
+) -> None:
+    """
+    Generate a single video and save it to disk.
+    """
+    generator = torch.Generator(device="cuda").manual_seed(seed)
+    print(f"▶️ Generating video: {prompt}")
+    output = pipe(
+        prompt=prompt.strip(),
+        negative_prompt=negative_prompt,
+        height=480,
+        width=832,
+        num_frames=81,
+        guidance_scale=5.0,
+        generator=generator
+    ).frames[0]
+    export_to_video(output, str(output_file_path), fps=15)
+    print(f"✅ Saved: {output_file_path}")
+def extract_optical_flow(video_path: str, sample_interval_sec: float = 0.5) -> list:
+    """
+    Sample frames from the video and compute mean optical flow between adjacent samples.
+    """
+    cap = cv2.VideoCapture(video_path)
+    if not cap.isOpened():
+        raise IOError(f"Cannot open video: {video_path}")
+    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    fps = cap.get(cv2.CAP_PROP_FPS)
+    sample_every = int(fps * sample_interval_sec) if fps and fps > 0 else 1
+    frames = []
+    for i in range(0, frame_count, sample_every):
+        cap.set(cv2.CAP_PROP_POS_FRAMES, i)
+        ret, frame = cap.read()
+        if ret:
+            frames.append(frame)
+        else:
+            break
+    cap.release()
+    flows = []
+    for i in range(len(frames) - 1):
+        prev_gray = cv2.cvtColor(frames[i], cv2.COLOR_BGR2GRAY)
+        next_gray = cv2.cvtColor(frames[i + 1], cv2.COLOR_BGR2GRAY)
+        flow = cv2.calcOpticalFlowFarneback(
+            prev_gray, next_gray, None,
+            pyr_scale=0.5, levels=3, winsize=15,
+            iterations=3, poly_n=5, poly_sigma=1.2, flags=0
+        )
+        mean_flow_x = float(np.mean(flow[..., 0]))
+        mean_flow_y = float(np.mean(flow[..., 1]))
+        flows.append((mean_flow_x, mean_flow_y))
+    return flows
+# ---------------------------
+# Physics consistency + VLM alignment fusion + prompt refinement
+# ---------------------------
+def evaluate_physical_consistency(
+    flows: list,
+    pysical_rule: str,
+    text_prompt: str,
+    instruct_llm_path: str,
+    vlm_alignment: str = ""
+) -> tuple:
+    """
+    Physics consistency analysis + VLM alignment assessment fusion + prompt refinement.
+    Returns: (mismatch_summary, refined_prompt)
+    """
+    model = AutoModelForCausalLM.from_pretrained(
+        instruct_llm_path,
+        torch_dtype="auto",
+        device_map="auto"
+    )
+    tokenizer = AutoTokenizer.from_pretrained(instruct_llm_path)
+    # Phase 1: Physics plausibility check based on optical flow
+    physics_check_prompt = (
+        "You are an expert in physics and motion analysis. I am providing you with a prompt for generating a video "
+        "and the optical-flow motion statistics extracted from that generated video.\n\n"
+        f"Prompt for the video: {text_prompt}\n"
+        f"Sequence of average optical flow vectors (x, y) per sample: {flows}\n\n"
+        "Task: Judge whether the motion is physically plausible, referencing laws such as inertia, conservation of momentum, "
+        "buoyancy, and continuous force application. Provide a concise final conclusion only (no process), e.g., "
+        "\"Sudden global reversals without external force violate inertia\" or \"No obvious physical inconsistency\".\n"
+        "Examples:\n"
+        "Response 1: Objects or liquids have sudden reverse motion between adjacent frames; if there is no external force explanation "
+        "(such as secondary collision, bounce), this sudden acceleration does not conform to the law of inertia; "
+        "in particular, liquids or debris usually do not have overall reverse flow.\n"
+        "Response 2: Based on the extracted optical flow, there are no obvious physical inconsistencies in this video. "
+        "The motion is smooth, directional, and realistic in magnitude and trend. There are no sudden reversals of direction or unrealistic oscillations."
+    )
+    messages = [
+        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
+        {"role": "user", "content": physics_check_prompt}
+    ]
+    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+    generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
+    generated_ids = [
+        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+    physics_response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    # Phase 2: Fuse VLM alignment + physics issues -> rewrite prompt
+    rewrite_prompt = (
+        "You are a prompt engineering expert for diffusion-based text-to-video generation. "
+        "Refine the prompt so that the next generated video better matches real-world physics and the intended semantics.\n\n"
+        f"Related physical rule to obey: {pysical_rule}\n"
+        f"Original prompt: {text_prompt}\n"
+        "Detected mismatches:\n"
+        f"- Optical-flow-based physics analysis: {physics_response}\n"
+        f"- VLM alignment assessment (semantic/temporal/object-action alignment): {vlm_alignment}\n\n"
+        "Requirements for the refined prompt:\n"
+        "- Describe the expected video content directly; do not mention rules, analysis, or this instruction.\n"
+        "- Keep it under 120 words.\n"
+        "- Preserve the core intent but explicitly constrain motions, forces, object states, timings, and camera if helpful."
+    )
+    rewrite_messages = [
+        {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
+        {"role": "user", "content": rewrite_prompt}
+    ]
+    text = tokenizer.apply_chat_template(rewrite_messages, tokenize=False, add_generation_prompt=True)
+    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+    generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
+    generated_ids = [
+        output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+    refined_prompt = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    # Summarize mismatches for logging
+    mismatch_summary = f"[Physics] {physics_response}  ||  [VLM] {vlm_alignment}"
+    return mismatch_summary, refined_prompt
+# ---------------------------
+# Main workflow: unchanged logic + call VLM assessment
+# ---------------------------
+if __name__ == "__main__":
+    # ==== Centralized checkpoint and path configuration ====
+    WAN_MODEL_ID = "../../ckpt/Wan2.1-T2V-1.3B-Diffusers"  # Wan T2V checkpoint
+    INSTRUCT_LLM_PATH = "../../ckpt//Qwen2.5-7B-Instruct"    # Instruction-tuned LLM for physics/rewrite
+    QWEN_VL_PATH = ../../ckpt//qwen2.5-vl-7B-instruct"      # VLM for alignment assessment
+    # Output and data
+    OUTPUT_DIR = Path("./results/examples_refined/")
+    OUTPUT_LOG = Path("./results/examples_refined/refined_prompts.csv")
+    CSV_PATH = Path("examples.csv")
+    # Negative prompt (not a checkpoint path)
+    NEGATIVE_PROMPT = (
+        "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, "
+        "images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, "
+        "incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, "
+        "misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
+    )
+    # Prepare I/O
+    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+    OUTPUT_LOG.parent.mkdir(parents=True, exist_ok=True)
+    # Load T2V pipeline
+    pipe = load_model(WAN_MODEL_ID)
+    # Read input CSV
+    df = pd.read_csv(CSV_PATH)
+    # Number of refinement iterations per prompt
+    num_refine_iterations = 5
+    # === Load existing log (if any) for resume capability ===
+    # Log columns: base_name, iter_idx, video_file, mismatch, refined_prompt
+    if OUTPUT_LOG.exists() and OUTPUT_LOG.stat().st_size > 0:
+        log_df = pd.read_csv(OUTPUT_LOG, header=None, names=["base_name", "iter_idx", "video_file", "mismatch", "refined_prompt"])
+    else:
+        log_df = pd.DataFrame(columns=["base_name", "iter_idx", "video_file", "mismatch", "refined_prompt"])
+    # Main loop over rows
+    for idx, row in df.iterrows():
+        base_name = f"{idx + 1:07d}"
+        PHYSICAL_RULE = row['phys_law']
+        orig_prompt = row['captions']
+        # Recover latest refined prompt if previous iterations exist
+        done_rows = log_df[log_df["base_name"] == base_name].sort_values("iter_idx")
+        if not done_rows.empty:
+            last_iter = int(done_rows["iter_idx"].iloc[-1])
+            prompt = str(done_rows["refined_prompt"].iloc[-1])
+            start_iter = last_iter + 1
+            print(f"\n=== {base_name} has {last_iter} recorded iterations, resuming from {start_iter} ===")
+        else:
+            prompt = orig_prompt
+            start_iter = 1
+            print(f"\n=== Processing Row {base_name} ===")
+        for i in range(start_iter, num_refine_iterations + 1):
+            video_file = OUTPUT_DIR / f"{base_name}_r{i}.mp4"
+            # Skip if this iteration already exists in the log (ensure the prompt chain stays consistent)
+            if not done_rows.empty and i in set(done_rows["iter_idx"].astype(int).tolist()):
+                prompt_i = str(done_rows[done_rows["iter_idx"] == i]["refined_prompt"].iloc[0])
+                prompt = prompt_i
+                print(f"[ Skip ] {base_name} iteration {i} already in log. Skipping.")
+                continue
+            # If the video exists but no log entry, evaluate and log directly
+            if video_file.exists():
+                print(f"[ Found ] Existing video: {video_file}, skipping generation and evaluating directly.")
+                try:
+                    flows = extract_optical_flow(str(video_file))
+                    vlm_text = misalignment_assessment(
+                        qwen_vl_path=QWEN_VL_PATH,
+                        video_path=str(video_file),
+                        prompt=orig_prompt,
+                        max_new_tokens=256,
+                        device="cuda"
+                    )
+                    mismatch, refined_prompt = evaluate_physical_consistency(
+                        flows, PHYSICAL_RULE, orig_prompt, INSTRUCT_LLM_PATH, vlm_alignment=vlm_text
+                    )
+                except Exception as e:
+                    print(f"[ Warn ] Evaluation failed for existing video: {e}. Skipping this iteration.")
+                    continue
+                print(f"[ Iter {i} ] Mismatch: {mismatch}")
+                print(f"[ Iter {i} ] Refined Prompt: {refined_prompt}")
+                with open(OUTPUT_LOG, mode='a', newline='', encoding='utf-8') as log_file:
+                    writer = csv.writer(log_file)
+                    writer.writerow([base_name, i, str(video_file), mismatch, refined_prompt])
+                prompt = refined_prompt  # Use for the next iteration
+                continue
+            # Normal path: generate -> optical flow -> VLM assess -> fuse -> log
+            generate_single_video(pipe, prompt, video_file, NEGATIVE_PROMPT)
+            flows = extract_optical_flow(str(video_file))
+            vlm_text = misalignment_assessment(
+                qwen_vl_path=QWEN_VL_PATH,
+                video_path=str(video_file),
+                prompt=orig_prompt,
+                max_new_tokens=256,
+                device="cuda"
+            )
+            mismatch, refined_prompt = evaluate_physical_consistency(
+                flows, PHYSICAL_RULE, orig_prompt, INSTRUCT_LLM_PATH, vlm_alignment=vlm_text
+            )
+            print(f"[ Iter {i} ] Mismatch: {mismatch}")
+            print(f"[ Iter {i} ] Refined Prompt: {refined_prompt}")
+            with open(OUTPUT_LOG, mode='a', newline='', encoding='utf-8') as log_file:
+                writer = csv.writer(log_file)
+                writer.writerow([base_name, i, str(video_file), mismatch, refined_prompt])
+            prompt = refined_prompt  # Use for the next iteration

requirement.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+networkx==3.4.2
+sentence-transformers==4.1.0
+tqdm==4.64.0
+transformers==4.51.3
+pandas==2.2.3
+protobuf==6.30.2
+accelerate==1.6.0

requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+gradio==5.49.1
+gradio-client==1.13.3
+httpx>=0.24.1,<1.0
+ruff>=0.9.3
+huggingface_hub>=0.20.0
+sentence-transformers>=2.0.0
+sentencepiece==0.2.1
+torch==2.5.1
+torchvision==0.20.1
+torchaudio==2.5.1
+networkx==3.4.2
+tqdm==4.64.0
+transformers==4.51.3
+pandas==2.2.3
+protobuf==6.30.2
+accelerate==1.6.0
+spaces